Assignment 03 Fun with Spark Part II

Assignment 03 Description


Fun with Spark Lab

Prepare Sample Data

For this assignment we are going to use two sample data sets.  One is a text file version of the classic book “War and Peace”, the other is the same housing sales data from Week 2.

If you have not already done so, please put home_data.csv onto your VM and then also war_and_peace.txt.  To get the data onto your VM you can either download to your local machine and then use the SCP command to copy the files e.g: “scp home_data.csv username@ip:   (the colon is necessary)
Or use the wget command and pass in the link directly, similar to how you downloaded the HDP sandbox.   Note that the “stylized” single quote marks in this doc might cause problems if you copy/paste directly into command line.



If you would like to, you can instead add to HDFS and access from there.  To copy home_data.csv to HDFS can either scp into sandbox and then use HDFS commands to copy to HDFS:

Scp home_data.csv root@localhost:
Then SSH into sandbox and run
Hadoop fs -put home_data.csv /tmp/home_data.csv

Or use the Ambari Web UI “Files View” (located same place you find the “Hive2 View” and can upload directly from local machine via web UI

Run Spark Shell

From within the sandbox (after you ssh -p root@localhost) run the command “spark-shell
(or “pyspark” for a Python shell)

You can safely ignore warning about “SparkUI can’t bind to port…”

If all goes well you should see a screen with some excellent ASCII art like this:


Congratulations, now you are ready to Spark!

To get some familiarity and make sure Spark is working properly, I’d recommend running through a few of the examples from the slides this week.

Assignment 3

Please answer the following questions using the sample data provided, providing code that you used.  Either Scala or Python is acceptable.  If you are unable to come up with code to answer a question please describe how you think you would solve the problem in a “sparkified” way based on what we have learned so far.

You will likely find the String method “split” in both Scala and Python, and “contains” in Scala (“in” in Python)  useful in many of these exercises.  This week’s material shows some sample usage.

  1. Which lines from the book war_and_peace.txt contain both words “war” and “peace”?  
  2. Approximately how many sentences are in the book war_and_peace.txt?  HINT: Let’s define a sentence as a sequence of text that ends with a period,and note that an approximation is ok.
  3. Save a random sampling of 500 lines from war_and_peace.txt either to local or HDFS storage as text
  4. From home_data.csv, how many houses sold were built prior to 1979?
  5. From home_data.csv, how many houses sold had a lot size (in sq. ft) that was greater than triple the living space (in sq. ft)?  Save this output to file either locally or on HDFS.  Example: lot size 500 living size 100 would count (500 > 300) but 500/200 would not (500 < 600).

Bonus Exercise

Remember, bonus exercise is meant for those either with prior knowledge on the topic or the interest (and time) to do some research beyond what we covered in class and is optional.

Exercise 1:

Let’s again use the zipcode list from Assignment 2, you can access via

How many homes were sold with a zipcode being defined as in “Seattle”?

Hint: how do you go about doing this in SQL?



Exercise 2:

Using home_data.csv create an RDD of key/value pairs where the key is the “id” of a row and the value is everything else.  

For Example: "7129300520","20141013T000000",221900,"3","1",1180,5650,"1",0,0,3,7,1180,0,1955,0,"98178",47.5112,-122.257,1340,5650

The key would be "7129300520" and the value would be "20141013T000000",221900,"3","1",1180,5650,"1",0,0,3,7,1180,0,1955,0,"98178",47.5112,-122.257,1340,5650

Hint:  Most of the transformations we have done so far are simple one liners; But remember that you can use any arbitrary function that you define in your transformation, like we did with the square root example in the functional programming demo.


Assignment 03 Answers

Tim Pauley 
BI Data 201A Assignment 03 
Update on Answers Assignment 3

 Please answer the following questions using the sample data provided, providing code that you used. Either Scala or Python is acceptable. If you are unable to come up with code to answer a question please describe how you think you would solve the problem in a ìsparkifiedî way based on what we have learned so far. You will likely find the String method ìsplitî in both Scala and Python, and ìcontainsî in Scala (ìinî in Python) useful in many of these exercises. This weekís material shows some sample usage. 

 1. Which lines from the book war_and_peace.txt contain both words ìwarî and ìpeaceî? 

 ###Process: Convert everything to uppercase, issue can arise with words like warm bause there is no spack in the search 

val lines = sc.textFile("file:///data/war_and_peace.txt") 
val combined = lines.map(line => line.toUpperCase()).filter(line => line.contains("WAR") && line.contains("PEACE")) 
combined.count 

 Answer res0: Long = 9 

Answer Print combined.collect.mkString("\n") res1: String =

 WAR AND PEACE HAD FOLLOWED, AND ONLY THE HEAVENS PROMISED PEACE. TOWARD MORNING ALL AT HOME IN A QUIET HAVEN. IN MOSCOW HE FELT AT PEACE, AT HOME, WARM AND "TO ENTER RUSSIA WITHOUT DECLARING WAR! I WILL NOT MAKE PEACE AS LONG AS THEY ARE NEITHER FIT FOR WAR NOR PEACE! BARCLAY IS SAID TO BE THE MOST PEACE AND THE ABOLITION OF WAR, AND SECONDLY, BY THE FACT THAT WHEN HE WAR AND THE PEACE THAT HAD BEEN CONCLUDED. "YES, I HAVE BEEN MUCH BLAMED," HE SAID, "BOTH FOR THAT WAR AND THE PEACE... BUT EVERYTHING MADE PEACE WITH TURKEY AND SHOULD NOT HAVE BEEN THROUGH WITH THAT WAR. 

 2. Approximately how many sentences are in the book war_and_peace.txt? HINT: Letís define a sentence as a sequence of text that ends with a period,and note that an approximation is ok.
 ###Process: First define words as flat map, then filter words based on critera of sentance with period.  

val lines = sc.textFile("file:///data/war_and_peace.txt") val words = lines.flatMap(line => line.split(" ")) 
val periods = words.filter(word => word.contains(".")) 
periods.count 

 Answer res3: Long = 26652 

 3.Save a random sampling of 500 lines from war_and_peace.txt either to local or HDFS storage as text 

 ###Process:first take sample, then parallelize, finally save to temp folder, ouput is 4 files. If you want one do (sample,1) 

 val lines = sc.textFile("file:///data/war_and_peace.txt") 
val sample = lines.takeSample(false, 500) sc.parallelize(sample).saveAsTextFile("file:///tmp/ouput2") 

 ***val sample returns sample: Array[String] = Array(wouldn't have room to turn round there. If you must go, go to the, "", Nat·sha., domestic serf, about forty-five years old, with grizzled hair and a, manservant to the Vozdvizhenka to ascertain whether he had come. He had, thing really.", were seven of us. We lived well. We were real peasants. It so, wrath. Having snapped at an officer for an unpolished badge, at another, uplifted whip meant that he saw a sitting hare.), couple of bottles of wine by himself. The process in his mind went on, have a look at it" óand Kut˙zov with an ironical smile about the corners, armies made their presence known, but later when they reached the, at it, and stamped his foot twice on the floor as a signal. The door, Perhaps you think you have invented a novelty? You have been,... Answer: Returns (4) files to directory 

 4. From home_data.csv, how many houses sold were built prior to 1979? ###Process: Similar to a Vlookup(), we seperate date by column, then count over 14 columns to capture (yr_built) val lines = sc.textFile("file:///data/home_data.csv") val filtered = lines.filter(line => line.split(",")(14) < "1979" ) filtered.count Answer: res0 Long = 11991 

 5. From home_data.csv, how many houses sold had a lot size (in sq. ft) that was greater than triple the living space (in sq. ft)? Save this output to file either locally or on HDFS. Example: lot size 500 living size 100 would count (500 > 300) but 500/200 would not (500 < 600). ###Process: header is remove row (1), then replaced row (1) with row (0), ### Next: call out value doesn't = header ### Next: like Vlookup(), go over 6 columns to [sqFt lot] > 3, then multiply over 5 columns to [sqFt Living Room] ***note: toInt is converting string to interger val homes = sc.textFile("file:///data/home_data.csv") val header = homes.take(1)(0) val noHeader = homes.filter(line => !(line contains header)) val sqFt = noHeader.filter(line => {val splits = line.split(","); splits(6).toInt > 3 * splits(5).toInt}) sqFt.count Answer = Long = 14267 Bonus 

Exercise Exercise 1: 

 1. How many homes were sold with a zipcode being defined as in ìSeattleî? Hint: how do you go about doing this in SQL? 

 ### Process: remove header 
### Go over 1 column to [city] = Seattle 
### Go over 0 column to colect [zipcode] Array val zipcodes = sc.textFile("file:///data/wa_zipcodes.csv") 

val seattleLines = zipcodes.filter(line => line.split(",")(1) contains ("Seattle")) 
val seattleZips = seattleLines.map(line => line.split(",")(0)).collect 


Answer:Note: Seattle Zips becomes a locckup table seattleZips: Array[String] = Array(98101, 98102, 98103, 98104, 98105, 98106, 98107, 98108, 98109, 98111, 98112, 98113, 98114, 98115, 98116, 98117, 98118 , 98119, 98121, 98122, 98124, 98125, 98126, 98127, 98129, 98131, 98132, 98133, 98134, 98136, 98138, 98139, 98141, 98144, 98145, 98146, 98148, 98151 , 98154, 98155, 98158, 98160, 98161, 98164, 98165, 98166, 98168, 98170, 98171, 98174, 98175, 98177, 98178, 98181, 98184, 98185, 98188, 98190, 98191 , 98194, 98195, 98198, 98199) 

 ###Process matching seattleZip (lookup table), with seattleHomes over 16 columns to [zipcode], then taking substring because it is in quotes, hense (6) int val homes = sc.textFile("file:///data/home_data.csv") val seattleHomes = homes.filter(line => seattleZips contains line.split(",")(16).substring(1,6)) seattleHomes.count Answer: res2 Long = 8977 **Note: There is a more efficent way to do this Exercise 2: [Stop 19:32:23] look at tomorrow Using home_data.csv create an RDD of key/value pairs where the key is the ìidî of a row and the value is everything else. For Example: "7129300520","20141013T000000",221900,"3","1",1180,5650,"1",0,0,3,7,1180,0,1955,0,"98178",47.5112,-122.257,1340,5650 The key would be "7129300520" and the value would be "20141013T000000",221900,"3","1",1180,5650,"1",0,0,3,7,1180,0,1955,0,"98178",47.5112,-122.257,1340,5650 

 ###Process: it drops the header, grabs the first tubble and strings it 

val homes = sc.textFile("file:///data/home_data.csv") 
val header = homes.take(1)(0) val noHeader = homes.filter(line => !(line contains header)) val mapped = noHeader.map(home => {val spilts = home.split(","); (splits(0), splits.drop(1).mkString(","))}) mapped.take(2) 

Answer res4: Array[(String, String)] = Array(("7129300520","20141013T000000",221900,"3","1",1180,5650,"1",0,0,3,7,1180,0,1955,0,"98178",47.5112,-122.257,1340,5650) , ("6414100192","20141209T000000",538000,"3","2.25",2570,7242,"2",0,0,3,7,2170,400,1951,1991,"98125",47.721,-122.319,1690,7639))



Comments

Popular posts from this blog

Final Project: NYC Parking Tickets

Assignment 05 Fun with Spark III