Assignment 03 Fun with Spark Part II
Assignment 03 Description
Fun with Spark Lab
Prepare Sample Data
For this assignment we are going to use two sample data sets. One is a text file version of the classic book “War and Peace”, the other is the same housing sales data from Week 2.
If you have not already done so, please put home_data.csv onto your VM and then also war_and_peace.txt. To get the data onto your VM you can either download to your local machine and then use the SCP command to copy the files e.g: “scp home_data.csv username@ip: (the colon is necessary)
Or use the wget command and pass in the link directly, similar to how you downloaded the HDP sandbox. Note that the “stylized” single quote marks in this doc might cause problems if you copy/paste directly into command line.
wget ‘https://drive.google.com/a/uw.edu/uc?authuser=2&id=0B0Ntj7VtxrluZG9xRkc0NmZ4Q0E&export=download’ -O war_and_peace.txt
wget ‘https://drive.google.com/a/uw.edu/uc?authuser=2&id=0B0Ntj7VtxrluN1dFWlRiY0pHNHM&export=download’ -O home_data.csv
If you would like to, you can instead add to HDFS and access from there. To copy home_data.csv to HDFS can either scp into sandbox and then use HDFS commands to copy to HDFS:
Scp home_data.csv root@localhost:
Then SSH into sandbox and run
Hadoop fs -put home_data.csv /tmp/home_data.csv
Or use the Ambari Web UI “Files View” (located same place you find the “Hive2 View” and can upload directly from local machine via web UI
Run Spark Shell
From within the sandbox (after you ssh -p root@localhost) run the command “spark-shell”
(or “pyspark” for a Python shell)
You can safely ignore warning about “SparkUI can’t bind to port…”
If all goes well you should see a screen with some excellent ASCII art like this:
Congratulations, now you are ready to Spark!
To get some familiarity and make sure Spark is working properly, I’d recommend running through a few of the examples from the slides this week.
Assignment 3
Please answer the following questions using the sample data provided, providing code that you used. Either Scala or Python is acceptable. If you are unable to come up with code to answer a question please describe how you think you would solve the problem in a “sparkified” way based on what we have learned so far.
You will likely find the String method “split” in both Scala and Python, and “contains” in Scala (“in” in Python) useful in many of these exercises. This week’s material shows some sample usage.
- Which lines from the book war_and_peace.txt contain both words “war” and “peace”?
- Approximately how many sentences are in the book war_and_peace.txt? HINT: Let’s define a sentence as a sequence of text that ends with a period,and note that an approximation is ok.
- Save a random sampling of 500 lines from war_and_peace.txt either to local or HDFS storage as text
- From home_data.csv, how many houses sold were built prior to 1979?
- From home_data.csv, how many houses sold had a lot size (in sq. ft) that was greater than triple the living space (in sq. ft)? Save this output to file either locally or on HDFS. Example: lot size 500 living size 100 would count (500 > 300) but 500/200 would not (500 < 600).
Bonus Exercise
Remember, bonus exercise is meant for those either with prior knowledge on the topic or the interest (and time) to do some research beyond what we covered in class and is optional.
Exercise 1:
Let’s again use the zipcode list from Assignment 2, you can access via
wget ‘https://drive.google.com/a/uw.edu/uc?authuser=0&id=0B0Ntj7VtxrluSXhiakdXLWx0N3c&export=download’ -O wa_zipcodes.csv
How many homes were sold with a zipcode being defined as in “Seattle”?
Hint: how do you go about doing this in SQL?
Exercise 2:
Using home_data.csv create an RDD of key/value pairs where the key is the “id” of a row and the value is everything else.
For Example: "7129300520","20141013T000000",221900,"3","1",1180,5650,"1",0,0,3,7,1180,0,1955,0,"98178",47.5112,-122.257,1340,5650
The key would be "7129300520" and the value would be "20141013T000000",221900,"3","1",1180,5650,"1",0,0,3,7,1180,0,1955,0,"98178",47.5112,-122.257,1340,5650
Hint: Most of the transformations we have done so far are simple one liners; But remember that you can use any arbitrary function that you define in your transformation, like we did with the square root example in the functional programming demo.
Assignment 03 Answers
Tim Pauley
BI Data 201A Assignment 03
Update on Answers
Assignment 3
Please answer the following questions using the sample data provided, providing code that you used. Either Scala or Python is acceptable. If you are unable to come up with code to answer a question please describe how you think you would solve the problem in a ìsparkifiedî way based on what we have learned so far.
You will likely find the String method ìsplitî in both Scala and Python, and ìcontainsî in Scala (ìinî in Python) useful in many of these exercises. This weekís material shows some sample usage.
1. Which lines from the book war_and_peace.txt contain both words ìwarî and ìpeaceî?
###Process: Convert everything to uppercase, issue can arise with words like warm bause there is no spack in the search
val lines = sc.textFile("file:///data/war_and_peace.txt")
val combined = lines.map(line => line.toUpperCase()).filter(line => line.contains("WAR") && line.contains("PEACE"))
combined.count
Answer res0: Long = 9
Answer Print
combined.collect.mkString("\n")
res1: String =
WAR AND PEACE
HAD FOLLOWED, AND ONLY THE HEAVENS PROMISED PEACE. TOWARD MORNING ALL
AT HOME IN A QUIET HAVEN. IN MOSCOW HE FELT AT PEACE, AT HOME, WARM AND
"TO ENTER RUSSIA WITHOUT DECLARING WAR! I WILL NOT MAKE PEACE AS LONG AS
THEY ARE NEITHER FIT FOR WAR NOR PEACE! BARCLAY IS SAID TO BE THE MOST
PEACE AND THE ABOLITION OF WAR, AND SECONDLY, BY THE FACT THAT WHEN HE
WAR AND THE PEACE THAT HAD BEEN CONCLUDED. "YES, I HAVE BEEN MUCH
BLAMED," HE SAID, "BOTH FOR THAT WAR AND THE PEACE... BUT EVERYTHING
MADE PEACE WITH TURKEY AND SHOULD NOT HAVE BEEN THROUGH WITH THAT WAR.
2. Approximately how many sentences are in the book war_and_peace.txt? HINT: Letís define a sentence as a sequence of text that ends with a period,and note that an approximation is ok.
###Process: First define words as flat map, then filter words based on critera of sentance with period.
val lines = sc.textFile("file:///data/war_and_peace.txt")
val words = lines.flatMap(line => line.split(" "))
val periods = words.filter(word => word.contains("."))
periods.count
Answer res3: Long = 26652
3.Save a random sampling of 500 lines from war_and_peace.txt either to local or HDFS storage as text
###Process:first take sample, then parallelize, finally save to temp folder, ouput is 4 files. If you want one do (sample,1)
val lines = sc.textFile("file:///data/war_and_peace.txt")
val sample = lines.takeSample(false, 500)
sc.parallelize(sample).saveAsTextFile("file:///tmp/ouput2")
***val sample returns
sample: Array[String] = Array(wouldn't have room to turn round there. If you must go, go to the, "",
Nat·sha., domestic serf, about forty-five years old, with grizzled hair and a, manservant to the Vozdvizhenka to ascertain whether he had come.
He had, thing really.", were seven of us. We lived well. We were real peasants. It so, wrath. Having snapped at an officer for an unpolished badge,
at another, uplifted whip meant that he saw a sitting hare.), couple of bottles of wine by himself. The process in his mind went on, have a look at it"
óand Kut˙zov with an ironical smile about the corners, armies made their presence known, but later when they reached the, at it, and stamped his foot twice
on the floor as a signal. The door, Perhaps you think you have invented a novelty? You have been,...
Answer: Returns (4) files to directory
4. From home_data.csv, how many houses sold were built prior to 1979?
###Process: Similar to a Vlookup(), we seperate date by column, then count over 14 columns to capture (yr_built)
val lines = sc.textFile("file:///data/home_data.csv")
val filtered = lines.filter(line => line.split(",")(14) < "1979" )
filtered.count
Answer: res0 Long = 11991
5. From home_data.csv, how many houses sold had a lot size (in sq. ft) that was greater than triple the living space (in sq. ft)?
Save this output to file either locally or on HDFS. Example: lot size 500 living size 100 would count (500 > 300) but 500/200 would not (500 < 600).
###Process: header is remove row (1), then replaced row (1) with row (0),
### Next: call out value doesn't = header
### Next: like Vlookup(), go over 6 columns to [sqFt lot] > 3, then multiply over 5 columns to [sqFt Living Room] ***note: toInt is converting string to interger
val homes = sc.textFile("file:///data/home_data.csv")
val header = homes.take(1)(0)
val noHeader = homes.filter(line => !(line contains header))
val sqFt = noHeader.filter(line => {val splits = line.split(","); splits(6).toInt > 3 * splits(5).toInt})
sqFt.count
Answer = Long = 14267
Bonus
Exercise
Exercise 1:
1. How many homes were sold with a zipcode being defined as in ìSeattleî? Hint: how do you go about doing this in SQL?
### Process: remove header
### Go over 1 column to [city] = Seattle
### Go over 0 column to colect [zipcode] Array
val zipcodes = sc.textFile("file:///data/wa_zipcodes.csv")
val seattleLines = zipcodes.filter(line => line.split(",")(1) contains ("Seattle"))
val seattleZips = seattleLines.map(line => line.split(",")(0)).collect
Answer:Note: Seattle Zips becomes a locckup table
seattleZips: Array[String] = Array(98101, 98102, 98103, 98104, 98105, 98106, 98107, 98108, 98109, 98111, 98112, 98113, 98114, 98115, 98116, 98117, 98118
, 98119, 98121, 98122, 98124, 98125, 98126, 98127, 98129, 98131, 98132, 98133, 98134, 98136, 98138, 98139, 98141, 98144, 98145, 98146, 98148, 98151
, 98154, 98155, 98158, 98160, 98161, 98164, 98165, 98166, 98168, 98170, 98171, 98174, 98175, 98177, 98178, 98181, 98184, 98185, 98188, 98190, 98191
, 98194, 98195, 98198, 98199)
###Process matching seattleZip (lookup table), with seattleHomes over 16 columns to [zipcode], then taking substring because it is in quotes, hense (6) int
val homes = sc.textFile("file:///data/home_data.csv")
val seattleHomes = homes.filter(line => seattleZips contains line.split(",")(16).substring(1,6))
seattleHomes.count
Answer: res2 Long = 8977
**Note: There is a more efficent way to do this
Exercise 2: [Stop 19:32:23] look at tomorrow
Using home_data.csv create an RDD of key/value pairs where the key is the ìidî of a row and the value is everything else.
For Example: "7129300520","20141013T000000",221900,"3","1",1180,5650,"1",0,0,3,7,1180,0,1955,0,"98178",47.5112,-122.257,1340,5650
The key would be "7129300520" and the value would be "20141013T000000",221900,"3","1",1180,5650,"1",0,0,3,7,1180,0,1955,0,"98178",47.5112,-122.257,1340,5650
###Process: it drops the header, grabs the first tubble and strings it
val homes = sc.textFile("file:///data/home_data.csv")
val header = homes.take(1)(0)
val noHeader = homes.filter(line => !(line contains header))
val mapped = noHeader.map(home => {val spilts = home.split(","); (splits(0), splits.drop(1).mkString(","))})
mapped.take(2)
Answer
res4: Array[(String, String)] = Array(("7129300520","20141013T000000",221900,"3","1",1180,5650,"1",0,0,3,7,1180,0,1955,0,"98178",47.5112,-122.257,1340,5650)
, ("6414100192","20141209T000000",538000,"3","2.25",2570,7242,"2",0,0,3,7,2170,400,1951,1991,"98125",47.721,-122.319,1690,7639))
Comments
Post a Comment