Posts

Final Project: NYC Parking Tickets

Final Project Big Data Technologies 201A This is where the big data file is and the sample data https://www.kaggle.com/new-york-city/nyc-parking-tickets/data This is a cool map I couldn't get to work. Maybe next course I will attempt it http://www.bigendiandata.com/2017-06-27-Mapping_in_Jupyter/ This is a way to split csv files with a windows machine https://www.addictivetips.com/windows-tips/csv-splitter-for-windows/ Moving local csv file from Local Drive to VM (Docker Container) Command Prompt Move to Linux server pscp -i c:\BigDataTechnolgies\ServerKey\tjpauley_azure.ppk C:\data\DateDim.csv tjpauley@ IPAddress : Note: I used the FAQ at putty's website Linux Command Copy to sandbox data folder   sudo cp DateDim .csv /data Zeppelin Notebook Create data frame case class DateDim ( DateNo: String, Weekday: String, Year: Integer, Month: Integer, Day: Integer) val ParkingFourteens = spark.read.option("inferSchema", "true...

Assignment 05 Fun with Spark III

Image
Week 5 Lab This week’s work outside of class is optional, but highly recommended.  We will walk through installing Jupyter as well as make some sandbox modifications to support running Jupyter as well as Zeppelin and the Spark Web UI and I’ve also provided some additional fun Spark exercises for those of you who are especially interested in the software part of our class. Docker Port Forwarding Modifications In this step we are going to go through process we talked about in class this week about adding additional port forwards to your sandbox-hdp Docker container.  Remember that we can only set up these proxies when the container is created, so we need to commit an image of our current setup and recreate a new container with the new mappings. SSH into your VM: Stop sandbox-hdp Docker container sudo docker stop sandbox-hdp Confirm  not running using Docker ‘ps” sudo docker ps Crate image of the current container state sudo docker com...

Assignment 04 Fun with Spark Part III

Assignment 04 Description Assignment 4 - More Fun with Spark No environment setup this week!  Use the Spark shell (or notebook if you have set it up) to do this weeks assignment in either Scala or Python.  Unless specified otherwise, please use the DataFrame API for this week’s work; do not just use SQL string queries. If you are unable to come up with code to answer a question please describe how you think you would solve the problem in a “sparkified” way based on what we have learned so far. From home_data.csv, how many houses sold were built prior to 1979? (I promise that is the last time I’m going to ask this question) From home_data.csv, what is the most expensive zipcode in the data set, defined as highest average sales price? How many unique zipcodes have sales data in the home_data.csv data set? Demonstrate how to drop the “sqft_living15” and “sqft_lot15” columns from your dataset. Access the zipcode table stored in Hive (ok to use a SQL string...

Assignment 03 Fun with Spark Part II

Image
Assignment 03 Description Fun with Spark Lab Prepare Sample Data For this assignment we are going to use two sample data sets.  One is a text file version of the classic book “War and Peace”, the other is the same housing sales data from Week 2. If you have not already done so, please put home_data.csv onto your VM and then also war_and_peace.txt.  To get the data onto your VM you can either download to your local machine and then use the SCP command to copy the files e.g: “scp home_data.csv username@ip:   (the colon is necessary) Or use the wget command and pass in the link directly, similar to how you downloaded the HDP sandbox.   Note that the “stylized” single quote marks in this doc might cause problems if you copy/paste directly into command line. wget ‘ https://drive.google.com/a/uw.edu/uc?authuser=2&id=0B0Ntj7VtxrluZG9xRkc0NmZ4Q0E&export=download ’ -O war_and_peace.txt wget ‘ https://drive.google.com/a/uw.edu/uc?authuser=2...