Posts

Showing posts from November, 2017

Assignment 05 Fun with Spark III

Image
Week 5 Lab This week’s work outside of class is optional, but highly recommended.  We will walk through installing Jupyter as well as make some sandbox modifications to support running Jupyter as well as Zeppelin and the Spark Web UI and I’ve also provided some additional fun Spark exercises for those of you who are especially interested in the software part of our class. Docker Port Forwarding Modifications In this step we are going to go through process we talked about in class this week about adding additional port forwards to your sandbox-hdp Docker container.  Remember that we can only set up these proxies when the container is created, so we need to commit an image of our current setup and recreate a new container with the new mappings. SSH into your VM: Stop sandbox-hdp Docker container sudo docker stop sandbox-hdp Confirm  not running using Docker ‘ps” sudo docker ps Crate image of the current container state sudo docker commit sandbox-hd

Assignment 04 Fun with Spark Part III

Assignment 04 Description Assignment 4 - More Fun with Spark No environment setup this week!  Use the Spark shell (or notebook if you have set it up) to do this weeks assignment in either Scala or Python.  Unless specified otherwise, please use the DataFrame API for this week’s work; do not just use SQL string queries. If you are unable to come up with code to answer a question please describe how you think you would solve the problem in a “sparkified” way based on what we have learned so far. From home_data.csv, how many houses sold were built prior to 1979? (I promise that is the last time I’m going to ask this question) From home_data.csv, what is the most expensive zipcode in the data set, defined as highest average sales price? How many unique zipcodes have sales data in the home_data.csv data set? Demonstrate how to drop the “sqft_living15” and “sqft_lot15” columns from your dataset. Access the zipcode table stored in Hive (ok to use a SQL string query he

Assignment 03 Fun with Spark Part II

Image
Assignment 03 Description Fun with Spark Lab Prepare Sample Data For this assignment we are going to use two sample data sets.  One is a text file version of the classic book “War and Peace”, the other is the same housing sales data from Week 2. If you have not already done so, please put home_data.csv onto your VM and then also war_and_peace.txt.  To get the data onto your VM you can either download to your local machine and then use the SCP command to copy the files e.g: “scp home_data.csv username@ip:   (the colon is necessary) Or use the wget command and pass in the link directly, similar to how you downloaded the HDP sandbox.   Note that the “stylized” single quote marks in this doc might cause problems if you copy/paste directly into command line. wget ‘ https://drive.google.com/a/uw.edu/uc?authuser=2&id=0B0Ntj7VtxrluZG9xRkc0NmZ4Q0E&export=download ’ -O war_and_peace.txt wget ‘ https://drive.google.com/a/uw.edu/uc?authuser=2&id=0B0Ntj7VtxrluN1dFWlRiY

Assignment 02 Install Docker Container

Image
Assignment Description Install Docker and Hortonworks HDP Install Docker Community Edition For reference, we’re going to essentially follow directions from here: https://docs.docker.com/engine/installation/linux/docker-ce/ubuntu/#install-using-the-repository -  Start VM -  SSH to server -  Run the following commands (hit “Y” to confirm apt-get commands if prompted) - sudo apt-get update -  sudo apt-get install apt-transport-https curl software-properties-common -  curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - -  sudo add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) \ stab