Assignment 05 Fun with Spark III

Week 5 Lab


This week’s work outside of class is optional, but highly recommended.  We will walk through installing Jupyter as well as make some sandbox modifications to support running Jupyter as well as Zeppelin and the Spark Web UI and I’ve also provided some additional fun Spark exercises for those of you who are especially interested in the software part of our class.


Docker Port Forwarding Modifications


In this step we are going to go through process we talked about in class this week about adding additional port forwards to your sandbox-hdp Docker container.  Remember that we can only set up these proxies when the container is created, so we need to commit an image of our current setup and recreate a new container with the new mappings.


SSH into your VM:
  • Stop sandbox-hdp Docker container
    • sudo docker stop sandbox-hdp
  • Confirm  not running using Docker ‘ps”
    • sudo docker ps
  • Crate image of the current container state
    • sudo docker commit sandbox-hdp sandbox-hdp
      • Note: Normally would want to use a new name, but we are short on space and script is hardcoded to use sandbox-hdp
      • This will take a while since it’s effectively making a copy of your container
  • Edit start-sandbox.sh to add additional port forwardings
    • When you edit the file you will see list of lines with the format “-p PORT:PORT \”
    • Add one line for Jupyter and update the existing Spark UI one (4040) to include the range like in the following screenshot
-p 9999:9999 \ (for jupyter)
-p 4040-4050:4040-4050 \ (for spark web ui)


  • Remove old container
sudo docker rm sandbox-hdp
  • Create new container and start sandbox
sudo start-sandbox-hdp.sh

Log into Azure Web Portal:
  • Add Port Forwarding Rules on the VM Networking page (same as we did to allow access to Ambari Web UI during initial setup
    • Jupyter
      • 9999
    • Support up to 10 simulataneous Spark applications
      • 4040-4050
    • Zeppelin (if interested, if not leave port firewalled)
      • 9995
  • Assuming sandbox is running can verify port mapping by going to ip:4040 and should see a Spark UI


Install Jupyter


Normally we do NOT install Jupyter as root, but for this sandbox it is the most straightforward way.  We are going to use a Python virtual environment manager environment called Anaconda that besides including Jupyter also installs many useful data anlysis libraries as well in one package.


SSH into sandbox (from VM)
  • Download anaconda installation script
  • Run installation script
    • bash Anaconda3-5.0.1-Linux-x86_64.sh
  • Follow prompts
    • Accept license
    • Choose to install in “/opt/anaconda3” when prompted for path
  • A bunch of packages will be downloaded and installed
  • At the end will be prompted to add to PATH
    • Make sure to select “no” (default) to add to prompt
  • Anaconda + Jupyter is now installed
  • Generate a config file and update to allow access vi VM
    • /opt/anaconda3/bin/jupyter notebook --generate-config --allow-root
  • Edit /root/.jupyter/jupyter_notebook_config.py & the following changes
    • Remove the comment at the beginning of the line and make the updates in the text
c.NotebookApp.ip = ‘*’
c.NotebookApp.open_browser = False
c.NotebookApp.port = 9999


If you are only going to use Python are done installing things


To start Jupyter using just plain Python (no Spark)
  • /opt/anaconda3/bin/jupyter notebook --allow-root
  • Copy “token” string, paste into browser and replace localhost with your IP and should see Jupyter home screen
    • Should look something like this
    • http://localhost:9999/?token=810134bc844491a627de49ced8805a96c85edacb88151137



To use pyspark in Jupyter


  • Need to tell pyspark to use Jupyter for python environment
  • PYSPARK_DRIVER_PYTHON=/opt/anaconda3/bin/jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook --allow-root' pyspark
    • This command is one line
    • Alternatively could set these env vars to export when log in to avoid typing every time (if you know what that means)


If using Scala need to install a kernel for Spark:




  • Update spark-env.sh to include correct HDP version
    • This is a sandbox finicky thing that only needs to be done for this environment
    • Edit  /usr/hdp/current/spark2-client/conf/spark-env.sh
    • Add an additional “export” line like the following
      • export HDP_VERSION=2.6.1.0-129


Can start Jupyter with command  “/opt/anaconda3/bin/jupyter notebook” and then select “Apache Toree - Scala” to start  a Spark scala kernel
























Assignment 5


This is a course about big data, right?  So far we’ve only looked at data that was a few MB.  Let’s see if anything changes when we move up.


I’ve made available a dataset of loan data from LendingTree from 2007-2015.  The data is a CSV file with approximately 800K rows and 75 columns, total size a little over 400MB.


Download the sample data from here:




There is also a data dictionary that explains the columns in excel format here:




Use this data for the following questions:

  1. Create a “histogram” of the counts for each type defined in the “home_ownership” column.
  2. This a very wide data set, and we frequently will only be querying a subset of the columns
    1. Currently it is in CSV format.  Is there a more appropriate file format?  If so, then explain why.
    2. Convert the CSV data to the format you specified in part a) by saving the file to that format, and experiment with queries to show the difference in performance.
  3. For each category of “home_ownership”, what is the count of each “loan_status”?
  4. Did any of these loans originate in King county based on the data we have in wa_zipcodes.txt? (the file is misleadingly named... it's not all zipcodes in WA but only King county zip codes").  Note: since the data only has the first 3 of each zip, as long as the first 3 match then count it.

Comments

Popular posts from this blog

Final Project: NYC Parking Tickets

Assignment 03 Fun with Spark Part II