I have been using PySpark recently to quickly munge data. My workflow involves taking lots of json data from S3, transforming it, filtering it, then post processing the filtered output. At Spark Summit East, I got turned on to using parquet files as a way to store the intermediate output of my ETL process.
A sketch of the code looks something like this:
The output are files in
s3n://bucket/data/ that have the form
With the data in s3 as compressed parquet files, it can be quickly ingested back into pyspark for further processing. Conveniently, we can even use wildcards in the path to select a subset of the data. For instance, if I want to look at all the October data, I could run:
I like to work in Jupyter Notebooks when I am doing exploratory data analysis. After a couple runs at trying to set up Jupyter to run pyspark, i finially found a low-pain method here.
Assuming you have spark, hadoop, and java installed, you only need to pip install findspark by running
pip install -e . in the root of that repo. From there, we can run
The output should be
Hooking up AWS
With pyspark running, the next step is to get the S3 parquet data in to pandas dataframes:
I intitally ran into an error where spark couldn’t see my Hadoop
core-sites.xml file so it was throwing the following exception:
There are all kinds of cludgy ways to add your AWS keys to your system. But if you have already set up Hadoop you should be able to execute:
If that works, you only have to add
HADOOP_CONF_DIR to your path. If you can’t get to s3 from hadoop, you need to setup
core-sites.xml by following the directions here.
Once Hadoop is working, the final step is to add the following line to
~/.bash_profile (OSX) or
~/.bashrc (linux) in order to use s3 paths in pyspark: