Recently I had the opportunity to use Apache Spark as an ETL tool for a project I am working on here at GrowthIntel.
It was the first time I was working in a cluster computing framework. The experience was a good although slightly challenging one.
A cluster computing framework, also called distributed or parallel computing system, allows you to break up your data or calculation into parts, which each get sent to different ‘computers’. This means that the calculations can be done in a parallel fashion, which can speed up the computing time significantly. It does, however also mean that the code I was writing needed to be optimised for use in a parallel fashion. Having come from a heavily focussed on Python Pandas way of thinking, switching to using Spark was a fun challenge.
As is usual in a distributed setup, the setup in Spark consists of a master node and a set of workers. The master or driver node sends the instructions and makes sure that the workers are keeping up. The worker nodes contain executors and it is possible to have more than one executor per worker node. This is where all the work is done. The underlying data structure in Spark is called an RDD – Resilient Distributed Dataset. The RDD abstraction is exposed to numerous languages, including Python (through the PySpark library), Java, Scala and R. Spark itself is written in Scala and the RDD’s allow you to perform in-memory computations with a high fault tolerance (read this for more details)
Spark works through transformations and actions…and it does this lazily, i.e. it waits for an action before doing any work. In other words, you can apply as many transformations as you like on to an RDD but only when you call an action does Spark do any work. Once an action is called, it will look back through all the transformations and work out the most optimal way to parallelise the processes.
About a year ago, the dataframe data structure was introduced to Spark, inspired by the Pandas dataframe. The actions allowed on an RDD are only count, collect, reduce, lookup and save. The dataframe structure allowed for a much richer data analysis process. On the dataframe structure in Spark, you have access to functions like filter, map, groupby, flatMap, show, printSchema, and many more.
Here’s a quick example of how to apply a self-defined python function to every row of a dataset (It’s not quite as simple as it is in Pandas):
- Use a udf (udf = user defined function class in pyspark):
func_udf = udf(lambda x: self_defined_python_function(x))
df = df.withColumn('new_column',
Most of the functionality in Pandas is available on the Spark dataframes, but tread carefully – there are some differences:
- Underlying a dataframe object are RDD’s, hence they are immutable. This means it’s not possible to access/change individual rows. It also means that you can’t iterate over them.
- Joining two dataframes in Spark is not as easy as in Pandas. In Spark, the column that you are joining on is kept twice, which means that you can’t just refer to that column once you have the joint dataframe. To get the same behaviour as in Pandas, you need to actually select the columns you want to keep:
df1.join(df2, df1.col = df2.col, ‘left_outer’)
.select(df1.col, df1.col2, df2.col3)
As the main reason for using Spark is speed, it is important to optimise memory usage. This meant that I needed to gather a deeper understanding of Spark’s internals, such as what memory the java virtual machines require on each executor. This is not something I’ve had to worry about before. It’s not easy to optimise the size of the executor memory given the size of each worker node.
Spark comes with a WebUI, which aids in the understanding of what is going on under the hood. It shows what jobs are active and queued, how much memory has been cached, how each executor is performing, and more. It is a little frustrating that you have to continuously refresh the page to keep up to date with the calculations but it’s still a very useful, and in my opinion necessary, tool.
From Pandas to PySpark, a fun and worthwhile challenge. It is however, important to remember that you shouldn’t try to force all tasks through Spark. If Pandas can handle the size of the data, then it’s currently more flexible and user friendly. However, if the dataset is too large for Pandas, Spark with PySpark is a technology worth considering.