This post attempts to continue the previous introductory series “Getting started with Spark in Python” with the topics UDFs and Window Functions.
- Part 1 Getting Started – covers basics on distributed Spark architecture, along with Data structures (including the old good RDD collections (!), whose use has been kind of deprecated by Dataframes)
- Part 2 intro to Dataframes
- Part 3 intro to UDFs and Window Functions
- Part 4 unit testing in PySpark environments
Note: For this post I’m using Spark 1.6.1. There are some minor differences in comparison to the new coming Spark 2.0, such as using a SparkSession object to initialize the Spark Context, instead of HiveContext as I do here. Nonetheless, the important parts are common in both.