Getting Started with Spark (part 4) – Unit Testing

Alright quite a while ago (already counting years), I published a tutorial series focused on helping people getting started with Spark. Here is an outline of the previous posts:

In the meanwhile Spark has not decreased popularity, so I thought I continued updating the same series. In this post we cover an essential part of any ETL project, namely Unit testing.

For that I created a sample repository, which is meant to serve as boiler plate code for any new Python Spark project.

Let us browse through the main job script. Please note that the repository might contain updated version, which might defer in details with the next gist.

The previous gist recovers the same example used in the previous post on UDFs and Window Functions.

Here is an example how we could test our “amount_spent_udf” function:

Now note the first line on the unit tests script, which is the secret sauce to load a spark context for your unit tests. Bellow is the code that creates the “spark_session” object passed as an argument to the “test_amount_spent_udf” function.

And that is it. We strongly encourage you to have a look on the correspondent git repository, where we specify detailed instructions how to run it locally.

And that is it for today, hope it helped!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s