Getting Started with Spark (part 4) – Unit Testing

Alright quite a while ago (already counting years), I published a tutorial series focused on helping people getting started with Spark. Here is an outline of the previous posts:

In the meanwhile Spark has not decreased popularity, so I thought I continued updating the same series. In this post we cover an essential part of any ETL project, namely Unit testing.

For that I created a sample repository, which is meant to serve as boiler plate code for any new Python Spark project.

Decrypting correctly parameters from AWS SSM

Today is yet short one, but ideally will already save a whole lot of headaches for some people.

Scenario: You have stored the contents of a string using AWS SSM parameter store (side note: if you are not using it yet, you should definitely have a look), but when retrieving it  decrypted via CLI, you notice that the string has new lines (‘\n’) substituted by spaces (‘ ‘).

In my case, I was storing a private SSH key encrypted to integrate with some Ansible scripts triggered via AWS CodePipeline + CodeBuild. CodeBuild makes it realy easy to access secrets stored in SSM store, however it was retrieving my key incorrectly, which in term domino-crashed my ansible scripts.

Here you can also confirm more people are facing this issue. After following the suggestion of using AWS SDK – in my case with python boto3 – it finally worked. So here is a gist to overwrite an AWS SSM parameter, and then retrieving it back:

Hope this helps!

Container orchestration in AWS: comparing ECS, Fargate and EKS

Before rushing into the new cool kid, namely AWS EKS, AWS hosted offering of Kubernetes, you might want to understand how it works underneath and compares to the already existing offerings. In this post we focus on distinguishing between the different AWS container orchestration solutions out there, namely AWS ECS, Fargate, and EKS, as well as comparing their pros and cons.


Before we dive into comparisons, let us summarize what each product is about.

ECS was the first container orchestration tool offering by AWS. It essentially consists of EC2 instances which have docker already installed, and run a Docker container which talks with AWS backend. Via ECS service you can either launch tasks – unmonitored containers suited usually for short lived operations – and services, containers which AWS monitors and guarantees to restart if they go down by any reason. Compared to Kubernetes, it is quite simpler, which has advantageous and disadvantages.

Fargate was the second service offering to come, and is intended to abstract all everything bellow the container (EC2 instances where they run) from you. In other words, a pure Container-as-a-Service, where you do not care where that container runs. Fargate followed two core technical advancements made in ECS: possibility to assign ENI directly and dedicated to a Container and integration of IAM on a container level. We will get into more detail on this later.

The following image sourced from AWS blog here illustrates the difference between ECS and Fargate services.


EKS is the latest offering, and still only available on some Regions. With EKS you can abstract some of the complexities of launching a Kubernetes Cluster, since AWS will now manage the Master Nodes – the control plane. Kubernetes is a much richer container orchestrator, providing features such as network overlay, allowing you to isolate container communication, and storage provisioning. Needless to say, it is also much more more complex to manage, with a bigger investment in terms of DevOps effort.

Like Kubernetes, you can also use kubectl to communicate with EKS cluster. You will need to configure the AWS IAM authenticator locally to communicate with your EKS cluster.

