Airflow: create and manage Data Pipelines easily

This bootstrap guide was originally published at GoSmarten but as the use cases continue to increase, it’s a good idea to share it here as well.

What is Airflow

The need to perform operations or tasks, either simple and isolated or complex and sequential, is present in all things data nowadays. If you or your team work with lots of data on a daily basis there is a good chance you’re struggled with the need to implement some sort of pipeline to structure these routines. To make this process more efficient, Airbnb developed an internal project conveniently called Airflow which was later fostered under the Apache Incubator program. In 2015 Airbnb open-sourced the code to the community and, albeit its trustworthy origin played a role in its popularity, there are many other reasons why it became widely adopted (in the engineering community). It allows for tasks to be set up purely in Python (or as Bash commands/scripts).

What you’ll find in this tutorial

Not only we will walk you through setting up Airflow locally, but you’ll do so using Docker, which will optimize the conditions to learn locally while minimizing transition efforts into production. Docker is a subject for itself and we could dedicate a few dozen posts to, however, it is also simple enough and has all the magic needed to show off some of its advantages combining it with Airflow. Continue reading “Airflow: create and manage Data Pipelines easily”

Summary Terraform vs Spinakker

I recently needed to build this summary, so thought I’d rather share with more people as well. Please feel free to add any points you see fitting.


Rather then putting on versus another assuming mutual exclusivity, many companies are adopting both tools simultaneously.Terraform is usually used for static cloud Infrastructure setup and updates, such as networks/VLANs, Firewalls, Load Balancers, storage buckets, etc. Spinakker is used for setting up more complex deployment pipelines, mainly orchestration of software packages and application code to setup on servers.Though there is intersection (Spinakker can also deploy App environment), Terraform provides an easy and clean way to setup Infrastructure-as-Code.


  • Hashicorp product focused on allowing you to cleanly describe and deploy infrastructure in Cloud and on premise environments;
  • Allows to describe your infrastructure as code, as well as to deploy it. Although complexer deployments replying on server images is possible, man-power effort starts growing exponentially
  • The sintax used to describe resources is own Hashicorp Configuration Language (HCL), which may be a turn off at first sight; however, after seeing how clear human readable it is, you won’t regret it;
  • Deployment form varies; from immutable infrastructure style deployments (e.g. new update = complete new resource) to updates, depending on the nature of resources. For example, server instances require immutable approach, where firewall rules do not;
  • Failure to deploy resources does not roll back and destroy “zombie” resources; instead they are marked as tainted, and left as it is. On a next execution plan run, tainted resources will be removed, to give place to new intended resources. More detail here;
  • Multi-cloud deployments, currently supporting AWS, GCP, Azure, OpenStack, Heroku, Atlas, DNSimple, CloudFlare.
  • Doesn’t require server deployment, e.g. can be launched from your own local machine. However, Hashicorp provides Atlas, to provide remote central deployments.


  • Is a Netflix open source tool for managing deployment of complex workflows (Continuous Delivery). It is kind of a second generation/evolution of Asgard.
  • Orchestrator for deploying full Application environments, from sourrounding infrastructure environment (networks, firewalls and load balancers), along with server images;
  • Works hand-in-hand with tools such as Jenkins and packer.
  • As an orchestrator, it focuses on Pipelines, a sequence of stages. A stage is an atomic building block; examples of stages:
    • Build an image (example AMI in case of AWS) in a specific Region
    • Deploy the image
    • Run a bash script
    • Resize a server group
  • Multi-cloud deployments, currently supporting AWS, GCP, Azure, OpenStack, Kubernetes, and CloudFoundry.
  • It is by itself a product with several functionality, requiring itself server deployemnt. Images for easy deployment are available both in AWS, Azure, GCP, and Kubernetes;
  • Provides Web GUI for full pipeline configuration, as well as an API
  • Allows to setup webhooks for notifications via email, SMS and chat clients (HipChat/Slack)


More Resources