Build a Data Pipeline with AWS Athena and Airflow (part 2)

After learning the basics of Athena in Part 1 and understanding the fundamentals or Airflow, you should  now be ready to integrate this knowledge into a continuous data pipeline.

The idea is for it to run on a daily schedule, checking if there’s any new CSV file in a folder-like structure matching the day for which the task is running. For example, if the task is running for 2010-01-31, then then it will check if there is any file in s3://data/year=2010/month=01/day=31/*. If it finds a file there, it will add the “folder” as a partition to Athena so we can keep querying it.

Remind me again: why Athena?

At this point, if you are still wondering why Athena is so useful when you already have a pipeline in process to dump data somewhere (maybe a DB?) well, remember Athena is a “pay as you go” solution that will scale automatically for the desired queries you are running. The underlying costs are only associated with the S3 file hosting itself plus the execution of queries. Such queries, when combined with the Hive Metastore will provide a fast solution for querying heavy loads of data stored in several different types of files on an S3 bucket. On the other hand, provisioning a Database for dumping data will have fixed costs such as processing power, memory and storage amount which will surpass the first ones in case you are not using/needing the full blown features of having in place a proper database engine.

Before proceeding, there are three important assumptions: Continue reading “Build a Data Pipeline with AWS Athena and Airflow (part 2)”

Setting up Airflow remote logs to S3 bucket

Today is a short one, but hopefully a valuable devOps tip, if you are currently setting up remote logging integration to S3 of Airflow logs using Airflow version 1.9.0.

Basically this stackoverflow post provides the main solution. However, the current template incubator-airflow/airflow/config_templates/airflow_local_settings.py present in master branch contains a reference to the class “airflow.utils.log.s3_task_handler.S3TaskHandler”, which is not present in apache-airflow==1.9.0 python package. The fix is simple – use rather this base template: https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/config_templates/airflow_local_settings.py (and follow all other instructions in the mentioned answer)

Hope this helps!

Build a Data Pipeline with AWS Athena and Airflow (part 1)

In this post, I build up on the knowledge shared in the post for creating Data Pipelines on Airflow and introduce new technologies that help in the Extraction part of the process with cost and performance in mind. I’ll go through the options available and then introduce to a specific solution using AWS Athena. First we’ll establish the dataset and organize our data in S3 Buckets. Afterwards, you’ll learn how to make it so that this information is queryable through AWS Athena, while making sure it is updated daily.

Data dump files of not so structured data are a common byproduct of Data Pipelines that include extraction. dumps of not-so-structured data. This happens by design: business-wise and as Data Engineers, it’s never too much data. From an investment stand point, object-relational database systems can become increasingly costly to keep, especially if we aim at keeping performance while the data grows.

Having this said, this is not a new problem. Both Apache and Facebook have developed open source software that is extremely efficient in dealing with extreme amounts of data. While such softwares are written in Java, they maintain an abstracted interface to the data that relies on traditional SQL language to query data that is stored on filesystem storage, such as S3 for our example and in a wide range of different formats from HFDS to CSV.

Today we have many options to tackle this problem and I’m going to go through on how to welcome this problem in today’s serverless world with AWS Athena. For this we need to quickly rewind back in time and go through the technology Continue reading “Build a Data Pipeline with AWS Athena and Airflow (part 1)”

Airflow: create and manage Data Pipelines easily

This bootstrap guide was originally published at GoSmarten but as the use cases continue to increase, it’s a good idea to share it here as well.

What is Airflow

The need to perform operations or tasks, either simple and isolated or complex and sequential, is present in all things data nowadays. If you or your team work with lots of data on a daily basis there is a good chance you’re struggled with the need to implement some sort of pipeline to structure these routines. To make this process more efficient, Airbnb developed an internal project conveniently called Airflow which was later fostered under the Apache Incubator program. In 2015 Airbnb open-sourced the code to the community and, albeit its trustworthy origin played a role in its popularity, there are many other reasons why it became widely adopted (in the engineering community). It allows for tasks to be set up purely in Python (or as Bash commands/scripts).

What you’ll find in this tutorial

Not only we will walk you through setting up Airflow locally, but you’ll do so using Docker, which will optimize the conditions to learn locally while minimizing transition efforts into production. Docker is a subject for itself and we could dedicate a few dozen posts to, however, it is also simple enough and has all the magic needed to show off some of its advantages combining it with Airflow. Continue reading “Airflow: create and manage Data Pipelines easily”

Consuming Kinesis Streams with Lambda functions locally

This blog post was originally published at GoSmarten website. As the number of projects where we use it was increasing, we thought we might as well share it. Let me know if it was helpful!

Motivation

AWS offers the cool possibility to consume from Kinesis streams in real time in a serverless fashion via AWS Lambda. However in can become extremely annoying to have to deploy a Lambda function in AWS just to test it. Moreover, it is also expensive to hold a Kinesis stream (e.g. queue) up and running just to test code.

Thus, by combining Kinesis Client Library (KCL) with AWS Kinesis and DynamoDB docker containers we were able to recreate locally everything that happens on the background when you plug a Lambda function to a Kinesis stream on AWS. Besides saving costs, this allows developers to substantially reduce development time, as well as develop higher quality code due to the ease and flexibility of testing different scenarios locally.

Feel free to checkout the code supporting this blog post on our repository.

Context: Event Sourcing/CQRS

 

Event sourcing is not a new concept, but as available streaming technologies have evolved, its widespread use has gained the attention it deserves. Thanks to “publish-subscribe” type of queues, it has become much easier to build streams of events available to multiple consumers at the same time. This democratization of access to an immutable, append-only stream of events is essential, as it separates the responsibility of modelling an event schema to a particular logic. It is also the reason why so many people either argue CQRS and event sourcing are the same thing, or have a symbiotic relationship. Continue reading “Consuming Kinesis Streams with Lambda functions locally”

Using Akka Streaming for “saving alerts” – part 2

This blog post is the second and final part of the post Using akka streaming for “saving alerts”. Part 1 is available here.  In this part we enter the details on how the application was designed.

Full disclosure: this post was initially published at Bonial tech blog here. If you are looking for positions in tech, I would definitely recommend checking their career page.

Application Actor System

The following illustration gives you a schematic view of all the actors used in our application, and (hopefully) some of the mechanics of their interaction:

akka-streaming-pipeline

As previously mentioned, one could divide the application lifecycle logically into three main stages. We will get into more detail about each one of them next, but for now let us walk through them narrowly and try to map to our application Continue reading “Using Akka Streaming for “saving alerts” – part 2″

Using Akka Streaming “saving alerts” – part 1

Full disclosure: this post was initially published at Bonial tech blog here, one my favorite companies at the heart of Berlin, and where I have been fortunate enough to be working for 2+ years as a freelance Data Engineer. If you are looking for positions in tech, I can’t help to recommend checking their career page.

Overview

Some months ago I was working on an internal project at Bonial using Akka Streaming (in scala) to provide additional features to our current push notification system. The main goal of the project was to enhance the speed to which the client is able to notify its end users of available discount coupons. In this case, we wanted to notify the users in a real time fashion of available coupons on store, so that they could use them more effectively on the spot and save money. Hence our project code name: “saving alerts”!

After some architectural discussions where we compared several technical options, we decided to give akka streaming a go. It has been a fun ride, so we thought we might as well share some of the lessons learned. Continue reading “Using Akka Streaming “saving alerts” – part 1″