AWS Server-less data pipelines with Terraform to Redshift – Part 2

Alright, it’s time for the second post of our sequence focusing on AWS options to setup pipelines in a server-less fashion. The topics that we are covering throughout this series are:

In this post we complement the previous one, by providing infrastructure-as-code with Terraform for deployment purposes. We are strong believers of a DevOps approach also to Data Engineering, also known as “DataOps”. Thus we thought it would make perfect sense to share a sample Terraform module along with Python code.

To recap, so far we have Python code that, if triggered by a AWS event on a new S3 object, will connect to Redshift, and issue SQL Copy command statement to load that data into a given table. Next we are going to show how to configure this with Terraform code.

As usual, all the code for this post is available publicly in this github repository. In case you haven’t yet, you will need to install terraform in order follow along this post.

Continue reading “AWS Server-less data pipelines with Terraform to Redshift – Part 2”

Get AppScaled ECS Tasks served by AWS Network Load Balancer

This article is intended to be a quick and dirty snippet for anyone going to through the struggle of getting your ECS service, which might have one or more containers running the same App (being part of an Auto Scaling Group), with a Network Load Balancer (instead of the more common ELB or ALB).

ECS Service/Task Definition

Another particularity of this implementation is that I also decided to use the ECS task’s network mode as awsvpc. In the case that you are not acquainted with this new option, this means that:

  • Your container will get its own network interface and its own IP address;
  • The Host port and the Container port need to be the same, since there is not middleware managing port match between the two entities.

The cherry on top is that the ECS Service now has the option of automatically registering and deregistering LB targets by their IP address, which fits perfectly on the intention described.

Network Load Balancer

This post isn’t concretely about describing the technical details of what is a Network Load Balancer but about the caveats of using it in this scenario: because NLB is a layer 4 load balancer, you won’t be able to define Security Groups at the NLB level. Instead, you’ll have to make sure you make your tasks/containers secure by attaching the security groups to them – remember that with the awsvpc network mode, each container will get its own NIC.

Implementation

As for the actual code snippet to support what I’m trying to achieve: Continue reading “Get AppScaled ECS Tasks served by AWS Network Load Balancer”

Getting started with AWS IoT and a dockerized device

IoT isn’t a new term and has, actually, been one of the buzz words of the XXI century, right alongside Crypto Currency. The two of them are seen holding hand in hand in the good and the bad, from interesting crypto projects focused on IoT usage, or actual IoT devices being hacked to mine crypto currencies. But this article is actually about IoT: it seems the last few years have given us the perfect momentum between three main ingredients for its popularity: the exponential increase of network capable devices, the easily available processing power and, finally, the hungry-for-data dynamic we from most parties nowadays.

For someone working as a Data Engineer (or related) there isn’t there much of a more end-to-end project than one which goes from setting up an actual no-so-intelligence device, design the methods to connect these devices to a network and, still, go about designing all streaming/batch processing infrastructure needed to deal with all the data. It really is a huge challenge.

In the following lines, I’ll focus on how AWS IoT has come to help in the first and second challenge and show an example of how to emulate an actual IoT device with the help of docker and walk through some of the nice and easy features AWS IoT offers such as IoT Core, management, topics and rules as well as integration with other AWS services such as Kinesis, Firehose or Lambdas.

Continue reading “Getting started with AWS IoT and a dockerized device”

AWS Server-less data pipelines with Terraform to Redshift – Part 1

This post is the first of sequence of posts focusing on AWS options to setup pipelines in a serverless fashion. The topics that we all cover throughout the whole series are:

In this post we lean towards another strategy to setup data pipelines, namely event triggered. That is, rather than being scheduled to execute with a given frequency, our traditional pipeline code is executed immediately triggered by a given event. Our example consists of a demo scenario for immediately and automatically loading data that is stored in S3 into Redshift tutorial. Continue reading “AWS Server-less data pipelines with Terraform to Redshift – Part 1”

Build a Data Pipeline with AWS Athena and Airflow (part 1)

In this post, I build up on the knowledge shared in the post for creating Data Pipelines on Airflow and introduce new technologies that help in the Extraction part of the process with cost and performance in mind. I’ll go through the options available and then introduce to a specific solution using AWS Athena. First we’ll establish the dataset and organize our data in S3 Buckets. Afterwards, you’ll learn how to make it so that this information is queryable through AWS Athena, while making sure it is updated daily.

Data dump files of not so structured data are a common byproduct of Data Pipelines that include extraction. dumps of not-so-structured data. This happens by design: business-wise and as Data Engineers, it’s never too much data. From an investment stand point, object-relational database systems can become increasingly costly to keep, especially if we aim at keeping performance while the data grows.

Having this said, this is not a new problem. Both Apache and Facebook have developed open source software that is extremely efficient in dealing with extreme amounts of data. While such softwares are written in Java, they maintain an abstracted interface to the data that relies on traditional SQL language to query data that is stored on filesystem storage, such as S3 for our example and in a wide range of different formats from HFDS to CSV.

Today we have many options to tackle this problem and I’m going to go through on how to welcome this problem in today’s serverless world with AWS Athena. For this we need to quickly rewind back in time and go through the technology Continue reading “Build a Data Pipeline with AWS Athena and Airflow (part 1)”

Consuming Kinesis Streams with Lambda functions locally

This blog post was originally published at GoSmarten website. As the number of projects where we use it was increasing, we thought we might as well share it. Let me know if it was helpful!

Motivation

AWS offers the cool possibility to consume from Kinesis streams in real time in a serverless fashion via AWS Lambda. However in can become extremely annoying to have to deploy a Lambda function in AWS just to test it. Moreover, it is also expensive to hold a Kinesis stream (e.g. queue) up and running just to test code.

Thus, by combining Kinesis Client Library (KCL) with AWS Kinesis and DynamoDB docker containers we were able to recreate locally everything that happens on the background when you plug a Lambda function to a Kinesis stream on AWS. Besides saving costs, this allows developers to substantially reduce development time, as well as develop higher quality code due to the ease and flexibility of testing different scenarios locally.

Feel free to checkout the code supporting this blog post on our repository.

Context: Event Sourcing/CQRS

 

Event sourcing is not a new concept, but as available streaming technologies have evolved, its widespread use has gained the attention it deserves. Thanks to “publish-subscribe” type of queues, it has become much easier to build streams of events available to multiple consumers at the same time. This democratization of access to an immutable, append-only stream of events is essential, as it separates the responsibility of modelling an event schema to a particular logic. It is also the reason why so many people either argue CQRS and event sourcing are the same thing, or have a symbiotic relationship. Continue reading “Consuming Kinesis Streams with Lambda functions locally”

Using Akka Streaming “saving alerts” – part 1

Full disclosure: this post was initially published at Bonial tech blog here, one my favorite companies at the heart of Berlin, and where I have been fortunate enough to be working for 2+ years as a freelance Data Engineer. If you are looking for positions in tech, I can’t help to recommend checking their career page.

Overview

Some months ago I was working on an internal project at Bonial using Akka Streaming (in scala) to provide additional features to our current push notification system. The main goal of the project was to enhance the speed to which the client is able to notify its end users of available discount coupons. In this case, we wanted to notify the users in a real time fashion of available coupons on store, so that they could use them more effectively on the spot and save money. Hence our project code name: “saving alerts”!

After some architectural discussions where we compared several technical options, we decided to give akka streaming a go. It has been a fun ride, so we thought we might as well share some of the lessons learned. Continue reading “Using Akka Streaming “saving alerts” – part 1″