Getting started with AWS IoT and a dockerized device

IoT isn’t a new term and has, actually, been one of the buzz words of the XXI century, right alongside Crypto Currency. The two of them are seen holding hand in hand in the good and the bad, from interesting crypto projects focused on IoT usage, or actual IoT devices being hacked to mine crypto currencies. But this article is actually about IoT: it seems the last few years have given us the perfect momentum between three main ingredients for its popularity: the exponential increase of network capable devices, the easily available processing power and, finally, the hungry-for-data dynamic we from most parties nowadays.

For someone working as a Data Engineer (or related) there isn’t there much of a more end-to-end project than one which goes from setting up an actual no-so-intelligence device, design the methods to connect these devices to a network and, still, go about designing all streaming/batch processing infrastructure needed to deal with all the data. It really is a huge challenge.

In the following lines, I’ll focus on how AWS IoT has come to help in the first and second challenge and show an example of how to emulate an actual IoT device with the help of docker and walk through some of the nice and easy features AWS IoT offers such as IoT Core, management, topics and rules as well as integration with other AWS services such as Kinesis, Firehose or Lambdas.

Continue reading “Getting started with AWS IoT and a dockerized device”

Build a Data Pipeline with AWS Athena and Airflow (part 1)

In this post, I build up on the knowledge shared in the post for creating Data Pipelines on Airflow and introduce new technologies that help in the Extraction part of the process with cost and performance in mind. I’ll go through the options available and then introduce to a specific solution using AWS Athena. First we’ll establish the dataset and organize our data in S3 Buckets. Afterwards, you’ll learn how to make it so that this information is queryable through AWS Athena, while making sure it is updated daily.

Data dump files of not so structured data are a common byproduct of Data Pipelines that include extraction. dumps of not-so-structured data. This happens by design: business-wise and as Data Engineers, it’s never too much data. From an investment stand point, object-relational database systems can become increasingly costly to keep, especially if we aim at keeping performance while the data grows.

Having this said, this is not a new problem. Both Apache and Facebook have developed open source software that is extremely efficient in dealing with extreme amounts of data. While such softwares are written in Java, they maintain an abstracted interface to the data that relies on traditional SQL language to query data that is stored on filesystem storage, such as S3 for our example and in a wide range of different formats from HFDS to CSV.

Today we have many options to tackle this problem and I’m going to go through on how to welcome this problem in today’s serverless world with AWS Athena. For this we need to quickly rewind back in time and go through the technology Continue reading “Build a Data Pipeline with AWS Athena and Airflow (part 1)”

Airflow: create and manage Data Pipelines easily

This bootstrap guide was originally published at GoSmarten but as the use cases continue to increase, it’s a good idea to share it here as well.

What is Airflow

The need to perform operations or tasks, either simple and isolated or complex and sequential, is present in all things data nowadays. If you or your team work with lots of data on a daily basis there is a good chance you’re struggled with the need to implement some sort of pipeline to structure these routines. To make this process more efficient, Airbnb developed an internal project conveniently called Airflow which was later fostered under the Apache Incubator program. In 2015 Airbnb open-sourced the code to the community and, albeit its trustworthy origin played a role in its popularity, there are many other reasons why it became widely adopted (in the engineering community). It allows for tasks to be set up purely in Python (or as Bash commands/scripts).

What you’ll find in this tutorial

Not only we will walk you through setting up Airflow locally, but you’ll do so using Docker, which will optimize the conditions to learn locally while minimizing transition efforts into production. Docker is a subject for itself and we could dedicate a few dozen posts to, however, it is also simple enough and has all the magic needed to show off some of its advantages combining it with Airflow. Continue reading “Airflow: create and manage Data Pipelines easily”

Using Akka Streaming for “saving alerts” – part 2

This blog post is the second and final part of the post Using akka streaming for “saving alerts”. Part 1 is available here.  In this part we enter the details on how the application was designed.

Full disclosure: this post was initially published at Bonial tech blog here. If you are looking for positions in tech, I would definitely recommend checking their career page.

Application Actor System

The following illustration gives you a schematic view of all the actors used in our application, and (hopefully) some of the mechanics of their interaction:

akka-streaming-pipeline

As previously mentioned, one could divide the application lifecycle logically into three main stages. We will get into more detail about each one of them next, but for now let us walk through them narrowly and try to map to our application Continue reading “Using Akka Streaming for “saving alerts” – part 2″

Getting Started with Spark (part 3) – UDFs & Window functions

This post attempts to continue the previous introductory series “Getting started with Spark in Python” with the topics UDFs and Window Functions. Unfortunately it stayed marinating in my Word press for quite a while (was already more less 70% complete) and only now had the opportunity to complete it.

Note: For this post I’m using Spark 1.6.1. There are some minor differences in comparison to the new coming Spark 2.0, such as using a SparkSession object to initialize the Spark Context, instead of HiveContext as I do here. Nonetheless, the important parts are common in both.

Continue reading “Getting Started with Spark (part 3) – UDFs & Window functions”

Spark 2.0: From quasi to proper-streaming?

This post attempts to follow the relatively recent new Spark release – Spark 2.0 – and understand the differences regarding Streaming applications. Why is streaming in particular?, you may ask. Well, Streaming is the ideal scenario most companies would like to use, and the competition landscape is definitely heating up, specially with Apache Flink and Google’s Apache Beam.

Why is streaming so difficult

There are three main problems when it comes to building real time applications based on streaming data:

  • Data consistency:  due to the distributed architecture nature it is possible that at any given point in time some events have been processed in some nodes and not  in other nodes, even though these events might actually have occurred before than others. For example, you may have a Logout event without a Login event, close app event without open app, etc.
  • Fault tolerance (FT): related to the first problem, on node failure processing engines may try to reprocess an event that had actually already been processed by the failed node, leading to duplicated records and/or inaccurate metrics
  • Dealing with late data/out-of-order data: either because you are reading from a message bus system such as Kafka or Kinesis, or simply because mobile app might only be able to sync data hours later, developers simply must tackle this challenge in application code.

See this post for an excellent detailed explanation. Continue reading “Spark 2.0: From quasi to proper-streaming?”

Overview HP Vertica vs AWS Redshift

While working in HP some years ago, I was exposed to not only internal training materials, but also a demo environment. I still remember the excitement when HP acquired Vertica Systems in 2011, and we had a new toy to play with… Come on, you can’t blame me, distributed DBs was something only the cool kids were doing.

Bottom line is that it’s been a while since I laid eyes on it… Well recently, while considering possible architectural solutions, I had the pleasure  to revisit Vertica. And since AWS Redshift has been gaining a lot of popularity and we’re also using it at some of our clients, I thought I might give some easy summary to help others.

Now if you’re expecting a smack down post, then I’m afraid I’ll disappoint you – for that you have the internet. If experience has taught me  something is that – in the case of top-notch solutions there are only use cases, and one finds the best fitting one. Continue reading “Overview HP Vertica vs AWS Redshift”