Robinhood is a very popular California based FinTech, included by Forbes in the top 50 FinTechs to watch in 2019. Their primary mission is to bring down stock trading fees for the common Joe/Jane, although their roadmap also includes cryptocurrency trading.
Due to the nature of the bidding market, their data stack probably includes a lot of stream tooling. Also (probably) due to the lack of quick and easy tooling for streaming with Python, supported with the growing demand for Python in the developer community, they launched their own port of Kafka Streams, called Faust.
In this post, we’re going to show how to easy it is to bootstrap a project with Faust to put your stream related business logic needs in practice very quickly. The demo we prepared is of an app which filters words from a Kafka topic and then keeps a count of how many times it has seen the colors “red”, “green” and “blue”.
In a nutshell, Faust is:
Continue reading “Faust, your Python-based streaming library”
Picking up on Diogo’s last post on how to obliterate all resources on your AWS Account, I thought it could also be useful to, instead, list all you have running.
Since I’m long overdue on a Go post, I’m going to share a one file app that uses the Go AWS SDK for to crawl each region for all taggable resources and pretty printing it on stdout, organised by Service type (e.g. EC2, ECS, ELB, etc.), Product Type (e.g. Instance, NAT, subnet, cluster, etc.).
The AWS SDK allows to retrieve all ARNs for taggable resources, so that’s all the info I’ll use for our little app.
Note: If you prefer jumping to full code code, please scroll until the end and read the running instructions before.
The main goal is to get structured information from the ARNs retrieved, so the first thing is to create a type that serves as a blue print for what I’m trying to achieve. Because I want to keep it simple, let’s call this type
Also, since we are taking care of the basics, we can also define the TraceableRegions that we want the app to crawl through.
Finally, to focus the objective, let’s also create a function that accepts a slice of
*SingleResource and will convert will print it out as a table to stdout:
Continue reading “List all your AWS resources with Go”
This article is intended to be a quick and dirty snippet for anyone going to through the struggle of getting your ECS service, which might have one or more containers running the same App (being part of an Auto Scaling Group), with a Network Load Balancer (instead of the more common ELB or ALB).
ECS Service/Task Definition
Another particularity of this implementation is that I also decided to use the ECS task’s network mode as awsvpc. In the case that you are not acquainted with this new option, this means that:
- Your container will get its own network interface and its own IP address;
- The Host port and the Container port need to be the same, since there is not middleware managing port match between the two entities.
The cherry on top is that the ECS Service now has the option of automatically registering and deregistering LB targets by their IP address, which fits perfectly on the intention described.
Network Load Balancer
This post isn’t concretely about describing the technical details of what is a Network Load Balancer but about the caveats of using it in this scenario: because NLB is a layer 4 load balancer, you won’t be able to define Security Groups at the NLB level. Instead, you’ll have to make sure you make your tasks/containers secure by attaching the security groups to them – remember that with the awsvpc network mode, each container will get its own NIC.
As for the actual code snippet to support what I’m trying to achieve: Continue reading “Get AppScaled ECS Tasks served by AWS Network Load Balancer”
IoT isn’t a new term and has, actually, been one of the buzz words of the XXI century, right alongside Crypto Currency. The two of them are seen holding hand in hand in the good and the bad, from interesting crypto projects focused on IoT usage, or actual IoT devices being hacked to mine crypto currencies. But this article is actually about IoT: it seems the last few years have given us the perfect momentum between three main ingredients for its popularity: the exponential increase of network capable devices, the easily available processing power and, finally, the hungry-for-data dynamic we from most parties nowadays.
For someone working as a Data Engineer (or related) there isn’t there much of a more end-to-end project than one which goes from setting up an actual no-so-intelligence device, design the methods to connect these devices to a network and, still, go about designing all streaming/batch processing infrastructure needed to deal with all the data. It really is a huge challenge.
In the following lines, I’ll focus on how AWS IoT has come to help in the first and second challenge and show an example of how to emulate an actual IoT device with the help of docker and walk through some of the nice and easy features AWS IoT offers such as IoT Core, management, topics and rules as well as integration with other AWS services such as Kinesis, Firehose or Lambdas.
Continue reading “Getting started with AWS IoT and a dockerized device”
After learning the basics of Athena in Part 1 and understanding the fundamentals or Airflow, you should now be ready to integrate this knowledge into a continuous data pipeline.
The idea is for it to run on a daily schedule, checking if there’s any new CSV file in a folder-like structure matching the day for which the task is running. For example, if the task is running for 2010-01-31, then then it will check if there is any file in
s3://data/year=2010/month=01/day=31/*. If it finds a file there, it will add the “folder” as a partition to Athena so we can keep querying it.
Remind me again: why Athena?
At this point, if you are still wondering why Athena is so useful when you already have a pipeline in process to dump data somewhere (maybe a DB?) well, remember Athena is a “pay as you go” solution that will scale automatically for the desired queries you are running. The underlying costs are only associated with the S3 file hosting itself plus the execution of queries. Such queries, when combined with the Hive Metastore will provide a fast solution for querying heavy loads of data stored in several different types of files on an S3 bucket. On the other hand, provisioning a Database for dumping data will have fixed costs such as processing power, memory and storage amount which will surpass the first ones in case you are not using/needing the full blown features of having in place a proper database engine.
Before proceeding, there are three important assumptions: Continue reading “Build a Data Pipeline with AWS Athena and Airflow (part 2)”
In this post, I build up on the knowledge shared in the post for creating Data Pipelines on Airflow and introduce new technologies that help in the Extraction part of the process with cost and performance in mind. I’ll go through the options available and then introduce to a specific solution using AWS Athena. First we’ll establish the dataset and organize our data in S3 Buckets. Afterwards, you’ll learn how to make it so that this information is queryable through AWS Athena, while making sure it is updated daily.
Data dump files of not so structured data are a common byproduct of Data Pipelines that include extraction. dumps of not-so-structured data. This happens by design: business-wise and as Data Engineers, it’s never too much data. From an investment stand point, object-relational database systems can become increasingly costly to keep, especially if we aim at keeping performance while the data grows.
Having this said, this is not a new problem. Both Apache and Facebook have developed open source software that is extremely efficient in dealing with extreme amounts of data. While such softwares are written in Java, they maintain an abstracted interface to the data that relies on traditional SQL language to query data that is stored on filesystem storage, such as S3 for our example and in a wide range of different formats from HFDS to CSV.
Today we have many options to tackle this problem and I’m going to go through on how to welcome this problem in today’s serverless world with AWS Athena. For this we need to quickly rewind back in time and go through the technology Continue reading “Build a Data Pipeline with AWS Athena and Airflow (part 1)”
This bootstrap guide was originally published at GoSmarten but as the use cases continue to increase, it’s a good idea to share it here as well.
What is Airflow
The need to perform operations or tasks, either simple and isolated or complex and sequential, is present in all things data nowadays. If you or your team work with lots of data on a daily basis there is a good chance you’re struggled with the need to implement some sort of pipeline to structure these routines. To make this process more efficient, Airbnb developed an internal project conveniently called Airflow which was later fostered under the Apache Incubator program. In 2015 Airbnb open-sourced the code to the community and, albeit its trustworthy origin played a role in its popularity, there are many other reasons why it became widely adopted (in the engineering community). It allows for tasks to be set up purely in Python (or as Bash commands/scripts).
What you’ll find in this tutorial
Not only we will walk you through setting up Airflow locally, but you’ll do so using Docker, which will optimize the conditions to learn locally while minimizing transition efforts into production. Docker is a subject for itself and we could dedicate a few dozen posts to, however, it is also simple enough and has all the magic needed to show off some of its advantages combining it with Airflow. Continue reading “Airflow: create and manage Data Pipelines easily”