Reproducibility is a pillar in science, and version control via git has been a blessing to it. For pure Software Engineering it works perfectly. However, machine learning projects are not just only about code, but rather also about the data. The same model trained with two distinct data sets can produce completely different results.
So it comes with no surprise when I stumble with csv files on git repos of data teams, as they struggle to keep track of code and metadata. However this cannot be done for today’s enormous datasets. I have seen several hacks to solve this problem, none of them bullet proof. This post is not about those hacks, rather about an open source solution for it: DVC.
Let us exemplify by using a kaggle challenge: predicting house prices with Advanced Regression Techniques
Continue reading “Versioning in data projects”
Alright, it’s time for the second post of our sequence focusing on AWS options to setup pipelines in a server-less fashion. The topics that we are covering throughout this series are:
In this post we complement the previous one, by providing infrastructure-as-code with Terraform for deployment purposes. We are strong believers of a DevOps approach also to Data Engineering, also known as “DataOps”. Thus we thought it would make perfect sense to share a sample Terraform module along with Python code.
To recap, so far we have Python code that, if triggered by a AWS event on a new S3 object, will connect to Redshift, and issue SQL Copy command statement to load that data into a given table. Next we are going to show how to configure this with Terraform code.
As usual, all the code for this post is available publicly in this github repository. In case you haven’t yet, you will need to install terraform in order follow along this post.
Continue reading “AWS Server-less data pipelines with Terraform to Redshift – Part 2”
This article is intended to be a quick and dirty snippet for anyone going to through the struggle of getting your ECS service, which might have one or more containers running the same App (being part of an Auto Scaling Group), with a Network Load Balancer (instead of the more common ELB or ALB).
ECS Service/Task Definition
Another particularity of this implementation is that I also decided to use the ECS task’s network mode as awsvpc. In the case that you are not acquainted with this new option, this means that:
- Your container will get its own network interface and its own IP address;
- The Host port and the Container port need to be the same, since there is not middleware managing port match between the two entities.
The cherry on top is that the ECS Service now has the option of automatically registering and deregistering LB targets by their IP address, which fits perfectly on the intention described.
Network Load Balancer
This post isn’t concretely about describing the technical details of what is a Network Load Balancer but about the caveats of using it in this scenario: because NLB is a layer 4 load balancer, you won’t be able to define Security Groups at the NLB level. Instead, you’ll have to make sure you make your tasks/containers secure by attaching the security groups to them – remember that with the awsvpc network mode, each container will get its own NIC.
As for the actual code snippet to support what I’m trying to achieve: Continue reading “Get AppScaled ECS Tasks served by AWS Network Load Balancer”
IoT isn’t a new term and has, actually, been one of the buzz words of the XXI century, right alongside Crypto Currency. The two of them are seen holding hand in hand in the good and the bad, from interesting crypto projects focused on IoT usage, or actual IoT devices being hacked to mine crypto currencies. But this article is actually about IoT: it seems the last few years have given us the perfect momentum between three main ingredients for its popularity: the exponential increase of network capable devices, the easily available processing power and, finally, the hungry-for-data dynamic we from most parties nowadays.
For someone working as a Data Engineer (or related) there isn’t there much of a more end-to-end project than one which goes from setting up an actual no-so-intelligence device, design the methods to connect these devices to a network and, still, go about designing all streaming/batch processing infrastructure needed to deal with all the data. It really is a huge challenge.
In the following lines, I’ll focus on how AWS IoT has come to help in the first and second challenge and show an example of how to emulate an actual IoT device with the help of docker and walk through some of the nice and easy features AWS IoT offers such as IoT Core, management, topics and rules as well as integration with other AWS services such as Kinesis, Firehose or Lambdas.
Continue reading “Getting started with AWS IoT and a dockerized device”
This post is the first of sequence of posts focusing on AWS options to setup pipelines in a serverless fashion. The topics that we all cover throughout the whole series are:
In this post we lean towards another strategy to setup data pipelines, namely event triggered. That is, rather than being scheduled to execute with a given frequency, our traditional pipeline code is executed immediately triggered by a given event. Our example consists of a demo scenario for immediately and automatically loading data that is stored in S3 into Redshift tutorial. Continue reading “AWS Server-less data pipelines with Terraform to Redshift – Part 1”
After learning the basics of Athena in Part 1 and understanding the fundamentals or Airflow, you should now be ready to integrate this knowledge into a continuous data pipeline.
The idea is for it to run on a daily schedule, checking if there’s any new CSV file in a folder-like structure matching the day for which the task is running. For example, if the task is running for 2010-01-31, then then it will check if there is any file in
s3://data/year=2010/month=01/day=31/*. If it finds a file there, it will add the “folder” as a partition to Athena so we can keep querying it.
Remind me again: why Athena?
At this point, if you are still wondering why Athena is so useful when you already have a pipeline in process to dump data somewhere (maybe a DB?) well, remember Athena is a “pay as you go” solution that will scale automatically for the desired queries you are running. The underlying costs are only associated with the S3 file hosting itself plus the execution of queries. Such queries, when combined with the Hive Metastore will provide a fast solution for querying heavy loads of data stored in several different types of files on an S3 bucket. On the other hand, provisioning a Database for dumping data will have fixed costs such as processing power, memory and storage amount which will surpass the first ones in case you are not using/needing the full blown features of having in place a proper database engine.
Before proceeding, there are three important assumptions: Continue reading “Build a Data Pipeline with AWS Athena and Airflow (part 2)”
Today is a short one, but hopefully a valuable devOps tip, if you are currently setting up remote logging integration to S3 of Airflow logs using Airflow version 1.9.0.
Basically this stackoverflow post provides the main solution. However, the current template incubator-airflow/airflow/config_templates/airflow_local_settings.py present in master branch contains a reference to the class “airflow.utils.log.s3_task_handler.S3TaskHandler”, which is not present in apache-airflow==1.9.0 python package. The fix is simple – use rather this base template: https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/config_templates/airflow_local_settings.py (and follow all other instructions in the mentioned answer)
Hope this helps!