AWS Server-less data pipelines with Terraform to Redshift – Part 2

Alright, it’s time for the second post of our sequence focusing on AWS options to setup pipelines in a server-less fashion. The topics that we are covering throughout this series are:

In this post we complement the previous one, by providing infrastructure-as-code with Terraform for deployment purposes. We are strong believers of a DevOps approach also to Data Engineering, also known as “DataOps”. Thus we thought it would make perfect sense to share a sample Terraform module along with Python code.

To recap, so far we have Python code that, if triggered by a AWS event on a new S3 object, will connect to Redshift, and issue SQL Copy command statement to load that data into a given table. Next we are going to show how to configure this with Terraform code.

As usual, all the code for this post is available publicly in this github repository. In case you haven’t yet, you will need to install terraform in order follow along this post.

Continue reading “AWS Server-less data pipelines with Terraform to Redshift – Part 2”

AWS Server-less data pipelines with Terraform to Redshift – Part 1

This post is the first of sequence of posts focusing on AWS options to setup pipelines in a serverless fashion. The topics that we all cover throughout the whole series are:

In this post we lean towards another strategy to setup data pipelines, namely event triggered. That is, rather than being scheduled to execute with a given frequency, our traditional pipeline code is executed immediately triggered by a given event. Our example consists of a demo scenario for immediately and automatically loading data that is stored in S3 into Redshift tutorial. Continue reading “AWS Server-less data pipelines with Terraform to Redshift – Part 1”

Overview HP Vertica vs AWS Redshift

While working in HP some years ago, I was exposed to not only internal training materials, but also a demo environment. I still remember the excitement when HP acquired Vertica Systems in 2011, and we had a new toy to play with… Come on, you can’t blame me, distributed DBs was something only the cool kids were doing.

Bottom line is that it’s been a while since I laid eyes on it… Well recently, while considering possible architectural solutions, I had the pleasure  to revisit Vertica. And since AWS Redshift has been gaining a lot of popularity and we’re also using it at some of our clients, I thought I might give some easy summary to help others.

Now if you’re expecting a smack down post, then I’m afraid I’ll disappoint you – for that you have the internet. If experience has taught me  something is that – in the case of top-notch solutions there are only use cases, and one finds the best fitting one. Continue reading “Overview HP Vertica vs AWS Redshift”

Spark – Redshift: AWS Roles to the rescue

If you are using AWS to host your applications, you probably heard that you can apply IAM Roles also to ec2 instances. In a lot of cases this can be a really cool way to avoid passing AWS credentials to your applications, and having the pain of having to manage key distribution among servers, as well as ensuring key rotation mechanisms for security purposes.

This post is about a simple trick on how to take advantage of this feature when your Spark job needs to interact with AWS Redshift.

As can be read in Databricks repo for Spark-redshift library the are three (3) strategies for setting up AWS credentials: either setup in hadoop configuration (how many people are used to so far with Cloudera or HortonWorks), encoding the keys in a tempdir (by far not the best option if you ask me), or using temporary keys. The last strategy is the one being discussed here, and its based on AWS own documentation how to use temporary keys. Continue reading “Spark – Redshift: AWS Roles to the rescue”