Build a Data Pipeline with AWS Athena and Airflow (part 1)

In this post, I build up on the knowledge shared in the post for creating Data Pipelines on Airflow and introduce new technologies that help in the Extraction part of the process with cost and performance in mind. I’ll go through the options available and then introduce to a specific solution using AWS Athena. First we’ll establish the dataset and organize our data in S3 Buckets. Afterwards, you’ll learn how to make it so that this information is queryable through AWS Athena, while making sure it is updated daily.

Data dump files of not so structured data are a common byproduct of Data Pipelines that include extraction. dumps of not-so-structured data. This happens by design: business-wise and as Data Engineers, it’s never too much data. From an investment stand point, object-relational database systems can become increasingly costly to keep, especially if we aim at keeping performance while the data grows.

Having this said, this is not a new problem. Both Apache and Facebook have developed open source software that is extremely efficient in dealing with extreme amounts of data. While such softwares are written in Java, they maintain an abstracted interface to the data that relies on traditional SQL language to query data that is stored on filesystem storage, such as S3 for our example and in a wide range of different formats from HFDS to CSV.

Today we have many options to tackle this problem and I’m going to go through on how to welcome this problem in today’s serverless world with AWS Athena. For this we need to quickly rewind back in time and go through the technology Continue reading “Build a Data Pipeline with AWS Athena and Airflow (part 1)”

Overview HP Vertica vs AWS Redshift

While working in HP some years ago, I was exposed to not only internal training materials, but also a demo environment. I still remember the excitement when HP acquired Vertica Systems in 2011, and we had a new toy to play with… Come on, you can’t blame me, distributed DBs was something only the cool kids were doing.

Bottom line is that it’s been a while since I laid eyes on it… Well recently, while considering possible architectural solutions, I had the pleasure  to revisit Vertica. And since AWS Redshift has been gaining a lot of popularity and we’re also using it at some of our clients, I thought I might give some easy summary to help others.

Now if you’re expecting a smack down post, then I’m afraid I’ll disappoint you – for that you have the internet. If experience has taught me  something is that – in the case of top-notch solutions there are only use cases, and one finds the best fitting one. Continue reading “Overview HP Vertica vs AWS Redshift”

Architectural overview of HP’s Vertica DB

If you start digging for more info on HP Vertica, chances are high that you stumble uppon this image, which summarizes it’s key selling points:Vertica

So the first thing one should start talking about should always be the origin of Vertica’s name. Instead of horizontal data population, Vertica stores in a verticaL disposition – a columnar storage of data (imagine a 90 degree turn on you DB, where the properties of each row are now the main agents to consider). The goal is to benefit both in querry as in storage capabilities. Allocation in columnar fashion allows to sort data (no need for indexing, since data is already sorted), which is then followed by an encoding process that allows to store only one time each unique value, as well as how many times that value repeats itself.

Vertica prefers enconding instead of compressing data. You can still querry data while it is on its enconded state. Vertica applies different encoding mechanisms in the same table to different Columns, accordingly . You can also apply both encoding and compression. to the type of data stored. Besides using encoding to provide additional query speed, storage reduction is also in stake. Vertica claims a 4:1 reduction of storage footprint. 

Another important differentiator is Vertica’s distributed grid-like architecture. Likewise Big Data architectures, instead of having hercule muscled machine, you are able to gain the advantages of multiple smaller and, most importantly, less expensive computing nodes working in parralel. In other words, you use a Massively-parallel processing (MPP) system -grid-base architecture clustering a bunch of x86-64 servers. So it is a shared-nothing architecture – each node functions as if it were the only node, and uses its own processing capabilities to return part of the answer of a query that involves that data that it owns.

How to design such a system? Each node has a minimum hardware recommendation of two quad-core CPU, and 16GB of RAM, 1TB of disk space and two VLANs assigned – one for client connection, another for private inter-node communication. In the private inter-node communication VLAN is where nodes distribute queries along them, obtaining parallel processing.

You might be wondering by now, what would happen if you were to loose one or more of those commondity hardware servers. The Db provides all the HA mechanisms you need by spreading data repeatadidly across several nodes. We are talking about a RAID-like storage of the same data at least twice in different nodes, thus eliminating SPOFs and tolerating node/s failure. To do so, Vertica does table segmentation, distributing tables evenly across different nodes. The result: you get a clean HA, without any need for manual log-based recovery.

Queries are indeed run on all nodes, so if one node is done, you will suffer on performance (comparing to multi-node performance, not other DBs!), since at least one node will have to do double work on failure. If you lose enough nodes to loose at least one segment of data, Vertica goes automatically down so not to corrupt the DB.

Also quite neat is the auto-recover process. When a failing node comes back online, other nodes automatically sync data.

As you can see, the answer to the following question is yes, ou can forget about your centrallized SAN environment, and using your big-ass Storage Array. Each node uses local disk which also allows better performance, as redundancy is guaranteed at the DB-level.

Another important aspect is that you can load your DB schema into Vertica, and it uses what they call “Automatic DB designer” to facilitate the whole process of DB design. You can you the designer repeatadily in different points of time, as you evolve to adjust.

Finally Vertica has a standard SQL interface, and supports SQL, ODBC, JDBC and the majority of ETL and BI reporting products.

Disclamer: note that these are my own notes. HP is not responsable for any of the content here provided.