Build a Data Pipeline with AWS Athena and Airflow (part 1)

In this post, I build up on the knowledge shared in the post for creating Data Pipelines on Airflow and introduce new technologies that help in the Extraction part of the process with cost and performance in mind. I’ll go through the options available and then introduce to a specific solution using AWS Athena. First we’ll establish the dataset and organize our data in S3 Buckets. Afterwards, you’ll learn how to make it so that this information is queryable through AWS Athena, while making sure it is updated daily.

Data dump files of not so structured data are a common byproduct of Data Pipelines that include extraction. dumps of not-so-structured data. This happens by design: business-wise and as Data Engineers, it’s never too much data. From an investment stand point, object-relational database systems can become increasingly costly to keep, especially if we aim at keeping performance while the data grows.

Having this said, this is not a new problem. Both Apache and Facebook have developed open source software that is extremely efficient in dealing with extreme amounts of data. While such softwares are written in Java, they maintain an abstracted interface to the data that relies on traditional SQL language to query data that is stored on filesystem storage, such as S3 for our example and in a wide range of different formats from HFDS to CSV.

Today we have many options to tackle this problem and I’m going to go through on how to welcome this problem in today’s serverless world with AWS Athena. For this we need to quickly rewind back in time and go through the technology Continue reading “Build a Data Pipeline with AWS Athena and Airflow (part 1)”

Overview HP Vertica vs AWS Redshift

While working in HP some years ago, I was exposed to not only internal training materials, but also a demo environment. I still remember the excitement when HP acquired Vertica Systems in 2011, and we had a new toy to play with… Come on, you can’t blame me, distributed DBs was something only the cool kids were doing.

Bottom line is that it’s been a while since I laid eyes on it… Well recently, while considering possible architectural solutions, I had the pleasure  to revisit Vertica. And since AWS Redshift has been gaining a lot of popularity and we’re also using it at some of our clients, I thought I might give some easy summary to help others.

Now if you’re expecting a smack down post, then I’m afraid I’ll disappoint you – for that you have the internet. If experience has taught me  something is that – in the case of top-notch solutions there are only use cases, and one finds the best fitting one.

They share some properties in terms of architecture in internal engine:

  • Massively Parallel Processing (MPP) architecture: data is distributed among distinct nodes in shared nothing architecture, leveraging scale out and; in case you’re wondering how it compares to hadoop Hive, AirBnB did a smack-down comparison, concluding around 5x advantage of Redshift over Hive;
  • High availability (HA): this follows the first, thanks to to data replication mechanism; in the case of Vertica, they call it”k-safety” for measuring replication factor; and you may also want to check fault groups to control how data is replicated according to physical distribution (such as server rack location, power circuits, etc); same automatic data replication among nodes happens with Redshift under the hood (besides more goodies such as backups)
  • columnar oriented data store: for analytical applications/OLAP (where queries usually select only specific columns in opposition to OLTP), it is usually much faster mainly due to a) does not need to scan whole row and then discard the unnecessary content, b) higher efficiency in compression/encoding mechanisms due to similar data types; for more detailed explanation, I suggest here. Both Vertica and Redshift are built with this architecture;
  • Data compression: Vertica mixes encoding strategies, depending on column data type, table cardinality, and sort order; they do distinguish a difference between encoding and compression, since it will operate directly on encoded data whenever possible, which does not hold true for compression; Redshift does not make such a distinction, and recommends leaving compression to auto mode, although you can choose encoding type;
  • SQL standard interface: as always, minor differences are present, but bottom line you can use the SQL syntax that you’re already accustomed to
  • User Defined Functions (UDFs): both Redshift and Vertica give you space for customization;
  • In-memory DB: Nop, none of these are like SAP HANA, Oracle TimesTen or IBM’s SolidDB

Where they differ:

  • Architecture: in Vertica all nodes are “created equally”, meaning they share similar functions; Redshift has the concept of a leader node, a dedicated node which manages workload and query coordination among worker nodes;
  • Management: (this is a key differentiator that most likely that the biggest weight in the final decision) with Vertica you have to do all the ops work (install, upgrade/update, configure nodes, etc.);  Redshift is a fully managed Cloud solution, where you only have pure Database related ops work; Note: yes, HP provides an AMI to easily kickstart projects in AWS cloud, but come on, this is still not the same thing;
  • Freedom of environment: with Redshift you’re locked in to AWS; with Vertica you can run it wherever you feel like;
  • Schema Design: Vertica provides you with a Designer Tool to easily migrate from traditional RDBMS systems based on their schema (not saying this saves the world, but can be helpful); this is specially important in the beginning, since columnar data warehouses don’t support indexes; so in Vertica you play with the projection concept, in Redshift with distribution and sort keys (and you better do it well, as it will be key for performance and keeping things balanced)
  • Payment scheme: In Vertica (for data bigger than 1TB, which most certainly is the case) you pay upfront licensing, plus the cost of the machines where you’re running them; with Redshift costs all diluted into an hourly cost, and that’s it;
  • Compiled code: Redshift claims that the leader node compiles the code for optimal performance on execution time, which the guys at Cake also confirm with this excellent post;
  • Free Trial/usage: Vertica lets you have up to 3 nodes and 1TB of data;  Redshift, on the other hand, in case you’re account is still electable for the free usage tier (in the first year), you can try for a total of 750 normalized instance hours per month, enough for running continuously one DC1.Large single node, with 160TB SSD storage
  • Add-Ons: Vertica Pulse for sentiment analysis and Place for geospatial data analysis;

Finally, you might want to go deeper. Again, I really suggest this excellent post by Cake, which provides performance benchmarks. Benchmarks are always disputable; but still, it is always interesting and important comparison method.

Architectural overview of HP’s Vertica DB

If you start digging for more info on HP Vertica, chances are high that you stumble uppon this image, which summarizes it’s key selling points:Vertica

So the first thing one should start talking about should always be the origin of Vertica’s name. Instead of horizontal data population, Vertica stores in a verticaL disposition – a columnar storage of data (imagine a 90 degree turn on you DB, where the properties of each row are now the main agents to consider). The goal is to benefit both in querry as in storage capabilities. Allocation in columnar fashion allows to sort data (no need for indexing, since data is already sorted), which is then followed by an encoding process that allows to store only one time each unique value, as well as how many times that value repeats itself.

Vertica prefers enconding instead of compressing data. You can still querry data while it is on its enconded state. Vertica applies different encoding mechanisms in the same table to different Columns, accordingly . You can also apply both encoding and compression. to the type of data stored. Besides using encoding to provide additional query speed, storage reduction is also in stake. Vertica claims a 4:1 reduction of storage footprint. 

Another important differentiator is Vertica’s distributed grid-like architecture. Likewise Big Data architectures, instead of having hercule muscled machine, you are able to gain the advantages of multiple smaller and, most importantly, less expensive computing nodes working in parralel. In other words, you use a Massively-parallel processing (MPP) system -grid-base architecture clustering a bunch of x86-64 servers. So it is a shared-nothing architecture – each node functions as if it were the only node, and uses its own processing capabilities to return part of the answer of a query that involves that data that it owns.

How to design such a system? Each node has a minimum hardware recommendation of two quad-core CPU, and 16GB of RAM, 1TB of disk space and two VLANs assigned – one for client connection, another for private inter-node communication. In the private inter-node communication VLAN is where nodes distribute queries along them, obtaining parallel processing.

You might be wondering by now, what would happen if you were to loose one or more of those commondity hardware servers. The Db provides all the HA mechanisms you need by spreading data repeatadidly across several nodes. We are talking about a RAID-like storage of the same data at least twice in different nodes, thus eliminating SPOFs and tolerating node/s failure. To do so, Vertica does table segmentation, distributing tables evenly across different nodes. The result: you get a clean HA, without any need for manual log-based recovery.

Queries are indeed run on all nodes, so if one node is done, you will suffer on performance (comparing to multi-node performance, not other DBs!), since at least one node will have to do double work on failure. If you lose enough nodes to loose at least one segment of data, Vertica goes automatically down so not to corrupt the DB.

Also quite neat is the auto-recover process. When a failing node comes back online, other nodes automatically sync data.

As you can see, the answer to the following question is yes, ou can forget about your centrallized SAN environment, and using your big-ass Storage Array. Each node uses local disk which also allows better performance, as redundancy is guaranteed at the DB-level.

Another important aspect is that you can load your DB schema into Vertica, and it uses what they call “Automatic DB designer” to facilitate the whole process of DB design. You can you the designer repeatadily in different points of time, as you evolve to adjust.

Finally Vertica has a standard SQL interface, and supports SQL, ODBC, JDBC and the majority of ETL and BI reporting products.

Disclamer: note that these are my own notes. HP is not responsable for any of the content here provided.