Overview HP Vertica vs AWS Redshift

While working in HP some years ago, I was exposed to not only internal training materials, but also a demo environment. I still remember the excitement when HP acquired Vertica Systems in 2011, and we had a new toy to play with… Come on, you can’t blame me, distributed DBs was something only the cool kids were doing.

Bottom line is that it’s been a while since I laid eyes on it… Well recently, while considering possible architectural solutions, I had the pleasure  to revisit Vertica. And since AWS Redshift has been gaining a lot of popularity and we’re also using it at some of our clients, I thought I might give some easy summary to help others.

Now if you’re expecting a smack down post, then I’m afraid I’ll disappoint you – for that you have the internet. If experience has taught me  something is that – in the case of top-notch solutions there are only use cases, and one finds the best fitting one.

They share some properties in terms of architecture in internal engine:

  • Massively Parallel Processing (MPP) architecture: data is distributed among distinct nodes in shared nothing architecture, leveraging scale out and; in case you’re wondering how it compares to hadoop Hive, AirBnB did a smack-down comparison, concluding around 5x advantage of Redshift over Hive;
  • High availability (HA): this follows the first, thanks to to data replication mechanism; in the case of Vertica, they call it”k-safety” for measuring replication factor; and you may also want to check fault groups to control how data is replicated according to physical distribution (such as server rack location, power circuits, etc); same automatic data replication among nodes happens with Redshift under the hood (besides more goodies such as backups)
  • columnar oriented data store: for analytical applications/OLAP (where queries usually select only specific columns in opposition to OLTP), it is usually much faster mainly due to a) does not need to scan whole row and then discard the unnecessary content, b) higher efficiency in compression/encoding mechanisms due to similar data types; for more detailed explanation, I suggest here. Both Vertica and Redshift are built with this architecture;
  • Data compression: Vertica mixes encoding strategies, depending on column data type, table cardinality, and sort order; they do distinguish a difference between encoding and compression, since it will operate directly on encoded data whenever possible, which does not hold true for compression; Redshift does not make such a distinction, and recommends leaving compression to auto mode, although you can choose encoding type;
  • SQL standard interface: as always, minor differences are present, but bottom line you can use the SQL syntax that you’re already accustomed to
  • User Defined Functions (UDFs): both Redshift and Vertica give you space for customization;
  • In-memory DB: Nop, none of these are like SAP HANA, Oracle TimesTen or IBM’s SolidDB

Where they differ:

  • Architecture: in Vertica all nodes are “created equally”, meaning they share similar functions; Redshift has the concept of a leader node, a dedicated node which manages workload and query coordination among worker nodes;
  • Management: (this is a key differentiator that most likely that the biggest weight in the final decision) with Vertica you have to do all the ops work (install, upgrade/update, configure nodes, etc.);  Redshift is a fully managed Cloud solution, where you only have pure Database related ops work; Note: yes, HP provides an AMI to easily kickstart projects in AWS cloud, but come on, this is still not the same thing;
  • Freedom of environment: with Redshift you’re locked in to AWS; with Vertica you can run it wherever you feel like;
  • Schema Design: Vertica provides you with a Designer Tool to easily migrate from traditional RDBMS systems based on their schema (not saying this saves the world, but can be helpful); this is specially important in the beginning, since columnar data warehouses don’t support indexes; so in Vertica you play with the projection concept, in Redshift with distribution and sort keys (and you better do it well, as it will be key for performance and keeping things balanced)
  • Payment scheme: In Vertica (for data bigger than 1TB, which most certainly is the case) you pay upfront licensing, plus the cost of the machines where you’re running them; with Redshift costs all diluted into an hourly cost, and that’s it;
  • Compiled code: Redshift claims that the leader node compiles the code for optimal performance on execution time, which the guys at Cake also confirm with this excellent post;
  • Free Trial/usage: Vertica lets you have up to 3 nodes and 1TB of data;  Redshift, on the other hand, in case you’re account is still electable for the free usage tier (in the first year), you can try for a total of 750 normalized instance hours per month, enough for running continuously one DC1.Large single node, with 160TB SSD storage
  • Add-Ons: Vertica Pulse for sentiment analysis and Place for geospatial data analysis;

Finally, you might want to go deeper. Again, I really suggest this excellent post by Cake, which provides performance benchmarks. Benchmarks are always disputable; but still, it is always interesting and important comparison method.

Check out OpenFace

If you’re interested in using machine learning (ML) on image and video datasets, then you might be interested in heaving a look on a relatively new project called OpenFace (first released in October 2015), with  Brandon Amos, Ludwiczuk Bartosz and Mahadev Satyanarayanan as authors.

TL;DR: For the impatient

  • Pitch me: Open source project (aka free for you to use) developed inside Carnegie Mellon University  for face recognition with deep neural networks, with a Python API
  • What do I get from it: improved accuracy and reduced training time
  • Need to see to believe (and so one should)? You can start playing with it with Docker, and check the provided demos

What about it

Even though face recognition research has already started since the 1970s, it is still far from stagnant. The usual strategy for solving the problem has been divided into three main steps; given an image with a set of faces, first run face detection algorithm to isolate the faces from the rest, then preprocess this cropped part to reduce the high dimensionality, and finally classification. However, what makes this whole process so challenging is that many factors can create noise around this process, such as images can be taken from different angles, different lighting conditions, the face itself also suffers changes throughout time (for example due to age or style), etc.

Now one important fact to point out is that state of the art top performing algorithms are using convolutional neural networks. OpenFace is inspired by Facebook’s DeepFace and (mainly) Google’s FaceNet systems. The performance smack down that the authors present using the “Labeled Faces in the wild” dataset (LFW) for eveluation, and achieved some interesting results.

Another interesting point is that, as the authors state, the implementation is tuned for using the model in mobile devices, so the  “[…] key design consideration is a system that gives high accuracy with low training and prediction times“.

Note: In case you are wondering what’s the difference to OpenBiometrics (OpenBR). As stated by the authors of OpenFace in HackerNews, the main difference lies on the approach taken – deep convolutional networks – and could potentially be integrated into OpenBR’s pipeline.

Internal Guts

As you might imagine (as any image/video processing package), dependencies are complex and time consuming, so prepare yourself for some dependencies troubleshooting in case your machine is still new to this world.

The project’s API is written in Python 2 – entry point here – given its dependencies on OpenCV and DLib. OpenCV provides the computer vision base, DLib enhances OpenCV face detection ability, numpy for matrix algebra operations and scikit-learn for classification operations.

For training the convolutional network openface uses Torch, Lua and Luajit which is written in Lua programming language. In this case, Torch allows the neural networks to be executed either in CPU or CUDA enabled GPUs.

The following illustration was extracted from the recent technical report “OpenFace: A general-purpose face recognition

library with mobile applications“, by the authors, and provides interesting insight:

openface_diagram

So important to note is that you do have the option to use already pretrained models (which use the CASIA-WebFace and FaceScrub databases) to help with face detection, which you can find in the models directory. The provided bash script downloads them.

Where to get started

To setup either locally or with Docker you can check the provided documentation.

Finally, you might also be interested in having a look at other projects using deep neural networks for face recognition:  Visual Geometry Group (VGG) Face Descriptor, and Lightened Convolutional Neural Networks (CNNs)