Build a Data Pipeline with AWS Athena and Airflow (part 1)

In this post, I build up on the knowledge shared in the post for creating Data Pipelines on Airflow and introduce new technologies that help in the Extraction part of the process with cost and performance in mind. I’ll go through the options available and then introduce to a specific solution using AWS Athena. First we’ll establish the dataset and organize our data in S3 Buckets. Afterwards, you’ll learn how to make it so that this information is queryable through AWS Athena, while making sure it is updated daily.

Data dump files of not so structured data are a common byproduct of Data Pipelines that include extraction. dumps of not-so-structured data. This happens by design: business-wise and as Data Engineers, it’s never too much data. From an investment stand point, object-relational database systems can become increasingly costly to keep, especially if we aim at keeping performance while the data grows.

Having this said, this is not a new problem. Both Apache and Facebook have developed open source software that is extremely efficient in dealing with extreme amounts of data. While such softwares are written in Java, they maintain an abstracted interface to the data that relies on traditional SQL language to query data that is stored on filesystem storage, such as S3 for our example and in a wide range of different formats from HFDS to CSV.

Today we have many options to tackle this problem and I’m going to go through on how to welcome this problem in today’s serverless world with AWS Athena. For this we need to quickly rewind back in time and go through the technology Continue reading “Build a Data Pipeline with AWS Athena and Airflow (part 1)”

Airflow: create and manage Data Pipelines easily

This bootstrap guide was originally published at GoSmarten but as the use cases continue to increase, it’s a good idea to share it here as well.

What is Airflow

The need to perform operations or tasks, either simple and isolated or complex and sequential, is present in all things data nowadays. If you or your team work with lots of data on a daily basis there is a good chance you’re struggled with the need to implement some sort of pipeline to structure these routines. To make this process more efficient, Airbnb developed an internal project conveniently called Airflow which was later fostered under the Apache Incubator program. In 2015 Airbnb open-sourced the code to the community and, albeit its trustworthy origin played a role in its popularity, there are many other reasons why it became widely adopted (in the engineering community). It allows for tasks to be set up purely in Python (or as Bash commands/scripts).

What you’ll find in this tutorial

Not only we will walk you through setting up Airflow locally, but you’ll do so using Docker, which will optimize the conditions to learn locally while minimizing transition efforts into production. Docker is a subject for itself and we could dedicate a few dozen posts to, however, it is also simple enough and has all the magic needed to show off some of its advantages combining it with Airflow. Continue reading “Airflow: create and manage Data Pipelines easily”

Consuming Kinesis Streams with Lambda functions locally

This blog post was originally published at GoSmarten website. As the number of projects where we use it was increasing, we thought we might as well share it. Let me know if it was helpful!

Motivation

AWS offers the cool possibility to consume from Kinesis streams in real time in a serverless fashion via AWS Lambda. However in can become extremely annoying to have to deploy a Lambda function in AWS just to test it. Moreover, it is also expensive to hold a Kinesis stream (e.g. queue) up and running just to test code.

Thus, by combining Kinesis Client Library (KCL) with AWS Kinesis and DynamoDB docker containers we were able to recreate locally everything that happens on the background when you plug a Lambda function to a Kinesis stream on AWS. Besides saving costs, this allows developers to substantially reduce development time, as well as develop higher quality code due to the ease and flexibility of testing different scenarios locally.

Feel free to checkout the code supporting this blog post on our repository.

Context: Event Sourcing/CQRS

 

Event sourcing is not a new concept, but as available streaming technologies have evolved, its widespread use has gained the attention it deserves. Thanks to “publish-subscribe” type of queues, it has become much easier to build streams of events available to multiple consumers at the same time. This democratization of access to an immutable, append-only stream of events is essential, as it separates the responsibility of modelling an event schema to a particular logic. It is also the reason why so many people either argue CQRS and event sourcing are the same thing, or have a symbiotic relationship.

The consumer on his side can then choose how to represent a given fact and ingest it in light of his own business framing. Now, it is important to note that though consumers use events to constantly mutate state representation and store data in a database – that does not mean that they are locked to that interpretation forever in time. As raw events are stored in an immutable fashion and decoupled from consumers interpretation, state can always be replayed and reconstructed to any given point in time.

This is where AWS Kinesis shines, offering a queue as a service, offloading a lot of the maintenance effort and complexities from your development and/or operations teams.

kinesis-architecture

Furthermore, having serverless streaming consumption can further offload a substantial chunk of work of streaming projects. Lambda functions obviously have limitations, and cannot stand by themselves compared to proper streaming engines, such as Apache Flink, Apache Spark, etc. The first limitation would be that each execution is stateless and independent of the previous one. The implications are that, unless you query some external data source, you are left alone in the party with a collection of events, completely blind to what happened in the previous execution. To add up, at the time of writing this blog post, lambda functions can only run up to 5 minutes maximum.

However, depending on your requirements, there are potential ways around their limitations. One example would be using materialized views, supported in all main relational databases. Depending on your write load, some people even consider database triggers, though this might be a ticking bomb in the medium/long term. Another strategy could be leveraging further AWS goodies, using DynamoDB as a stateful layer and DynamoDB streams to progressively evolve as potential out of order events arrive.

Thus, considering plugging AWS Lambda functions to your streaming execution can be a viable solution, depending on the complexity of the application you are building. Next, we dive into more details on how AWS is actually implementing the plugging of Lambda functions to Kinesis streams.

AWS Lambda for streaming

AWS Lambda service has come a long way since it was launched, and it integrates with numerous other services. One of the ways it integrates with other services is by allowing you to specify other services as triggers for lambda execution. In our case, the service is Kinesis streams. The way this works is by having the AWS Lambda service constantly polling the stream and invoking a particular lambda function.

When using AWS services as a trigger for lambda invocation, that invocation is obviously predetermined by the service. As stated in the AWS documentation, in the case of stream-based services – at the time of writing this blog post that means Kinesis streams and DynamoDB streams – the invocation will always be synchronous. However, the polling from the stream is done in parallel, where the parallelism level is determined by the number of shards a given stream has. In practice, that will potentially result in having X amount of lambda functions simultaneously running. If this represents a potential issue for you, one way to minimize ordering issues is to adapt the partition key in your producer to group events conveniently.

Last but not least, one of the cool things about this solution is that the polling from the stream is done in the background by AWS. Every second AWS will poll from the stream, and if there are records, it will pass that collection of records to your lambda function. However, don’t fear: you can and should customize this batch size of records, given that, as previously mentioned, lambda functions can only run up to 5 minutes.

Running locally

The following steps assume that you have clone locally our public repo.

1) start docker env

Although we are big fans of docker compose, we have rather chosen to implement bootstrapping docker containers via bash scripts for two main reasons:
a) We wanted to give developers the flexibility to choose which Dockers to start. For example, Java consumer implementation requires using a local DynamoDB, whereas Python doesn’t;
b) We wanted to have the flexibility to automate additional functionality, such as creating Kinesis streams and DynamoDB tables, listing them after boot, creating local AWS CLI profile, etc.
To bootstrap all containers:

cd docker/bootstrapEnv && bash +x runDocker.sh

If you check the runDocker.sh bash script, you will see that it will:
a) start DynamoDB docker
b) locally setup a fake AWS profile to interact with local Kinesis
c) start Kinesis container
d) create a new kinesis stream Note that we are not persisting any data from these containers, so every time you start any Docker it will be “brand new”.

Also relevant to point out is that we are creating the stream with three shards. In WS Kinesis terminology, this means the queue partitioning, and also how one would improve read/write capacity of a given stream. However, in reality this is completely mocked, since we are running a single docker container, which “pretends” to have X amount of shards/partitions.

2) Publish fake data to Kinesis stream

To help you get started, we provided yet another bash script that pushes mock data (json encoding) to the same Kinesis stream previously created.

To start producing events:

cd docker/common && bash +x generateRecordsKinesis.sh

This will continuously publish events to the Kinesis stream until you decide to stop it (for example by hitting Ctrl+C). Note also that we are randomly publishing to any of the 3 available shards. For future reference, besides adapting our mock event data towards your requirements, the specification of partition key might also be something you want to enforce depending on how your producers are publishing data into Kinesis.

3) Start consuming from kinesis stream

At this point, you have everything to test the execution of a Lambda function. We have provided an example of a basic Lambda – com.gosmarten.mockcloud.localrunner.TestLambda – that just prints each event. To actually test it running, you need to run com.gosmarten.mockcloud.localrunner.RawEventsProcessorLambdaRunner class. This class continuously iterates over each stream shard and pulls for data, which it will then pass to our lambda as a collection of Records.

4) How to test your own Lambda functions

final KinesisConsumerProcessorManager recordProcessorFactory = new KinesisConsumerProcessorManager<>(TestLambda::new);recordProcessorFactory.setAwsProfile(AWS_PROFILE).runWorker(STREAM_NAME, APP_NAME);

Instead of “TestLambda”, specify your own. And … that’s it!

Last but not least, stay tuned as we plan to update the original repo with the python version. Happy coding!

 

 

 

Using Akka Streaming for “saving alerts” – part 2

This blog post is the second and final part of the post Using akka streaming for “saving alerts”. Part 1 is available here.  In this part we enter the details on how the application was designed.

Full disclosure: this post was initially published at Bonial tech blog here. If you are looking for positions in tech, I would definitely recommend checking their career page.

Application Actor System

The following illustration gives you a schematic view of all the actors used in our application, and (hopefully) some of the mechanics of their interaction:

akka-streaming-pipeline

As previously mentioned, one could divide the application lifecycle logically into three main stages. We will get into more detail about each one of them next, but for now let us walk through them narrowly and try to map to our application

Main Actors

The main actors in our application are: KafkaConsumerActorCouponRouterActor and PushNotificationRouterActor. They perform the core business tasks of this application:

  • Consume events from Kafka and validate them – this is done by KafkaConsumerActor. This is also the actor who controls the whole Akka Streaming Pipeline/flow. The flow is controled so that we can be assured to not overflow the main downstram Akka actors – CouponRouterActor and PushNotificationRouterActor – with too many events such that they cannot handle.
  • Query Coupon API for results – for available coupons for a given merchant and for a given user, we query coupon API for results. Those results are sent back to Akka Streaming Pipeline.
  • Apply Business Rules & fire or not a Push notification – the last key stage involves sending returned results to PushNotificationRouterActor for it to apply a given set of business rules. In case those rules consider the event valid, a push notification may be fired, in case none has been sent in the last X amount of hours.

Not mentioned yet is MetaInfoRouterActor. It is used with sole purpose of monitoring statistics throughout the whole Akka Streaming pipeline flow. As written on the illustration, given that it is not a core feature of the application itself, and thus we send all messages to our monitoring service in a “fire and forget” manner – that is, we do not wait for acknowledgement. This of course implies that there is the possibility of messages not being delivered, and ultimately not landing in our monitoring system. However, this was considered as a minor and neglectable risk.

Secondary Actors

In the sidelines, and as a secondary service that serves the main actors we have three actors: AppMasterActor, MetaInfoActor and RulesActor.

AppMasterActor actor has two main functions: control the discovery protocol that we implemented, and host healthcheck endpoint used for outside application monitoring.

The so called discovery protocol basically makes sure that all actors know where – on which servers – other actors are, so that theoretically speaking we could separate each actor into different servers in a scale-out fashion. As a side note, we would like to highlight that this discovery protocol could have been implemented using Distributed PubSub modules from Akka – which would be definetely more advisable in case our application would grow in the number of actors. Full disclosure: at the time, due to the simplicity of our current App, it seemed simpler to implement it ourselves to keep the project simpler and smaller, which might be a questionable technical architecture decision.

Technically speaking, MetaInfoActor and RulesActor are almost identical actors in their implementation: they basically have a scheduled timer to remind them to check in a S3 bucket for a given key, stream-load it into memory, and broadcast it to their respective client actors.

As explained in the previous section, routers host many workers (so called “routees”) behind them, serving as a … well yes, router in front that directs traffic to them. All the actors that are Routers have it explicitely referenced in their name. Thus, when we say the MetaInfoActor or the RulesActor broadcast a message, in fact we are just sending one single message to the respective Router wrapped in a Broadcast() case class; the router then knows that it should broadcast the intended message to all it’s routees.

Scalability & HA

All the actors depict in the illustration live in the same server. As a matter a fact, for the time being we are scaling out the application kind of in a “schizofrenic manner” – we deploy the application in different servers, and each server runs a completely isolated application unaware of the existance of other twin applications. In other words, actors living inside server 1 do not cumunicate with any actor living in server 2. Thus we like to call our current deployment “Pod mode”. We are able to achieve this because all the application “Pods” are consuming events from Kafka using the same consumer group. Kafka intelligently assigns partitions Ids to the several consumers. In other words, Kafka controls the distribution of partitions to each POD, thus allowing us to scale out in a very simple manner:

kafka-actor

To increase performance, we can scale out the number of KafkaConsumerActors up to the same number of Kafka partitions. So, for example, if we had a topic with three (3) partitions, we could improve consumption performance by scaling up to three (3) KafkaConsumerActors.

To address High Availability (HA), we could, theoretically speaking, add N+1 KafkaConsumerActors, where N is the number of paritions for HA purposes, if our application was mission critical and very sensitive to performance. This would, however, only potentially improve HA of the application, as this additional KafkaConsumerActor would sit iddle (e.g. not consuming anything from Kafka) until another KafkaConsumerActor node failed.

Moreover, in case you are wondering, not having N+1 KafkaConsumerActor does not severely harm the HA of our application, as Kafka will reassign partition Ids among remaining Consumers in the event of Consumer failure. However, this obviously means that consumers will be unbalanced, as one them will be simultaneously consuming from two partitions.

Now, you may ask what happens in the case of failure of a given node that was processing a given event? Well at the end of the Akka Streaming Pipeline each KafkaConsumerActor commits back the message offset to Kafka – thus ackowledging consumption. So in this case, after the default TTL of message consumption that is configured in Kafka passes, we know that a message was not successfully processed (no acknowledgement), and so another KafkaConsumerActor will actually read again from Kafka that same message, and thereby reprocessing it.

As mentioned previously, when an event processing was processed by the system KafkaConsumerActor will commit back the to Kafka that event’s offset, thereby acknowledging to Kafka that a given message has been successfully consumed for it’s Kafka Consumer Group. We can’t stress this enough (and thus repeating ourselves): this is how we are able to guarantee at at-least once semantics when processing a given message. Note that in our case, since we are storing in Kafka the offsets, in our implementation we cannot guarantee exactly once semantics. Nonetheless, this does not constitute a problem, as we are later using Redis cache to assure event. For more information about Akka Kafka consumer, please check here.

Let us address scalabilty in the rest of the application, by taking the CouponRouterActor architecture into consideration.

coupon-actor-scalability

 

As shown in the previous illustration, performance is scaled by using Akka “routees” behind CouponRouterActor (as well as behind PushNotificationRouterActor). On of the beauties of Akka is that it allows us to code the CouponRouterActor 99% the same as if it was not operating as Akka Router. Simply on Actor class instantiation we mention its Router nature, and the rest is handled by Akka.

Final remarks

We will dive into more detail into each stage next. However, we would like to highlight the importance of Akka Streaming Pipeline. It is able to control how many messages should be read from Kafka, because it sends messages to CouponRouterActor and PushNotificationRouterActor using the Ask Pattern – which waits for a response (up to a given time-to-live (TTL)).

Also note that no matter how far an event may go down the flow (an event may be, for example, filtered right in the beginning in case it is considered invalid), we always log to Datadog that a given message was read from Kafka, and was successfully processed. Note that “successfully processed” can have different meanings – either considered Invalid event right in the beginning of the streaming pipeline, or no available coupons returned from Coupon API, or even business rules considered that the system should not send push notification to Kepler API, as business rules define it is unfit.

Moreover, please note that when an event processing is finished – again, no matter how far it goes down the stream pipelineKafkaConsumerActor has the final task of committing back the to Kafka for that event’s offset. In other words, we acknowledge back to Kafka that a given message has been processed. This is an important detail –  since in case of failure of processing a given event (let’s say one of the application servers crashes), after the default TTL of message consumption tha tis configured in Kafka passes, another KafkaConsumerActor will actually read again from Kafka that same message, thus reprocessing it.

 

Docker environment

Currently we are only using Docker for local development, although this application would fit quite well in, say, Kubernettes cluster, for example.

We have setup a complete emulation of the production env in local via docker:

This is (extremely) useful not only to get a better grip of how the system works in day to day development, but also to do harder to emulate behavioral tests, such as High Availability (HA) tests.

Final Notes

Like any application, there are a number of things that could have be done better, but due to practical constraints (mainly time), were not.  Let us start with some of the things we do not regret:

  • Using Akka: there are many ways we could have implemented this application. Overall akka is a mature full-fledge framework – contains modules for almost anything you may require while building distributed highly available asyncronous applications – and with very satisfactory performance.
  • Using Akka streaming: there are many blogs out there with horror stories on constant performance issues with pure Akka implementations. Akka Streaming module, not only increases stability via back-pressure, it also provides a very intuitive and fun to work with API
  • Using Docker in local: this allowed us to test very easily and especially rapidly in our local machines, more rare scenarious, such as simulating failures on all points in the application: Kafka nodes, Redis, S3, and of course, the Akka application itself.

Some open topics for further reflection:

  • Using our own discovery protocol ended was a questionable technical decision. One possible alternative could have been using akka module “DistributedPubSub”
  • Ideally, this application would be a very nice initial use case to start using Container orchestration tools, such as Kubernetes

 

And … that’s all folks. We hope that this post was useful to you.

Using Akka Streaming “saving alerts” – part 1

Full disclosure: this post was initially published at Bonial tech blog here, one my favorite companies at the heart of Berlin, and where I have been fortunate enough to be working for 2+ years as a freelance Data Engineer. If you are looking for positions in tech, I can’t help to recommend checking their career page.

Overview

Some months ago I was working on an internal project at Bonial using Akka Streaming (in scala) to provide additional features to our current push notification system. The main goal of the project was to enhance the speed to which the client is able to notify its end users of available discount coupons. In this case, we wanted to notify the users in a real time fashion of available coupons on store, so that they could use them more effectively on the spot and save money. Hence our project code name: “saving alerts”!

After some architectural discussions where we compared several technical options, we decided to give akka streaming a go. It has been a fun ride, so we thought we might as well share some of the lessons learned.

This post has been divided into two parts:

  1. Part 1 – we provide an overview about the main tech building blocks being used in this project (mainly focusing on Akka)
  2. Part 2 – details how the system was designed and implemented

 

Without further ado, let us start with an overview of our “saving-alerts” application:

app-overview

Logically speaking one can group the tasks executed by our application into three (3) main stages:

  1. Read an event(s) from a given Kafka topic and perform basic validation (which are collected from our own tracking API); each event belongs to a unique user, and is triggered by our mobile App;
  2. Querying an internal service – the so-called “Coupon API” – to check if there are any available coupons for that given user;
  3. Apply of set of business logic rules – which at the moment are determined by our Product Managers – which determine whether, in the end, to send or not to send a push notification to that user mobile app.

Besides these main logical tasks, we still do some other “offline” housekeeping scheduled processes, namely: loading from an S3 bucket into memory updated versions of meta information about our retailers and the business logic rules, renew an Auth token to talk internally within the client’s API, and obviously logging of app statistics for monitoring purposes.

In terms of tech stack, relevant for this project is simply the Akka actor system, a dedicated single node Redis instance, and some dedicated S3 buckets. All the rest – such as the tracking API, Kafka queue, Authentication API, Coupon API, Monitoring Service and Push Notification API, etc. – are all viewed as external services from the app point of view, even though most of them belong inside the company.

Though not particularly relevant for this project, the whole Akka application was deployed on AWS on ec2 instances. As we state in our final conclusion notes, a good fit for this application would also be to use some Docker container orchestration service such as Kubernetes.

Before we dive deep into how we implemented this system, let us first review the main technical building block concepts used in the project.

System Architecture

The main building block of the current application is the Akka framework. Hopefully this section will guide you through some of the rational that we used to guide our decisions, and ideally why we choose to use Akka for this project.

About Akka

Why akka

Let’s start from the very basics: building concurrent and distributed applications is far from being a trivial task. In short, Akka comes to the rescue for this exact problem: it is an open source project that provides a simple and high level abstraction layer in the form of Actor model to greatly simplify dealing concurrent, distributed and fault tolerant applications on the JVM. Here is a summary of Akka’s purpose:

  • provide concurrency (scale up) and remoting (scale out)
  • easily get rid of race conditions and multi-threading locking problems, such as deadlocks (“[…] when several participants are waiting on each other to reach a specific state to be able to progress”), starvation (“[…] when there are participants that can make progress, but there might be one or more that cannot”), and livelocks (when several participants are granting each other a given resource and none ends up using it) – please see akka documentation, which does a fantastic job explaining it;
  • provide easy programming model to code a multitude of complex tasks
  • allow you to scale horizontally in your Application

 

There are three main Building blocks we are using from Akka framework in this project:

  • Akka actors
  • Akka remote & clustering
  • Akka streaming

Akka Actors

The actor system provides asynchronous non-blocking highly performant message-driven programming model distributed environment. This is achieved via the Actor concept.

An Actor is sort of a safety container, a sort of light weight isolated computation units which encapsulate state and behaviour.

In fact, actor methods are private by default –  one cannot call methods on actors from the outside. The only way to interact with actors is by message sending – this holds also true for inter-actor communication. Moreover, and as stated in Akka documentation:  “This method (the “receive” method, which has to be overriden by every actor implementation) is responsible for receiving and handling one message. Let’s reiterate what we already said: each and every actor can handle at most one message at a time, thus receive method is never called concurrently. If the actor is already handling some message, the remaining messages are kept in a queue dedicated to each actor. Thanks to this rigorous rule, we can avoid any synchronization inside actor, which is always thread-safe.

Here is an example of a basic Actor class (written in scala), retrieved from Akka’s documentation and changed on minor details:

 

The receive method

In order to create an actor, one needs to extend the Actor Trait (sort of Java Interface), which mandates the implementation of the “receive” method – basically where all the action happens. In our example, in case the “MyActor” receives the message “test”, the actor will log “received test”, and if it receives the message “mutate”, it will mutate its local variable by incrementing one (1). As each message is handled sequentially, and there is no other way to interact with an actor, it follows that you do not have to synchronize access to internal Actor state variables, as they are protected from state corruption via isolation – this is what is meant when one says that actors encapsulate state and behaviour.

As mentioned before, the receive method needs to be implemented by every actor. The receive Method is a PartialFunction, which accepts “Any” type and with void return. You can confirm this in Akka’s source code, namely the Actor object implementation:

By the way, as a side note, the receive method being a PartialFunction is also one of Akka Streaming main proponents criticism, due to the lack of type safety.

In the provided example we are using simple strings as messages (“test”, and “mutate”). However usually one uses scala case classes to send messages, since, as a best practice, messages should be immutable objects, which do not hold any object that is mutable. Finally, Akka will take care of serialization in the background. However, you can also implement your custom serializers, as is recommended speacially in the cases of remoting, in order to optimize performance or for complex cases. Here is an example how two actors can communicate with each other:

If one wants to reply to a message sent, one can use the exclamation mark “!” notation to send a message. This is a “fire and forget” way of sending a message (which means there is no acknowledgement from the receiver that the message was indeed received). In order to have an acknowledgement one could use the ask pattern instead with the interrogation mark “?”.

Also note that to retrieve the underlying message sender we call the “sender” method, which returns a so-called “ActorRef” – a reference of the underlying address of the sender actor.  Thus, if actor DudeB would receive message “hallo” from actor DudeA, it would be able to reply to it just by calling sender() method, which is provided in the Actor trait:

Finally, messages are stored in the recipients Mailbox. Though there are exceptions (such a routers, which we will see later), every actor can have a dedicated Mailbox. A Mailbox is a queue to store messages.

Message ordering

Important to note is that message order is not guaranteed. That is, if say Actor B has sent a message to Actor A at a given point in time, and then later Actor C sends a message to same Actor A, Akka does not provide any guarantee that the Actor’s B message will be delivered before Actor’s C message (event though Actor B sent it a message before Actor C). This would be fairly difficult for Akka to control especially in the case where actors are not co-located on the same server (as we will discuss later) – if Actor B is having high gitter on his network for example, it might happen that Actor C gets his message passed through first, for example.

Though order between different actors is not guaranteed, Akka does guarantee that messages from the same actor to another actor will be ordered. So, if Actor B sends one message to Actor A, and then later sends a second message again to Actor A, one has the guarantee that, assuming both messages are successfully delivered, the first message will be processed before the second.

 

Besides being processed sequentially, it is also relevant to note that messages are also processed asynchronously to avoid blocking the current thread where the actor is residing. Each actor gets assigned a light weight thread – you can have several millions of actors per GB of heap memory – which are completely shielded from other actors. This is the first basic fundamental advantage of Akka – providing a lighting fast asynchronous paradigm/API for building applications.

As we will see next, akka provides many more building blocks which enhance its capabilities. We will focus on how akka benefits this application specifically, namely how it provides an optimized multi-threading scale-up (inside the same server) and scale-out (accross several remote servers) environment for free.

Akka Routers (scale-up)

Router actors are a special kind of actors, that make it easy to scale out. That is, with exactly the same code, one can simply launch an actor of a type Router, and it starts automatically child actors – so-called “routees” in akka terminology.

The child actors will have their own Mailboxes; however the Router itself will not. A router will serve as a fast proxy, which just forwards messages to it’s own routees according to a given algorithm. For example, in our application, we are simply using round-robin policy. However, other (some more complex) algorithms could be used, such as load balancing by routee CPU and memory statistics, or scatter-gun-approach (for low latency requirements for example), or even simply to broadcast to all routees.

akka_router_architecture

The main advantage of Routers is that they provide a very easy way to scale-up the multi-threading environment. With the same Class code and simply changing the way we instantiate the Actor we can transform an actor to a Router.

Akka Remote & clustering modules (scale-out)

To distribute actors accross different servers one has two modules available: Akka remoting, and, dependent on the first, Akka Clustering. Akka remote provides location transparency to actors, so that the application code does not have to change (well, neglectable) if an actor is communicating with another actor on the same server or on a different one.

Akka Clustering on the other hand, goes on step further and builds on top of Akka Remoting, providing failure detection and potentially failover mechanisms. The way clustering works is by having a decentralized peer-to-peer membership service with no single-point-of-failure, nor single point of bottleneck. The membership is done via gossip protocol based on Amazon DynamoDB.

As we will see later, the way we scale in this our application, the clustering advantadges are not currently being used. That is, we are not extending specific actor system accross more than one node. However, the application is written in a way that it is completly prepared for it.

Akka Streaming (backpressure for  streaming consumption)

Akka streaming is yet another module from Akka, relatively recently released. The main point of it is to hide away the complexities of creating a streaming consumption environment,  and providing back-pressure for free. Akka has a really good documentation explaining back-pressure in more detail, but in short back-pressure ensures that producers halt down production speed in case consumers are lagging behind (for example, for some performance issue not being able to consume fast enough).

Last but not least, it is important to highlight that Akka Streaming works kind of a blackbox (in a good way), doing all the heavy lifting in the background reliefing you to focus on other business critial logic. The API is also intuitive to use, with a following nice functional programming paradigm style. However, we should warn that as your operations graph complexity grows, you will be forced to dive deep into more advanced topics.

About Kafka

Kafka is a complex system – more specifically in this case a distributed publish-subscribe messaging system – where one of the many uses-cases include messaging. This is provided to the Akka application as a service, thus from the application stand point, one does not need to care much about it. However, it is beneficial to understand how the application scales and how it ingests data. The following summary attempts to highlight some of the core differences that make Kafka different from other messaging systems, such as RabbitMQ:
  • Kafka implements philosophy dumb broker, smart consumer; consumers are responsible for knowing from when they should consume data – kafka does not keep track; this is a major destinction compared to, for example, RabbitMQ, where many sophisticated delivery checks are available to deal with dead letter messages; in regards to our application, given Akka’ Streaming back-pressure mechanism, our application will halt consumption, in case consumers are not able to keep up with producers;
  • Persistent storage during X amount of time; many clients can read from same topic, for as long as Kafka is configured to persist data;
  • Topics can be partitioned for scalability – in practice this means that we can distribute and store for the same topic among several severs;
  • Data in each partition added in append-only mode, creating an immutable sequence of records – a structured commit log; records are stored in key value structure, and in any format, such as: String, JSON, Avro, etc.
  • It follows that order is only guaranteed on a partition basis; that is, inside the same partition if event A was appended before event B, it will be consumed before as well by the Consumer assigned to that partition. However, among partitions order is not guaranteed; the following illustration taken from kafka’s own project page illustrates this concept better:

 

log_anatomy

  • Besides possibly being partitioned, topics can also be replicated among several nodes, thus guaranteeing HA;
  • Consumers can be assigned to groups, thus scaling the amount of times topic pool can be consumed;

For more detail about Kafka, we recommend Kafka’s own page, which has really good intro. Finally, if you are indeed familiarized with RabbitMQ, we would recommend reading the following article, comparing Kafka with RabbitMQ, especially to which use cases each fits best.

 

Getting through Deep Learning – Tensorflow intro (part 3)

This post is part of a tutorial series:

  1. Getting through Deep Learning – CNNs (part 1)
  2. Getting through Deep Learning – TensorFlow intro (part 2)
  3. Getting through Deep Learning – TensorFlow intro (part 3)

Alright, lets move on to more interesting stuff: linear regression. Since the main focus in TensorFlow, and given the abundancy of online resources on the subject, I’ll just assume you are familiar with Linear Regressions.

As previously mentioned, a linear regression has the following formula:

linear_regression

Where Y is the dependent variable, X is the independent variable, and b0 and b1 being the parameters we want to adjust.

Let us generate random data, and feed that random data into a linear function. Then, as opposed to using the closed-form solution, we use an iterative algorithm to progressively become closer to a minimal cost, in this case using gradient descent to fit a linear regression.

We start by initializing the weights – b0 and b1 – with random variables, which naturally results in a poor fit. However, as one can see through the print statements as the model trains, b0 approaches the target value of 3, and b1 of 5, with the last printed step: [3.0229037, 4.9730182]

The next figure illustrates the progressive fitting of lines of the model, as weights change:

lr_fitting

 

Sources:

 

Getting through Deep Learning – Tensorflow intro (part 2)

Yes, I kind of jumped the guns on my initial post on Deep Learning straight into CNNs. For me this learning path works the best, as I dive straight into the fun part, and eventually stumble upon the fact that maybe I’m not that good of a swimmer, and it might be good to practice a bit before going out in deep waters. This post attempts to be exactly that: going back to the basics.

This post is part of a tutorial series:

  1. Getting through Deep Learning – CNNs (part 1)
  2. Getting through Deep Learning – TensorFlow intro (part 2)

TensorFlow is a great starting point for Deep Learning/ Machine Learning, as it provides a very concise yet extremely powerful API. It is an open-source project created by Google initially with numerical computation tasks in mind, and used for Machine Learning/Deep Learning.

TensorFlow provides APIs for both Python and C++, but it’s backend is written in C/C++, allowing it to achieve much greater performance milestones. Moreover, it supports CPU, GPU, as well as distributed computing in a cluster.

The first thing to realize is that TensorFlow uses the concept of a session. A session is nothing more than a series of operations to manipulate tensors, organized in a structure of a data flow graph. This graph building activity pretty much works like Lego building, by matching nodes and edges. Nodes represent mathematical operations, and edges multi-dimensional arrays – aka: Tensors. As the name hints, a tensor is the central data structure in TensorFlow, and is described by its shape. For example, one would characterize a 2 row by 3 columns matrix as a tensor with shape of [2,3].

Important also to note is that the graph is lazy loaded, meaning that computation will only be triggered by an explicit run order for that session graph. OK, enough talking, let us get into coding, by exemplifying how a basic graph Session is built:

In the previous example, variables “a” and “b” are the nodes in the graph, and the summation is the operation connecting both of them.

A TensorFlow program is typically split into two parts: construction phase – where the graph is built – and a second one called execution phase, when actually resources (CPU and/or GPU, RAM and disk) are allocated until the session is closed.

Typically machine learning applications strive to iteratively update model weights. So, of course, one can also specify tensors of variable type, and even combine those constants.

By defining the dtype of a node, one can gain/loose precision, and at the same time impact on memory utilization and computation times.

Note lines 35 to 37 now:

r1 = op1.eval()
r2 = op2.eval()
result = f.eval()

TensorFlow automatically detects which operations depend on each other. In this case, TensorFlow will know that op1 depends on x and y evaluation, op2 on a, b and c evaluation,  and finally that f depends on both op1 and op2. Thus internally the lazy evaluation is also aligned with the computation graph. So far so good.

However, all nodes values are dropped between graph runs, except for variable values, which are maintained by the session across graph runs. This has the import implication that op1 and op2 evaluation will not be reused upon f graph run – meaning the code will eveluate op1 and op2 twice.

To overcome this limitation, one needs to instruct TensorFlow to run those operations in a single graph:

And yes, that is all for today. I want to blog more frequently, and instead of writing just once every couple of months (and in the meanwhile pilling up a lot of draft posts that never see the light of day), I decided to keep it simple. See you soon 🙂

Sources: