Setting up parallel Spark2 installation with Cloudera and Jupyter Notebooks

Yes, this topic is far from new material. Especially if you consider Cloud tech stack evolution/change speed, it has been a long time since Apache Spark version 2 was introduced (26-07-2016, to be more precise). But moving into the cloud is not an easy solution for all companies, where data volumes can make such a move prohibitive. And in on-premises contexts, the speed of operational change is significantly slower.

This post summarizes the steps for deploying Apache Spark 2 alongside Spark 1 with Cloudera, and install python jupyter notebooks that can switch between Spark versions via kernels. Given that this is a very frequent setup in big data environments, thought I would make the life easier for “on-premise engineers”, and, hopefully, speed up things just a little bit.

Continue reading “Setting up parallel Spark2 installation with Cloudera and Jupyter Notebooks”

Check out video Understanding Using Temporal Cycle-Consistency Learning

It has been a long time since I last blogged. João and I have been busy getting things ready to launch a new project to streamline multi-cloud management that we have been working in the last couple of months. We will talk soon in more detail about it, so stay put.

Meanwhile just wanted to share this very interesting blog post from Google AI blog – video understanding using temporal cycle-consistency learning – where they propose a self-supervised learning method to classify different actions, postures, etc. in videos.

Continue reading “Check out video Understanding Using Temporal Cycle-Consistency Learning”

Faust, your Python-based streaming library

Robinhood is a very popular California based FinTech, included by Forbes in the top 50 FinTechs to watch in 2019. Their primary mission is to bring down stock trading fees for the common Joe/Jane, although their roadmap also includes cryptocurrency trading.

Due to the nature of the bidding market, their data stack probably includes a lot of stream tooling. Also (probably) due to the lack of quick and easy tooling for streaming with Python, supported with the growing demand for Python in the developer community, they launched their own port of Kafka Streams, called Faust.

In this post, we’re going to show how to easy it is to bootstrap a project with Faust to put your stream related business logic needs in practice very quickly. The demo we prepared is of an app which filters words from a Kafka topic and then keeps a count of how many times it has seen the colors “red”, “green” and “blue”.

In a nutshell, Faust is:

Continue reading “Faust, your Python-based streaming library”

List all your AWS resources with Go

Picking up on Diogo’s last post on how to obliterate all resources on your AWS Account, I thought it could also be useful to, instead, list all you have running.

Since I’m long overdue on a Go post, I’m going to share a one file app that uses the Go AWS SDK for to crawl each region for all taggable resources and pretty printing it on stdout, organised by Service type (e.g. EC2, ECS, ELB, etc.), Product Type (e.g. Instance, NAT, subnet, cluster, etc.).

The AWS SDK allows to retrieve all ARNs for taggable resources, so that’s all the info I’ll use for our little app.

Note: If you prefer jumping to full code code, please scroll until the end and read the running instructions before.

The objective

The main goal is to get structured information from the ARNs retrieved, so the first thing is to create a type that serves as a blue print for what I’m trying to achieve. Because I want to keep it simple, let’s call this type SingleResource.

Also, since we are taking care of the basics, we can also define the TraceableRegions that we want the app to crawl through.

Finally, to focus the objective, let’s also create a function that accepts a slice of []*SingleResource and will convert will print it out as a table to stdout:

Continue reading “List all your AWS resources with Go”

Keeping AWS costs low with AWS Nuke

A common pattern in several companies using AWS services is having several distinct AWS accounts, partitioned not only by teams, but also by environments, such as develop, staging, production.

This can very easily explode your budget with not utilized resources. A classic example occurs when automated pipelines – think of terraform apply, or CI/CD procedures, etc – fail or time out, and all the resources created in the meanwhile are left behind.

Another frequent example happens in companies recently moving to the cloud. They create accounts for the sole purpose of familiarizing and educating developers on AWS and doing quick and dirty experiments. Understandably, after clicking around and creating multiple resources, it becomes hard to track exactly what was instantiated, and so unused zombie resources are left lingering around.

Continue reading “Keeping AWS costs low with AWS Nuke”

Integrating IAM user/roles with EKS

To be completely honest, this article spawns out of some troubleshooting frustration. So hopefully this will save others some headaches.

The scenario: after having configured an EKS cluster, I wanted to provide permissions for more IAM users. After creating a new IAM user with belonged to the target intended IAM groups, the following exceptions were thrown in the CLI:

kubectl get svc
error: the server doesn't have a resource type "svc"
kubectl get nodes
error: You must be logged in to the server (Unauthorized)

 

AWS profile config

First configure your local AWS profile. This is also useful if you want to test for different users and roles.

 

# aws configure --profile
# for example:
aws configure --profile dev

If this is your firts time, this will generate two files,

~/.aws/config and ~/.aws/credentials

It will simply append to them, which means that you can obviously edit the files manually as well if you prefer. The way you can alternate between these profiles in the CLI is:

#export AWS_PROFILE=
# for example:
export AWS_PROFILE=dev

Now before you move on to the next section, validate that you are referencing the correct user or role in your local aws configuration:

# aws --profile  sts get-caller-identity
# for example:
aws --profile dev sts get-caller-identity

{
"Account": "REDACTED",
"UserId": "REDACTED",
"Arn": "arn:aws:iam:::user/john.doe"
}

 

Validate AWS permissions

Validate that your user has the correct permissions, namely the following two are required:

# aws eks describe-cluster --name=
# for example:
aws eks describe-cluster --name=eks-dev

Add IAM users/roles to cluster config

If you managed to add worker nodes to your EKS cluster, then this documentation should be familiar already. You probably AWS documentation describes

kubectl apply -f aws-auth-cm.yaml

While troubleshooting I saw some people trying to use the clusters role in the “-r” part. However you can not assume a role used by the cluster, as this is a role reserved/trusted for instances. You need to create your own role, and add root account as trusted entity, and add permission for the user/group to assume it, for example as follows:

{
  "Version": "2012-10-17",
  "Statement": [
     {
        "Effect": "Allow",
        "Principal": {
             "Service": "eks.amazonaws.com",
             "AWS": "arn:aws:iam:::user/john.doe"
         },
        "Action": "sts:AssumeRole"
    }
   ]
}

 

Kubernetes local config

then, generate a new kube configuration file. Note that the following command will create a new file in ~/.kube/config

aws --profile=dev eks update-kubeconfig --name esk-dev

AWS suggests isolating your configuration in a file with name “config-“. So, assuming our cluster name is “dev”, then:

export KUBECONFIG=~/.kube/config-eks-dev
aws --profile=dev eks update-kubeconfig --name esk-dev

 

This will then create a the config file in ~/.kube/config-eks-dev rather than ~/.kube/config

As described in AWS documentation, your kube configuration should be something similar to the following:

If you want to make sure you are using the correct configuration:

export KUBECONFIG=~/.kube/config-eks-dev
kubectl config current-context

This will print whatever the alias you gave in the config file.

Last but not least, update the new config file and add the profile used.

Last step is to confirm you have permissions:

export KUBECONFIG=~/.kube/config-eks-dev
kubectl auth can -i get pods
# Ideally you get "yes" as the answer.
kubectl get svc

 

Troubleshooting

To make sure you are not working in a environment with hidden environmental variables that you are not aware and may conflict, make sure you unset them as follows:

unset AWS_ACCESS_KEY_ID && unset AWS_SECRET_ACCESS_KEY && unset AWS_SESSION_TOKEN

Also if you are getting as follows:

could not get token: AccessDenied: User arn:aws:iam:::user/john.doe is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam:::user/john.doe

Then it means you are specifying the -r flag in your kube/config file. This should be only used for roles.

Hopefully this short article was enough to unblock you, but in case not, here is a collection of further potential useful articles:

Getting Started with Spark (part 4) – Unit Testing

Alright quite a while ago (already counting years), I published a tutorial series focused on helping people getting started with Spark. Here is an outline of the previous posts:

In the meanwhile Spark has not decreased popularity, so I thought I continued updating the same series. In this post we cover an essential part of any ETL project, namely Unit testing.

For that I created a sample repository, which is meant to serve as boiler plate code for any new Python Spark project.

Continue reading “Getting Started with Spark (part 4) – Unit Testing”