Terraforming in 2021 – new features, testing and compliance

 

 Both João and I have been very silent in terms of blogging activity already for a while ago. Not only due to our main projects, but also because we were working on the makeops project, which was intended to simplify terraform management across different cloud accounts. Working on this project made us explore a lot of other tooling to optimize terraform management, and so we thought it could be useful to share some of the things we learned along the way.

We kick off with the latest updates in terraform itself, from the latest 0.15 up until version 0.12, and then go straight to surrounding tooling, where we mainly focus on testing and compliance checking for infrastructure.

You can find all the code supporting this this post here.

 

Overview of latest Terraform versions

Version 0.15 has just recently been released. Yet, a lot of companies out there are still running environments up to 0.11.X, and for a good reason. Though version 0.12 was launched already a while ago (2019), it brought great disruption. If this is your case, chances are that you gave up on keep up to date with its progress. Thus, we thought it would make sense to first review some of terraform features you might be missing out, before we try to convince you also considering further improvements.

0.12

The key highlights are:

  • First-class expression syntax: probably the most noticeable change, no more string interpolation sintax (confettis blowing in the background)!
  • Generalized type system: (yay!) able to specify types of variables;
  • Iteration constructs: introduction a the for operator allowing closer proximity to programming languages and thus more expressiveness to the DSL;

More details, see the general availability announcement page.

0.13

The key highlights are:

  • Module improvements – brings ability to use several meta tags that until so far were only available for resources in modules. In this case, we can now use depends_on for coupling dependencies with modules (finally!), along with ability instantiate multiple module instances with count or for_each; more details here;

Example specifying a module dependency (a personal favorite), source hashicorp blog

  • Custom variable validation – when specifying variables one can from this version on wards specify custom rules for the input that is accepted and respective error message to allow fast failure; more details here;  

Example variable validation, source hashicorp blog

  • Support 3rd party providers – terraform now allows one to reference your own provider in the terraform block required_providers. The required_providers keyword in the terraform block already existed in terraform 12, though with was restricted to hashicorp’s own providers; now you can reference your own DNS source to host your own registry, and upon terraform init it will be installed same as it used to all other providers; more details here; note that if you are already using the required_providers keyword in terraform block, to migrate from version 12 to 13 you should adapt it, as shown in the following example:

Example required_providers, source terraform upgrade guides

More details, see the general availability announcement page.

0.14

The key highlights are:

  • Sensitive variables & outputs – allowing one to flag a variable as “sensitive” will redact its output on the CLI;
  • Improved diff – this is probably one of the most common complaints I hear, namely the difficulty in reading a terraform plan (and along with it the same for apply and show) output; with this new change, it hides unchanged fields (irrelevant for the diff) while displaying a count of the hidden elements for better clarity; more details here;

Example excerpt diff, source hashicorp blog

  • dependency lockfile – as soon as you run terraform init with version 14 you will notice that a .terraform.lock.hcl file will be created in that directory and outside the .terraform directory. This file is intended to be added to version control (git committed), to guarantee that the exact same provider versions are run everywhere;

More details, see the general availability announcement page.

 

0.15

Finally, here are some of the main highlights the latest version just very recently announced:

  • Remote state data source compatibility: in order to make it easier to upgrade to newer versions, you can use reference remote state data objects that are using older versions; note that this feature has been back ported into previous releases as well, namely 0.14.0, 0.13.6, and 0.12.30;
  • Improvements to conceal sensitive values (passwords, for example): following version 0.14 work now provider developers can specify the properties that are by default sensitive and should be anyway hidden in outputs; moreover, terraform also added a function (sensitive) for users to be able to explicitly hide values;
  • Improvements in logging behavior: ability to control provider and terraform log level separately TF_LOG_CORE=level and TF_LOG_PROVIDER=level.

More details, see the general availability announcement page.

 

Handling multiple versions

Assuming that these new features convinced you to upgrade your existing terraform environments, being realistic here, this will not happen from one day to another. You will have a transition period (if not permanently) where you have environments with different terraform versions. That is OK, you can still keep your sanity while hopping between all of them thanks to tools like the following:

  • TFEnv – terraform environment switcher inspired (from the ruby world) by rbenv written with shell scripts;
  • Terraform Switcher – yet another project essentially doing the same written in go;

Both of these projects overlap almost entirely, so we will simply exemplify with one of them, namely tfenv:

Here we are showing how you can switch between installed versions tfenv use <local-version>, how you can check which versions are already locally installed with tfenv list, all versions currently available tfenv list-remote (minor detail: the current version of the library I’m using to record my terminal, terminalizer, does not capture me scrolling up and selecting version terraform 0.14.5)
Last but not least, we also show a cool feature from tfenv, namely the ability to automatically recognize the minimum required version in a given environment. Same goes for the latest version, in case you are wondering. And yes, this is also available in terraform switcher project.

Testing

Testing is probably the most confusing topic in the Infrastructure-as-Code (IaC) land, and terraform not being an exception, as a lot of different tools and procedures get thrown in this same bag, when, well, they probably should not. Usually when talking about testing, people usually mean three different things:

  • static checks – validation of mainly the structure code without actually running the code; a fast away for performing sanity checks, either local and/or in your CI/CD pipelines;
  • integration tests – provided you are properly using inversion of control with variables, they give you the power to test your modules or environments for different generic cases;
  • compliance checks – tests done in the aftermath, after all resources have already been deployed; these can serve distinct purposes that we will explore better later, but essentially the goal is to confirm that what is deployed is what was initially intended and expressed in terraform;

Yes, you got that right: if you were looking forward to some classic unit testing, you can forget that for now. Let us dive straight into more details.

Static checks

Again, these are not tests strictly speaking, but rather just simple validation checks one can run to catch common errors without actually deploying anything. There is an array of tools out there, but let us start with the one provided out of the box in terraform binary.

It is not uncommon for the tool creators to provide their own validators. Kubernetes provides validate kubectl apply -f <file> --dry-run --validate=true, helm provides lint helm lint <path>; terraform provides validate.

In the following example running terraform validate would catch two of the three issues:

# terraform validate will catch typo in resource reference
resource "aws_s3_bukcet" "wrong_resource" {
name = "my-bucket"
}
# terraform validate will catch wrong CIDR
resource "aws_vpc" "default" {
cidr_block = "0.0.0.0/0"
}
resource "aws_instance" "web" {
# .. however, non existing instance type is not caught by terraform validate
instance_type = "t200.micro"
ami = "ami-0db61e5fa6d1a815a"
vpc_security_group_ids = [aws_security_group.default.id]
subnet_id = aws_subnet.default.id
provisioner "remote-exec" {
inline = [
"sudo apt-get -y update",
"sudo apt-get -y install nginx",
"sudo service nginx start",
]
}
}

Validate will require to have previously run terraform init, so that it can leverage providers. In our example it will detect the typo in the aws bucket resource reference, along with the invalid CIDR provided to create the VPC. What it will not detect, however, is the non existent EC2 instance type.

TFLint comes to the rescue. Being yet another open source tool written in go, it comes as a binary much like terraform and does not even require terraform to be installed.

Running with the deep option requires one to provide credentials, providing a more thorough inspection. In our case, it detected the incorrect instance type, as well as the wrong AMI.

You can also customize tflint to inject variables, define modules to ignore, etc. You can check the user guide for more details.

Now besides linting for configuration issues in your code, another recommendation is to check for security issues, such as too permissive security group rules, unencrypted resources, etc.

Here again more than one tool exists to assist. We will highlight two of the most popular ones here: tfsec and checkov. Both provide a predefined set of checks that they use to inspect your code, allowing to explicitly open exceptions (if you really want to) by annotating your code with comments, and adjust the configuration to ignore some modules, for example.

TFSec is written in Go, and is probably the fastest to get started, and currently provides up to 10 checks for the current main cloud providers (AWS, GCP, and Azure). The potential downside is that works exclusively for Terraform, so you will need to use additional tools to inspect kubernetes/helm/cloudformation etc.

Checkov, on the other had, despite being a more recent tool, has seen stellar development speed (being developed by a startup with good founding rounds, and PaloAlto Networks acquisition can’t hurt). Not only do they have a really comprehensive number of checks across all the main cloud providers, but they also span across multiple technologies, such as Kubernetes, Cloudformation, serverless and ARM templates. And the list keeps growing. Checkov provides you the option to run either a pure static check by just pointing to the terraform directory or terraform file, or by actually running it against a terraform plan file. The nice thing about running it directly is naturally the simplicity of not requiring to have the target environment accessible to test the code.

These tools are grabbing a lot of attention lately, as the double checking for security issues was usually locked in the hands of devops/devsecops teams, which in practice constituted a development bottleneck. By injecting these checks early in CI/CD pipelines, a great deal of development speed is freed without compromising security.

Integration tests

Terratest is probably the closest one can get now a days to testing the specific peace of terraform code. It is a Go library, and requires one to write tests in Go. This is obviously a potential limitation as not all teams have knowledge in Go. On the upside, I would argue that the learning curve of learning Go to get the basics – read enough for writing terraform tests – is not steep if you know already at least one programming language.

Having worked already in several Go projects in the context of mklabs, we’re naturally favorably biased towards Terratest. We really like. So, even if you have never tried Go, we would very much like to invite you to try our sample repository out, where we provide you a Makefile and go mod setup to make it easy getting started. And if this did not convince you yet, here’s what might: you can also use Terratest to test Dockerfiles, Kubernetes and Packer setups.

This might seem intimidating, but we would argue that the benefits are worth it. Let us look at an example skeleton of a test setup for AWS:

package tests
import (
"github.com/gruntwork-io/terratest/modules/aws"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
"testing"
)
func TestTerraformAwsEnvironment(t *testing.T) {
myVarToInject := "my-var"
awsRegion := "eu-west-1"
terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
// Variables to pass to our Terraform code using -var options
Vars: map[string]interface{}{
"my_variable": myVarToInject,
},
EnvVars: map[string]string{
"AWS_DEFAULT_REGION": awsRegion,
},
TerraformDir: "../environments/base-example",
})
// cleanup created resources at the end of the test
defer terraform.Destroy(t, terraformOptions)
// run terraform init & apply
terraform.InitAndApply(t, terraformOptions)
// Perform assertions …
expected := "some-id"
result := terraform.Output(t, terraformOptions, "some_output")
assert.Equal(t, expected, result)
}

Essentially the skeleton is always the same:

  1. start by defining the variables you want to inject as input for your terratest run; you might want to rather inject random variables even, to test in greater depth;
  2. make sure you tell the setup to destroy all created resources after the terraform apply has been completed using the defer statement; this is equivalent to the trap keywork in shell, and it will execute even if something fails in the meanwhile; the only prerequisite is that you declare it before you call terraform.InitAndApply() method;
  3. Let the test be deployed by passing the terraform config setup you have previously declared in terraformOptions to terraform.InitAndApply method;
  4. Finally, declare the things you want to assert; that is, declare what you expect should have been deployed – the expected – and see if they match what was actually deployed – the result. The simplest way is to check in terraform apply output, which is accessible via `terraform.Output()` method. Alternatively, terragrunt also provides you some methods out of the box for frequently checked things on a per provide basis.

Sounds like fun, right? Here is an example of some of the tests we could write to our basic terraform example:

package tests
import (
"fmt"
"github.com/gruntwork-io/terratest/modules/random"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
"strings"
"testing"
)
func TestTerraformAwsEnvironment(t *testing.T) {
awsRegion := "eu-west-1"
bucketName := strings.ToLower(fmt.Sprintf("my-bucket-%s", random.UniqueId()))
securityGroupName := strings.ToLower(fmt.Sprintf("my-security-group-%s", random.UniqueId()))
securityGroupDescription := strings.ToLower(fmt.Sprintf("security-group-description-%s", random.UniqueId()))
terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
// Variables to pass to our Terraform code using -var options
Vars: map[string]interface{}{
"bucket_name": bucketName,
"security_group_name": securityGroupName,
"security_group_description": securityGroupDescription,
"aws_region": awsRegion,
},
EnvVars: map[string]string{
"AWS_DEFAULT_REGION": awsRegion,
},
TerraformDir: "../environments/base-example",
})
// cleanup created resources at the end of the test
defer terraform.Destroy(t, terraformOptions)
// run terraform init & apply
terraform.InitAndApply(t, terraformOptions)
// Perform assertions regarding s3 bucket
bucketNameResult := terraform.Output(t, terraformOptions, "bucket_name")
bucketArnResult := terraform.Output(t, terraformOptions, "bucket_arn")
expectedBucketArn := fmt.Sprintf("arn:aws:s3:::%s", bucketName)
assert.Equal(t, expectedBucketArn, bucketArnResult)
assert.Equal(t, bucketName, bucketNameResult)
// Perform assertions regarding VPC
vpcCidrResult := terraform.Output(t, terraformOptions, "vpc_cidr")
assert.Equal(t, "10.0.0.0/16", vpcCidrResult)
// Perform assertions regarding security group
securityGroupNameResult := terraform.Output(t, terraformOptions, "security_group_name")
assert.Equal(t, securityGroupName, securityGroupNameResult)
}

Here is the output of running these tests for our demo setup:

Final thoughts regarding terratest: we find terratest a great tool to implement changes in your infrastructure with confidence, no matter if they are just simple day to day changes, or bigger and complexer upgrades or migrations.

We just scratched the surface here on the tests one can develop with terratest – for example combining SSH access into instances and confirming access to resources, etc. However, there are no free lunches, and this can come at a price: tests can take a long time (depending obviously on what you are testing), and require a non trivial time investment – learning how to use, writing them, and setting up the environment, as you probably will want to run them in an isolated environment. For example, in AWS case, this would ideally be a dedicated account. We recommend reading best practices on how to perform testing from terragrunt, the company behind terratest.

Compliance checks

The last mile is asserting that what you wanted to be deployed, was indeed deployed exactly as you wanted. Abusing terraform null_resource is a classic one leading to unintended surprises. One way of achieving this would be to run these tests right after the terraform apply stage of your CI/CD pipeline.

But the next question you might have is how do I know that these configurations stay that way, that no one changed things inadvertently ? We’ve seen this situation arise in different forms: changes done by users manually via GUI or CLI; via different terraform environments mutating overlapping resources properties; or as a by-product of using different IaC tools, for example configuring some bits with Cloudformation, some K8s or Helm, etc.

Arguably more interesting would be scheduling to run these checks continuously and repeatedly, to make sure things stay as you expected. Let’s see some options out there to achieve this.

The rogue approach: in practice you can do this type of compliance checks with any programming language you favor. As we shown before, one can export terraform’ state into json format, and then use, for example, python with pytest and boto3 libraries to compare what is deployed with the desired output. You could even go further, and use boto3 to scan your accounts for different aspects that you considered go against best practices, such as lack of encryption (in-flight and at rest).

While this might be tempting at first sight, you might end up writing way more testing code, than the actual terraform code. Not only it seems kind of silly, but it can get hairy.

The second reason not to follow this approach, is that there are already several solutions out there fairly tested and intuitive to use that can help you with this task, which we will cover now:

  • Driftctl – open source go library from cloudskiff for terraform;
  • Sentinel – Hashicorp’s own solution;
  • terraform-compliance – open source BDD based solution dedicated for terraform;
  • Conftest – test suite for multiple frameworks besides terraform, such as kubernetes and dockerfiles;
  • Inspec – Chef compliance testing tool, written in ruby;
  • Built-in cloud provider – each cloud provider has it’s own inspections mechanisms in place;

You may be wondering why the first place in our list is a library still in beta. That is a good point, and we would argue that its progress seems to be promising, and that we like to support startups. Moreover we think its simplicity of usage deserves a highlight as it delivers one thing and one thing only: check if what is in your terraform state file is what is actually deployed. Simply point to your terraform statefile when you run driftctl scan command and you will get a detailed report if you have drifted. Due to the low effort required to implement this library and value provided, we think it deserves taking it for a spin.

Next on our list is Hashicorp’s (the company behind terraform) own enterprise solution for this, Sentinel. This is could make sense if you are already using other Hashicorp’s enterprise functionality, benefiting from Terraform Enterprise.

A direct open source comparable alternative would be using terraform-compliance. It follows BDD directives so that you can specify in an easy human readable way your expectations, using:

  • given: a given resource type;
  • when: an optional condition you might want to add;
  • then: what you expectation is;

Here is an example of a test file for AWS S3 buckets:

Feature: Buckets config
Scenario: encryption at rest
Given I have AWS S3 Bucket defined
Then encryption at rest must be enabled
Scenario: resources are tagged
Given I have AWS S3 Bucket defined
Then it must contain tags
And its value must not be null
Scenario Outline: tags convention
Given I have AWS S3 Bucket defined
Then it must contain tags
And its value must match the "<value>" regex
Examples:
| tags | value |
| Name | terraform |
| Environment | compliance |

Although we do get the appeal of easily understandable tests that do not require knowing how code, we find terraform-compliance lacks flexibility to test various aspects, mainly due to the BDD nature. Moreover, there are more people disenchanted by it with some valid points.

If you like terraform-compliance, Conftest might also be worth having a look. It has its own DSL to write policies, and allows you to test multiple frameworks. We found this blog post from Lennard Eijsackers very informative, and would thus rather recommend you to check it out.

Before we dive into own cloud provider compliance checking services, we want to highlight yet another open source tool, namely InSpec. It allows you to write tests in ruby, and was built on top of RSpec. If you know already awsspec, then this should feel very similar, with the advantage that InSpec also supports GCP and Azure.

Even though none of us is actively a ruby developer, we find InSpec very easy to get started with and, most importantly, very powerful. It allows you to combine IaC checks to target resources deployed and mapped in terraform state files, with other general policies for cloud account configuration. Moreover, it also allows you to combine additional security checks, such as on OS level configurations and services running. Let us illustrate how you could write these checks. The following check illustrates how to define global policies that an account should obey:

# encoding: utf-8
# copyright: 2021, mk::labs
title 'general AWS IAM account best practices'
control 'All human users should have MFA enabled' do
impact 0.7
title 'Ensure there all human users have MFA enabled'
desc 'Ensure there all human users have MFA enabled'
tag "severity": 'high'
tag "check": "Review your AWS console and note if any IAM users do not
have MFA device enabled"
tag "fix": "Contact relevant user(s) so that they activate their MFA device"
exception_users_list = input('exception_users_list')
aws_iam_users.usernames.each do |user|
next if exception_users_list.include?(user)
describe aws_iam_user(user) do
it { should have_mfa_enabled }
end
end
describe aws_iam_root_user do
it { should have_mfa_enabled }
end
end
control 'Account should have a strong password policy set' do
impact 0.7
title 'Ensure that password policy is setup'
desc 'Ensure that password policy is setup'
tag "severity": 'high'
tag "check": "Review your AWS console and in IAM section check
if there is a password policy set"
tag "fix": "Configure an account password policy"
MIN_PASSWORD_LENGTH = input('min_password_length', value: 8)
describe aws_iam_password_policy do
it { should exist }
it { should require_uppercase_characters }
it { should require_lowercase_characters }
it { should require_numbers }
it { should require_symbols }
its('minimum_password_length') { should be >= MIN_PASSWORD_LENGTH }
it { should expire_passwords }
it { should allow_users_to_change_password }
it { should prevent_password_reuse }
end
end

The next one illustrates for generic networking:

# encoding: utf-8
# copyright: 2021, mk::labs
title 'Network related resources compliance checks'
control 'Security groups hardening default – port 22' do
impact 0.7
title 'Ensure default security groups do not allow port 22'
desc 'Ensure default security groups do not allow port 22'
tag "severity": 'high'
tag "check": "Review your AWS console in EC2 menu, and check your security groups' rules"
tag "fix": "Ideally fix this in your Infrastructure-as-Code (such as terraform/cloudformation/etc)"
aws_vpcs.vpc_ids.each do |vpc_id|
describe aws_security_group(vpc_id: vpc_id, group_name: 'default') do
it { should_not allow_in(port: 22) }
end
end
end
control 'Security groups hardening default ingress open to everything' do
impact 0.7
title 'Ensure default security groups do not allow 0.0.0.0/0'
desc 'Ensure default security groups do not allow 0.0.0.0/0'
tag "severity": 'high'
tag "check": "Review your AWS console in EC2 menu, and check your security groups' rules"
tag "fix": "Ideally fix this in your Infrastructure-as-Code (such as terraform/cloudformation/etc)"
aws_vpcs.vpc_ids.each do |vpc_id|
describe aws_security_group(vpc_id: vpc_id, group_name: 'default') do
it { should_not allow_in(ipv4_range: '0.0.0.0/0') }
end
end
end
control 'Security groups hardening – do not allow FTP ingress' do
impact 0.8
title 'Ensure AWS Security Groups disallow FTP ingress from 0.0.0.0/0.'
aws_vpcs.vpc_ids.each do |vpc_id|
describe aws_security_group(vpc_id: vpc_id) do
it { should_not allow_in(ipv4_range: '0.0.0.0/0', port: 21) }
end
end
end
control 'Security groups hardening – do not allow access to postgres from everywhere' do
impact 0.8
title 'Ensure AWS Security Groups disallow postgres ingress from 0.0.0.0/0.'
aws_vpcs.vpc_ids.each do |vpc_id|
describe aws_security_group(vpc_id: vpc_id) do
it { should_not allow_in(ipv4_range: '0.0.0.0/0', port: 5432) }
end
end
end
control 'Consistency in VPCs config' do
impact 0.4
title 'Ensure all VPCs use the same DHCP option set'
desc 'Ensure all VPCs use the same DHCP option set'
tag "severity": 'high'
tag "check": "Review your AWS console and check you VPCs state"
tag "fix": "Ideally fix this in your Infrastructure-as-Code (such as terraform/cloudformation/etc)"
describe aws_vpcs.where { dhcp_options_id != 'dopt-12345678' } do
it { should_not exist }
end
end
control 'Consistency in VPCs config' do
impact 0.4
title 'Ensure VPCs have the correct state'
desc 'Ensure VPCs have the correct state'
tag "severity": 'high'
tag "check": "Review your AWS console and check you VPCs state"
tag "fix": "Ideally fix this in your Infrastructure-as-Code (such as terraform/cloudformation/etc)"
vpc_instance_tenancy = input('vpc_instance_tenancy', value: 'default')
aws_vpcs.vpc_ids.each do |vpc_id|
describe aws_vpc(vpc_id: vpc_id) do
its('state') { should eq 'available' }
its('instance_tenancy') { should eq vpc_instance_tenancy }
end
end
end

The previous examples only illustrate generic control checks that validate overall best practices in your acocunt. However, you might want to also perform assertions regarding what is in your terraform state versus what is deployed – what the cloudskiff engineers rightly name drift. Christoph Hartmann – one of InSpecs creators – has a nice blog post explaining how to use InSpec and integrate with terraform. The approach is essentially as described previously in the rogue approach – import terraform state json file and use it as the expected assertion.

Built-in cloud provider policy tools

Each cloud provider has a native tool to address company-wide governance policies. Some examples of services are:

These are just some of the services that we know that can be used for such enforcement. Most of these services go beyond just checking cloud config, and also provide security inspections at instance level, for example. The important point to keep in mind is that these security checks are post deployment compliance assertion, not for preventing configuration issues.

Final thoughts

We find that there is not a single silver bullet that solves all problems, and the best strategy is actually a combination of multiple tactics employed at different stages. The two main phases that require different approaches are pre and post deployment.

For pre-deployment, a combination of static checks and actual tests can be used. Static checks are a great starter, as they are easy to setup, and allow you to enforce generally good practices across the whole organization. In other words, static checks provide you an powerful easy win. With terratest, on the other hand, you mainly gain on confidence that the IaC code will actually work, and that you will catch faux pas. However, terratest does not come for free, and does require a learning curve with go, along with time investment to develop the actual tests.

After your code gets deployed comes the next challenge: making sure things stay well architected. Regular scheduled checks that assess if your infrastructure is running as intended should also be part of your devOps strategy, and for this two main routes exist: open source or using dedicated cloud services.

And that is it. Thank you for reading, we hope this has been useful. Feel free to reach out to us if you have questions or suggestions.

Once again, you can find all the code supporting this this post here.

Sources

As usual, here is the summary of sources used for this post:

Setting up parallel Spark2 installation with Cloudera and Jupyter Notebooks

Yes, this topic is far from new material. Especially if you consider Cloud tech stack evolution/change speed, it has been a long time since Apache Spark version 2 was introduced (26-07-2016, to be more precise). But moving into the cloud is not an easy solution for all companies, where data volumes can make such a move prohibitive. And in on-premises contexts, the speed of operational change is significantly slower.

This post summarizes the steps for deploying Apache Spark 2 alongside Spark 1 with Cloudera, and install python jupyter notebooks that can switch between Spark versions via kernels. Given that this is a very frequent setup in big data environments, thought I would make the life easier for “on-premise engineers”, and, hopefully, speed up things just a little bit.

Continue reading “Setting up parallel Spark2 installation with Cloudera and Jupyter Notebooks”

Check out video Understanding Using Temporal Cycle-Consistency Learning

It has been a long time since I last blogged. João and I have been busy getting things ready to launch a new project to streamline multi-cloud management that we have been working in the last couple of months. We will talk soon in more detail about it, so stay put.

Meanwhile just wanted to share this very interesting blog post from Google AI blog – video understanding using temporal cycle-consistency learning – where they propose a self-supervised learning method to classify different actions, postures, etc. in videos.

Continue reading “Check out video Understanding Using Temporal Cycle-Consistency Learning”

Keeping AWS costs low with AWS Nuke

A common pattern in several companies using AWS services is having several distinct AWS accounts, partitioned not only by teams, but also by environments, such as develop, staging, production.

This can very easily explode your budget with not utilized resources. A classic example occurs when automated pipelines – think of terraform apply, or CI/CD procedures, etc – fail or time out, and all the resources created in the meanwhile are left behind.

Another frequent example happens in companies recently moving to the cloud. They create accounts for the sole purpose of familiarizing and educating developers on AWS and doing quick and dirty experiments. Understandably, after clicking around and creating multiple resources, it becomes hard to track exactly what was instantiated, and so unused zombie resources are left lingering around.

Continue reading “Keeping AWS costs low with AWS Nuke”

Integrating IAM user/roles with EKS

To be completely honest, this article spawns out of some troubleshooting frustration. So hopefully this will save others some headaches.

The scenario: after having configured an EKS cluster, I wanted to provide permissions for more IAM users. After creating a new IAM user with belonged to the target intended IAM groups, the following exceptions were thrown in the CLI:

kubectl get svc
error: the server doesn't have a resource type "svc"
kubectl get nodes
error: You must be logged in to the server (Unauthorized)

 

AWS profile config

First configure your local AWS profile. This is also useful if you want to test for different users and roles.

 

# aws configure --profile
# for example:
aws configure --profile dev

If this is your firts time, this will generate two files,

~/.aws/config and ~/.aws/credentials

It will simply append to them, which means that you can obviously edit the files manually as well if you prefer. The way you can alternate between these profiles in the CLI is:

#export AWS_PROFILE=
# for example:
export AWS_PROFILE=dev

Now before you move on to the next section, validate that you are referencing the correct user or role in your local aws configuration:

# aws --profile  sts get-caller-identity
# for example:
aws --profile dev sts get-caller-identity

{
"Account": "REDACTED",
"UserId": "REDACTED",
"Arn": "arn:aws:iam:::user/john.doe"
}

 

Validate AWS permissions

Validate that your user has the correct permissions, namely the following two are required:

# aws eks describe-cluster --name=
# for example:
aws eks describe-cluster --name=eks-dev

Add IAM users/roles to cluster config

If you managed to add worker nodes to your EKS cluster, then this documentation should be familiar already. You probably AWS documentation describes

kubectl apply -f aws-auth-cm.yaml

While troubleshooting I saw some people trying to use the clusters role in the “-r” part. However you can not assume a role used by the cluster, as this is a role reserved/trusted for instances. You need to create your own role, and add root account as trusted entity, and add permission for the user/group to assume it, for example as follows:

{
  "Version": "2012-10-17",
  "Statement": [
     {
        "Effect": "Allow",
        "Principal": {
             "Service": "eks.amazonaws.com",
             "AWS": "arn:aws:iam:::user/john.doe"
         },
        "Action": "sts:AssumeRole"
    }
   ]
}

 

Kubernetes local config

then, generate a new kube configuration file. Note that the following command will create a new file in ~/.kube/config

aws --profile=dev eks update-kubeconfig --name esk-dev

AWS suggests isolating your configuration in a file with name “config-“. So, assuming our cluster name is “dev”, then:

export KUBECONFIG=~/.kube/config-eks-dev
aws --profile=dev eks update-kubeconfig --name esk-dev

 

This will then create a the config file in ~/.kube/config-eks-dev rather than ~/.kube/config

As described in AWS documentation, your kube configuration should be something similar to the following:


apiVersion: v1
clusters:
cluster:
certificate-authority-data: <certificateAuthority.data from describe-cluster>
server: <endpoint from describe-cluster>
name: <cluster-name>
contexts:
context:
cluster: <cluster-name>
user: aws
name: aws
current-context: aws
kind: Config
preferences: {}
users:
name: aws
user:
exec:
apiVersion: client.authentication.k8s.io/v1alpha1
args:
token
-i
<cluster-name>
command: heptio-authenticator-aws
env:
name: AWS_PROFILE
value: <profile-name>

If you want to make sure you are using the correct configuration:

export KUBECONFIG=~/.kube/config-eks-dev
kubectl config current-context

This will print whatever the alias you gave in the config file.

Last but not least, update the new config file and add the profile used.

Last step is to confirm you have permissions:

export KUBECONFIG=~/.kube/config-eks-dev
kubectl auth can -i get pods
# Ideally you get "yes" as the answer.
kubectl get svc

 

Troubleshooting

To make sure you are not working in a environment with hidden environmental variables that you are not aware and may conflict, make sure you unset them as follows:

unset AWS_ACCESS_KEY_ID && unset AWS_SECRET_ACCESS_KEY && unset AWS_SESSION_TOKEN

Also if you are getting as follows:

could not get token: AccessDenied: User arn:aws:iam:::user/john.doe is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam:::user/john.doe

Then it means you are specifying the -r flag in your kube/config file. This should be only used for roles.

Hopefully this short article was enough to unblock you, but in case not, here is a collection of further potential useful articles:

Getting Started with Spark (part 4) – Unit Testing

Alright quite a while ago (already counting years), I published a tutorial series focused on helping people getting started with Spark. Here is an outline of the previous posts:

In the meanwhile Spark has not decreased popularity, so I thought I continued updating the same series. In this post we cover an essential part of any ETL project, namely Unit testing.

For that I created a sample repository, which is meant to serve as boiler plate code for any new Python Spark project.

Continue reading “Getting Started with Spark (part 4) – Unit Testing”

Decrypting correctly parameters from AWS SSM

Today is yet short one, but ideally will already save a whole lot of headaches for some people.

Scenario: You have stored the contents of a string using AWS SSM parameter store (side note: if you are not using it yet, you should definitely have a look), but when retrieving it  decrypted via CLI, you notice that the string has new lines (‘\n’) substituted by spaces (‘ ‘).

In my case, I was storing a private SSH key encrypted to integrate with some Ansible scripts triggered via AWS CodePipeline + CodeBuild. CodeBuild makes it realy easy to access secrets stored in SSM store, however it was retrieving my key incorrectly, which in term domino-crashed my ansible scripts.

Here you can also confirm more people are facing this issue. After following the suggestion of using AWS SDK – in my case with python boto3 – it finally worked. So here is a gist to overwrite an AWS SSM parameter, and then retrieving it back:


my_string = """
your string \n seperated \n by \n new \n lines.
"""
account_id = '12345678910'
region = 'eu-west-1'
parameter_name = 'some-secret-name'
key_id = 'your-key-id'
kms_key_id = 'arn:aws:kms:{region}:{account_id}:key/{key_id}'.format(region=region, account_id=account_id, key_id=key_id)
ssm = boto3.client('ssm')
response = ssm.put_parameter(
Name=parameter_name,
Description='My encrypted secret blob',
Value=my_string,
Type='SecureString',
KeyId=kms_key_id,
Overwrite=True,
)
response = ssm.get_parameter(
Name=parameter_name,
WithDecryption=True
)
print(response.get('Parameter', {}).get('Value'))

Hope this helps!

Container orchestration in AWS: comparing ECS, Fargate and EKS

Before rushing into the new cool kid, namely AWS EKS, AWS hosted offering of Kubernetes, you might want to understand how it works underneath and compares to the already existing offerings. In this post we focus on distinguishing between the different AWS container orchestration solutions out there, namely AWS ECS, Fargate, and EKS, as well as comparing their pros and cons.

Introduction

Before we dive into comparisons, let us summarize what each product is about.

ECS was the first container orchestration tool offering by AWS. It essentially consists of EC2 instances which have docker already installed, and run a Docker container which talks with AWS backend. Via ECS service you can either launch tasks – unmonitored containers suited usually for short lived operations – and services, containers which AWS monitors and guarantees to restart if they go down by any reason. Compared to Kubernetes, it is quite simpler, which has advantageous and disadvantages.

Fargate was the second service offering to come, and is intended to abstract all everything bellow the container (EC2 instances where they run) from you. In other words, a pure Container-as-a-Service, where you do not care where that container runs. Fargate followed two core technical advancements made in ECS: possibility to assign ENI directly and dedicated to a Container and integration of IAM on a container level. We will get into more detail on this later.

The following image sourced from AWS blog here illustrates the difference between ECS and Fargate services.

fargate-1

EKS is the latest offering, and still only available on some Regions. With EKS you can abstract some of the complexities of launching a Kubernetes Cluster, since AWS will now manage the Master Nodes – the control plane. Kubernetes is a much richer container orchestrator, providing features such as network overlay, allowing you to isolate container communication, and storage provisioning. Needless to say, it is also much more more complex to manage, with a bigger investment in terms of DevOps effort.

Like Kubernetes, you can also use kubectl to communicate with EKS cluster. You will need to configure the AWS IAM authenticator locally to communicate with your EKS cluster.

Continue reading “Container orchestration in AWS: comparing ECS, Fargate and EKS”

Versioning in data projects

Reproducibility is a pillar in science, and version control via git has been a blessing to it. For pure Software Engineering it works perfectly. However, machine learning projects are not just only about code, but rather also about the data. The same model trained with two distinct data sets can produce completely different results.

So it comes with no surprise when I stumble with csv files on git repos of data teams, as they struggle to keep track of code and metadata. However this cannot be done for today’s enormous datasets. I have seen several hacks to solve this problem, none of them bullet proof. This post is not about those hacks, rather about an open source solution for it: DVC.

dvc_flow_large

Let us exemplify by using a kaggle challenge: predicting house prices with Advanced Regression Techniques

Continue reading “Versioning in data projects”

AWS Server-less data pipelines with Terraform to Redshift – Part 2

Alright, it’s time for the second post of our sequence focusing on AWS options to setup pipelines in a server-less fashion. The topics that we are covering throughout this series are:

In this post we complement the previous one, by providing infrastructure-as-code with Terraform for deployment purposes. We are strong believers of a DevOps approach also to Data Engineering, also known as “DataOps”. Thus we thought it would make perfect sense to share a sample Terraform module along with Python code.

To recap, so far we have Python code that, if triggered by a AWS event on a new S3 object, will connect to Redshift, and issue SQL Copy command statement to load that data into a given table. Next we are going to show how to configure this with Terraform code.

As usual, all the code for this post is available publicly in this github repository. In case you haven’t yet, you will need to install terraform in order follow along this post.

Continue reading “AWS Server-less data pipelines with Terraform to Redshift – Part 2”