Create an AWS Auto Scaling Group with Terraform that scales based on Ubuntu memory utilization

January 25, 2016

In this post we will show you how to use Terraform to spin up an AWS Auto Scaling Group that uses instance memory utilization as the trigger for adding/removing instances from the group. We use Ubuntu 14.04 (trusty) as our OS of choice. Hopefully some of you find this useful since we could not find all this information put together in a nice easy to understand way.

Background

At Testable our focus is on making it easy for our clients to quickly spin up large performance tests and execute them across a global set of stateless agents. Our cloud provider is AWS and we run a set of docker containers on EC2 instances running Ubuntu 14.04 (trusty). In each location where tests can execute we have a set of agents running. Each agent works roughly as follows:

  • Connect to the central coordination service (load balanced via an ELB of course)
  • Perform a registration/handshake process
  • Listen for new “work” where “work” is defined as some part of a test to execute
  • Upon receiving new “work”, execute it and report back results/errors/logging

The agents are completely stateless and it is only the central coordination service that maintains any state about test progress, how many agents are running, etc.

This is a classic use case for an Auto Scaling Group! When demand is low we do not need many instances of the agent process running in each location. When a client comes with a large test or many clients come at the same time, we want to dynamically spin up more agents and then destroy them afterwards when they are no longer needed.

If you are unfamiliar with Auto Scaling Groups, try reading this introduction and information on dynamic scaling first.

The above is a simple image that shows what we are aiming to setup. Let’s go through it step by step now. We assume you already have an AWS account and have installed Terraform on your computer.

Build our AMI

We need to create an AMI to use in our Auto Scaling Group launch configuration. This instance will run Ubuntu 14.04 and be configured to report memory (and disk) utilization.

  • Launch a new EC2 instance and select Ubuntu 14.04 as your operating system.
  • Once the instance is running, connect via SSH
  • Install monitoring script that reports memory utilization as a CloudWatch metricAmazon provides monitoring scripts that work on Ubuntu. We have created an extension that also captures disk inode utilization that can be used (source code and binaries). We will use our extension for the purposes of this blog but it should work the same for memory utilization.
sudo apt-get install unzip
wget https://s3.amazonaws.com/testable-scripts/AwsScriptsMon-0.0.1.zip
unzip AwsScriptsMon-0.0.1.zip
rm AwsScriptsMon-0.0.1.zip

We now have the monitoring scripts installed on our instance, let’s try them out once:

cd aws-scripts-mon
./mon-put-instance-data.pl --verify --verbose --auto-scaling --mem-util --aws-access-key-id=[access-key-here] --aws-secret-key=[secret-key-here]
  • Add monitoring script to crontabIf the trial above works successfully we can cron this script to run every 5 minutes and report memory utilization as a CloudWatch metric:
crontab -l | { cat; echo "*/5 * * * * ~/aws-scripts-mon/mon-put-instance-data.pl --from-cron --auto-scaling --mem-util --aws-access-key-id=[access-key-here] --aws-secret-key=[secret-key-here]"; } | crontab - 

Feel free to add other metrics like disk utilization, disk inode utilization, etc to improve your general instance monitoring.

NOTE: This script caches the instance id locally (6 hour TTL default, 24 hours for auto scaling groups). This means that new instances launched with this image will report metrics for the wrong instance id for up to 24 hours. To get around this we use user_data to delete the cached instance id during instance creation. See the Terraform code section for more details.

  • Install any other software required for your instance
  • In the AWS console, right click your instance -> Image -> Create Image. Note the AMI ID of your image for future steps.

Terraform Code

Now that we have the image and the metrics let’s set everything else up using Terraform. Let’s go through it part by part.

provider "aws" {
    access_key = "${var.access_key}"
    secret_key = "${var.secret_key}"
    region = "us-east-1"
}

This first part simply initializes the AWS provider with our access key, secret, and region.

resource "aws_launch_configuration" "agent-lc" {
    name_prefix = "agent-lc-"
    image_id = "${var.ami}"
    instance_type = "${var.instance_type}"
    user_data = "${file("init-agent-instance.sh")}"

    lifecycle {
        create_before_destroy = true
    }

    root_block_device {
        volume_type = "gp2"
        volume_size = "50"
    }
}

The launch configuration is very similar to an instance configuration. See the documentation for the full set of options (e.g. availability zone, security groups, vpc, etc). The above is just a simple example. The image_id should be the AMI id from earlier.

Note the user_data script. This script, as mentioned earlier, deletes the cached instance id used by the monitoring scripts to ensure that we report metrics using the right instance id from the moment the instance launches.

Contents of init-agent-instance.sh:

#!/bin/bash
rm -Rf /var/tmp/aws-mon

Next we configure the actual auto scaling group.

resource "aws_autoscaling_group" "agents" {
    availability_zones = ["us-east-1a"]
    name = "agents"
    max_size = "20"
    min_size = "1"
    health_check_grace_period = 300
    health_check_type = "EC2"
    desired_capacity = 2
    force_delete = true
    launch_configuration = "${aws_launch_configuration.agent-lc.name}"

    tag {
        key = "Name"
        value = "Agent Instance"
        propagate_at_launch = true
    }
}

This defines the group as containing 1-20 instances and points at our earlier launch configuration as the way to launch new instances. The tag is propogated to any launched instance at launch.

resource "aws_autoscaling_policy" "agents-scale-up" {
    name = "agents-scale-up"
    scaling_adjustment = 1
    adjustment_type = "ChangeInCapacity"
    cooldown = 300
    autoscaling_group_name = "${aws_autoscaling_group.agents.name}"
}

resource "aws_autoscaling_policy" "agents-scale-down" {
    name = "agents-scale-down"
    scaling_adjustment = -1
    adjustment_type = "ChangeInCapacity"
    cooldown = 300
    autoscaling_group_name = "${aws_autoscaling_group.agents.name}"
}

The above configures our scale up and scale down policies that we will trigger using a CloudWatch alarm. At this point we are just configuring the policy that defines what to do in either event which is to add one instance or remove one instance.

resource "aws_cloudwatch_metric_alarm" "memory-high" {
    alarm_name = "mem-util-high-agents"
    comparison_operator = "GreaterThanOrEqualToThreshold"
    evaluation_periods = "2"
    metric_name = "MemoryUtilization"
    namespace = "System/Linux"
    period = "300"
    statistic = "Average"
    threshold = "80"
    alarm_description = "This metric monitors ec2 memory for high utilization on agent hosts"
    alarm_actions = [
        "${aws_autoscaling_policy.agents-scale-up.arn}"
    ]
    dimensions {
        AutoScalingGroupName = "${aws_autoscaling_group.agents.name}"
    }
}

resource "aws_cloudwatch_metric_alarm" "memory-low" {
    alarm_name = "mem-util-low-agents"
    comparison_operator = "LessThanOrEqualToThreshold"
    evaluation_periods = "2"
    metric_name = "MemoryUtilization"
    namespace = "System/Linux"
    period = "300"
    statistic = "Average"
    threshold = "40"
    alarm_description = "This metric monitors ec2 memory for low utilization on agent hosts"
    alarm_actions = [
        "${aws_autoscaling_policy.agents-scale-down.arn}"
    ]
    dimensions {
        AutoScalingGroupName = "${aws_autoscaling_group.agents.name}"
    }
}

Creates the CloudWatch metric alarms. The first one triggers the scale up policy when the group’s overall memory utilization is >= 80% for 2 5 minute intervals. The second one triggers the scale down policy when the group’s overall memory utilization is <= 40%.

Save this to your main.tf file and go ahead and apply it to initialize all the resources:

terraform apply

And that’s it. You now have a cluster of instances that automatically scale up and down based on memory usage. It is easy to extend this to also monitor disk utilization, disk inode utilization, cpu, etc.

Post in the comments if you have any questions/comments.

SHARE: