Shoot for the moon with spot instances in AWS
As part of the DevOps engineering team at Transifex, we always strive for the right balance for resources between our infrastructure and our newest services. A few months ago, we introduced horizontal pod autoscaling to our Kubernetes cluster, which enabled us to scale our pods on application spikes better.
It also gave us a good overview of which services and tasks are getting the most out of autoscaling, and a good indication of our next bottleneck. Autoscaling an application’s resources is great if you have enough nodes to handle all the additional traffic. Kubernetes supports node autoscaling out of the box. However, the cost can get out of hand quickly unless you introduce spot instances to the cluster, bringing it down a lot.
Our end goal? Improve our delivery times and boost our performance while reducing our infrastructure costs.
Table of Contents
An introduction to spot instances
Spot Instances in AWS (preemptible instances in GC) are unused instances you can bid and use at steep discounts. The caveat? AWS will claim it back if a customer requests it in their usual methods and will give you a 2 minute termination warning to move your load to another node. Due to their nature, spot instances are better suited for stateless, fault-tolerant workloads. In our case, they are great for all our Python tasks, which are distributed by Celery to a RabbitMQ cluster.
An analytical view of what we are trying to achieve
Until now, we were using a fleet of M5 EC2 reserved instances for our workloads. We often found ourselves adding more instances, especially when we had to deploy new services or add new features to existing ones. Even with the increasing number of nodes, though, there were cases where we couldn’t scale fast enough. We were missing data to handle corner cases where much traffic could spike many thousands of tasks over a dozen different queues. In cases like these, the number of nodes was our bottleneck, preventing us from consuming everything promptly. Because we couldn’t add nodes indefinitely, we missed the data we needed to analyze our problem and provide a solution.
For instance, it was very difficult to answer questions like the ones below:
- How many more pods / nodes would we need to reduce the time spent consuming all the tasks in queue 1 on a spike?
- It seems, from our data, that horizontal pod autoscaling (HPA) on queue 2 triggers an HPA on queues 3 and 4. Do we have enough horsepower to handle a spike?
- A bug in service A decreased its performance and needs 1GB of additional memory on its pods. Until it gets fixed, can we still support 10 pods through its HPA?
This list can go on and on, and it’s getting incrementally more difficult to answer its questions as you add new services and extend your features to the existing ones. This is where adding more nodes seems like a good idea, but more nodes give you more space to expand; it doesn’t tell you anything about your performance and actual traffic.
A glimpse into our solution
So we have decided to use AWS’s spot instances. Implementing our solution is easy in Terraform because we already use EKS’s Terraform module. We only have to create an additional worker group, tag it to include only spot instances, and add a bidding price high enough to ensure you will always get an instance. Contrary to our reserved instances worker group, this time, we want EKS to autoscale the nodes, so we confidently define a max size allowing us to run our scalability tests.
A small part of our Terragrunt code. We request many instance types of the same family to ensure there will always be one available in a certain region.
How does auto scaling work?
When HPA decides that more pods of deployment should be created, it notifies the Kubernetes API. The API “creates” those pod objects, but the scheduler is responsible for placing them to a node. If no node has the free resources for it (memory and CPU, based on requests and limits), then the pod is marked as pending.
At this point, it is time to introduce Kubernetes’s autoscaler. Autoscaler looks for pods in a pending state, then adds nodes to the cluster to move these pods into. As only our spot instance worker group has the flexibility to scale, we want to ensure only tasks will fit into our newly introduced nodes, as our main web component will remain in the reserved instances for now.
What about the 2 minute termination time?
This is tricky. We make sure we only add fault-tolerant workload in these instances, so if a request takes over two minutes, it will fail and will be rescheduled to another node. We have configured Celery to acknowledge a task only when it finishes — this ensures if a task gets interrupted, it will get picked up by another worker.
The missing piece to the puzzle is the node termination handler that lives inside our cluster and listens for these AWS events. When AWS sends the signal about the coming termination of a spot instance, the following take place:
- The termination handler receives the event.
- It sends a request to the Kubernetes API to drain and cordon the node in question. No new pods can be scheduled into it from now on.
- The autoscaler acknowledges the pending pods and sends a request for a new spot instance if there is no space in the existing nodes.
- In a few minutes, the pending pods are being distributed.
From our tests, the 2 minutes given by AWS are enough for a new node to be scheduled (and it needs a couple more minutes to be ready), so we expect no delays in our workloads.
One final thing to implement is our cluster’s ability to scale down when HPA levels are back to normal usage and nodes have plenty of resources available. Autoscaler to the rescue once again — its configuration will take down a node if it uses less than 50% of its CPU resources, and its pods can be rescheduled to another node.
With the above setup in place, the only thing left is the mechanism that will place certain pods to certain nodes, and this is where taints, tolerations, and pod’s node affinity come into play. Node taints prevent a pod from being scheduled to a node without the correct label tolerations. Similarly, pods can be scheduled only to nodes with the correct labels using the pod’s affinity. Combining the above ensures pods will be placed only in spot instances only if we want them to.
Scale with data in mind
We already have an extensive monitoring suit using Grafana and Prometheus, so we only had to build on top of that. The idea is to give the cluster the ability to scale a lot, let HPA handle all the new pods, and measure how many resources we need to deliver our SLAs successfully.
Testing our Grafana dashboard in our demo cluster. These graphs will give us a great insight into our workload and help us plan for the future.
We have an on-demand fillup queue, which manually runs fillups on a customer’s project untranslated strings. Depending on the size of the project, it would take from 20′ to several hours.This is a one time off operation, and can be very resource-heavy, so we couldn’t scale as much as we wanted to with the previous setup as it was unpredictable — it could still take hours to complete, and we would have fewer resources to spend on other critical tasks.
Using spot instances, we don’t have the limitation of resources. The screenshot below shows the time spent using 20 pods instead of 5 in an extreme case scenario. It drops to 45′ for a job that would take ~5h.
Another major benefit of implementing spot instances is the usage of multiple EC2 instance types. Until now, we were using general-purpose M5 instances, but we knew certain workloads would benefit if we switched to CPU-based ones.
With the new implementation, testing our theory is very simple. Add another worker instance to use C5 instance types, then point certain tasks to them and measure their performance. Our tests indicate up to a 30% boost on certain tasks, which is great!
Now that the implementation is in place comes the difficult part of monitoring, comparing data, and iterating. Spot instances give us great flexibility to play around with other types of EC2 instances, take better care of our reserved ones (now that they have space to breath), and educate our engineers on choosing, testing, and monitoring the best instances for their workloads.