A couple of weeks ago, we successfully completed our first internal Bugathon event – which is a part of our ongoing efforts on quality engineering training. The whole engineering department was split into three teams with the goal of solving as much as they could in three hours.
Here is how Bugathon works.
All of the three groups get a staging environment that is running modified code, specifically designed to introduce bugs and performance degradation.
Every team consists of a Guardian (a senior engineer) and a DevOps member to assist them throughout the event. The Guardian’s role is similar to a game master who’s giving occasional hints and guidance to the team. While the DevOps engineer, on the other hand, is shadowing the event to ensure that all instances are working as intended.
The event was mainly focused on monitoring tools and DevOps best practices. And we can safely say that it has been a big success. After all, it turns out that it’s a great way of gathering interesting data on how people react to various production crises.
What Led us to This Event
At Transifex, each quarter we aim to host at least one event that is relevant to our tools and procedures. As a member of the DevOps team, it is not enough to implement your monitoring tool. Training your people around it is just as crucial!
We are aware that in times of a production incident, any service degradation that affects our SLAs, engineers don’t feel very confident around them.
Events like this motivate people to dive deeper into our monitoring stack and experiment with our tools. And that’s without the stress of service being down or a bug affecting our performance in production.
Our Bugathon event introduced 10 regression incidents and bugs, on various services in scaling difficulties. Many of these incidents are DevOps-oriented as our end goal was for engineers to be able to identify complex dependencies between our clusters and our monitoring stack.
Without getting into many details, we introduced issues that:
- Had to do with wrong service configuration, or any other fine-tuning that may bring down a service
- Two or more services couldn’t authenticate or talk successfully to each other
- Pods could fail because of OOM or liveness/ readiness issues
- Several performance issues on pods shortage or CPU throttling
- Random delays in a single service would cause unexpected behavior in its interconnecting services
- While we also introduced several coding bugs around regular expressions, Celery tasks, and database transactions
Interestingly, all of those items represent real issues that we occasionally see in our systems. Some of those even took place a week before the event.
The Technology Behind It
Transifex has a bunch of beta and staging instances. Nevertheless, until recently we lacked the support of multiple, easy-to-be-deployed staging environments.
This changed in early Q1, where we decided to upgrade our internal slack bot (famously called tx-sentinel). After the upgrade, tx-sentinel can now launch on-the-spot ephemeral staging environments on specific branches in less than 5 minutes.
Those `team-betas` have seen great daily usage so far, and without their support, we couldn’t easily progress on a project as unique as Bugathon.
Additionally, the event wouldn’t be complete without a wide range of monitoring tools. Tools that are necessary to support the engineers in their findings.
We had to make sure that all issues could be discovered on at least 2 different tools. In that way, people could combine information or in case they just ignored one of them.
For reference, we used:
- Kibana to access pod logs and various pre-made dashboards
- Grafana with a ton of application and cluster-specific dashboards, powered by Prometheus and CloudWatch
- Sentry for application errors
- New Relic with insightful metrics, well..for everything!
- Metabase for databases queries
- Kubedash for a clean UI around the cluster
- Good old terminal with read-only access to the cluster resources
It’s worth mentioning that building and integrating all these tools requires many small steps and iterations.
We are fully aware that the complexity introduced in order to maintain a highly scalable microservice architecture is not a small feat. People need to be familiar with the monitoring tools that they have to use in order to identify any issues quickly and efficiently.
I firmly believe that the responsibilities of the DevOps team are not only limited to implementing those tools. Ensuring that the engineers are highly skilled at them through recurring training is another part of our job.
While gathering data was not our main focus this time around, we gained some interesting, yet not that surprising insights.
More specifically, engineers:
- Felt more confident to stay behind, following the one who was sharing his screen
- Often followed their instinct and intuition, sometimes ignoring or completely forgetting to check the appropriate monitoring tools
- Quite often assumed the worst-case scenario instead of starting with the simplest assumption. As an example, if a service was misbehaving, they would often dive deep into the code instead of investigating its configuration and settings.
- Many times jumped to conclusions without fully reading an issue’s description. Each bug was constructed to look like a ticket from our customer support team, with small hints hidden in each content.
The event was a big success and received very good feedback for the next iteration. Two out of three teams worked through half of the issues. The third one cleared out a whopping 8/10, leaving out only two issues marked with very high difficulty.
Congratulations to Pablo Sanchez, Aris Katsikaridis, Konstantinos Georgakilas, and Andreas Sgouros for their coordinated effort and success!
As a next step, the engineering managers are planning to work with their teams on some of the above findings. There have also been discussions for smaller, similar events within a team’s scope. On top of that, we’re also planning another big Bugathon before the end of the year!