Agility is all about trust, making prompt decisions, and acting fast in the scope of self-organizing teams. What happens though when this flow is jeopardized by technical limitations, bureaucracy, and a cumbersome deployment process of new features?
Transifex was originally built on top of a big, fat Django application, a monolith as it is commonly said, with all developers planning and contributing to the same Git repository regardless of the team or feature they were working on.
When a company scales and teams grow, the monolith pattern stops working and useless waste is introduced in the delivery of new features. This waste was surfacing more and more in Transifex through the following observations:
- It was taking more than 30 minutes for our test suite to complete, meaning that for the pull-request of feature X to be merged, developers had to wait at least half an hour for CI tests to complete before being “allowed” to merge their work to the master branch.
- If another developer merged another pull-request of a feature Y in the meantime, the X branch had to be rebased and tests needed to run again, introducing more and more wasted time in a vicious cycle of reading.
- Refactoring core parts of the application in the monolith were dangerous and time-consuming, as spaghetti code and untested side effects were leading to bugs or even downtime.
- Upgrading core technology, such as newer Django versions, was a cumbersome task that was usually being put aside in favor of features that would bring more business value and as a result…here be dragons!
- Taking the monolith to production was a task taking about an hour to complete, making deployments a scary and sparse thing within our sprints.
The DevOps Bottleneck
Our company had a dedicated DevOps team that owned the infrastructure. All feature teams were relying on the DevOps team to manage and troubleshoot deployments, creating an impediment for releasing new features at a fast pace.
Moreover, the DevOps team was already swamped with SysOps tasks, striving to keep the platform healthy, scale it, and make it work reliably for our customers overall. This put so much pressure on the team that they could not manage the load of tasks effectively and their time was being spent more on firefighting and support and less on innovation and evolution.
This put sprints at risk and company goals at stake.
At some point, we started decoupling a few features away from the monolith using new separate services. This made things much better and paved the road for the future.
These services were satellite components around the monolith, such as a new notification system, API, and Transifex Live services, where teams could develop, deploy, and maintain independently from one another.
However, even those satellite services were suffering from some important pathogens. They were coupled with the monolith application through the database or task queue system. Sometimes a change in the database schema could affect 2 or 3 components, leading to what we call a “distributed monolith”.
The distributed monolith pattern is bad, really bad, because it makes service maintenance even worse, as core changes to one service can trigger changes to other services as well. In other words, the spaghetti code problem is deferred to a spaghetti infrastructure problem.
In late 2018 we decided it was time to do something about it. Following trends in the space we decided to start investing in a microservices architecture with the following goals for our engineering teams:
- Empower our teams to deploy in minutes and multiple times per day.
- Give the power to teams to own their infrastructure, from development to deployment.
- Shift the focus of the DevOps team to innovation and tooling instead of maintenance and troubleshooting.
- Automate all the things!
In order to achieve those goals we groomed an “Engineering Vision” and aligned our company around it to make it happen. The outcome of this vision was a new architecture that we would follow for the development of new services while migrating/refactoring existing functionality away from the monolith.
Regarding microservices, we decided NOT to go too “micro” as hundreds of microservices in our stack could create huge maintenance overhead for our small engineering team. On the contrary, we split microservices based on the domain of functionality by following some design rules:
- Microservices should be coherent. Each service would have its own database, queue system, or other components, totally decoupled and isolated from other services. No more spaghetti, both in code and/or infrastructure.
- Freedom in the technology used. Pick the right tool for the job.
- All microservices would communicate with each other using an HTTP Rest API, ideally following the JSON-API specification where possible for consistency.
- Use JWT for authentication and security between microservices.
Managed Services & Kubernetes
Our engineering vision would never be fulfilled unless we made microservices development a breeze to deploy. Thus we made the decision to invest in building our infrastructure from scratch in Kubernetes.
To make this happen, we decided to invest in managed services in AWS and we started a big project so that the new infrastructure would be provisioned entirely through Terraform, using the Infrastructure As Code design pattern. This way, everything would be code. Provisioning a new database would be as easy as opening a pull-request to a Terraform Git repository.
Investing in managed services, and more specifically AWS, allowed our DevOps team to focus on building a platform for our development teams to work on, instead of being the bottleneck for deployment and troubleshooting. That platform included all the tools to deploy, monitor, and scale our apps. Investing in managed services gave us peace of mind.
Automate All The Things
Another important pillar of our engineering vision was automation. Repetitive and manual tasks should be as automated as possible, removing any time wasted from our teams.
We invested heavily in Jenkins as a CI tool and created pipelines for all of our services – running tests and/or deploying releases to production or even running database migrations through the system.
An important component of the automation process was the Kubernetes orchestration through Helm. Helm removed the complexity of Kubernetes and heavily simplified the deployment of web services, workers, and cron-jobs, through reusable and maintainable code.
Agility on Steroids
The above work took a little more than a year to complete and the outcome was outstanding for the productivity of our teams. In retrospect, this is how it affected the agility of our teams:
- Deployment of the monolith increased from a few times per week to a few times per day.
- Experimentation of new features is super easy to develop and spin-up to production for testing and evaluation.
- Development teams can now play around with cool, modern, and exotic technologies, boosting the creativity, productivity, and happiness of our people.
- Development teams can now own the entire lifecycle of a feature. No more cross-team impediments.
- Provisioning of new infrastructure is now at the ownership of the development teams and deployment to production is just a pull-request away.
- Observability tools allow teams to troubleshoot on production or even setup alerting systems to monitor the health of the application.
All in all, teams are now fully self-organized and the platform is promoting rapid development and deployment of increments, with automation in place to help, allowing teams to focus on producing customer value.
Having all these components in place, our DevOps team is now putting their strength towards making the new platform even better with more exciting stuff on the way. For us, it was a perilous and cumbersome journey but eventually, it all paid off.