A little bug bites the dust: the anatomy of fixing performance issues
We had another update today with two small changes. But we would like to talk specifically about a particular update, the one that improved the responsiveness of the access control page of a project.
A few days ago we noticed that the above page was responding very slowly under certain circumstances; if the user was authenticated, he would have to wait a few seconds to view the page.
When a user goes to a page, there are a lot of queries to the database that are executed. We tracked this issue down to a particular one:
SELECT DISTINCT "projects_project"."id", "projects_project"."private", "projects_project"."slug", "projects_project"."name", "projects_project"."description", "projects_project"."long_description", "projects_project"."homepage", "projects_project"."feed", "projects_project"."bug_tracker", "projects_project"."trans_instructions", "projects_project"."anyone_submit", "projects_project"."hidden", "projects_project"."enabled", "projects_project"."created", "projects_project"."modified", "projects_project"."tags", "projects_project"."outsource_id", "projects_project"."owner_id", "projects_project"."long_description_html" FROM "projects_project" LEFT OUTER JOIN "projects_project_maintainers" ON ("projects_project"."id" = "projects_project_maintainers"."project_id") LEFT OUTER JOIN "teams_team" ON ("projects_project"."id" = "teams_team"."project_id") LEFT OUTER JOIN "teams_team_coordinators" ON ("teams_team"."id" = "teams_team_coordinators"."team_id") LEFT OUTER JOIN "teams_team_members" ON ("teams_team"."id" = "teams_team_members"."team_id") WHERE ( ("projects_project"."private" = false) OR ("projects_project_maintainers"."user_id" = 7674 OR "teams_team_coordinators"."user_id" = 7674 OR "teams_team_members"."user_id" = 7674) ) ORDER BY "projects_project"."name" ASC;
which corresponds to the following django queryset:
Project.objects.exclude( Q(private=True) & ~(Q(maintainers__in=[user]) | Q(team__coordinators__in=[user]) | Q(team__members__in=[user])) ).distinct().exclude(slug=project.slug).order_by('name')
Basically, the above code populates the drop-down menu with the projects the user can select from to outsource the access control of his project to.
The generated query joins (with a LEFT OUTER JOIN) five tables and sorts the output, so that should be it, that should be why it took so long to execute!
Analyzing the issue
However, developers are notoriously bad at estimating the bottleneck in their program. So, let’s turn to PostgreSQL to tell us why that query takes so long.
So, we first turned to the EXPLAIN ANALYZE statement to learn where the query spent the most time. The first few lines of the output are the following (the next ones looked normal):
Unique (cost=1191.34..1373.29 rows=3639 width=281) (actual time=12117.871..12225.835 rows=1858 loops=1) -> Sort (cost=1191.34..1200.44 rows=3639 width=281) (actual time=12117.866..12162.041 rows=19903 loops=1) Sort Key: projects_project.name, projects_project.id, projects_project.private, projects_project.slug, projects_project.description, projects_project.long_description, projects_project.homepage, projects_project.feed, projects_project.bug_tracker, projects_project.trans_instructions, projects_project.anyone_submit, projects_project.hidden, projects_project.enabled, projects_project.created, projects_project.modified, projects_project.tags, projects_project.outsource_id, projects_project.owner_id, projects_project.long_description_html Sort Method: external merge Disk: 14272kB
The query takes more than ten seconds to execute and except for the first few milliseconds they are all spent on sorting!
There are two things that struck us as weird: the sort method used to sort the output and the too many columns used for sorting.
So, let’s discuss the sort method first. The external merge method uses the disk to sort the data instead of doing the sorting in RAM. It is chosen by PostgreSQL, when the memory available for sorting is not enough. The memory that can be allocated for a sort operation in PostgreSQL is determined by the setting work_mem. Increasing the amount of available memory for sorting dropped the execution time to about two seconds. Still not enough.
Too many columns
But why are all those columns used in the first place? Looking again at our initial query, we saw that it uses the
DISTINCT keyword. The way databases implement this, is by sorting by all columns selected, so that they can then remove duplicate rows from the result. This, of course, means extra overhead. But are all those columns necessary?
Hunting down the code in Transifex and Django, we found that the only fields needed for the drop-down menu were the id of the projects and their name (as one would expect in a drop-down menu). So, the fix was easy. We only needed to call a
[only('id', 'name')](<a href="https://bitbucket.org/indifex/txo-live/changeset/c545109b162c">https://bitbucket.org/indifex/txo-live/changeset/c545109b162c</a>) function on our initial queryset. Look at the django documentation for the
EXPLAIN ANALYZE again showed that the execution time was brought down to a few milliseconds!
First, you should always check your logs (you do have logs enabled, don’t you?). Having a monitoring tool for your servers helps, too.
But, how did we not find about that before? Well, one reason is the data in our database grows larger and larger. The same query could have taken a few milliseconds a while ago. The other reason is that it is very difficult to test all paths the code follows with the live data. The particular issue would appeare only if you were a simple authenticated user. That said, if you find any other performance issues, let us know.