Everyone who writes SQL queries encounters them once in a while: those queries that just take too long. Recently, we ran into such an issue with one of our systems. In this blog post, I will first describe the system, then show how the problem could arise and lastly, how we solved it. Spoiler: the process now takes 24 hours… instead of a few weeks.
One of our applications provides our clients with a different view on their own data. Every few weeks, we receive a dataset from the client, which we import into our application. However, the representation that we use in our application differs from the representation of the received data dump. To be able to process the differences between the received dump and the data that is in our system, we use a mutation detector. The mutation detector uses SQL queries to find batches of differences between the data in our application and the received dump. Each batch is first processed before the next batch of changes will be fetched. An example of a mutation detection query can be found below. It detects all entries that do occur in table
A but not in table
B, under the assumption that for an entry which occurs in both
B it holds that
A.id = B.id.
SELECT A.* FROM A LEFT JOIN B ON A.id = B.id WHERE B.id IS NULL LIMIT 400
The mutation detector executes the same query over and over again, until no changes can be found. However, MySQL does not keep information about the last query it executed, i.e. MySQL does not know where it already looked for changes. It is very likely that MySQL looks at a record and finds that it isn’t different from the data in our system, because the entry was already covered in a previous query. The result is that the query becomes slower over time: the first changes are detected within milliseconds, but as more changes have been processed, the search for new mutations takes longer and longer.
The solution to this problem is straightforward: we need to give MySQL more information about where it has already looked for changes, so that it does not look at the same entry multiple times. To find out where to start, we run
EXPLAIN to find out how the
JOIN statements are resolved, and which table SQL would read first (we will call this table base from now on). This is the topmost entry in the output of the
EXPLAIN statement. Knowing where MySQL starts searching, we can introduce a "cheap
ORDER BY". Let
pk be the primary key of base; we can add
ORDER BY base.pk without introducing extra cost. Now that we have told MySQL in what order it should detect mutations, we can also keep track of where it detected the last one. Instead of querying for just mutations, we add
base.pk to the selected columns. In the mutation detector, we save the largest value for base.pk that was encountered, and add the following condition to the query:
base.pk > [largest base.pk encountered]. Because of the
ORDER BY and the condition on
base.pk, we are sure that MySQL does not cover the same entry multiple times.
We can now incorporate these techniques into the mutation detection query given above. Because we do a
LEFT JOIN from
B, we know that
A will be read first and thus corresponds to the base table we talked about earlier.
A.id is the primary key of
A and is already included in
A.*, so in this case, it is not needed to explicitly select
A.id. The resulting query is as follows:
SELECT A.* FROM A LEFT JOIN B ON A.id = B.id WHERE B.id IS NULL AND A.id > [largest encountered A.id in previous queries] ORDER BY A.id LIMIT 400
This mechanism (including some other small optimizations) reduced the time it took to import a certain dataset from weeks or even months* to 24 hours. The mechanism described above applies to a context where you want to detect and process batches, instead of the complete dataset at once. The key lesson is that adding extra information to a query can gain you a huge speedup. Another thing that each reader should take to heart: use
EXPLAIN to analyze your queries. It will deepen your understanding of how a database handles your query and you will learn how to deal with the database's query strategies.
*don’t worry, we did not actually wait for weeks: we optimized the query before the process finished.
Originally published at www.moxio.com.