In the last blog, I wrote about the physical joins used by the SQL engine. Before version 8, MySQL had only nested loop join, but hash join was included in version 8. Postgresql and sql server has nested loop join, hash join and merge join. In this blog post, I will explain how SQL engine decide which physical join to use.
When two tables joined then, both tables traversed once or multiple times to find the matched data. Merge join is best because it requires traversing both tables once. Merge join requires both inputs to be sorted on join keys or the join key is clustered index. It traverses the outer table and finds the matching data in the inner table. Once it found the data, then it moved to the next row. If data is not found, then move to the next row of the table that has the lowest value.
Since both tables are sorted on the join key, it can move to the next row of the table that has the lowest value because the value in the next row will be higher than the current row. There is a chance that the value in the other table has a high value that can be found in the next row. If the table is not already sorted, then the SQL engine may decide to sort it and do the inner join if it is more optimized. But for that, it is necessary that the SQL engine should be able to sort the data on the join key. Also, one more important thing for merge join is that there should be at least one equijoin(=) expression.
Nested loop join traverses through the inner table multiple times. For each value of the outer table, it traverses the inner table to find the matched value. If the join key is indexed in the inner table then it will use to find the data in a more optimized way. A nested inner join is used when one table is small and the other is large. The small table is used as the outer table and the large table is used as the inner table. The inner table is traversed for each row of the outer table that’s why it is more time-consuming.
If the outer table has n elements and the inner table has m elements then the time complexity can be n*m or n*logm(if the join key is indexed, then the index will use to find the data in the table). If it requires sorting the data on the join key and this operation is expensive then the loop through all rows to find the matching data, then nested loop join might prefer over merge join.
We discussed merge join and nested loop join and hash join need to discuss. There are two phases in hash join. One phase is the build phase and the other is the probe phase. One of the tables is used in the build phase and the other is used in the probe phase.
In the build phase, all the rows of the build table are scanned and used to generate the hash key(using a hash function) and save it in the in-memory hash table. In the probe phase, we go through all the rows of the probe table and generate the hash key using the hash function and the join key of the probe table. Find the hash key in the in-memory hash table. If the key is found then join the row of both tables. Key lookup is fast in the hash table. Like merge join, at least one equijoin(=) expression requires in hash join too.
The hash table generated during the build phase is saved in the in-memory that’s why we chose the small table for the build phase so that less memory is used. If there is a memory issue, some of the partitions of the hash table are swapped to tempdb and whenever there is a need it is loaded in the cache.
If the join table is large and there is no index on the join key then the hash join can be very efficient. Merge join works If the join key is indexed, nested loops join need to go through all the rows of the inner table again and again and if the table is very large then it can be expensive.
You don’t need to specify which join type SQL should use. SQL engine is intelligent enough to find the optimized physical join to use depending on the conditions. But you can make sure you are doing your best to design the database schema, index the column and do the join query using the correct columns.
Top comments (3)
Thank you for your blog. Can you give an example where hash join is prefered over merge join and nested join prefered over hash join?
Thanks for sharing informative infomation. dua for allah's immediate help
If MySQL use only nested loop before version 8, then how does it optimize the join? You mentioned that nested loop is prefer over merge join for large data.