DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

Cover image for How sql engine handle join internally-part 1
Amit Tiwary
Amit Tiwary

Posted on • Updated on

How sql engine handle join internally-part 1

There are different types of logical join in SQL. Inner join and outer join are mainly used join. When we need to get the data from tables we use the join. But how do these joins work? Physical join is implemented inside RDMS. The user uses a logical join to write the query and a physical join is used by RDMS to do join operations. Part 2 of this blog post is here.

There are three different physical joins Nested loop join, Hash Join and merge join. Depending on the data in the joined tables, RDMS use any one physical join.

Nested loop joins: In nested loop join, out of the two tables, the table with a smaller number of records is selected, and it will loop through the second table until matches are found. This is available in MYSQL, Postgres and even in SQL servers too. This is not a scalable option for a large table. It is used mostly when the join operator doesn’t use equality. MySQL has only this option in versions lower than MYSQL 8.
For example lets say we need to get the students that height is smaller then the teacher.

SELECT 
    s.student_id, 
    t.teacher_id, 
    s.student_name, 
    t.teacher_name 
FROM 
    student s, teacher t 
WHERE 
    s.height < t.height
Enter fullscreen mode Exit fullscreen mode

Hash Join: Hash join is a way of executing a join where the hash table is used to find the match record. A Hash table creates in the memory. If the data is very large and memory is not sufficient to hold the data then it writes to disk. It is efficient compared to the Nested loop join. During the execution, RDMS build the in-memory hash table where rows from one of the join tables is stored using the join attributes as the key. Once it is done then the server starts reading rows from another table and finds the matching row from the hash table. This is used mostly when the join operators use equality.
Lets say we created two table. First table is student table and second table is class table. Each student belongs to one of the class. Now when we join these two table using the class id to get the student and details of class that this student belongs to, then sql server may use the hash join.

SELECT 
    s.student_Id ,
    s.student_name ,
    c.fund ,
    c.strength 
FROM 
    student AS s 
    JOIN
    class AS c ON 
        s.class_id = c.id;
Enter fullscreen mode Exit fullscreen mode

hash-table
When we do the join sql server create the hash table using the class table. 'class_id' is the key here. Now when sql go through student table, it get the class data from hash table using the class_id.

Merge Join: Merge join is used if the join condition uses an equality operator and both sides of the join are large. Merge join uses the sorted data inputs. So if there is an index on the expressions used in the join column then it is used to get the sorted data. But if the server is doing any operation to sort the data then please look at the indexes and better try to modify the indexes in order to achieve better results.

Read part 2 here.

Top comments (5)

Collapse
 
aarone4 profile image
Aaron Reese

Thank for the article. I think the title could be revised as I nearly skipped is as (yet another) comparison of inner, left and cross join and it is clearly more details than that
I think you could also expand (or have a second article) on which internal process are more effective and how to encourage the database to use a more efficient match strategy (e.g. indexing, comparable data types across tables, prefiltering, fully specifying join constraints to utilize indexes, not using non-SARGable predicates)
Also you said the database uses the smaller table as the driving a loop join. This is generally true but the engine optimization may choose a different table if statistics indicate it could be quicker due to any filter restrictions in the where clause, and/or the cardinality and index state of the fields being compared. Historically this query plan will be cached and you could only have one plan for a query so if the query accepts parameters (e.g. a date range) then the cached plan may be effective for some values (e.g. a week) but perform poorly on others (e.g. a decade)

Collapse
 
amitiwary999 profile image
Amit Tiwary

Thanks. I will add another blog with an explanation of which process is effective and how SQL chose one over the other.
Yes, the engine may decide to choose a different one. There are lots of factors that affect the engine decision. I tried to explain using the general scenario.

Collapse
 
amitiwary999 profile image
Amit Tiwary

@aarone4 I have written blog post on how sql decide physical join based on number of datas and indexes dev.to/amitiwary999/how-join-works....

Collapse
 
geraldew profile image
geraldew

Could you please reformat the code blocks for this piece.

For example in my browser, for the second code block I only initially see:

SELECT s.student_Id, s.student_name, c.fund, c.strength FROM stude

As that didn't seem to make sense I knew to visit the block with my mouse to then see the horizontal scroll control (because the current excuse for web design is to hide such vital visual elements).

By the way, that was on a 1920x1080 screen - it would be even more truncated on many smaller screens that I find myself using at times. As it happens, in my current browser mousing over to display the horizontal scroll also impedes on the code block a little, obscuring the bottom of lower case letter descenders. Yet another reason why single-line code blocks are a bad idea.

Of course, there are many ways to format SQL - so I'll just use my own:

SELECT 
    s.student_Id ,
    s.student_name ,
    c.fund ,
    c.strength 
FROM 
    student AS s 
    INNER JOIN
    class AS c ON 
        s.class_id = c.id
;
Enter fullscreen mode Exit fullscreen mode
Collapse
 
amitiwary999 profile image
Amit Tiwary

thanks. I will reformat the code.

An Animated Guide to Node.js Event Lop

Node.js doesn’t stop from running other operations because of Libuv, a C++ library responsible for the event loop and asynchronously handling tasks such as network requests, DNS resolution, file system operations, data encryption, etc.

What happens under the hood when Node.js works on tasks such as database queries? We will explore it by following this piece of code step by step.