DEV Community

Andrei Navumau
Andrei Navumau

Posted on

Example of complex SQL query to get as much data as possible from Database.

I'd like to share an example of a complex SQL query, which will return all possible data from Database.

DB Schema looks like this:

DB Schema

The task was "to displays all employees and their related info even if some info is missing. Get as much information as you can about the employees".

My final SQL query was like this:

SELECT
  e.employee_id AS "Employee #"
  , e.first_name || ' ' || e.last_name AS "Name"
  , e.email AS "Email"
  , e.phone_number AS "Phone"
  , TO_CHAR(e.hire_date, 'MM/DD/YYYY') AS "Hire Date"
  , TO_CHAR(e.salary, 'L99G999D99', 'NLS_NUMERIC_CHARACTERS = ''.,'' NLS_CURRENCY = ''$''') AS "Salary"
  , e.commission_pct AS "Comission %"
  , 'works as ' || j.job_title || ' in ' || d.department_name || ' department (manager: '
    || dm.first_name || ' ' || dm.last_name || ') and immediate supervisor: ' || m.first_name || ' ' || m.last_name AS "Current Job"
  , TO_CHAR(j.min_salary, 'L99G999D99', 'NLS_NUMERIC_CHARACTERS = ''.,'' NLS_CURRENCY = ''$''') || ' - ' ||
      TO_CHAR(j.max_salary, 'L99G999D99', 'NLS_NUMERIC_CHARACTERS = ''.,'' NLS_CURRENCY = ''$''') AS "Current Salary"
  , l.street_address || ', ' || l.postal_code || ', ' || l.city || ', ' || l.state_province || ', '
    || c.country_name || ' (' || r.region_name || ')' AS "Location"
  , jh.job_id AS "History Job ID"
  , 'worked from ' || TO_CHAR(jh.start_date, 'MM/DD/YYYY') || ' to ' || TO_CHAR(jh.end_date, 'MM/DD/YYYY') ||
    ' as ' || jj.job_title || ' in ' || dd.department_name || ' department' AS "History Job Title"
  
FROM employees e
-- to get title of current job_id
  JOIN jobs j 
    ON e.job_id = j.job_id
-- to get name of current manager_id
  LEFT JOIN employees m 
    ON e.manager_id = m.employee_id
-- to get name of current department_id
  LEFT JOIN departments d 
    ON d.department_id = e.department_id
-- to get name of manager of current department
-- (not equal to current manager and can be equal to the employee itself)
  LEFT JOIN employees dm 
    ON d.manager_id = dm.employee_id
-- to get name of location
  LEFT JOIN locations l
    ON d.location_id = l.location_id
  LEFT JOIN countries c
    ON l.country_id = c.country_id
  LEFT JOIN regions r
    ON c.region_id = r.region_id
-- to get job history of employee
  LEFT JOIN job_history jh
    ON e.employee_id = jh.employee_id
-- to get title of job history job_id
  LEFT JOIN jobs jj
    ON jj.job_id = jh.job_id
-- to get namee of department from job history
  LEFT JOIN departments dd
    ON dd.department_id = jh.department_id

ORDER BY e.employee_id;

Let's go through it step by step:

Before creating our final Select statement, let's see which tables require LEFT joins and which do not. We need this check because LEFT JOINS are slower than INNER JOINS, so for efficiency we will be using (INNER) JOINS were possible.

To check that every employee has a job_id. If there are results (id's of employees), so we need LEFT JOIN, because there are employees who do not have (IS NULL == true) job_id. Otherwise (IS NULL == false) we use only JOIN:

SELECT employee_id FROM employees WHERE job_id IS NULL;

Result: empty. Conclusion: we use JOIN for 'jobs' table.

The same check as above, this time checking department_id:

SELECT employee_id FROM employees WHERE department_id IS NULL;

Result: 178. Conclusion: we use LEFT JOIN for 'departments' table.

Because there is one employee without department_id, the whole chain of future joins (departments-loations-countries-regions) should be also LEFT Joined, despite the fact that these tables can be INNER joined between each other in a separate query.

The same check as above, this time checking manager_id:

SELECT employee_id FROM employees WHERE manager_id IS NULL;

Result: 100. Conclusion: we use LEFT JOIN (self join) for 'employees' table. It means, that one employee doesn't have a manager.

And some general comments on this final query:

AS
  • we use it to define alias, column names in the result table look more human-readable.
    ||
  • we use this to concatenate strings
    'text'
  • we enclose any text into single quotes
    TO_CHAR(e.hire_date, 'MM/DD/YYYY')
  • in this way we can set format of the date
    TO_CHAR(e.salary, 'L99G999D99', 'NLS_NUMERIC_CHARACTERS = ''.,'' NLS_CURRENCY = ''$''')
  • in this way we can output currency with separators. My detailed post about it can be found here.
    JOIN jobs j ...
  • this 'j' means alias to a table name. The same table can be joined to itself (self join) and for this reason this table alias for the same table can and should be different.

The query returned 110 results, though there are only 107 employees. Because out of 107 employees, 3 have two job history entries. The first couple of rows of the result look like this:

Complex query result

Complex query result continue

We could definitely go further and get all information about departments, where according to job history entries a couple of employees were working, but I decided that it would be too much.

Originally posted on my website.

I hope this example will be useful for newbies. And to those of you, who are experienced, I'd like to address some questions:

  1. Is it really important which JOIN to apply: LEFT or INNER? I mean is there any real gain in productivity?
  2. Is it a good practice to pack all possible information into one query? Or to split it into small queries is a better approach?
  3. What mistakes were made in this query, if any?

Take care!

Top comments (7)

Collapse
 
dmfay profile image
Dian Fay • Edited

First: The distinction between outer (left or right) and inner joins is critical, but it's not about efficiency. Inner joins add to the result set if the row in table A has a corresponding row in table B. Rows without complements are skipped. Meanwhile, outer joins add all rows from table A (or B) to the result set and apply B (or A) data when found; if the corresponding rows don't exist, the origin rows are still returned with NULLs in place of the missing data.

Outer joins may seem slower because they're returning more data or because they're poorly indexed but performance is not a reason to prefer inner joins. It's about the shape of the data: you shouldn't just be checking to see if any employee currently has a null job_id but whether it is possible for an employee ever to have a null job_id. If that field is nullable and you are potentially interested in employees without job_ids you must use an outer join or risk incomplete results.

Second: Do what makes sense. You're trying to strike a balance between a few conflicting goals: simple queries, few roundtrips to the database, manageable resultset size. Where the balance is depends on what you're using the data for. Sometimes you really do need a huge, complicated query that returns everything imaginable because you really are using all that data all at once. Usually if you have huge amounts of data people have difficulty processing it unless you split it up for them, which in turn leads to smaller, more targeted queries. Here, you have duplicate employee rows thanks to multiple job histories, so whatever you're doing with the data will likely entail some post-processing to transform the data in a way that gives you one entry per employee. Having to process data further before it's useful is a sign that you're trying to do too much.

Third: Does it return the data you expect in a reasonable amount of time? Then it's correct enough.

Collapse
 
tyzia profile image
Andrei Navumau

Thank you, Dian.
You're right. When I was wondering about LEFT JOIN vs INNER JOIN, I was concerned not to end up with LEFT JOINS everywhere, in every query without proper understanding what I'm doing. I don't want to use LEFT JOIN just to be on the safe side, that some column in 'employees' table has NULL value in job_id field, for instance, and thus this employee is not included into overall result.

But you are right, that I need to take a look at the constraint of this field. If it is NOT NULL, so I can use INNER JOIN safely.

This, by the way, is an example when one should think about redesigning table constraints and, probably, adding NOT NULL constraints to important fields.

Thank you, once again for your time and comment. Honestly, I didn't think that anyone will read this post to the end and will grasp the idea of my questions.

Talking about the amount of data this query returns, this was a test task to see, if I'm able to join tables correctly. In real world, I'd prefer short queries which return specific data, not merging everything into one query. To my mind, it's more efficient.

Take care!

Collapse
 
gregorgonzalez profile image
Gregor Gonzalez

I love this! I work with this kind of query every day.

I think It's really good, the first important thing is to reach your goal. I do the same process as you and only use inner-left joins as a standard to keep it simple (not right, not outer, etc). Then check the performance, try with thousands of records (I do this for huge reports), use "explain" and others approach to test.

I work mostly with MySQL, I can only tell you a few things: Make this query a view. In MySQL try to avoid a lot of subquerys because performance can be lost. Don't make "view of views" because the performance and maintaining this can be difficult.

I still avoid complex functions and stored procedures to keep it simple for new teammates and jr developers.

Collapse
 
tyzia profile image
Andrei Navumau

Hi, Gregor.

Thanks for commenting. Yes, I'll consider using view in this case. And I do agree, that stored procedures can be tricky for new teammates.

Collapse
 
ericschillerdev profile image
Unfrozen Caveman Dev

As a fairly experienced dev, I will note that this is VERY IMPORTANT for anyone who may ever touch a database, even if it's just via an ORM (this is what it translates to on the back end, and the concepts are the same). I'm surprised how little understanding of SQL there is when I interview folks. My main note would be to mention which database system you're using, because the SQL changes slightly by system (sometimes concatenation is +, inner join needs to be explicit, etc.).

As for your questions:

  1. It's important more for the scale of WHAT you want back (do you want an exact match, or do you want nulls if one table doesn't have a match?). Note that any sort of outer join will return more records.

The efficiency will depend on the system you're using and how it parses, what is indexed and what isn't, etc. Figure out what you need first, and optimize from there.

  1. Again, this depends on what you're doing. A lot of joins do slow database queries down, especially if some of the keys aren't indexed. It may also make it harder to read if you have a long set of joins at once.

That said, it can also get hard to read (and much slower) if you declare a bunch of temp tables and keep putting partial results in there.

If you have a bunch of related data that is separated by the normalization, and you're just trying to traverse the tree (which is really all a relational database is), then you are generally better off bundling into one query. Retrieving it that way makes sense because you want a combined dataset, and generally things are tied via foreign keys. If you have to bundle disparate data sources (i.e. people vs. other entities), format them to match in a more generic format, etc. then consider breaking up the query into smaller parts.

  1. I don't see any obvious mistakes.
Collapse
 
tyzia profile image
Andrei Navumau

Hi, Unfrozen Caveman Dev.
Thank you for your comment.

To your questions:

  1. What I wanted back? - As much information as possible about all employees. This, basically, made me think of LEFT JOIN to employees table, so that I don't miss anyone in this list.
  2. I was using Oracle SQL developer to access my database.

I appreciate your feedback.

Collapse
 
ericschillerdev profile image
Unfrozen Caveman Dev

Makes sense. Then yes, you used the correct join. And it's worth noting that JOIN syntax is slightly different in different systems, so it's worth specifying a little more clearly which join you're using.

INNER JOIN instead of JOIN, LEFT OUTER JOIN, etc. I can't remember, but I think those work in Oracle just fine, and will work across a few other systems.

If I'm wrong, no worries, just a readability thing.