DEV Community

Todd Birchard for Hackers And Slackers

Posted on • Originally published at hackersandslackers.com on

Welcome to SQL 3: Building Relationships and Combining Data Sets

Welcome to SQL 3: Building Relationships and Combining Data Sets

If you've felt distant or estranged from SQL so far in the series, never fear: we're about to discover the magic of what makes relational databases so... relational. Turn down the lights and put on your favorite Marvin Gaye track; we're about to make connections on a whole other level.

I find that existing attempts to explain Database relations (JOINs in particular) have been an utter failure in illustrating these concepts. The Venn Diagrams we're all accustomed to seeing mean nothing to somebody who has never seen a JOIN occur, and even then, do they really describe what's happening? I'd love to toss together some quick animations as an alternative, but chances are I'll settle for something mediocre like the rest of us.

Relational Databases in Action

As much as we've covered SQL so far, we still haven't had "the talk." Oh God no, not that talk; I meant the obligatory example-of-how-two-tables-might-relate-to-one-another talk. This talk is a bit less awkward, but it definitely won't prepare you for the finer things in life. Just kidding, data is the finer part of life. Or at least it is in mine. Let's not linger on that too long.

Let's look at the most common scenario used to illustrate data relationships: the customers vs. orders predicament. Let's say we decided to open up an Organic Vegan Paleo Keto Kale Voltron 5000 health-food marketplace to cater to a high-end clientele: pretentious rich assholes. It just so happens that the "rich asshole" market is very receptive to best practices in customer relations, so we start a CRM to track our best customers. This record-keeping helps us pretend to remember the names and personalities of our clientele:

Customers Table

id first_name last_name email gender state phone
653466635-9 Timothea Crat tcrat0@bandcamp.com Female Washington 206-220-3752
418540868-4 Kettie Fuggle kfuggle1@cafepress.com Female California 661-793-1372
857532654-6 Boonie Sommerland bsommerland2@soundcloud.com Male North Carolina 919-299-0715
563295938-4 Red Seldon rseldon3@addthis.com Male Indiana 765-880-7420
024844147-7 Marika Gallatly mgallatly4@loc.gov Female New York 718-126-1462
900992907-8 Sharlene McMaster smcmaster5@gmpg.org Female Nevada 775-376-0931
329211747-X Grover Okey gokey6@weather.com Male Texas 915-913-0625
656608031-7 Farly Pluck fpluck7@buzzfeed.com Male Texas 432-670-8809
906380018-5 Sumner Pickerell spickerellb@bloglovin.com Male Colorado 719-239-5042

On the other hand, we need to keep track of inventory and items sold. Since we're already swiping credit cards and getting all this personal customer data, why not associate purchases with loyal customers? Thus, we have a list of transactions that looks something as such:

Orders Table

| item_id | customer_id | item_purchased | first_name | last_name | amount | date_purchased |
| 82565290-530d-4272-9c8b-38dc0bc7426a | 653466635-9 | Creme De Menthe Green | Timothea | Crat | $8.57 | 5/13/18 |
| 9cfa5f5c-6a9c-4400-8f0f-f8262a787cd0 | 653466635-9 | Veal Inside - Provimi | Timothea | Crat | $5.77 | 3/3/18 |
| 5dea0cce-c6be-4f35-91f6-0c6a1a8b8f11 | 656608031-7 | Arizona - Plum Green Tea | Grover | Okey | $1.72 | 9/6/18 |
| b4813421-12e8-479b-a3b6-3d1c4c539625 | 656608031-7 | Beer - Fruli | Grover | Okey | $4.05 | 10/1/18 |
| 4e7c8548-340f-4e89-a7f1-95173dcc6e53 | 656608031-7 | Boogies | Grover | Okey | $1.97 | 12/17/18 |
| 65261e94-494d-48cc-8d5a-642ae6921600 | 656608031-7 | Cup - 3.5oz; Foam | Grover | Okey | $1.84 | 11/28/18 |
| 1bfdca0f-d54a-4845-bbf5-982813ab4a65 | 656608031-7 | Arizona - Green Tea | Grover | Gauford | $0.22 | 5/23/18 |
| d20d7add-bad4-4559-8896-d4f6d05aa3dd | 906380018-5 | Lemonade - Strawberry; 591 Ml | Sumner | Tortoishell | $7.98 | 10/11/18 |
| 12134510-bc6c-4bd7-b733-b549a61edaa3 | 906380018-5 | Pasta - Cappellini; Dry | Sumner | Wash | $0.31 | 11/13/18 |
| 80f1957c-df4d-40dc-b9c4-2c3939dd0865 | 906380018-5 | Remy Red Berry Infusion | Sumner | Pisculli | $1.25 | 12/31/18 |
| a75f7593-3312-43e4-a604-43405f02efdd | 906380018-5 | Veal - Slab Bacon | Sumner | Janaszewski | $9.80 | 3/9/18 |
| c6ef1f55-f35d-4618-8de7-36f59ea6653a | 906380018-5 | Beans - Black Bean; Dry | Sumner | Piegrome | $1.36 | 12/11/18 |
| c5b87ee3-da94-41b1-973a-ef544a3ffb6f | 906380018-5 | Calypso - Strawberry Lemonade | Sumner | Piegrome | $7.71 | 2/21/19 |
| e383c58b-d8da-40ac-afd6-7ee629dc95c6 | 656608031-7 | Basil - Primerba; Paste | Mohammed | Reed | $2.77 | 10/21/18 |
| d88ccd5b-0acb-4144-aceb-c4b4b46d3b17 | 656608031-7 | Cheese - Fontina | Mohammed | Reed | $4.24 | 7/14/18 |
| 659df773-719c-447e-a1a9-4577dc9c6885 | 656608031-7 | Cotton Wet Mop 16 Oz | Jock | Skittles | $8.44 | 1/24/19 |
| ff52e91e-4a49-4a52-b9a5-ddc0b9316429 | 656608031-7 | Pastry - Trippleberry Muffin - Mini | Jock | Skittles | $9.77 | 11/17/18 |
| 86f8ad6a-c04c-4714-8f39-01c28dcbb3cb | 656608031-7 | Bread - Olive | Jock | Skittles | $4.51 | 1/10/19 |
| e7a66b71-86ff-4700-ac57-71291e6997b0 | 656608031-7 | Wine - White; Riesling; Semi - Dry | Farly | Pluck | $4.23 | 4/15/18 |
| c448db87-1246-494a-bae4-dceb8ee8a7ae | 656608031-7 | Melon - Honey Dew | Farly | Pluck | $1.00 | 9/10/18 |
| 725c171a-452d-45ef-9f23-73ef20109b90 | 656608031-7 | Sugar - Invert | Farly | Pluck | $9.04 | 3/24/18 |
| 849f9140-1469-4e23-a1de-83533af5fb88 | 656608031-7 | Yokaline | Farly | Pluck | $3.21 | 12/31/18 |
| 2ea79a6b-bfec-4a08-9457-04128f3b37a9 | 656608031-7 | Cake - Bande Of Fruit | Farly | Pluck | $1.57 | 5/20/18 |

Naturally, customers buy more than one item; they buy a lot. Especially that Farly Pluck guy at the bottom- quite the unfortunate auto-generated name.

As standalone tables, the customers and orders tables each serve at least one straightforward purpose on their own. The Customers table helps us with consumer demographic analysis, whereas the Orders table makes sure we’re making money and aren't getting robbed. While important, neither of the functions are particularly revolutionary: this basic level of record keeping has been at the core of nearly every business since the 70s.

The ability to combine data enables us to gain far more significant insights. We can reward loyal customers, cater to the needs of individuals based on their preferences, and perhaps even sell the personal data of where and when Mr. Pluck has been every Tuesday and Thursday for the past 4 months to the highest bidding data broker (hint: he's at our store).

Thanks to relational databases, we are neither limited to single monolithic tables nor are we shackled by the constraints of the tables we set up front. Associating data is trivial, as long as we have a means by which to associate it by. Below is a visualization of matching a foreign key in our orders table to a primary key in our Customers table:

Welcome to SQL 3: Building Relationships and Combining Data Sets
An Order's Foreign Key References a customer's ID

The above illustrates what we've already brushed on a bit: Foreign Key association. Primary and foreign keys are essential to describing relations between the tables, and in performing SQL joins. Without further adieu, let's join some data.

Joining Sets of Data

To “join” multiple sets of data is to consolidate multiple tables into one.

The manner of this consolidation is determined by which of the four methods of joining tables we use: inner joins , right joins , left joins , and outer joins (left and right joins are kind of the same, but whatever). Regardless of the type of join, all joins have the following in common:

  • Row comparison: we look for rows where the values of a column in Table A match the values of a column in Table B.
  • Consolidation of columns: The purpose of any join is to come away with a table containing columns from both tables.

LEFT & RIGHT JOINs

LEFT and RIGHT joins cover a myriad of use cases. With a bit of creativity, left/right joins can help solve problems we may not have expected. The terms " left" and " right" refer to the table we'd like to join on when reading from left-to-right. When joining tables via LEFT JOIN, the first table in our query will be the "left" table. Alternatively, a RIGHT JOIN refers to the last table.

When we say "table to join on," we're specifying which table's key values will be the "authority" for our merge. In a LEFT MERGE, all of the records in Table A will survive the merge:

  • For rows which have a match in Table B, these rows will be 'extended' to include the data in Table B. This means the new columns being added to Table A from Table B will contain data for all rows where an association has been made.
  • For rows which exist in Table A but do NOT have a match in Table B , these rows are unaffected: they will contain the same data as before the join, with values in the new columns left blank.
  • Keys which exist in Table B but do NOT exist in Table A will be discarded. The purpose of these joins is to enrich the data of the primary table.

Below is an example of an actual left join I use to power the Kanban board module on our "Projects" page. The left table is a table of JIRA issues, and the right table is a collection of issue-based customizations, such as custom icons and colors for issue types. Take a look at how this data is associated, and what makes it into the final table:

Welcome to SQL 3: Building Relationships and Combining Data Sets
Keys on the left table determine which rows stay or go.

The structure of a LEFT JOIN query looks as such:

SELECT 
  table_1.*, table_2.*
FROM
  t1
    LEFT JOIN
  t2 ON t1.column_name = t2.column_name;
Enter fullscreen mode Exit fullscreen mode

Here's an example with actual values:

SELECT first_name, last_name, order_date, order_amount
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id;
Enter fullscreen mode Exit fullscreen mode

Compare this to a RIGHT JOIN:

SELECT first_name, last_name, order_date, order_amount
FROM customers c
RIGHT JOIN orders o
ON c.customer_id = o.customer_id
Enter fullscreen mode Exit fullscreen mode

INNER JOIN (or CROSS JOIN)

Inner joins are the most conservative method for joining sets of data. Unlike LEFT or RIGHT joins, there is no authoritative table in an inner join: only rows which contain a match in all tables will survive the join. All other rows will be ignored:

SELECT table_1.column_name(s), table_2.column_name(s), 
FROM table1
INNER JOIN table2
ON table1.column_name = table2.column_name;
Enter fullscreen mode Exit fullscreen mode

Because inner joins will only act on rows which match in all affected tables, an inner join will typically contain the most "complete" data set (highest number of columns satisfied with values), but will contain the fewest number of rows.

OUTER JOINs

Outer joins actually come in a few different flavors. Generally speaking, outer joins maximize the amount of data which will survive after the join is performed.

LEFT (OR RIGHT) OUTER JOIN

At first glance, you might look at the results of a left/right outer join and mistake them to exactly the same as their pure left/right join counterparts. Well, you actually wouldn't be mistaken at all! That's right, I was lying: there's essentially no difference between types of joins (thus our time mentioning them has been worthless).

FULL OUTER JOIN

In a full outer join, all columns and rows will be joined into the resulting output, regardless of whether or not the rows matched on our specified key. Why do we specify a key at all, you ask? Matching rows on a key still combines rows which are similar to all involved tables (if there are truly no rows with common ground during a merge, you should ask yourself why you're merging two unrelated sets of data in the first place).

The result is kind of a mess. I'm going to borrow an illustration from the Pandas documentation here:

Welcome to SQL 3: Building Relationships and Combining Data Sets
Source: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

While Column B appears to be left somewhat intact, take a look at what's happening around it: columns labeled A_x and A_y have been generated as a result of the join. The outer join has created a table where every possible combination of values for the keys in column B exists. Thus, the number of rows in our new table is effectively length of Table A * length of Table B.

I personally rarely use outer joins , but that's just me.

SELECT column_name(s)
FROM table1
FULL OUTER JOIN table2
ON table1.column_name = table2.column_name;
Enter fullscreen mode Exit fullscreen mode

Scenario: Create a New Table from Multiple JOINs

So far we've only looked at examples of two tables being joined at once. In fact, we can merge as many tables as we want, all at once! Going back to the JIRA example, here is the actual query I use to create the final table which powers a custom Kanban board:

CREATE TABLE jira
AS
SELECT
    jira_issues.*,
    jira_issuetypes.issuetype_url,
    jira_issuetypes.issuetype_color,
    jira_epiccolors.epic_color
FROM
    jira_issues
    LEFT JOIN jira_issuetypes ON jira_issues.issuetype = jira_issuetypes.issuetype
    LEFT JOIN jira_epiccolors ON jira_issues.epic_name = jira_epiccolors.epic_name;
Enter fullscreen mode Exit fullscreen mode

If you're using PostgreSQL, views are a great way to save the results of a join without adding additional tables. Instead of using CREATE TABLE, try using CREATE VIEW:

CREATE TABLE jira
AS SELECT
  jira_issues.*,
  jira_issuetypes.issuetype_url,
  jira_issuetypes.issuetype_color,
  jira_epiccolors.epic_color
FROM
  jira_issues
  LEFT JOIN jira_issuetypes ON jira_issues.issuetype = jira_issuetypes.issuetype
  LEFT JOIN jira_epiccolors ON jira_issues.epic_name = jira_epiccolors.epic_name;
Enter fullscreen mode Exit fullscreen mode

Unions & Union All

A good way to think about JOINs is extending our dataset horizontally. A UNION, then, is a way of combining data vertically. Unions combine data sets with the same structure: they simply create a table with rows from both tables. UNION operators can combine the result-set of two or more SELECT statements, as long as:

  • Each SELECT statement within UNION must have the same number of columns.
  • The columns must also have similar data types.
  • The columns in each SELECT statement must also be in the same order.

UNION

SELECT column_name(s) FROM table1
UNION
SELECT column_name(s) FROM table2;
Enter fullscreen mode Exit fullscreen mode

UNION (with WHERE)

We can also add logic to unions via where statements:

SELECT City, Country FROM Customers
WHERE Country='Germany'
UNION
SELECT City, Country FROM Suppliers
WHERE Country='Germany'
ORDER BY City;
Enter fullscreen mode Exit fullscreen mode

UNION ALL

An interesting distinction is the presence of UNION versus UNION ALL. Of the two, UNION is the more "intelligent" operation: if identical rows exist in both SELECT queries, a UNION will know to only give us one row to avoid duplicates. On the other hand, UNION ALL does return duplicates: this results in a faster query and could be useful for those who want to know what is in both SELECT statements:

SELECT column_name(s) FROM table1
UNION ALL
SELECT column_name(s) FROM table2;
Enter fullscreen mode Exit fullscreen mode

UNION ALL (with WHERE)

Just like UNION, we can add logic to union all via where statements:

SELECT City, Country FROM Customers
WHERE Country='Germany'
UNION ALL
SELECT City, Country FROM Suppliers
WHERE Country='Germany'
ORDER BY City;
Enter fullscreen mode Exit fullscreen mode

More SQL Ahead

I hope that visualizing the way which JOINs and UNIONs work can help to reduce friction for SQL new-comers. I find it difficult to believe that human beings can fully grasp these concepts without seeing them happen first-hand, which begs the question: why would anybody explore something so poorly explained, without knowing the benefits?

If you find these guides useful, feel welcome to holler at me to keep them coming. We still have more SQL ahead in our series: stay tuned for when we explore aggregate values and more!

Top comments (1)

Collapse
 
dominicbisset profile image
Dominic Bisset

That's a solid introduction to joins and unions. I particularly appreciated the animation - it's a great way of representing the filtering effect of joins, and certainly clearer than a Venn diagram.

I think you've missed some of the nuance of the full outer join though. The way you describe it as producing the Cartesian product of the two incoming tables - the rows in one table multiplied by the rows in the other - is more true of the CROSS JOIN (which you suggested is an alias of the INNER JOIN).

As it happens your Pandas example will produce the same output if you join the two tables on column B regardless of whether you do an FULL OUTER, LEFT or even INNER JOIN. The problem lies not in the type of join but that you're trying to join two tables with tangentially related data. There's not enough commonality between the sets to make sensible decisions about what should match to what. If you wanted to reduce the number of rows you'd have to use data sets with more columns so you have composite keys that overlap more.

Rather than showing rows that have multiple matches, a better example for FULL OUTER joins considers what happens when rows don't have a match. You end up with some complete rows, some rows with NULLs at the end and data at the start (assuming a select *), and others with NULLs at the start with data towards the end.

They are niche. There aren't many cases where you'd use one over a left join because largely you have got a primary driving data source that you're trying to add supplementary info to. A use case I came across was when I had two conceptually-similar tables from different sources - reels of paper tracked by different factory data collection systems. Both tables had some common columns, and some extra info that wasn't available in the other system. Both sources knew about some of the reels of paper but not all of them, and a LEFT (or RIGHT) JOIN would filter out the rows from the "other" system that the driving system didn't know about. By FULL OUTER JOINing them together and COALESCING the resultant NULLs to sensible defaults I was able to get the most detailed picture of the system as a whole.