SQL 101 (6 Part Series)
It's time for some jargon busting.
We hire for people skills, attitude and the motivation to learn new things. Having said that, sometimes getting to grips with the jargon and database terminology is a steep learning curve.
Here are some of those terms and concepts that pop up during training if you need a refresher or have a junior analyst in your life that needs something to refer back to.
CTEs v Subqueries
Massive Parallel Processing
OLTP v OLAP
Truncate V Drop
When joining tables, we need to state which column from which table we want to match up, and which columns we want to return in the results.
If we don't, an error will be thrown as the database doesn't know what we have in mind.
select orders.order_item, inventory.inventory_item from orders inner join inventory on orders.order_item = inventory.inventory_item
To make it quicker to type we can alias the two tables with something shorter.
select o.order_item, i.inventory_item from orders o inner join inventory i on o.order_item = i.inventory_item
Instead of having to type out the whole table name each time we want a new column added, we can alias them with the letter 'o' for orders and 'i' for inventory.
Read more about JOINs and aliasing in this beginner friendly post:
SQL Transactions are used to trap errors when making changes to tables. During an UPDATE or DELETE statement, the change is auto-committed.
By wrapping the statement in a transaction we have the opportunity to 'roll back' or 'commit' when we are sure that it should be executed, or if a condition has been met.
The following transaction will run in a block and commit if successful.
begin transaction update orders set status = 'sent' where order_id = '12345' update orders set status = 'sent' where order_id = '54321' commit transaction
Read more about how these work:
CTEs (Common Table Expressions) are used to filter datasets and name them to come back to in the query later.
I use them when dealing with large tables to, for example, get all the columns I need from the 'orders' table, then get all the columns I need from the 'inventory' table. Then in a final step join them together.
with orderset as (select order_id, order_item from orders where order_id between '12345' and '54321' ), with inventoryset as (select inventory_item inventory_balance from inventory ) select orderset.order_id, inventoryset.inventory_balance from inventoryset join orderset on orderset.order_item = inventoryset.inventory_item
They are a good way to simplify queries instead of doing complicated JOINs or creating tables that you may not have the permissions to create.
Read more about CTEs with examples and comparison to subqueries:
Datamarts tables are organised in one of two forms. A ‘Star’ schema and a ‘Snowflake’ schema made of two types of tables.
- Facts - that count how many times something has happened.
- Dimensions - (or Dims) that describe an attribute.
In the Star model, we can have a Sales table as our Fact in the centre with Dim tables for Store, Product and Location surrounding the Fact.
The Snowflake is similar but takes the Dims one step further. Instead of just a Location table, we may have a City, Country and even a Postcode table. All the Dims become the points on the snowflake.
Read more about the advantages and disadvantages of both:
ETL stands for Extract, Transform, Load and describes the process of getting data from one database to another or from its raw form into tables that can be queried.
This task is taken care of by the Data Engineers or Database Developers on the team. Read more about what they do:
In PostgreSQL, we can execute blocks of code, called Functions, on a schedule. They can be written like the statements we run ad hoc on the database or can be parsed variables to make them dynamic.
Read more about how to write and execute functions:
Aggregate functions allow us to perform calculations on fields. The most common ones are SUM, COUNT, MIN, MAX, AVERAGE.
When used in conjunction with GROUP BY we can group identical fields and perform the calculation as we do with Pivot Tables in Excel.
For example, to see the total amount due for each item in the orders table we can use the SUM of the amount_due column and GROUP BY
select order_item, sum(amount_due) from orders group by order_item;
Read more about Aggregate functions:
Heaped Storage is a term for tables that live on the database with no indexes. The data is in no particular order and new data simply gets added as it comes in.
Queries that are executed on these tables, especially if the tables are large, can be optimised by adding indexes.
Indexes are a way of telling the database to order the data or where to look to find the data you query often.
Clustered Indexes are like the contents page of a book. Applying this kind of index is telling the data how it should be ordered, like the pages in a book.
Non-clustered Indexes are like the index of a book, the pages haven't been arranged that way physically, but you now have a lookup to get to what you need faster.
Read more in this beginner friendly post:
This refers to data quality and rules ensuring data is traceable, searchable and recoverable.
- Entity Integrity - each table must have a unique primary key
- Referential Integrity - foreign keys on each table refers to a primary key on another or is NULL
- Domain Integrity - each column has a specific data type and length
Read more about Integrity and Database design:
You won't find everything you need in one table, so you will need to learn how to join them together to get what you need.
There are different types of JOINs depending on your needs. Read more about JOINs in this beginner friendly post:
A primary key is a column that best identifies one unique row, and identifies each record as unique, like an ID
- It ensures that there are no duplicates
- It cannot be unknown (NULL)
- There can only be one primary key per table
- A foreign key is a column that matches a primary key in another table so we can join the data in each together.
Read more about Keys:
When two users are trying to query or update the same table at the same time it may result in a lock. In the same way that two people with ATM cards for the same bank account are trying to withdraw the same $100 from the same bank account, one will be locked out while the first transaction is completed.
Rhymes does a great job of explaining how it works on the database:
"...database locks serve the purpose of protecting access to shared resources (tables, rows, data).
In a system where tens if not hundreds of connections operate on the same dataset, there has to be a system to avoid that two connections invalidate each other's operation (or in other cases causing a deadlock) ...
Locks are a way to do that. An operation comes to the DB, declares they need a resource, finishes its own modification, then releases such resource, so that the next operation can do the same. If they didn't lock their resource two operations might overwrite each other's data causing disasters.
In Massively Parallel Processing databases, like Redshift, the data is partitioned across multiple compute nodes with each node having memory to process data locally.
Redshift distributes the rows of a table to the nodes so that the data can be processed in parallel. By selecting an appropriate distribution key for each table, the workload can be balanced.
Read more about Redshift:
Database normalisation increases data integrity and allows new data to be added without changing the underlying structure.
The process of normalising a database takes multiple steps:
- 1st Normal Form - eliminates duplicate columns across all tables and adding a Primary Key.
- 2nd Normal Form - create relationships through Foreign Keys.
- 3rd Normal Form - fields should not be derived from other fields. ie. removing a Total column that multiplies the Quantity and Price column. This should instead be calculated by running a query, not storing it in the table.
Read more about Normalisation:
OLTP and OLAP refer to different types of databases and tools that perform different functions.
- OLTP - Online Transaction Processing - used for fast data processing and responds immediately to queries.
- OLAP - Online Analytics Processing - used for storing historical data and data mining.
Read more about OLTP:
If you intend on sharing a table with your colleagues who have access to your schema, you need to explicitly grant access to them. This keeps data locked down to just those who need to see it.
GRANT ALL ON <schemaname.tablename> TO <username> -- if you would like them to SELECT, UPDATE and DELETE GRANT SELECT ON <schemaname.tablename> TO <username> -- if you would like them to be able to only SELECT
Read more about Permissions and everything else you need to know about databases:
When we run a query there are many things that the SQL engine considers - the joins, the indexes, whether it will scan through the whole table or be faced with table locking.
Stay tuned for a post dedicated to a deep dive on this topic.
In PostgreSQL we can check the query plan using:
- EXPLAIN -- show the execution plan of a statement
- EXPLAIN ANALYZE -- causes the query to be executed as well
explain select orders.order_item, inventory.inventory_item from orders inner join inventory on orders.order_item = inventory.inventory_item
Read more about what each query looks like and how to interpret the results:
Disaster Recovery in the database world relates to the backups, logs and replication instances that are maintained while everything is working fine. These can then be switched on, switched over and analysed when something does go wrong, like a hardware failure, natural disaster or even human error.
Depending on the organisation, these solutions can extend to:
Failover - multiple clusters are set up so if one fails the other can take over.
Mirroring - maintaining two copies of the same database at different locations. One in offline mode so we know where things are at when we need to use it.
Replication - the secondary database is online and can be queried.
Read more about a database disaster and why you should always check your UPDATE statements:
The system tables contain information about all the objects in the database. Sometimes this is called the information schema or system catalogue.
From here we can write queries that show a list of all tables, columns and their data types, search the database for a column name we need, or return the size of each table.
If updating a table with new data use the TRUNCATE command. It deletes all of the rows from the table without deleting the format and headers.
Read more SQL tips in this beginner friendly post:
While a JOIN combines rows of columns horizontally a UNION combines the results vertically. Using a UNION combines the result of two queries into one column and removes duplicates. If your query has multiple columns, they need to be in the same order to complete the UNION.
- UNION - stacks the two tables or data sets together horizontally and removes duplicates
- UNION ALL - does the same but does not remove the duplicates
Read more about UNION and UNION ALL:
Views are not tables, they are queries that are executed on the fly and are used as a way to create a level of abstraction from the base table.
Joe sums it up perfectly in his post
A view is a stored query. When you create a database view, the database stores the SQL you gave it. Then, when you come along and query that view, the database takes the stored view query, adds in the extras from the query against the view, and executes it. That's it!
A window function gets its name because, unlike an aggregate function, it keeps each row intact and adds a row number or running total.
Here is an example using the orders table that returns a rank using the order_value.
select order_id, order_name order_date, rank() over(order amount_due desc) [rank] from dbo.orders
We can import files into tables using the import/export wizard. But they don't have to just be csv or txt files. By using a few lines of code we can import xml and json as well.
Read more with instructions:
Here are some tips for dealing with dates:
- It’s harder to work with dates that are stored as strings so make sure these never represent dates.
- Don’t split out the year, month, and day in separate columns. This makes queries much harder to write and filter.
- Always use UTC for your timezone. If you have a mix of non-UTC and UTC it makes understanding the data much more difficult.
- DATEDIFF and DATEADD functions from SQL Server don't exist in PostgreSQL, learn more about the differences between the two systems:
NULL means that the value is unknown, not zero and not blank. This makes it difficult to compare values if you are comparing NULLs with NULLs.
Depending on what you are asking your code to do, influences the strategy you need to take. Read more about NULLs and how to tackle the problem.
There we have it, a quick introduction to the key terms, concepts and jargon for those new to the world of SQL and databases.
Next up, I'll be writing more posts on some of these topics that need a deep dive - window functions, aggregate functions, using the query plan. Let me know if there's anything else you'd like to see.
This post first appeared on helenanderson.co.nz
Photo by Pixabay on Pexels