SQL30 (16 Part Series)
Welcome to the SQL showdown series!
I'm committing to publishing a SQL challenge every day for 30 days. In each post I'll describe my solution to the last day's challenge. I'll follow that up with a full description of the challenge for the next day.
Write your own solution in the comments! Let's see who can come up with the most creative solutions.
I'll add connection details for a PostgreSQL database containing test data for the challenge. Solutions are by no means limited to PostgreSQL, but it's there if you want an easy way to test your query!
I'm changing things up a bit for this challenge and focusing more on database design instead of a specific query. Instead of posing a single challenge, I'll present a series of questions about a certain database design topic.
I hope you'll chime in! I have an answer to these questions, but I don't pretend to have the best answers.
This design challenge centers around the idea of soft deletes. Normally, when data is deleted in a database, it's irrevocably gone. It's not uncommon, however, that you might want to preserve a record while hiding it in your user interface.
A simple way to go about this is to add a deleted boolean field to your table. In some cases, it might really be just that simple, and there's not much else to worry about.
But here are some questions about this design:
What happens to related child data when a parent is soft deleted? If a user deletes an "account" record, what happens to all of the related "order" records?
The entire application is now responsible for making sure that queries filter out all "deleted" records. Write enough queries, and sooner or later someone will forget to filter out the deleted records.
What affect will deleted records have on constraints that might be in place? If I have a unique constraint on a field called name, should users be able to create new records that re-use an existing name on a soft-deleted record?
This is the question we were trying to answer with yesterday's SQL challenge:
Can you write a query to efficiently return a random sample of records from a table?
Before going any further, let's first look at the simplest possible way to achieve this.
SELECT * FROM day10.measurement ORDER BY random() LIMIT 10;
(The above works in PostgreSQL and MySQL, but in SQL Server you'll need to use the NEWID() function.)
If your table is of a reasonable size, you probably don't need to overthink this any further. But if you're dealing with millions of rows, this approach can become quite slow.
Why is that? This query will need to scan through every row in the table and generate a random number. Once every row has a random number, the entire table will need to be sorted on that value. If the table is big enough, that sort will most likely spill over into a temporary table stored on disk.
A more efficient solution would be to generate N random numbers first, and then select rows out of your table by ID. The random numbers could be generated outside of the database, and then put into a select like so:
SELECT * FROM day10.measurement WHERE idx IN (?)
This of course has some pitfalls, including the fact that your primary key could have gaps if you've deleted any rows.
There is a lot of discussion online about selecting random rows. Specifically for PostgreSQL, there is a fascinating answer on StackOverflow that delves into some very advanced ways to fetch random records.
Have fun, and I can't wait to see what you come up with! I'll be back tomorrow with a solution to this problem and a new challenge.