- Environment preparation and data download
- Querying the data
- Conclusion and final thoughts
This article is a basic introduction to database tables indexes. If you are a seasoned database user or you already have some experience with indexes, this article might be too basic for you but it will give you a quick way to build a nice sandbox with real world data (large and varied enough to make it valid) that you can use to test more advanced concepts.
In an age of Pandas and ORMs, sometimes, database structures are abstracted into a blackbox. However, regardless of the way you access your database, the way you store, structure and index your data can have a massive impact in performance.
As always this article presents a real life example with a sample of data large enough to be very slow to query for a badly structured database table. Here I present some ideas on how to improve it. It's worth mentioning this is a massive topic and being a DBA is a challenging and extremely technical role that takes a lifetime to master. That doesn't mean we (developers, data scientists, etc) shouldn't at least understand the basics of database structures and configuration.
There are lots of tutorial and technical articles on how to install PostgreSQL so I won't go trough the details. If you're following this tutorial in Ubuntu (my development environment), one of the best tutorials I've found is this on in Digital Ocean https://www.digitalocean.com/community/tutorials/how-to-install-and-use-postgresql-on-ubuntu-18-04.
If you're following this article in Windows, this one is pretty good https://www.postgresqltutorial.com/install-postgresql/.
The whole idea is to install PostgreSQL (which contains psql). We will use psql to setup a superuser password and create our database.
To launch psql, look for the proper executable within the installation folder. For more details, see https://stackoverflow.com/questions/47606648/how-to-start-psql-exe.
In Ubuntu, you can use the following commands:
sudo -i -u postgres psql
The first command will initiate a shell for the postgres user so you can launch psql.
Now, we will create a new database. In a production-level setup, you must create a role to own the database with limited privileges but that is outside the scope of this tutorial. For now, we will continue with
To create the database just run the following query (all these queries need to be run from within psql).
CREATE DATABASE landregistry WITH OWNER = postgres ENCODING = 'UTF8' CONNECTION LIMIT = -1;
Now, let's test our DB. Quit psql with the command
\q. Now, lets go back into psql but using our newly created database with the command
psql -d landregistry. We should be back in psql but inside our database. The prompt should look like this.
Now, let's create a table to store our data. Tables are stored within schemas. In this case, we will simply use the
public schema. This simulates not having schemas at all. If you want to read more about this, check this stackoverflow question/answer https://stackoverflow.com/questions/2134574/why-do-all-databases-have-a-public-schema-in-postgresql
The columns on the data we will download in the next section are
["transaction", "price", "date", "postcode", "prop_type", "old_new", "duration", "paon", "saon", "street", "locality", "city", "district", "county", "ppd_cat", "rec_status"] so we will create a table called pricedata with these fields:
CREATE TABLE public.pricedata ( transaction character(50), price integer, date date, postcode character(10), prop_type character(1), old_new character(1), duration character(1), paon character(100), saon character(100), street character(100), locality character(100), city character(100), district character(100), county character(100), ppd_cat character(1), rec_status character(1) ); ALTER TABLE public.pricedata OWNER to postgres;
As you can see we left the table fairly "default". We didn't specify any indexes or other definitions.
This step is done back into your command line / shell.
We will use the UK gov housing price paid data. It is a large file so we should see some meaningful difference in performance (between a poorly and a better structured schema).
Make sure you read the usage guidelines here https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads#using-or-publishing-our-price-paid-data
The copyright disclaimer:
Contains HM Land Registry data © Crown copyright and database right 2020. This data is licensed under the Open Government Licence v3.0.
We will download all the data ever recorded. You can download it with wget (as shown below) or just download it with your browser. I will save it into a folder called data.
mkdir data cd data wget http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv
To finish this first part, go back to the landregistry database using psql and run (replacing path_to_data_folder with your own path):
COPY pricedata FROM '/path_to_data_folder/data/pp-complete.csv' DELIMITER ',' CSV
When it finishes, you will see how many rows were copied but if you missed it our you want to check again run (it will take some time to run so you might want to grab a cup of something while you wait):
SELECT COUNT(*) FROM pricedata;
The you can compare that number with the number of lines in your file. In Linux you can run the following command:
wc -l pp-complete.csv
That results in
25402030 records for me (my file is older so you will get more records now). You must have the same number in both, the original csv file and your table.
Let's do a couple of things. First, let's query the first 10 rows
select * from pricedata limit 10;
That probably took some time, but if you run the same query again, it should be fairly instantaneous. That's because databases in general (not only PostgreSQL) are ready good at caching your queries. (note: press
q to leave the query answer section)
Now let's do something more interesting that will take a while every time.
select min(date) from pricedata;
Because it needs to "make sure" it is correct i.e. you haven't added a new minimum, this is going to take a very long time every time you run it because it has to look in all the values.
What is cached or not is itself enough material to write a book but this article is pretty good if you want to deeper https://madusudanan.com/blog/understanding-postgres-caching-in-depth/#Contents
Now intuitively, if you had a structure where you know what is the minimum, when you add a new value, you will very quickly now whether you added a new minimum or not.
Let's look at another example but this time, we will record execution time. Type
\timing in the psql terminal to activate timing recording.
SELECT * FROM pricedata WHERE date >= '2014-02-01' AND date < '2014-03-01';
Again, the database doesn't know our dates are organised in a certain way. So it won't be able to cache the answer for this query.
In my little refurbished Lenovo, this query took
Time: 127879.697 ms (02:07.880) the first time and
Time: 128829.645 ms (02:08.830) the second time I ran it.
Let's help the database to understand the structure of our table by adding an index against date. What we want to do here is to show the database that dates can be ordered, they have minimum and maximum etc.
CREATE INDEX date_idx ON pricedata (date);
This will take some time as it will construct a btree (the default index type). You can read more about this data structure here https://en.wikipedia.org/wiki/B-tree. Also, this is a non-unique index sorted ascending (default when you don't specify it). For more details you can review the docs here https://www.postgresql.org/docs/current/sql-createindex.html.
The result should be something like this:
landregistry=# CREATE INDEX date_idx ON pricedata (date); CREATE INDEX Time: 138361.978 ms (02:18.362)
Now if you re-run the min
select min(date) from pricedata; you will see that it's almost instantaneous. For our previous (date range) query we get the following results for a first and a second run:
landregistry=# SELECT * FROM pricedata WHERE date >= '2014-02-01' AND date < '2014-03-01'; Time: 9664.184 ms (00:09.664) landregistry=# SELECT * FROM pricedata WHERE date >= '2014-02-01' AND date < '2014-03-01'; Time: 383.211 ms
So the first time, we see a massive improvement against the almost 2 minutes we saw originally and then the database know it can cache the answer (because it know its structure) so from that point forward is much faster.
The best way to understand indexes is to think about them as a way to teach your database about your data. It is a different representation of your data. You can think about them as a different way to access the same data but one where there is a structure to it.
For the same reason, every time you add data, you need to update your index so an index might actually be a bad idea if you do mostly writes and little read (like in an archive table for example)
Having this in mind, you should be able to think about cases where an index will improve querying and when it won't and to ask yourself questions that will allow you to gain more knowledge around this topic.
Indexes can also created as a way to establish unique fields (and unique together fields).
Indexes and in general table structure is a large topic but now you have an awesome sandbox (a large table with different type of columns) where you can experiment by searching through strings, dates, numbers, booleans etc.
I hope you enjoyed this article and have fun experimenting with indexes. Give me a shout if you have any questions, spot errors, or simply want to get in touch!