DEV Community

Nuno Martins
Nuno Martins

Posted on

Choosing the right MySQL key-pair indexing

Recently, I had the task of creating a MySQL table which had the main purpose of looking up records exclusively by a pair of attributes. A pretty simple table, one might say, but I was in doubt of the best way to introduce indexes in order to prioritise performance. I'm aware this could probably be faster running on a NoSQL engine but this is not what this article is about.

I knew I would have to associate the two attributes but was in doubt of what was better: using SQL's key-pair indexes or a hash algorithm (md5)? Did MySQL use a similar hashing algorithm or maybe even a faster one than md5? I googled around but was actually surprised I didn't find an obvious answer (or maybe I just suck at googling).

I needed a benchmark that would give me answers so I scripted one thus this article objective is too show some performance results and get some conclusions. For comparison purposes I've also wanted to check the performance of a table with indexes assigned individually to each attribute. In total there are 3 tables to benchmark:

Table individual_indexes

CREATE TABLE `no_indexes` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `attribute_a` varchar(255) NOT NULL,
  `attribute_b` varchar(255) NOT NULL,
  `created_at` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `attribute_a` (`attribute_a`),
  KEY `attribute_b` (`attribute_b`)
) ENGINE=InnoDB AUTO_INCREMENT=50001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Enter fullscreen mode Exit fullscreen mode

Table keypair_index

CREATE TABLE `keypair_index` (
  `attribute_a` varchar(255) NOT NULL,
  `attribute_b` varchar(255) NOT NULL,
  `created_at` timestamp NULL DEFAULT NULL,
  KEY `keypair_index_index` (`attribute_a`,`attribute_b`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Enter fullscreen mode Exit fullscreen mode

Table hash_index

CREATE TABLE `hash_index` (
  `hash` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `attribute_a` varchar(255) NOT NULL,
  `attribute_b` varchar(255) NOT NULL,
  `created_at` timestamp NULL DEFAULT NULL
  KEY `hash_index_hash_index` (`hash`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Enter fullscreen mode Exit fullscreen mode

The script for the benchmark was written in PHP 7.2 (running on Linux with MySQL 5.7.33). First test creates the same number of records on each table, populating the fields with random generated values, being that attribute_a had four possible different values while attribute_b had all distinct values.
Please note that for the hash_index table every record created needs to calculate the hash (with md5 algorithm) consisting of what I have decided to be a string concatenation of attribute_a and attribute_b. You can generate this hash in either PHP or MySQL.
The tests were ran multiple times to make sure the results were consistent.

Benchmark results of time to run multiple Insertions (time in miliseconds):
table1

On a very quick first look it would seem that hash_index table has the worst performance when inserting records. But the fact is that it offers the best performance if you plan to have a lot more than 50k records on your table! This is because it has the lowest scaling factor (lower is better) meaning it scales better than the other tables and at one point, once it reaches a certain number of records, it will outperform the individual_indexes table. The keypair_index table started to look okay when inserting 10k records but it's evident it will become a lot slower than the other tables if more than 50k records exist. The winner here, if you plan to have a lot of records and plan on scaling, is definitely the hash_index table.

Next, I wanted to test the read speed of searches, which was for me the most important performance measure. The script attempts to search in the same way as the records creation, with random values of attribute_a and attribute_b, with MySQL cache disabled (this is to mimic a production table that is constantly being updated). Please note that for the hash_index table is also necessary to calculate the md5 hash for every read(SELECT) operation because this is what identifies a record on the table.

Benchmark results of time to do 10k searches on each table (time in miliseconds):
table2

Now it is obvious that on every level the hash_index table offers the best performance and scaling. Again, the lowest scaling factor (lower is better) of 1.05(!) indicates how well this will perform with huge tables although it is worth noting that this factor will slightly increase with the amount of records due to the nature of B-tree structure. I don't think I even need to bring the other tables to this discussion.

Conclusion: A hash index will work much faster than a key-pair MySQL index if you want a table that performs a lot of lookups, specifically with two or more attributes combined, and scales much better.
There may be value in using a key-pair index if you plan to do something like a log table that performs very low amounts of search operations and is planned to have a low amount of records, but on the long run I would always go with a hash index.

Top comments (1)

Collapse
 
darkain profile image
Vincent Milum Jr

If you want your hash index to perform better, don't use character data for it. Especially dont use unicode and case insensitive data! Both of these add computational overhead, especially for inserts, but also in terms of size on disk too.

Convert your hash data into BINARY and use BINARY data type with a fixed length based on your hashing algorithm.

But also note, it looks like your test case is still relatively small. Once you start adding size and concurrency, the hash index will start to slow down due to range locking in MySQL/MariaDB. This also happens for indexes on the "key" of your "key/value" pair too. However, this is not the case with an auto_increment primary key.

There is a lot more that can be explored in this space! :)