Our QA team are absolute monsters, they're monsters I tell you!
No matter how much I try to bribe them with drinks at our next company gathering to stop finding bugs in my code, they keep coming up with unreasonable requests and "expected behaviours"...ewwwww, gross!
Don't believe me?
Recently they asked for a list to be ordered alphanumerically.
Can you believe that? Monsters I tell you!
I mean, it isn't like an end user would expect a list to be sorted:
test 1
test 2
test 12
...why can't it just be like MySQL thinks it should be sorted:
test 1
test 12
test 2
Makes perfect sense right? Right?
Anyway lets not get too deep into who is right and who is wrong here (they're right FYI), I thought I better try and fix it.
I mean...how hard can it be? (foreshadowing)
The short answer
I know you might be busy, so the short answer is sort it in JS / PHP / whatever. Don't try and do it with MySQL.
Like seriously, just return your data set and then use Intl.Collator
if you are using Node, or natsort()
in PHP.
And you know what, that would have been fine. It would have solved the problem and I could have moved on with my day. QA would have been happy, list would be sorted, all good!
However, if you are like me then when your CTO reviews your PR and says:
I take that to heart!
I mean, he is so right!
Why should we have to sort things in code, MySQL should be able to do that for us!
What ensued was over 4 hours of fiddling and swearing to try and get MySQL to bend to my will.
Did I manage it? No...but kinda.
What I ended up with works well enough for user input in our specific scenario, and is much better than other solutions I found, but has it's limitations.
But we are getting ahead of ourselves, I want to show you what is it like trying to solve this in MySQL!
The Data for testing
First thing is first, let's start with the test data I used, just so you can see if I missed anything:
The Test Data
Test Data
INSERT INTO test_data (data_value) VALUES
('2 test'),
('12 test'),
('1 test'),
('20 test'),
('10 test'),
('1test'),
('2test'),
('test 1'),
('my data 2020-01-01'),
('my data 2020-02-01'),
('01-02-24 data'),
('12 test'),
('4a test'),
('my 2020-01-01 data 2020-01-01'),
('my 2020-02-01 data 2020-01-01'),
('my 2020-02-01 data 202-01-01'),
('my 2020-02-01 data 20-01-01'),
('my 2020-02-01 data 1-01-01'),
('my 2020-02-01 data 2-01-01'),
('my 2020-02-01 data 12-01-01'),
('my 2020-02-01 data 01-01-01'),
('my 2020-01-01 data 2020-02-01'),
('my 2020-01-01 data 2021-01-01'),
('my 2020-01-01 data 2120-01-01'),
('my 2120-01-01 data 2020-01-01'),
('4b test'),
('my test'),
('my 12 magic test'),
('my magic 12 test'),
('cheese and test 12'),
('42-a-1'),
('40-a-1'),
('40a'),
('FoClSy4727'),
('Pthw068bf'),
('6bfS'),
('HOFAp_Yx7920'),
('25hWTX'),
('dnjLlW1'),
('RHrIt72402eaLr'),
('cIhb42WFNQ'),
('9244uVCpGa'),
('yDKrkCp7960'),
('GeGIrPM-H86'),
('wrOae537LGCT'),
('WffSPaBA318'),
('kQ33596c'),
('3uEKHmHePf'),
('796h-eYWy'),
('833HufIZAS'),
('utjtV03Xns'),
('dlCSh87811'),
('13IUkOxEVl'),
('VHCok55901XYVk'),
('2RnSVwq'),
('AwtwQdn09'),
('gvSV6z'),
('uxWLO039hb'),
('vTg946');
Some nonsense with numbers in various places throughout the strings, seemed reasonable enough to me.
It is probably not a perfect data set, but good enough for what I was after!
So once I had some data I did what every developer does, asked the Google God for an answer:
Googling for natural sort in MySQL
Seems like the logical first step right?
MySQL has been around for 30 years, and alphanumeric sorting is something you would think people need to do often, so it must have some answers out there already right?
Short answer...it was exhausting. Every single article, Stack Overflow post etc. spitting out the same nonsense that doesn't work or is hyper specific to a particular data format.
I wanted something generic, so I could just sort alphanumerically everywhere!
So let me walk you through the pain I endured so you don't have to, and so you can decide if my answer is any better...or if I have just added to the pain!
Let's try some of the generic sort methods people suggested:
Test 1: the "+0" method
SELECT
*
FROM
test_data
ORDER BY
`data_value`+0 ASC
The idea being that it tries to force MySQL to do numeric ordering over lexicographical ordering.
This would work fine if we had numbers or words in each row (the numbers would indeed sort correctly and then the words).
However because we have mixed data, it just spat out a load of nonsense:
Test 1 results
id | data_value |
---|---|
44 | GeGIrPM-H86 |
25 | my 2120-01-01 data 2020-01-01 |
27 | my test |
28 | my 12 magic test |
29 | my magic 12 test |
34 | FoClSy4727 |
Full Results for test 1
id
data_value
44
GeGIrPM-H86
25
my 2120-01-01 data 2020-01-01
27
my test
28
my 12 magic test
29
my magic 12 test
34
FoClSy4727
35
Pthw068bf
37
HOFAp_Yx7920
39
dnjLlW1
40
RHrIt72402eaLr
41
cIhb42WFNQ
43
yDKrkCp7960
30
cheese and test 12
45
wrOae537LGCT
46
WffSPaBA318
47
kQ33596c
51
utjtV03Xns
52
dlCSh87811
54
VHCok55901XYVk
56
AwtwQdn09
57
gvSV6z
58
uxWLO039hb
59
vTg946
17
my 2020-02-01 data 20-01-01
8
test 1
9
my data 2020-01-01
10
my data 2020-02-01
14
my 2020-01-01 data 2020-01-01
15
my 2020-02-01 data 2020-01-01
16
my 2020-02-01 data 202-01-01
18
my 2020-02-01 data 1-01-01
19
my 2020-02-01 data 2-01-01
21
my 2020-02-01 data 01-01-01
23
my 2020-01-01 data 2021-01-01
22
my 2020-01-01 data 2020-02-01
20
my 2020-02-01 data 12-01-01
24
my 2020-01-01 data 2120-01-01
6
1test
11
01-02-24 data
3
1 test
55
2RnSVwq
1
2 test
7
2test
48
3uEKHmHePf
13
4a test
26
4b test
36
6bfS
5
10 test
2
12 test
12
12 test
53
13IUkOxEVl
4
20 test
38
25hWTX
33
40a
32
40-a-1
31
42-a-1
49
796h-eYWy
50
833HufIZAS
42
9244uVCpGa
Ok so that didn't work (and I didn't really expect it to).
Let's try something else:
Test 2: The "length" hack
SELECT
alphanumeric, integer
FROM
sorting_test
ORDER BY
LENGTH(alphanumeric), alphanumeric
This is another interesting approach.
obviously "12" is longer than "1" so it would work there.
It also works if you have a standard format for data, like the "test 1, test 12, test 2" I mentioned at the beginning.
However, for our data it returns more nonsense:
Test 2 results
id | data_value |
---|---|
33 | 40a |
36 | 6bfS |
6 | 1test |
7 | 2test |
3 | 1 test |
1 | 2 test |
38 | 25hWTX |
32 | 40-a-1 |
31 | 42-a-1 |
57 | gvSV6z |
8 | test 1 |
59 | vTg946 |
5 | 10 test |
Full Results for test 2
id
data_value
33
40a
36
6bfS
6
1test
7
2test
3
1 test
1
2 test
38
25hWTX
32
40-a-1
31
42-a-1
57
gvSV6z
8
test 1
59
vTg946
5
10 test
2
12 test
12
12 test
4
20 test
55
2RnSVwq
13
4a test
26
4b test
39
dnjLlW1
27
my test
47
kQ33596c
49
796h-eYWy
56
AwtwQdn09
35
Pthw068bf
53
13IUkOxEVl
48
3uEKHmHePf
50
833HufIZAS
42
9244uVCpGa
41
cIhb42WFNQ
52
dlCSh87811
34
FoClSy4727
51
utjtV03Xns
58
uxWLO039hb
44
GeGIrPM-H86
46
WffSPaBA318
43
yDKrkCp7960
37
HOFAp_Yx7920
45
wrOae537LGCT
11
01-02-24 data
40
RHrIt72402eaLr
54
VHCok55901XYVk
28
my 12 magic test
29
my magic 12 test
30
cheese and test 12
9
my data 2020-01-01
10
my data 2020-02-01
18
my 2020-02-01 data 1-01-01
19
my 2020-02-01 data 2-01-01
21
my 2020-02-01 data 01-01-01
20
my 2020-02-01 data 12-01-01
17
my 2020-02-01 data 20-01-01
16
my 2020-02-01 data 202-01-01
14
my 2020-01-01 data 2020-01-01
22
my 2020-01-01 data 2020-02-01
23
my 2020-01-01 data 2021-01-01
24
my 2020-01-01 data 2120-01-01
15
my 2020-02-01 data 2020-01-01
25
my 2120-01-01 data 2020-01-01
And that also makes sense, all it is doing is ordering on length first so:
"a1" would come before "1potato" by length.
Test 3: The Hardouken!
Just take a look at this beauty:
SELECT
*
FROM
test_data
ORDER BY
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
data_value,
'0', 'Δ'),
'1', 'Δ'),
'2', 'Δ'),
'3', 'Δ'),
'4', 'Δ'),
'5', 'Δ
'),
'6', 'Δ'),
'7', 'Δ'),
'8', 'Δ'),
'9', 'Δ')
COLLATE utf8_bin;
The meme is real!
But despite that, this one pretty much works (albeit slightly differently to how I would have done it).
This one is smart in that it replaces each number with a character that will be sorted based on their binary representation in Unicode.
Test 3 results
id | data_value |
---|---|
56 | AwtwQdn09 |
34 | FoClSy4727 |
44 | GeGIrPM-H86 |
37 | HOFAp_Yx7920 |
35 | Pthw068bf |
40 | RHrIt72402eaLr |
54 | VHCok55901XYVk |
46 | WffSPaBA318 |
41 | cIhb42WFNQ |
30 | cheese and test 12 |
Full Results for test 3
id
data_value
56
AwtwQdn09
34
FoClSy4727
44
GeGIrPM-H86
37
HOFAp_Yx7920
35
Pthw068bf
40
RHrIt72402eaLr
54
VHCok55901XYVk
46
WffSPaBA318
41
cIhb42WFNQ
30
cheese and test 12
52
dlCSh87811
39
dnjLlW1
57
gvSV6z
47
kQ33596c
9
my data 2020-01-01
10
my data 2020-02-01
29
my magic 12 test
27
my test
28
my 12 magic test
14
my 2020-01-01 data 2020-01-01
22
my 2020-01-01 data 2020-02-01
23
my 2020-01-01 data 2021-01-01
24
my 2020-01-01 data 2120-01-01
21
my 2020-02-01 data 01-01-01
18
my 2020-02-01 data 1-01-01
20
my 2020-02-01 data 12-01-01
19
my 2020-02-01 data 2-01-01
17
my 2020-02-01 data 20-01-01
16
my 2020-02-01 data 202-01-01
15
my 2020-02-01 data 2020-01-01
25
my 2120-01-01 data 2020-01-01
8
test 1
51
utjtV03Xns
58
uxWLO039hb
59
vTg946
45
wrOae537LGCT
43
yDKrkCp7960
11
01-02-24 data
3
1 test
6
1test
5
10 test
2
12 test
12
12 test
53
13IUkOxEVl
1
2 test
55
2RnSVwq
7
2test
4
20 test
38
25hWTX
48
3uEKHmHePf
13
4a test
26
4b test
32
40-a-1
33
40a
31
42-a-1
36
6bfS
49
796h-eYWy
50
833HufIZAS
42
9244uVCpGa
You know what, I would use this if it wasn't for one thing. It would mean that we could no longer accept 'Δ' etc as input.
Unfortunately our software will be used internationally, so we can't make that call.
Additionally if you do decide to use this, it is worth noting your database column must use COLLATE 'utf8_bin'
on it as well or you may get errors.
Chat GPT Time
When googling fails, chat GPT can always help right?
I am not going to show all the results, but let's just say the following are just a few of the "did not work" ones it produced!
Test 4: GPT fever dream
SELECT
data_value,
-- Replace numbers with 'a' and letters with 'b' dynamically
(
SELECT GROUP_CONCAT(
CASE
WHEN c REGEXP '[0-9]' THEN 'a' -- Replace digits with 'a'
WHEN c REGEXP '[a-zA-Z]' THEN 'b' -- Replace letters with 'b'
ELSE c -- Leave other characters as-is
END
ORDER BY seq
) SEPARATOR ''
FROM (
SELECT SUBSTRING(data_value, seq, 1) AS c, seq
FROM (
SELECT data_value, seq
FROM test_data
JOIN (
SELECT n AS seq
FROM (SELECT 1 n UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 10) numbers
WHERE n <= 255 -- Handle up to 255 characters in a string
) seq_table
ON seq <= CHAR_LENGTH(data_value)
) char_table
) char_map
) AS sort_key
FROM test_data
ORDER BY sort_key;
It didn't run, and I couldn't quite work out why, but isn't it impressive looking!
What it seemed to be trying to do was replace all numbers with "a" and all letters with "b" to form strings like "aabaaa, aabba, abbaa" for comparison.
I would have probably spent more time trying to fix this if it wasn't flawed.
"11ab1" would produce "aabba"
"111ab1" would produce "aaaba".
So with the way MySQL sorts, 111ab1 would have been above 11ab1 anyway as "aaa" comes before "aab".
Nice idea, but sadly not quite it.
Test 5: GPT nearly nailed it!
This one surprised me.
I had seen it elsewhere already (as we know GPT is great at just spitting things out that it has already seen and isn't actually creative), but the explanation it gave made more sense!
SELECT
*
FROM
test_data
ORDER BY
CAST(data_value AS UNSIGNED), data_value ASC;
Ahhhh nearly. I really thought this was the answer.
But sadly it fails on "Test 1, Test 12, Test 2". It is good for sorting if the number comes first though!
My Turn
I still couldn't get what I wanted after several attempts. But I was now much closer to an answer after seeing a few techniques.
After a bit of noodle scratching and crying I thought I had solved it!
It works...kinda!
Here is what I came up with:
CREATE TABLE test_data (
id INT AUTO_INCREMENT PRIMARY KEY,
data_value VARCHAR(255) NOT NULL,
transformed_column VARCHAR(255) AS (
CASE
WHEN REGEXP_SUBSTR(data_value, '[0-9]+') IS NOT NULL THEN
REGEXP_REPLACE(
data_value,
'[0-9]+',
LPAD(REGEXP_SUBSTR(data_value, '[0-9]+'), 8, '0')
)
ELSE data_value -- Or another default value, e.g., 'No Numbers'
END
) STORED
);
Then we can query it like so:
SELECT
*
FROM
test_data
ORDER BY
transformed_column, data_value;
And this gives us nearly the results we would expect.
Test 6 results
data_value | transformed_column |
---|---|
1 test | 00000001 test |
01-02-24 data | 00000001-00000001-00000001 data |
1test | 00000001test |
2 test | 00000002 test |
2RnSVwq | 00000002RnSVwq |
2test | 00000002test |
3uEKHmHePf | 00000003uEKHmHePf |
4a test | 00000004a test |
Full Results for test 6
data_value
transformed_column
1 test
00000001 test
01-02-24 data
00000001-00000001-00000001 data
1test
00000001test
2 test
00000002 test
2RnSVwq
00000002RnSVwq
2test
00000002test
3uEKHmHePf
00000003uEKHmHePf
4a test
00000004a test
4b test
00000004b test
6bfS
00000006bfS
10 test
00000010 test
12 test
00000012 test
12 test
00000012 test
13IUkOxEVl
00000013IUkOxEVl
20 test
00000020 test
25hWTX
00000025hWTX
40-a-1
00000040-a-00000040
40a
00000040a
42-a-1
00000042-a-00000042
796h-eYWy
00000796h-eYWy
833HufIZAS
00000833HufIZAS
9244uVCpGa
00009244uVCpGa
AB-34Y67846
AB-00000034Y00000034
AwtwQdn09
AwtwQdn00000009
cheese and test 12
cheese and test 00000012
cIhb42WFNQ
cIhb00000042WFNQ
dlCSh87811
dlCSh00087811
dnjLlW1
dnjLlW00000001
E5-RMT893Y9
E00000005-RMT00000005Y00000005
EV-489RY3DA
EV-00000489RY00000489DA
FoClSy4727
FoClSy00004727
GeGIrPM-H86
GeGIrPM-H00000086
gvSV6z
gvSV00000006z
HOFAp_Yx7920
HOFAp_Yx00007920
kQ33596c
kQ00033596c
my 12 magic test
my 00000012 magic test
my 2020-01-01 data 2020-01-01
my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-01-01 data 2020-02-01
my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-01-01 data 2021-01-01
my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-01-01 data 2120-01-01
my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-02-01 data 01-01-01
my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-02-01 data 1-01-01
my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-02-01 data 12-01-01
my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-02-01 data 2-01-01
my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-02-01 data 20-01-01
my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-02-01 data 202-01-01
my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2020-02-01 data 2020-01-01
my 00002020-00002020-00002020 data 00002020-00002020-00002020
my 2120-01-01 data 2020-01-01
my 00002120-00002120-00002120 data 00002120-00002120-00002120
my data 2020-01-01
my data 00002020-00002020-00002020
my data 2020-02-01
my data 00002020-00002020-00002020
my magic 12 test
my magic 00000012 test
my test
my test
Pthw068bf
Pthw00000068bf
RHrIt72402eaLr
RHrIt00072402eaLr
test 1
test 00000001
utjtV03Xns
utjtV00000003Xns
uxWLO039hb
uxWLO00000039hb
VHCok55901XYVk
VHCok00055901XYVk
vTg946
vTg00000946
WffSPaBA318
WffSPaBA00000318
wrOae537LGCT
wrOae00000537LGCT
yDKrkCp7960
yDKrkCp00007960
It breaks on the rows:
data_value | transformed_column |
---|---|
my 2020-02-01 data 1-01-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
my 2020-02-01 data 12-01-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
my 2020-02-01 data 2-01-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
As you can see, but overall I think it will suit our needs for now!
This was based on a LPAD
example I saw. But it had the same flaw that it only worked on Strings with numbers at the beginning.
So I took that principle and applied it to all number series.
Also this now uses a generated column so that it is fast on reads (although we could soon make it part of the query if it didn't hurt read performance too much or we were more bothered about write performance).
The big issue is the fact that this works through "brute force" by working around the fact that MySQL sees "11" as smaller than "2" due to checking one character at a time (the dumbed down version of what happens) by making all numbers 8 digits long.
It also replaces all numbers in the transformed_column
with the first number it finds and as far as I can tell, other than nesting about 20 statements and using coalesce
or something, there is no way to avoid this behaviour I could find.
Conclusion
I am not really sure? MySQL should join the 21st century and provide an alphanumeric sort option like other DBs?
While we wish for things to happen I think the short answer in the scope of this article is that I still haven't solved this.
But with that being said, I feel like test 6 is much closer to the answer than anything I found out in the wild and may be useful to someone someday?
Obviously if you are expecting numbers longer than 8 digits, you may want to adjust the LPAD
value as that is one limitation.
Also it isn't perfect so you can't rely on it for scenarios other than sorting user input so they can find things easily.
I think at this point my best bet is to throw it out into the world and see if some smarter people than me can solve this thing properly! (I am probably needing some much deeper knowledge of sorting in MySQL to really solve it!)
If someone can solve the problem I have where I can't change every number to 8 digits long, then this would work even better! (so "my 2020-02-01 data 2-01-01" would become "my 00002020-00000002-00000001 data 00000002-00000001-00000001").
Or maybe there is a literal one liner that I missed in all my research that solves this? (I will only cry a little bit if there is, I promise!).
Anyway, that is the best I have, but would love to see something better!
Finally here is a DB Fiddle to play with
Here is a DB fiddle of the generated column version to start from if you think you can spot a quick win / improvement:
https://www.db-fiddle.com/f/o2ohcGVAgHZQg4teg1s9jW/1034
Thanks for reading, catch you all soon!
Top comments (1)
Guarantee someone shows me how to do this in one line and I end up deleting the post or putting "here is the answer, ignore the rest of the article" at the top! haha