Recently I had the task to find all the Diacritics in a database table. I had to find an easy way to select every row that has a diacritic in the name. The solution was surprisingly simple, but a bit unintuitive. In this article, I will provide you with that solution as well as additional information about how Java and Postgres treat diacritics and other characters.
What is a diacritic?
A Diacritic is a mark or an accent on a letter that indicates that the letter should be pronounced differently. Diacritics are often referred to as accents. The marked letter and the un-marked letter are logically the same letters in the language. Examples of Diacritics are
áàâüñçå
It is important to understand that some characters that we might consider diacritics are not actually diacritics. This is because, In the character's respective alphabet, they are individual letters. Examples are
Æ æ Ø ø Ł ł
When dealing with diacritics its important to understand how the system will treat certain characters.
How does Java deal with Diacritics?
Java handles diacritics in a strict fashion. For these examples, we will use the Normalizer.normalize utility in the Apache Commons-text library. We will see that "Æ æ Ø ø Ł ł", are not replaced with a similar English analog.
We will use the following method for testing
public static String stripDiacritics(String input){
if ( !Normalizer.isNormalized(input, Normalizer.Form.NFD) ) {
return Normalizer.normalize(input, Normalizer.Form.NFD).replaceAll("\\p{M}+", "");
}
return input;
}
Testing this method will result in the following
@test
public void testStripDiacritic(){
assertEquals("aabbsser3ssa35ds", DiacriticUtility.stripDiacritics("aabbsser3ssa35ds"));
assertEquals("This is a funky String", DiacriticUtility.stripDiacritics("Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ"));
assertEquals("Ø or Ł", DiacriticUtility.stripDiacritics("Ø or Ł"));
assertEquals("the German umlauts: a, o, u", DiacriticUtility.stripDiacritics("the German umlauts: ä, ö, ü"));
assertEquals("manana", DiacriticUtility.stripDiacritics("mañana"));
assertEquals("Æ æ Æ æ or Æ æ", DiacriticUtility.stripDiacritics("Ǣ ǣ Ǽ ǽ or Æ æ"));
}
Notice diacritics such as "ůňķŷ" are replaced with "unky", but "Æ æ" are not replaced with "AE ae"
How does Postgres handle diacritics?
Postgres treats diacritics the same as Java but has a different effect on the other characters. In Postgres, we use the unaccent extension.
To install unccent run the following.
CREATE EXTENSION unaccent;
Now let's run a simple query to see how characters are treated.
select unaccent('Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ | łŁØø | ä, ö, ü | Ǣ ǣ Ǽ ǽ æ or Æ æ')
FROM blog.diacritic_blog LIMIT1;
Output:
This is a funky String | lLOo | a, o, u | Ǣ ǣ Ǽ ǽ ae or AE ae
Notice that the expected diacritics are resolved, but so is "Æ æ ł Ł Ø ø". Interestingly enough is the Postgres's unaccent does not resolve "Ǣ ǣ Ǽ ǽ" to "Æ æ Æ æ" or to "AE ae AE as".
Finding diacritics in a Postgres Table?
As we saw above there are a lot of special cases regarding how Postgres will handle diacritics and special characters. This solution to finding the rows with diacritics will not take those edge cases into account but will give you a basic solution to querying the table for diacritics. You can add regex and additional where clauses to the query as needed.
This example will use the following
CREATE TABLE blog.diacritic_blog(
id bigserial NOT NULL,
name varchar(255) NOT NULL,
CONSTRAINT pk_diacritic_id PRIMARY KEY (id)
);
INSERT INTO blog.diacritic_blog(name) values ('aabbsser3ssa35ds');
INSERT INTO blog.diacritic_blog(name) values ('Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ');
INSERT INTO blog.diacritic_blog(name) values ('Ø or Ł or łŁØø');
INSERT INTO blog.diacritic_blog(name) values ('the German umlauts: ä, ö, ü');
INSERT INTO blog.diacritic_blog(name) values ('mañana');
INSERT INTO blog.diacritic_blog(name) values ('Ǣ ǣ Ǽ ǽ æ or Æ æ');
The following query will return 5 rows:
SELECT name, unaccent(name) FROM blog.diacritic_blog where name != unaccent(name);
OUTPUT:
Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ | This is a funky String |
Ø or Ł or łŁØø | O or L or lLOo |
the German umlauts: ä, ö, ü | the German umlauts: a, o, u |
mañana | manana |
Ǣ ǣ Ǽ ǽ æ or Æ æ | Ǣ ǣ Ǽ ǽ ae or AE ae |
Summary
Finding all the Diacritics in a Postgres table is simple if not a bit unintuitive. However, handling diacritics and non-English alphabet characters is a minefield. Hopefully, now you have a basic understanding of diacritics and how Java and Postgres treat these characters.
If you'd like to see a complete treatment of how other languages (JavaScript or Python) treats these character please reach out to me via Twitter @ryboflavin42 and let me know what other languages you'd like to see.
The examples from this article can be found on GitHub at https://github.com/djchi82/DiacriticBlogProject.
Top comments (1)
This article is republished from tenmilesquare.com/finding-diacriti....
During the original publishing, it was brought to my attention a few poor choices in wording.
I mention this is how "Postgres handles" diacritics. Postgres actually handles diacritics well and stores them with no problem. Its the Postgres unaccent extension that handles the diacritics in this scenario.
Please excuse the liberty in the language.