Unicode Normalization for NLP in Python

#python #machinelearning #deeplearning #tutorial

ℕ𝕠-𝕠𝕟𝕖 𝕚𝕟 𝕥𝕙𝕖𝕚𝕣 𝕣𝕚𝕘𝕙𝕥 𝕞𝕚𝕟𝕕 𝕨𝕠𝕦𝕝𝕕 𝕖𝕧𝕖𝕣 𝕦𝕤𝕖 𝕥𝕙𝕖𝕤𝕖 𝕒𝕟𝕟𝕠𝕪𝕚𝕟𝕘 𝕗𝕠𝕟𝕥 𝕧𝕒𝕣𝕚𝕒𝕟𝕥𝕤. 𝕋𝕙𝕖 𝕨𝕠𝕣𝕤𝕥 𝕥𝕙𝕚𝕟𝕘, 𝕚𝕤 𝕚𝕗 𝕪𝕠𝕦 𝕕𝕠 𝕒𝕟𝕪 𝕗𝕠𝕣𝕞 𝕠𝕗 ℕ𝕃ℙ 𝕒𝕟𝕕 𝕪𝕠𝕦 𝕙𝕒𝕧𝕖 𝕔𝕙𝕒𝕣𝕒𝕔𝕥𝕖𝕣𝕤 𝕝𝕚𝕜𝕖 𝕥𝕙𝕚𝕤 𝕚𝕟 𝕪𝕠𝕦𝕣 𝕚𝕟𝕡𝕦𝕥, 𝕪𝕠𝕦𝕣 𝕥𝕖𝕩𝕥 𝕓𝕖𝕔𝕠𝕞𝕖𝕤 𝕔𝕠𝕞𝕡𝕝𝕖𝕥𝕖𝕝𝕪 𝕦𝕟𝕣𝕖𝕒𝕕𝕒𝕓𝕝𝕖.

We also find that text like this is incredibly common - particularly on social media.

Another pain-point comes from diacritics (the little glyphs in Ç, é, Å) that you'll find in almost every European language.

These characters have a hidden property that can trip up any NLP model - take a look at the unicode for two versions of Ç:

Latin capital letter C with cedilla: \u00C7

Latin capital letter C + combining cedilla: \u0043\u0327

Both are completely different, despite rendering as the same character.

To deal with all of these text variants we need to use unicode normalization - which we will cover in this video.

Top comments (1)

Arvind Padmanabhan • Mar 18 '21

This topic is briefly covered in this article: devopedia.org/text-normalization
In particular, check out the 4 forms: NFD, NFC, NFKD and NFKC

DEV Community

Unicode Normalization for NLP in Python

Top comments (1)

Read next

This Week In Python

Demystifying the TailwindCSS `bg-current` utility class

Unlocking Tube Magic AI’s Potential: Features, Pricing, and Performance Review

Unlocking Quickpix AI's Potential: Features, Pricing, and Performance Review