DEV Community

Cover image for Unicode Normalization for NLP in Python
James Briggs
James Briggs

Posted on

Unicode Normalization for NLP in Python

ℕ𝕠-π• π•Ÿπ•– π•šπ•Ÿ π•₯π•™π•–π•šπ•£ π•£π•šπ•˜π•™π•₯ π•žπ•šπ•Ÿπ•• 𝕨𝕠𝕦𝕝𝕕 𝕖𝕧𝕖𝕣 𝕦𝕀𝕖 π•₯𝕙𝕖𝕀𝕖 π•’π•Ÿπ•Ÿπ• π•ͺπ•šπ•Ÿπ•˜ π•—π• π•Ÿπ•₯ π•§π•’π•£π•šπ•’π•Ÿπ•₯𝕀. 𝕋𝕙𝕖 𝕨𝕠𝕣𝕀π•₯ π•₯π•™π•šπ•Ÿπ•˜, π•šπ•€ π•šπ•— π•ͺ𝕠𝕦 𝕕𝕠 π•’π•Ÿπ•ͺ π•—π• π•£π•ž 𝕠𝕗 ℕ𝕃ℙ π•’π•Ÿπ•• π•ͺ𝕠𝕦 𝕙𝕒𝕧𝕖 𝕔𝕙𝕒𝕣𝕒𝕔π•₯𝕖𝕣𝕀 π•π•šπ•œπ•– π•₯π•™π•šπ•€ π•šπ•Ÿ π•ͺ𝕠𝕦𝕣 π•šπ•Ÿπ•‘π•¦π•₯, π•ͺ𝕠𝕦𝕣 π•₯𝕖𝕩π•₯ π•“π•–π•”π• π•žπ•–π•€ π•”π• π•žπ•‘π•π•–π•₯𝕖𝕝π•ͺ π•¦π•Ÿπ•£π•–π•’π••π•’π•“π•π•–.

We also find that text like this is incredibly commonβ€Š-β€Šparticularly on social media.

Another pain-point comes from diacritics (the little glyphs in Γ‡, Γ©, Γ…) that you'll find in almost every European language.

These characters have a hidden property that can trip up any NLP modelβ€Š-β€Štake a look at the unicode for two versions of Γ‡:

Latin capital letter C with cedilla: \u00C7

Latin capital letter C + combining cedilla: \u0043\u0327

Both are completely different, despite rendering as the same character.

To deal with all of these text variants we need to use unicode normalization - which we will cover in this video.

Discussion (1)

Collapse
arvindpdmn profile image
Arvind Padmanabhan

This topic is briefly covered in this article: devopedia.org/text-normalization
In particular, check out the 4 forms: NFD, NFC, NFKD and NFKC