Diacritic-insensitive string comparison in JavaScript

#webdev #javascript #beginners #frontend

In the world of web development, managing and manipulating text data is a crucial task. Strings, which are the primary data type for textual content in programming languages like JavaScript and TypeScript, are equipped with an abundance of methods for effective text data processing. However, when it comes to handling Unicode characters that fall outside the ASCII range, this task can become significantly more challenging.

Unicode is a universal character encoding standard. It encompasses virtually every character from all the writing systems that are in use today. While this extensive representation is one of Unicode's strengths, it also introduces unique challenges for developers. For instance, different sequences of Unicode characters can represent the same visual character or set of characters. This variation can lead to potential inconsistencies and challenges in text processing within web applications.

One particular task where these challenges become evident is filtering an array of objects based on user input. In this article, we will delve into this specific scenario and explore how to perform diacritic-insensitive string comparisons in JavaScript.

While filtering an array of objects based on a string value might seem like a straightforward task at first glance, it can get quite complex, especially when dealing with Unicode characters and ensuring consistency across various writing systems. By the end of this article, you will have a better understanding of how to tackle these complexities in your JavaScript or TypeScript projects.

Filtering an array of objects in JavaScript

Let's consider a practical example where we have a list of airport options for a Select component in a web application. Our goal is to filter these options based on user input before populating the Select component.

const filter = "Cancun";
const airports = [
  {
    name: "Húsavík Airport",
    code: "HZK",
  },
  {
    name: "Cancún International Airport",
    code: "CUN",
  },
  {
    name: "Zürich Airport",
    code: "ZRH",
  },
  {
    name: "Kraków John Paul II International Airport",
    code: "KRK",
  },
  {
    name: "Málaga-Costa del Sol Airport",
    code: "AGP",
  },
  {
    name: "Côte d'Azur Airport",
    code: "NCE",
  },
  {
    name: "Fès-Saïs Airport",
    code: "FEZ",
  },
];

The initial approach to filtering the list of airports based on the filter string could be as follows:

const filteredOptions = airports.filter((option) => {
  return option.name.toLowerCase().includes(filter.toLowerCase());
});

However, when we run this code, we see that filteredOptions results in an empty array. This happens because the strings "Cancun" and "Cancún" are not regarded as equivalent due to their distinct character compositions. JavaScript carries out string comparisons based on Unicode character values, and it considers diacritics.

To address this issue, we need to use the String.prototype.normalize method.

Understanding the String.prototype.normalize method in JavaScript

The String.prototype.normalize method is a powerful tool for handling Unicode strings in JavaScript. It's an integral part of the ECMAScript Internationalization API specification (ECMA-402), which offers language-sensitive string comparison, number formatting, and date and time formatting.

The normalize method converts a string into a specific Unicode Normalization Form. The syntax for this method is str.normalize([form]), where form can take one of four values: "NFC", "NFD", "NFKC", or "NFKD".

But what do these forms signify? Unicode Normalization Forms establish standard equivalences among Unicode sequences. "NFC" is an acronym for Normalization Form C, denoting the canonical composition, while "NFD" stands for Normalization Form D, representing canonical decomposition. On the other hand, "NFKC" and "NFKD" embody compatibility composition and decomposition, respectively.

In simpler terms, these forms either combine or break apart characters that can be depicted in multiple manners in Unicode.

Let's illustrate this with an example. Take the character 'é'. Unicode can represent it as either a single character (U+00E9, or "é"), or as a combination of 'e' and an acute accent (U+0065 U+0301, or "é"). Although they are visually identical, these representations may create inconsistencies in string comparison and manipulation. The normalize method can convert these varied representations into a uniform form, ensuring more dependable string operations.

Implementing string normalization

Armed with an understanding of the String.prototype.normalize method, let’s see how we can integrate it into our function. This will enable us to perform diacritic-insensitive string comparisons effectively, taking into account characters with diacritics.

const normalizeString = (str) => {
  return str
    .normalize("NFD")
    .replace(/[\u0300-\u036f]/g, "")
    .toLowerCase();
};

const filteredOptions = airports.filter((option) => {
  return normalizeString(option.name).includes(normalizeString(filter));
});

The function starts by calling str.normalize('NFD'). This transforms the input string into its canonical decomposition form, "NFD". In this form, composed characters like 'é' are split into their base characters and their combining marks, 'e' and '´', respectively.

Next, the function uses .replace(/[\u0300-\u036f]/g, '') to remove all diacritic marks from the string. The regular expression /[\u0300-\u036f]/g matches all combining diacritical marks in the Unicode range from U+0300 to U+036f, which includes accents like the acute accent in 'é'. By replacing these marks with an empty string, the function effectively removes them.

Finally, the function calls .toLowerCase(), which transforms all uppercase characters in the string to their lowercase equivalents. This makes the function case-insensitive, meaning it will treat 'A' and 'a' as the same character.

Transforming strings into a consistent, case-insensitive, diacritic-free form enables more reliable string comparison and manipulation.

Now the filteredOptions will correctly contain the Cancún option. Another common use case for this functionality is handling user input in form fields. When a user inputs text, it's common to normalize the input before processing it to ensure consistency. This is especially useful in form fields that accept international input, where users might input text with a variety of diacritic marks. We could also apply a similar approach to sorting an array of objects.

Considerations and Limitations

While the String.prototype.normalize method and the normalizeString function are powerful tools, it's important to understand their limitations and potential performance impacts.

Firstly, the normalization process can have a significant performance impact if used excessively or on large strings. The normalize method transforms the entire string, which can be a costly operation for large strings or frequent calls. Therefore, it's important to use this method judiciously and only when necessary.

Additionally, not all Unicode characters are handled correctly by the normalize method. The Unicode standard continues to evolve, and there may be characters or combinations that do not normalize as expected. It's crucial to test and validate the normalization behavior with the specific Unicode characters used in your application.

Conclusion

Handling Unicode characters in JavaScript and TypeScript presents unique challenges, particularly when it comes to diacritic-insensitive string comparisons. The String.prototype.normalize method, along with the normalizeString function presented in this article, provide an effective approach to address these challenges. They transform strings into a consistent, case-insensitive, and diacritic-free form, enabling more dependable string comparison and manipulation. This is especially beneficial in tasks such as filtering elements in an array, which is a common requirement in many applications. However, developers must be mindful of the potential performance implications of string normalization and its limitations with certain Unicode characters. The removal of diacritic marks, while useful for comparisons, may not always be linguistically accurate or appropriate. As always, understanding the tools at your disposal and their possible limitations is key to writing robust and efficient code.