DEV Community

Cover image for Internationalised Domain Names
Hussein Al Hammad
Hussein Al Hammad

Posted on

Internationalised Domain Names

I have recently launched a new website (wajad.art) with an internationalised domain name (IDN). The domain name and all the page paths are in Arabic, which makes things more fun given Arabic is written right to left (RTL).

Domain name is وجد.موقع, and موقع is the top-level domain (TLD), which is the equivalent to the site TLD.

A full URL example of page:

وجد.موقع/م/أساطير-خليجية
Enter fullscreen mode Exit fullscreen mode

How do IDNs work?

Given the Domain Name System (DNS) has to use ASCII characters, they store IDNs as ASCII strings using Punycode, which is:

a representation of Unicode with the limited ASCII character subset used for Internet hostnames

So while my newly launched website's domain name is وجد.موقع, in the DNS it is stored in the Punycode equivalent:

xn--rgbg7e.xn--4gbrim
Enter fullscreen mode Exit fullscreen mode

Fortunately, there are Punycode converters such as:

Domain Name vs Page Path

The IDNs use Punycode to work around the DNS limitation of only supporting ASCII characters. This does not apply to the rest of the URL. You can use Unicode characters in the page path.

MDN's What is a URL? is a good resource to learn the different parts of a URL.

Browsers

Unicode vs Punycode (ASCII)

When navigating to a website with an IDN via the browser address bar, both the Unicode (e.g. وجد.موقع) and the Punycode (e.g. xn--rgbg7e.xn--4gbrim) work.

Even if you typed in the Punycode in the address bar, web browsers may automatically convert the URL to the (human-friendly) Unicode equivalent if the URL meets the browser's IDN policy:

The goal of these policies is to protect users from IDN homograph attack. There are also browser extensions that alerts users if they are on a site that uses Punycode in its domain name.

Early this year, Google Chrome Developers YouTube channel's show HTTP 203 released an episode titled Humans can't read URLs. How can we fix it?. Jake and Surma briefly discuss how Chrome analyses the URL and when it may choose to display the Punycode over the Unicode.

RTL vs LTR

If you ever mixed RTL and LTR languages when typing something on a digital device, you'd certainly have experienced frustrating times attempting to get words to flow correctly. The browser address bar doesn't handle this too well either.

In the case of وجد.موقع, it is read RTL. However, adding the http protocol at the start means the address starts LTR. So you end up with:

https://وجد.موقع
Enter fullscreen mode Exit fullscreen mode

Even if you set the language in the browser to Arabic, which converts the UI to RTL:

Microsoft Edge address bar

This is not a huge pain point, but it does look odd. As a developer I know the start is https://. However, to an Arabic speaker who is not familiar with the protocol and its uses, they may interpret this as the URL ends in https://.

This may be slightly off-topic, but it is worth noting that things become even more confusing if you use an ASCII domain name (LTR) with an RTL page path and vice versa:

وجد.موقع/path/to/page

xn--rgbg7e.xn--4gbrim/الصفحة-1/الصفحة-2
Enter fullscreen mode Exit fullscreen mode

Copying the URL

When you are on a website with an IDN and copy the URL directly from the address bar, what gets copied into your clipboard varies across browsers.

Firefox (83.0) copies the Unicode:

https://وجد.موقع
Enter fullscreen mode Exit fullscreen mode

Chrome's (87.0.4280.66) behaviour is more sophisticated. If you include the https protocol when you copy the URL from the address bar, it copies the Punycode into your clipboard:

https://xn--rgbg7e.xn--4gbrim
Enter fullscreen mode Exit fullscreen mode

If you exclude the https protocol, it copies the Unicode:

وجد.موقع
Enter fullscreen mode Exit fullscreen mode

The above behaviour only applies to the domain name. When it comes to the page path, the behaviour is also inconsistent across browsers.

Firefox (83.0) encodes the page path to its UTF-8 representation when the URL is copied (think JavaScript's encodeURI(), or PHP's urlencode()), which is a huge UX pain for me in general and not only with IDNs. Receiving a URL in a chat app that fills up half my phone's screen with %s and a mix of meaningless digits and English characters is pointless to me as a user.

https://وجد.موقع/%D9%85/%D8%A3%D8%B3%D8%A7%D8%B7%D9%8A%D8%B1-%D8%AE%D9%84%D9%8A%D8%AC%D9%8A%D8%A9
Enter fullscreen mode Exit fullscreen mode

On Chrome (87.0.4280.66), if the https protocol is included, it copies the Punycode domain name and the encoded page path:

https://xn--rgbg7e.xn--4gbrim/%D9%85/%D8%A3%D8%B3%D8%A7%D8%B7%D9%8A%D8%B1-%D8%AE%D9%84%D9%8A%D8%AC%D9%8A%D8%A9
Enter fullscreen mode Exit fullscreen mode

If the https protocol is excluded, it copies the whole URL in Unicode:

وجد.موقع/م/أساطير-خليجية
Enter fullscreen mode Exit fullscreen mode

Sharing the URL

Web browsers on smartphones and tablets offer a built-in sharing option, which gives you the choice to copy the URL or share directly to native apps. The behaviour across browsers here is also inconsistent.

The same browser may not behave consistently when copying the URL from the address vs when copying/sharing the URL using the built-in share option. Samsung Internet (13.0.1.64), for instance, copies the Unicode (domain and page path) if you copy the URL directly from the address bar:

https://وجد.موقع/م/أساطير-خليجية
Enter fullscreen mode Exit fullscreen mode

However, it copies the Punycode and the encoded page path when using the built-in share option:

https://xn--rgbg7e.xn--4gbrim/%D9%85/%D8%A3%D8%B3%D8%A7%D8%B7%D9%8A%D8%B1-%D8%AE%D9%84%D9%8A%D8%AC%D9%8A%D8%A9
Enter fullscreen mode Exit fullscreen mode

JavaScript

The Location API returns the domain name in Punycode and encodes page paths:

{
  "ancestorOrigins": {},
  "href": "https://xn--rgbg7e.xn--4gbrim/%D9%85/%D8%A3%D8%B3%D8%A7%D8%B7%D9%8A%D8%B1-%D8%AE%D9%84%D9%8A%D8%AC%D9%8A%D8%A9",
  "origin": "https://xn--rgbg7e.xn--4gbrim",
  "protocol": "https:",
  "host": "xn--rgbg7e.xn--4gbrim",
  "hostname": "xn--rgbg7e.xn--4gbrim",
  "port": "",
  "pathname": "/%D9%85/%D8%A3%D8%B3%D8%A7%D8%B7%D9%8A%D8%B1-%D8%AE%D9%84%D9%8A%D8%AC%D9%8A%D8%A9",
  "search": "",
  "hash": ""
}
Enter fullscreen mode Exit fullscreen mode

The wilderness

I have used a number of services in which I had to enter the IDN for Wajad or on which the domain name is displayed.

Domain registration

Registering a domain with Punycode with a common TLD like .com is not an obstacle. Some domain registrars allow you to use Unicode when searching domains e.g. وجد.com.

But I was looking for the internationalised TLD موقع. It was not easy finding a domain registrar that sold موقع domains. I ended up on multiple scammy-looking sites during my search. Eventually I bought the domain via maracaria.com.

Cloudflare

I had no issues adding IDNs with Unicode when adding the site to Cloudflare. They are also displayed in Unicode in the dashboard:

Cloudflare dashboard

However, Cloudflare used the Punycode in the email notifications they sent to me so far:

Cloudflare email notification

Netlify

Before launching the site, I set up a "coming soon" landing page on Netlify. Unlike Cloudflare, Netlify did not allow me to add the domain name with Unicode, and I had to enter the Punycode equivalent. Netlify's dashboard displays the domain in Punycode:

Netlify dashboard

Their email notifications also display the domain in Punycode:

Netlify email notification

Cloudways

Wajad's current PHP-based site is hosted on DigitalOcean via Cloudways. The experience on Cloudways is similar to Netlify and I had to enter the Punycode:

Cloudways dashboard

Google Search Console

I was able to add the site to Google Search Console with the Unicode version of the domain. Oddly, some subsequent forms did not accept Unicode:

Google Search Console sitemap form

So I had to enter the Punycode equivalent, but Google Search Console displayed the URL in Unicode after submitting the form:

Google Search Console sitemap form

Fortunately, email notifications use Unicode:

Google Search Console sitemap form

Google Search results

Google Search results display the domain name in Unicode. I already knew it displayed Arabic correctly for breadcrumbs, but it is really nice to see the domain name displayed in a human-friendly manner:

Google search result - IDN

Both Unicode and Punycode are supported when using search operators like site::

site:وجد.موقع

site:xn--rgbg7e.xn--4gbrim
Enter fullscreen mode Exit fullscreen mode

Bing Webmaster Tools

Bing Webmaster Tools allow you to import verified sites from Google Search Console. Upon an import attempt it displayed an error message saying the site addition was unsuccessful:

Bing Webmaster Tools - import error message

I attempted to enter the URL manually as suggested, but the Unicode was not accepted:

Bing Webmaster Tools - import error message

Then when I went to check the list of sites under my account, Wajad was actually listed! I'm not entirely sure which of the above attempts was the successful one.

Bing Webmaster Tools lists the domain in Unicode, but when you open the dashboard for the site it lists the Punycode:

Bing Webmaster Tools - sites list

I had the opposite experience to Google Search Console when submitting the sitemap. The form accepted the Unicode, but the sitemap list displays the Punycode:

Bing Webmaster Tools - sitemap form
Bing Webmaster Tools - sitemap list

Bing search results

I have only recently submitted the sitemap via Bing Webmaster Tools, so I still do not know the full picture. From what I can tell so far Bing search results also display the domain in Unicode.

However, it seems only Punycode is supported when using search operators like site::

site:xn--rgbg7e.xn--4gbrim
Enter fullscreen mode Exit fullscreen mode

Bing search result - IDN

Fathom Analytics

I had no issue using the Unicode version of the domain when adding the site to Fathom Analytics. The domain is always displayed in Unicode (dashboard and email notifications).

Their recently-launched tool Phantom Analyzer also allowed me to enter the URL in Unicode, but the results page displayed the domain in Punycode.

Phantom Analyzer results

Zoho Mail

Neither Unicode nor Punycode is supported when signing up to Zoho Mail.

Zoho Mail - Unicode sign up
Zoho Mail - Punycode sign up

Emails

G Suite (now Google Workspace) allowed me to sign up with my IDN. I sent test emails to Gmail, Yahoo and Outlook. Gmail was the only one to display the domain name in Unicode.

Gmail - received email from IDN with Punycode
Outlook - received email from IDN with Punycode
Yahoo - received email from IDN with Punycode

I have also sent HTML email tests with images. Yahoo Mail and Windows Mail did not load images whose src had the domain in Unicode, but Gmail did:

<img src="https://وجد.موقع/path/to/image.jpg" alt="">
Enter fullscreen mode Exit fullscreen mode

Auto-linking

When sending messages via chat apps, adding the https protocol to the URL (with domain name in Unicode) seems to be enough for most apps.

Although email clients are known for linking text in HTML emails when you don't want them to, I found Gmail and Windows Mail don't auto-link:

https://وجد.موقع
Enter fullscreen mode Exit fullscreen mode

In-app browsers

The behaviour of in-app browsers is consistent. On iOS, Instagram's in-app browser displays the domain in Punycode, while Twitter's in-app browser displays the domain in Unicode.

I understand, but..

I understand why I'm seeing very different behaviour across browsers and apps, but as a developer and a user I just would love to see a better user experience overall.

Wajad is still a young side project, but it is clear to me that I'll run into more interesting IDN-related scenarios as it grows and I'll try my best to document them.


This article was first published on hussein-alhammad.com

Latest comments (1)

Collapse
 
lukechinworth profile image
Luke Chinworth

Thanks for documenting all these scenarios.