URLs were originally designed for ASCII characters only. To support international characters like é, 中文, or emoji, different parts of the URL use different encoding schemes.
Key Takeaways
- 1Domain names use Punycode (IDN) for international characters
- 2Paths and query strings use UTF-8 percent-encoding
- 3Browsers display Unicode but send encoded versions
- 4Use encodeURIComponent() for UTF-8 encoding in JavaScript
- 5IDN domains start with xn-- in their ASCII form
"The IDNA specification provides a mechanism for encoding internationalized domain names using only ASCII characters while allowing the full range of Unicode characters."
International Domain Names (IDN)
When you register a domain with non-ASCII characters, the domain registrar converts it to Punycode behind the scenes. This conversion allows international domains to work with the existing DNS infrastructure, which was built for ASCII only. Over 10 million IDN domains are registered worldwide, with Chinese, Arabic, and German domains being the most common.
The table below shows how various international domain names map to their Punycode equivalents. Notice that each Punycode domain starts with xn--, which signals to DNS servers that the domain contains encoded international characters.
| Display | Punycode (ASCII) | Language |
|---|---|---|
| münchen.de | xn--mnchen-3ya.de | German |
| 北京.中国 | xn--1lq90i.xn--fiqs8s | Chinese |
| münchen.de | xn--mnchen-3ya.de | German |
| пример.рф | xn--e1afmkfd.xn--p1ai | Russian |
// Convert between Unicode and Punycode
const url = new URL('https://münchen.de/path');
console.log(url.hostname); // "xn--mnchen-3ya.de" (Punycode)
console.log(url.href); // Full URL with Punycode
// Using punycode library for conversion
import { toASCII, toUnicode } from 'punycode';
toASCII('münchen.de'); // "xn--mnchen-3ya.de"
toUnicode('xn--mnchen-3ya.de'); // "münchen.de"The code above demonstrates how JavaScript's URL API automatically converts Unicode hostnames to Punycode. When you create a URL object with an international domain, the hostname property returns the ASCII-safe Punycode version. For explicit conversions, the punycode library provides toASCII() and toUnicode() functions.
While domain names require Punycode, the path and query string parts of URLs use a different encoding scheme. Let's look at how UTF-8 percent-encoding handles international characters in these components.
Unicode in Paths
Unlike domain names, path segments use UTF-8 percent-encoding to handle international characters. Each Unicode character is converted to its UTF-8 byte sequence, and each byte becomes a %XX escape sequence. This approach means a single character like "北" (3 UTF-8 bytes) becomes three percent-encoded sequences.
https://example.com/café/北京// Encoding paths
const path = '/café/北京';
const encoded = encodeURI('https://example.com' + path);
// "https://example.com/caf%C3%A9/%E5%8C%97%E4%BA%AC"
// é encodes to %C3%A9 (UTF-8 bytes: C3 A9)
// 北 encodes to %E5%8C%97 (UTF-8 bytes: E5 8C 97)This example shows how encodeURI() transforms international characters in paths. The accented letter "é" becomes %C3%A9 (two bytes), while the Chinese character "北" expands to %E5%8C%97 (three bytes). Browsers handle this encoding automatically when you type international URLs in the address bar.
Query parameters follow the same UTF-8 percent-encoding rules, but with one important addition: the URL API provides convenient methods that handle encoding and decoding automatically.
Unicode in Query Parameters
When you need to pass international text as query parameter values, the URL API is your best friend. It handles all the encoding complexity for you, converting Unicode characters to their percent-encoded form when building the URL and decoding them back when you read parameter values.
const url = new URL('https://example.com/search');
url.searchParams.set('q', '日本語');
url.searchParams.set('city', 'São Paulo');
console.log(url.href);
// "https://example.com/search?q=%E6%97%A5%E6%9C%AC%E8%AA%9E&city=S%C3%A3o+Paulo"
// The URL API handles encoding automatically
console.log(url.searchParams.get('q')); // "日本語"The code demonstrates the URL API's automatic encoding. When you set a parameter with Japanese text, it gets encoded in the URL, but when you retrieve it with get(), you receive the original Unicode string. This round-trip encoding happens seamlessly, eliminating manual encoding errors.
Beyond traditional international characters, modern URLs also need to handle emoji. These present unique challenges because emoji use 4-byte UTF-8 sequences, making their encoded forms quite long.
Emoji in URLs
Emoji have become increasingly common in URLs, especially for social media sharing and marketing campaigns. However, because emoji exist in the Unicode supplementary planes (codepoints above U+FFFF), they require 4 bytes in UTF-8. This means a single emoji like 😀 becomes 12 characters when percent-encoded.
// Emoji in query parameters
const url = new URL('https://example.com/search');
url.searchParams.set('mood', '😀');
console.log(url.href);
// "https://example.com/search?mood=%F0%9F%98%80"
// 😀 is U+1F600, encodes to 4 UTF-8 bytes: F0 9F 98 80This example shows how a single emoji character expands to %F0%9F%98%80 in the URL. While this works correctly, be mindful of URL length limits when using multiple emoji. Some systems impose a 2,048-character limit on URLs, and heavy emoji usage can quickly consume that budget.