Why do browsers show unicode but send encoded URLs?

Browsers display the human-readable version for usability but send the encoded ASCII version to servers. This ensures compatibility with all systems.

Are IDN domains secure?

IDN domains can be used for phishing (e.g., using Cyrillic "а" instead of Latin "a"). Modern browsers show Punycode for mixed-script domains to prevent this.

Can I use emoji in domain names?

Technically yes (via Punycode), but most registrars don't support emoji domains. Some .ws domains allow emoji.

Unicode in URLs - International Domain Names and Characters

URLs were originally designed for ASCII characters only. To support international characters like é, 中文, or emoji, different parts of the URL use different encoding schemes.

Key Takeaways

1Domain names use Punycode (IDN) for international characters
2Paths and query strings use UTF-8 percent-encoding
3Browsers display Unicode but send encoded versions
4Use encodeURIComponent() for UTF-8 encoding in JavaScript
5IDN domains start with xn-- in their ASCII form

Definition

Internationalized Domain Name (IDN)

A domain name that contains characters outside the basic ASCII character set, such as accented letters, Chinese characters, or Arabic script. IDNs are encoded using Punycode, which converts Unicode characters to ASCII-compatible encoding (ACE) starting with 'xn--'.Source: RFC 5890

"The IDNA specification provides a mechanism for encoding internationalized domain names using only ASCII characters while allowing the full range of Unicode characters."
— RFC 5891 - IDNA Protocol

International Domain Names (IDN)

When you register a domain with non-ASCII characters, the domain registrar converts it to Punycode behind the scenes. This conversion allows international domains to work with the existing DNS infrastructure, which was built for ASCII only. Over 10 million IDN domains are registered worldwide, with Chinese, Arabic, and German domains being the most common.

The table below shows how various international domain names map to their Punycode equivalents. Notice that each Punycode domain starts with xn--, which signals to DNS servers that the domain contains encoded international characters.

Display	Punycode (ASCII)	Language
münchen.de	xn--mnchen-3ya.de	German
北京.中国	xn--1lq90i.xn--fiqs8s	Chinese
münchen.de	xn--mnchen-3ya.de	German
пример.рф	xn--e1afmkfd.xn--p1ai	Russian

javascript

// Convert between Unicode and Punycode
const url = new URL('https://münchen.de/path');
console.log(url.hostname);  // "xn--mnchen-3ya.de" (Punycode)
console.log(url.href);      // Full URL with Punycode

// Using punycode library for conversion
import { toASCII, toUnicode } from 'punycode';
toASCII('münchen.de');    // "xn--mnchen-3ya.de"
toUnicode('xn--mnchen-3ya.de');  // "münchen.de"

The code above demonstrates how JavaScript's URL API automatically converts Unicode hostnames to Punycode. When you create a URL object with an international domain, the hostname property returns the ASCII-safe Punycode version. For explicit conversions, the punycode library provides toASCII() and toUnicode() functions.

While domain names require Punycode, the path and query string parts of URLs use a different encoding scheme. Let's look at how UTF-8 percent-encoding handles international characters in these components.

Unicode in Paths

Unlike domain names, path segments use UTF-8 percent-encoding to handle international characters. Each Unicode character is converted to its UTF-8 byte sequence, and each byte becomes a %XX escape sequence. This approach means a single character like "北" (3 UTF-8 bytes) becomes three percent-encoded sequences.

https://example.com/café/北京

Path with international characters

Open in Editor

javascript

// Encoding paths
const path = '/café/北京';
const encoded = encodeURI('https://example.com' + path);
// "https://example.com/caf%C3%A9/%E5%8C%97%E4%BA%AC"

// é encodes to %C3%A9 (UTF-8 bytes: C3 A9)
// 北 encodes to %E5%8C%97 (UTF-8 bytes: E5 8C 97)

This example shows how encodeURI() transforms international characters in paths. The accented letter "é" becomes %C3%A9 (two bytes), while the Chinese character "北" expands to %E5%8C%97 (three bytes). Browsers handle this encoding automatically when you type international URLs in the address bar.

Query parameters follow the same UTF-8 percent-encoding rules, but with one important addition: the URL API provides convenient methods that handle encoding and decoding automatically.

Unicode in Query Parameters

When you need to pass international text as query parameter values, the URL API is your best friend. It handles all the encoding complexity for you, converting Unicode characters to their percent-encoded form when building the URL and decoding them back when you read parameter values.

javascript

const url = new URL('https://example.com/search');
url.searchParams.set('q', '日本語');
url.searchParams.set('city', 'São Paulo');

console.log(url.href);
// "https://example.com/search?q=%E6%97%A5%E6%9C%AC%E8%AA%9E&city=S%C3%A3o+Paulo"

// The URL API handles encoding automatically
console.log(url.searchParams.get('q'));  // "日本語"

The code demonstrates the URL API's automatic encoding. When you set a parameter with Japanese text, it gets encoded in the URL, but when you retrieve it with get(), you receive the original Unicode string. This round-trip encoding happens seamlessly, eliminating manual encoding errors.

Beyond traditional international characters, modern URLs also need to handle emoji. These present unique challenges because emoji use 4-byte UTF-8 sequences, making their encoded forms quite long.

Emoji in URLs

Emoji have become increasingly common in URLs, especially for social media sharing and marketing campaigns. However, because emoji exist in the Unicode supplementary planes (codepoints above U+FFFF), they require 4 bytes in UTF-8. This means a single emoji like 😀 becomes 12 characters when percent-encoded.

javascript

// Emoji in query parameters
const url = new URL('https://example.com/search');
url.searchParams.set('mood', '😀');
console.log(url.href);
// "https://example.com/search?mood=%F0%9F%98%80"

// 😀 is U+1F600, encodes to 4 UTF-8 bytes: F0 9F 98 80

This example shows how a single emoji character expands to %F0%9F%98%80 in the URL. While this works correctly, be mindful of URL length limits when using multiple emoji. Some systems impose a 2,048-character limit on URLs, and heavy emoji usage can quickly consume that budget.

Unicode in URLs

Key Takeaways

International Domain Names (IDN)

Unicode in Paths

Unicode in Query Parameters

Emoji in URLs

Frequently Asked Questions

Related Guides

URL Encoding

Special Characters

Try it yourself