URL validation is the first line of defense against injection attacks. Proper validation ensures URLs are well-formed, point to expected destinations, and don't contain malicious payloads.
Key Takeaways
- 1Always parse URLs before validation, don't use regex alone
- 2Validate protocol, host, and path separately
- 3Use allowlists for redirect URLs and external links
- 4Block javascript:, data:, and file: schemes
- 5Encode output even after validation
"Input validation is performed to ensure only properly formed data is entering the workflow in an information system, preventing malformed data from persisting in the database and triggering malfunction of various downstream components."
Safe URL Parsing
The foundation of secure URL validation is proper parsing. Many developers reach for regular expressions, but URLs have complex syntax that regex can't reliably handle. The browser's built-in URL constructor provides standards-compliant parsing that handles edge cases correctly.
// Always use URL constructor for parsing
function parseUrl(input) {
try {
return new URL(input);
} catch {
return null; // Invalid URL
}
}
// Validate URL format
function isValidUrl(input) {
try {
new URL(input);
return true;
} catch {
return false;
}
}
// DON'T use regex for URL validation
// This regex can be bypassed:
const badRegex = /^https?:\/\/[\w.-]+/;
// Matches: https://evil.com@trusted.com (wrong!)The code above demonstrates safe parsing using the URL constructor, which throws an exception for malformed URLs. The regex example shows a common mistake: it matches the userinfo part (evil.com@) as the hostname, allowing attackers to redirect to malicious sites while appearing legitimate.
With reliable parsing in place, you can now validate the individual URL components. Protocol validation is the first critical check.
Protocol Validation
Attackers frequently exploit dangerous URL schemes like javascript: and data: to execute arbitrary code in the user's browser. These attacks are particularly effective in contexts where URLs are rendered as links or used in redirects. Strict protocol allowlisting prevents these injection attacks.
const ALLOWED_PROTOCOLS = ['https:', 'http:'];
function isSafeProtocol(url) {
try {
const parsed = new URL(url);
return ALLOWED_PROTOCOLS.includes(parsed.protocol);
} catch {
return false;
}
}
// Block dangerous protocols
const BLOCKED_PROTOCOLS = ['javascript:', 'data:', 'file:', 'vbscript:'];
function hasBlockedProtocol(url) {
const normalized = url.toLowerCase().trim();
return BLOCKED_PROTOCOLS.some(p => normalized.startsWith(p));
}
// Example attacks these prevent:
// javascript:alert(document.cookie)
// data:text/html,<script>alert(1)</script>
// file:///etc/passwdThese functions implement both allowlist and blocklist approaches. The allowlist (isSafeProtocol) is more secure since it only permits explicitly approved protocols. The blocklist provides an additional safety net by checking the raw input before parsing, catching obfuscation attempts.
After verifying the protocol is safe, you need to ensure the URL points to an expected destination. Host validation prevents attackers from redirecting users to malicious domains.
Host Validation
Host validation is critical for preventing open redirect vulnerabilities and ensuring users aren't sent to phishing sites. The safest approach is maintaining an explicit allowlist of permitted domains. When you need to support subdomains, careful pattern matching prevents common bypass techniques.
// Allowlist approach (recommended for redirects)
const ALLOWED_HOSTS = ['example.com', 'api.example.com', 'cdn.example.com'];
function isAllowedHost(url) {
try {
const parsed = new URL(url);
return ALLOWED_HOSTS.includes(parsed.hostname);
} catch {
return false;
}
}
// Domain pattern matching
function isSubdomainOf(url, domain) {
try {
const parsed = new URL(url);
const host = parsed.hostname.toLowerCase();
return host === domain || host.endsWith('.' + domain);
} catch {
return false;
}
}
// Watch out for these bypasses:
// evil.com.example.com (subdomain of evil.com!)
// example.com.evil.com (subdomain of evil.com!)
// example.com@evil.com (userinfo, host is evil.com!)The isAllowedHost function performs exact hostname matching against an allowlist. The isSubdomainOf function allows any subdomain of a given domain. Pay close attention to the bypass examples: attackers exploit string matching assumptions to craft URLs that look legitimate but resolve to malicious hosts.
Individual validation checks are useful, but production applications need a comprehensive validation function that combines all checks in the correct order.
Complete Validation
A complete URL validation function combines parsing, protocol checking, host validation, and additional security checks into a single reusable utility. This defense-in-depth approach ensures that no single bypass technique can circumvent your validation.
function validateUrl(input, options = {}) {
const {
allowedProtocols = ['https:'],
allowedHosts = null, // null = allow any
requirePath = false,
} = options;
// 1. Basic format check
let url;
try {
url = new URL(input);
} catch {
return { valid: false, error: 'Invalid URL format' };
}
// 2. Protocol check
if (!allowedProtocols.includes(url.protocol)) {
return { valid: false, error: 'Protocol not allowed' };
}
// 3. Host check (if specified)
if (allowedHosts && !allowedHosts.includes(url.hostname)) {
return { valid: false, error: 'Host not allowed' };
}
// 4. No userinfo (potential confusion)
if (url.username || url.password) {
return { valid: false, error: 'Credentials in URL not allowed' };
}
// 5. Path requirement
if (requirePath && (!url.pathname || url.pathname === '/')) {
return { valid: false, error: 'Path required' };
}
return { valid: true, url };
}This validation function returns a structured result with either the parsed URL or a specific error message. The five-step validation covers format, protocol, host, userinfo credentials, and optional path requirements. Customize the options based on your security requirements: redirect endpoints need strict host allowlists, while link previews may allow any external domain.
Even with solid validation in place, it helps to understand the attack techniques that malicious actors use. Recognizing these patterns helps you identify potential weaknesses in your implementation.
Common Attack Patterns
Attackers have developed numerous techniques to bypass URL validation over the years. The table below summarizes the most common attack patterns you should test against. Each attack exploits a specific assumption that naive validation code might make.
| Attack | Example | Defense |
|---|---|---|
| Protocol injection | javascript:alert(1) | Allowlist protocols |
| Host confusion | https://trusted.com@evil.com | Parse with URL API, check hostname |
| Unicode tricks | https://exаmple.com (Cyrillic а) | Normalize to ASCII |
| Path traversal | https://api.com/../../../etc/passwd | Normalize path, check result |
Protocol injection and host confusion are the most common attack vectors. Unicode tricks (also called homograph attacks) use visually similar characters from different alphabets to create convincing phishing URLs. Path traversal attempts to access files outside the intended directory.