Is regex sufficient for URL validation?

No. URLs are complex and regex can't catch all edge cases (like userinfo confusion). Always parse with the URL API first, then validate the parsed components.

Should I validate before or after decoding?

Both. Check for obvious attacks in the raw input (blocked protocols), then parse and validate the decoded components. Watch for double-encoding attacks.

What about relative URLs?

Relative URLs should be resolved against a known base URL first, then validated. new URL(relative, baseUrl) handles resolution.

URL Validation - Preventing Injection and Malformed URLs

URL validation is the first line of defense against injection attacks. Proper validation ensures URLs are well-formed, point to expected destinations, and don't contain malicious payloads.

Key Takeaways

1Always parse URLs before validation, don't use regex alone
2Validate protocol, host, and path separately
3Use allowlists for redirect URLs and external links
4Block javascript:, data:, and file: schemes
5Encode output even after validation

Definition

URL Validation

The process of verifying that a URL is well-formed, uses allowed protocols, points to expected destinations, and does not contain malicious payloads before processing or redirecting.Source: OWASP Input Validation Cheat Sheet

"Input validation is performed to ensure only properly formed data is entering the workflow in an information system, preventing malformed data from persisting in the database and triggering malfunction of various downstream components."
— OWASP Input Validation Cheat Sheet

Safe URL Parsing

The foundation of secure URL validation is proper parsing. Many developers reach for regular expressions, but URLs have complex syntax that regex can't reliably handle. The browser's built-in URL constructor provides standards-compliant parsing that handles edge cases correctly.

javascript

// Always use URL constructor for parsing
function parseUrl(input) {
  try {
    return new URL(input);
  } catch {
    return null;  // Invalid URL
  }
}

// Validate URL format
function isValidUrl(input) {
  try {
    new URL(input);
    return true;
  } catch {
    return false;
  }
}

// DON'T use regex for URL validation
// This regex can be bypassed:
const badRegex = /^https?:\/\/[\w.-]+/;
// Matches: https://evil.com@trusted.com  (wrong!)

The code above demonstrates safe parsing using the URL constructor, which throws an exception for malformed URLs. The regex example shows a common mistake: it matches the userinfo part (evil.com@) as the hostname, allowing attackers to redirect to malicious sites while appearing legitimate.

With reliable parsing in place, you can now validate the individual URL components. Protocol validation is the first critical check.

Protocol Validation

Attackers frequently exploit dangerous URL schemes like javascript: and data: to execute arbitrary code in the user's browser. These attacks are particularly effective in contexts where URLs are rendered as links or used in redirects. Strict protocol allowlisting prevents these injection attacks.

javascript

const ALLOWED_PROTOCOLS = ['https:', 'http:'];

function isSafeProtocol(url) {
  try {
    const parsed = new URL(url);
    return ALLOWED_PROTOCOLS.includes(parsed.protocol);
  } catch {
    return false;
  }
}

// Block dangerous protocols
const BLOCKED_PROTOCOLS = ['javascript:', 'data:', 'file:', 'vbscript:'];

function hasBlockedProtocol(url) {
  const normalized = url.toLowerCase().trim();
  return BLOCKED_PROTOCOLS.some(p => normalized.startsWith(p));
}

// Example attacks these prevent:
// javascript:alert(document.cookie)
// data:text/html,<script>alert(1)</script>
// file:///etc/passwd

These functions implement both allowlist and blocklist approaches. The allowlist (isSafeProtocol) is more secure since it only permits explicitly approved protocols. The blocklist provides an additional safety net by checking the raw input before parsing, catching obfuscation attempts.

After verifying the protocol is safe, you need to ensure the URL points to an expected destination. Host validation prevents attackers from redirecting users to malicious domains.

Host Validation

Host validation is critical for preventing open redirect vulnerabilities and ensuring users aren't sent to phishing sites. The safest approach is maintaining an explicit allowlist of permitted domains. When you need to support subdomains, careful pattern matching prevents common bypass techniques.

javascript

// Allowlist approach (recommended for redirects)
const ALLOWED_HOSTS = ['example.com', 'api.example.com', 'cdn.example.com'];

function isAllowedHost(url) {
  try {
    const parsed = new URL(url);
    return ALLOWED_HOSTS.includes(parsed.hostname);
  } catch {
    return false;
  }
}

// Domain pattern matching
function isSubdomainOf(url, domain) {
  try {
    const parsed = new URL(url);
    const host = parsed.hostname.toLowerCase();
    return host === domain || host.endsWith('.' + domain);
  } catch {
    return false;
  }
}

// Watch out for these bypasses:
// evil.com.example.com (subdomain of evil.com!)
// example.com.evil.com (subdomain of evil.com!)
// example.com@evil.com (userinfo, host is evil.com!)

The isAllowedHost function performs exact hostname matching against an allowlist. The isSubdomainOf function allows any subdomain of a given domain. Pay close attention to the bypass examples: attackers exploit string matching assumptions to craft URLs that look legitimate but resolve to malicious hosts.

Individual validation checks are useful, but production applications need a comprehensive validation function that combines all checks in the correct order.

Complete Validation

A complete URL validation function combines parsing, protocol checking, host validation, and additional security checks into a single reusable utility. This defense-in-depth approach ensures that no single bypass technique can circumvent your validation.

javascript

function validateUrl(input, options = {}) {
  const {
    allowedProtocols = ['https:'],
    allowedHosts = null,  // null = allow any
    requirePath = false,
  } = options;

  // 1. Basic format check
  let url;
  try {
    url = new URL(input);
  } catch {
    return { valid: false, error: 'Invalid URL format' };
  }

  // 2. Protocol check
  if (!allowedProtocols.includes(url.protocol)) {
    return { valid: false, error: 'Protocol not allowed' };
  }

  // 3. Host check (if specified)
  if (allowedHosts && !allowedHosts.includes(url.hostname)) {
    return { valid: false, error: 'Host not allowed' };
  }

  // 4. No userinfo (potential confusion)
  if (url.username || url.password) {
    return { valid: false, error: 'Credentials in URL not allowed' };
  }

  // 5. Path requirement
  if (requirePath && (!url.pathname || url.pathname === '/')) {
    return { valid: false, error: 'Path required' };
  }

  return { valid: true, url };
}

This validation function returns a structured result with either the parsed URL or a specific error message. The five-step validation covers format, protocol, host, userinfo credentials, and optional path requirements. Customize the options based on your security requirements: redirect endpoints need strict host allowlists, while link previews may allow any external domain.

Even with solid validation in place, it helps to understand the attack techniques that malicious actors use. Recognizing these patterns helps you identify potential weaknesses in your implementation.

Common Attack Patterns

Attackers have developed numerous techniques to bypass URL validation over the years. The table below summarizes the most common attack patterns you should test against. Each attack exploits a specific assumption that naive validation code might make.

Attack	Example	Defense
Protocol injection	javascript:alert(1)	Allowlist protocols
Host confusion	https://trusted.com@evil.com	Parse with URL API, check hostname
Unicode tricks	https://exаmple.com (Cyrillic а)	Normalize to ASCII
Path traversal	https://api.com/../../../etc/passwd	Normalize path, check result

Protocol injection and host confusion are the most common attack vectors. Unicode tricks (also called homograph attacks) use visually similar characters from different alphabets to create convincing phishing URLs. Path traversal attempts to access files outside the intended directory.

URL Validation

Key Takeaways

Safe URL Parsing

Protocol Validation

Host Validation

Complete Validation

Common Attack Patterns

Frequently Asked Questions

Related Guides

URL Security

Open Redirect

Try it yourself