What is the difference between urlparse and urlsplit?

urlparse() separates "params" (text after ; in the path) into a separate attribute. urlsplit() keeps params as part of the path. Path parameters are rarely used in modern URLs, so urlsplit() is usually sufficient and slightly faster.

Why does parse_qs return lists instead of strings?

Because the same query parameter key can appear multiple times (e.g., ?color=red&color=blue). parse_qs() returns lists for all values to handle this case uniformly. Use params.get("key", [""])[0] to get a single value.

When should I use quote vs quote_plus?

Use quote() for path segments and general URL components—it encodes spaces as %20. Use quote_plus() for query string values from form submissions—it encodes spaces as +. When using urlencode(), it handles this automatically.

How do I handle Unicode characters in URLs?

quote() and quote_plus() automatically encode Unicode characters using UTF-8. For example, quote("café") returns "caf%C3%A9". The encoding/decoding is handled transparently.

How do I validate if a string is a valid URL?

Use urlparse() and check that scheme and netloc are present. For stricter validation, also verify the scheme is http/https and the hostname looks valid. There's no built-in full URL validator in the standard library.

What is the best way to build URLs in Python?

For simple cases, use f-strings with urlencode() for the query string. For complex URL manipulation, parse with urlparse(), modify the components, and rebuild with urlunparse(). The requests library handles URL building automatically for HTTP requests.

How do I remove a query parameter from a URL?

Parse the URL with urlparse(), parse the query string with parse_qs(), delete the key from the dictionary, rebuild with urlencode(), and use urlunparse() to reconstruct the URL.

What happens if I parse an invalid URL?

urlparse() is permissive and won't raise an exception for malformed URLs. It will do its best to parse whatever is given. Always validate parsed results if you need to ensure the URL is well-formed.

How do I handle URLs with authentication credentials?

urlparse() extracts username and password from URLs like "https://user:pass@host.com". Access them via parsed.username and parsed.password. However, avoid embedding credentials in URLs for security reasons.

How do I convert a relative URL to absolute?

Use urljoin(base_url, relative_url). It handles all cases: same-directory references (other.html), parent directory (../page), root-relative (/path), and protocol-relative (//cdn.example.com).

Parse URLs in Python - urllib.parse Complete Guide

Python's urllib.parse module provides everything you need to work with URLs: parsing them into components, building them from parts, encoding special characters, and resolving relative URLs. This guide covers all the essential functions with practical examples.

Key Takeaways

1Use urlparse() to break URLs into components (scheme, netloc, path, query, fragment)
2Use parse_qs() for query strings—it returns lists because keys can repeat
3Use urlencode() to build query strings from dicts safely
4Use quote() for path segments and quote_plus() for query values
5Use urljoin() to resolve relative URLs against a base

Definition

urllib.parse

Python's standard library module for parsing URLs into components and building them from parts. It handles URL encoding, query string parsing, and relative URL resolution according to RFC 3986.Source: Python Documentation

Parsing URLs with urlparse()

The core of URL handling in Python is the urlparse() function. Unlike JavaScript's URL constructor that returns an object with methods, Python gives you a named tuple with direct attribute access. This functional approach fits well with Python's design philosophy.

The urlparse() function breaks a URL into its six components:

python

from urllib.parse import urlparse

url = "https://api.example.com:8080/v1/users?status=active&limit=10#results"
parsed = urlparse(url)

print(parsed.scheme)    # "https"
print(parsed.netloc)    # "api.example.com:8080"
print(parsed.hostname)  # "api.example.com"
print(parsed.port)      # 8080
print(parsed.path)      # "/v1/users"
print(parsed.query)     # "status=active&limit=10"
print(parsed.fragment)  # "results"

# The result is a named tuple
print(parsed)
# ParseResult(scheme='https', netloc='api.example.com:8080',
#             path='/v1/users', params='', query='status=active&limit=10',
#             fragment='results')

The code above parses a URL string into a ParseResult named tuple. You can access each component using dot notation like parsed.scheme or parsed.hostname. Notice that netloc includes both the hostname and port, while hostname and port give you these values separately.

The table below shows all available attributes on the ParseResult object. Pay attention to the difference between netloc (the full network location) and the individual hostname and port properties.

Attribute	Description	Example Value
`scheme`	Protocol (http, https, ftp)	`"https"`
`netloc`	Network location (host + port)	`"api.example.com:8080"`
`hostname`	Host without port	`"api.example.com"`
`port`	Port number (int or None)	`8080`
`path`	Path component	`"/v1/users"`
`query`	Query string (unparsed)	`"status=active&limit=10"`
`fragment`	Fragment/anchor	`"results"`
`params`	Path parameters (rarely used)	`""`

urlparse vs urlsplit

Python actually offers two parsing functions that differ in how they handle an obscure URL feature called path parameters. Understanding when to use each can save you from subtle bugs.

urlsplit() is similar to urlparse() but doesn't separate the rarely-used params component:

python

from urllib.parse import urlparse, urlsplit

# urlparse separates params (text after ; in path)
url = "https://example.com/path;params?query"
parsed = urlparse(url)
print(parsed.path)    # "/path"
print(parsed.params)  # "params"

# urlsplit keeps them together (faster, usually what you want)
split = urlsplit(url)
print(split.path)     # "/path;params"
# No params attribute

# For most modern URLs, urlsplit is sufficient and slightly faster

The key difference is that urlparse() separates path parameters (the part after ; in a path segment), while urlsplit() keeps them as part of the path. Since path parameters are rarely used in modern web applications, urlsplit() is usually the better choice for performance-sensitive code.

With the URL parsed into components, you'll often need to work with query parameters. Let's look at how Python handles query string parsing.

Working with Query Strings

Parsing Query Strings with parse_qs()

Query strings are where most of the action happens in URL manipulation. Python's parse_qs() function is your primary tool for extracting parameter values. One important design choice to understand: it returns lists for all values because the same key can appear multiple times in a query string (like ?tag=python&tag=web).

python

from urllib.parse import parse_qs, urlparse

url = "https://example.com/search?q=python&tags=web&tags=api&page=1"
parsed = urlparse(url)
params = parse_qs(parsed.query)

print(params)
# {'q': ['python'], 'tags': ['web', 'api'], 'page': ['1']}

# Get single value (first in list)
query = params.get('q', [''])[0]  # "python"

# Get all values for a key
tags = params.get('tags', [])  # ['web', 'api']

# Check if parameter exists
if 'page' in params:
    page = int(params['page'][0])  # 1

The code demonstrates parsing a query string into a dictionary where each value is a list. Notice that even single values like q are wrapped in a list. This consistency means you always use [0] to get the first value, which prevents errors when parameters might have multiple values.

Getting Ordered Pairs with parse_qsl()

Sometimes you need to preserve the exact order of parameters as they appeared in the URL, or handle the same key appearing multiple times differently. The parse_qsl() function returns a list of tuples instead of a dictionary.

parse_qsl() returns a list of (key, value) tuples, preserving order and duplicates:

python

from urllib.parse import parse_qsl

query = "color=red&size=large&color=blue"
params = parse_qsl(query)

print(params)
# [('color', 'red'), ('size', 'large'), ('color', 'blue')]

# Convert to dict (loses duplicates - last value wins)
params_dict = dict(params)
# {'color': 'blue', 'size': 'large'}

# Filter by key
colors = [v for k, v in params if k == 'color']
# ['red', 'blue']

The tuple format preserves ordering and allows you to handle duplicate keys any way you need. Converting to a dict loses duplicates (the last value wins), so use list comprehensions when you need all values for a repeated key.

Handling Blank Values

A subtle gotcha in query string parsing is how blank values are handled. Parameters like ?flag (no value) or ?empty= (empty string) are excluded by default. This matters when processing form submissions where empty fields still carry meaning.

python

from urllib.parse import parse_qs

# By default, blank values are excluded
query = "name=John&empty=&flag"
params = parse_qs(query)
print(params)  # {'name': ['John']}

# Include blank values with keep_blank_values=True
params = parse_qs(query, keep_blank_values=True)
print(params)  # {'name': ['John'], 'empty': [''], 'flag': ['']}

# This is important for forms where empty fields are meaningful

Setting keep_blank_values=True ensures that parameters without values are still included in the parsed result. This is essential when you need to distinguish between a missing parameter and one explicitly set to empty.

Now that you know how to parse query strings, let's look at the reverse operation: building them safely.

Building Query Strings with urlencode()

Building query strings by hand with string concatenation is error-prone. Special characters need encoding, and it's easy to forget an ampersand or equals sign. Python's urlencode() handles all of this safely.

python

from urllib.parse import urlencode

# From a dictionary
params = {
    'q': 'python tutorials',
    'category': 'programming',
    'page': 1
}
query_string = urlencode(params)
print(query_string)
# "q=python+tutorials&category=programming&page=1"

# Special characters are encoded automatically
params = {'query': 'Tom & Jerry', 'filter': 'price>100'}
query_string = urlencode(params)
print(query_string)
# "query=Tom+%26+Jerry&filter=price%3E100"

The code shows urlencode() converting a dictionary to a properly formatted query string. Notice how the ampersand in "Tom & Jerry" is encoded as %26 and the greater-than sign becomes %3E. You don't need to think about encoding at all.

Handling Multiple Values for Same Key

When a parameter needs multiple values (like selecting multiple checkboxes), you need to tell urlencode() to expand sequences with the doseq parameter.

python

from urllib.parse import urlencode

# Method 1: Use doseq=True with list values
params = {
    'color': ['red', 'blue', 'green'],
    'size': 'large'
}
query_string = urlencode(params, doseq=True)
print(query_string)
# "color=red&color=blue&color=green&size=large"

# Method 2: Use a list of tuples
params = [
    ('color', 'red'),
    ('color', 'blue'),
    ('size', 'large')
]
query_string = urlencode(params)
print(query_string)
# "color=red&color=blue&size=large"

Both approaches produce the same result: separate parameters for each value. The doseq=True option is more convenient when working with dictionaries, while the list of tuples approach gives you explicit control over ordering.

Building Complete URLs

Often you need to construct a complete URL from scratch or modify an existing one. Python offers several approaches depending on how much control you need.

python

from urllib.parse import urlencode, urlunparse, ParseResult

# Method 1: String formatting (simple cases)
base = "https://api.example.com/search"
params = {'q': 'python', 'limit': 10}
url = f"{base}?{urlencode(params)}"
print(url)
# "https://api.example.com/search?q=python&limit=10"

# Method 2: Using urlunparse (more control)
components = ParseResult(
    scheme='https',
    netloc='api.example.com',
    path='/v1/users',
    params='',
    query=urlencode({'status': 'active'}),
    fragment=''
)
url = urlunparse(components)
print(url)
# "https://api.example.com/v1/users?status=active"

# Method 3: Modify parsed URL
from urllib.parse import urlparse

parsed = urlparse("https://api.example.com/users")
new_url = parsed._replace(
    query=urlencode({'page': 2, 'limit': 20})
)
print(urlunparse(new_url))
# "https://api.example.com/users?page=2&limit=20"

Method 1 using f-strings is simple and works for most cases. Method 2 with urlunparse() is more powerful when you need to construct URLs from all their components. Method 3 shows how to modify a parsed URL using _replace() on the named tuple.

Understanding URL encoding is crucial for building URLs correctly. Let's explore how Python handles this.

URL Encoding with quote() and quote_plus()

URL encoding (also called percent-encoding) is how you safely include special characters in URLs. Python provides two functions that differ in one important way: how they encode spaces. The table below shows which to use where.

Function	Encodes Space As	Best For
`quote()`	`%20`	Path segments, general encoding
`quote_plus()`	`+`	Query string values (form data)

The key insight is that spaces can be encoded two ways: %20 (for paths) or + (for query strings). Using the wrong encoding can break your URLs.

python

from urllib.parse import quote, quote_plus

text = "Hello World & Friends"

# quote() - spaces become %20
print(quote(text))
# "Hello%20World%20%26%20Friends"

# quote_plus() - spaces become +
print(quote_plus(text))
# "Hello+World+%26+Friends"

# For path segments, use quote with safe=""
path_segment = "my file/name.pdf"
print(quote(path_segment, safe=''))
# "my%20file%2Fname.pdf"

# By default, quote() leaves / unencoded
print(quote(path_segment))
# "my%20file/name.pdf"  (/ not encoded)

The code demonstrates both encoding functions and their different treatment of spaces. For path segments, use quote() with safe='' to encode everything including slashes. For query string values, quote_plus() or urlencode() handles it automatically.

The safe Parameter

The safe parameter controls which characters are left unencoded. By default, quote() leaves slashes alone (useful for paths), but you can customize this behavior.

python

from urllib.parse import quote

# safe="" encodes everything except alphanumerics and _.-~
print(quote("a/b?c=d", safe=''))
# "a%2Fb%3Fc%3Dd"

# safe="/" leaves slashes unencoded (default)
print(quote("a/b?c=d"))
# "a/b%3Fc%3Dd"

# safe="/?" leaves both unencoded
print(quote("a/b?c=d", safe='/?'))
# "a/b?c%3Dd"

# For query values, urlencode() handles this automatically
# For path segments, explicitly use safe=''

Setting safe='' encodes all special characters, which is what you need for individual path segments. Leaving characters in safe is useful when you're encoding a full path and want to preserve the structure.

Decoding with unquote() and unquote_plus()

Decoding is the reverse operation, and the same space-encoding distinction applies. Choose the right function based on where the encoded string came from.

python

from urllib.parse import unquote, unquote_plus

# unquote() decodes %XX sequences
encoded = "Hello%20World%20%26%20Friends"
print(unquote(encoded))
# "Hello World & Friends"

# unquote_plus() also decodes + as space
form_data = "Hello+World+%26+Friends"
print(unquote_plus(form_data))
# "Hello World & Friends"

# unquote() doesn't decode +
print(unquote(form_data))
# "Hello+World & Friends"  (+ stays as +)

# For query strings from forms, use unquote_plus()
# For path segments and general URLs, use unquote()

The key difference: unquote_plus() converts + to spaces, while unquote() leaves them as plus signs. When parsing query strings from form submissions, use unquote_plus() or let parse_qs() handle it automatically.

Resolving Relative URLs with urljoin()

When scraping websites or processing HTML, you'll encounter relative URLs like ../images/logo.png. Python's urljoin() resolves these against a base URL, following the same rules browsers use.

python

from urllib.parse import urljoin

base = "https://example.com/blog/posts/article.html"

# Relative paths
print(urljoin(base, "other.html"))
# "https://example.com/blog/posts/other.html"

print(urljoin(base, "./images/photo.jpg"))
# "https://example.com/blog/posts/images/photo.jpg"

print(urljoin(base, "../about.html"))
# "https://example.com/blog/about.html"

# Absolute paths (from root)
print(urljoin(base, "/contact"))
# "https://example.com/contact"

# Protocol-relative
print(urljoin(base, "//cdn.example.com/script.js"))
# "https://cdn.example.com/script.js"

# Full URLs (base is ignored)
print(urljoin(base, "https://other.com/page"))
# "https://other.com/page"

The function handles all the edge cases: .. for parent directories, / for absolute paths from the root, and protocol-relative URLs. When the second argument is a complete URL, the base is ignored entirely.

While urllib.parse handles URL manipulation, you'll often use the requests library for actual HTTP operations. Let's see how they work together.

Using the requests Library

The requests library is the de facto standard for HTTP in Python. It integrates seamlessly with URL handling, automatically encoding query parameters so you can pass raw values.

python

import requests

# Query parameters are encoded automatically
response = requests.get(
    "https://api.example.com/search",
    params={
        'q': 'Tom & Jerry',
        'page': 1,
        'tags': ['animation', 'classic']
    }
)
print(response.url)
# "https://api.example.com/search?q=Tom+%26+Jerry&page=1&tags=animation&tags=classic"

# Parse the URL from a response
from urllib.parse import urlparse, parse_qs
parsed = urlparse(response.url)
params = parse_qs(parsed.query)

# Build URLs with sessions
session = requests.Session()
session.params = {'api_key': 'secret123'}  # Added to all requests
response = session.get(
    "https://api.example.com/users",
    params={'limit': 10}
)
# Combines session params with request params

Notice how requests handles encoding automatically when you pass the params dictionary. It also handles list values by repeating the parameter. You can still use urllib.parse to inspect or modify the resulting URL.

Let's put all these concepts together with some real-world examples you can adapt for your own projects.

Practical Examples

Building an API Client

This example shows a reusable API client class that handles URL construction, API key injection, and proper encoding. It's a pattern you'll use frequently when integrating with REST APIs.

python

from urllib.parse import urljoin, urlencode, urlparse, parse_qs
import requests

class APIClient:
    def __init__(self, base_url: str, api_key: str):
        self.base_url = base_url
        self.api_key = api_key

    def _build_url(self, endpoint: str, params: dict = None) -> str:
        """Build a full URL with query parameters."""
        url = urljoin(self.base_url, endpoint)

        if params:
            # Add API key to params
            params = {**params, 'api_key': self.api_key}
            url = f"{url}?{urlencode(params, doseq=True)}"
        else:
            url = f"{url}?api_key={self.api_key}"

        return url

    def search(self, query: str, filters: dict = None) -> dict:
        """Search with optional filters."""
        params = {'q': query}
        if filters:
            params.update(filters)

        url = self._build_url('/search', params)
        response = requests.get(url)
        return response.json()

# Usage
client = APIClient(
    base_url='https://api.example.com/v1/',
    api_key='your-api-key'
)
results = client.search(
    query='python tutorials',
    filters={'category': 'programming', 'sort': 'date'}
)

The API client encapsulates URL building logic, making it easy to add new endpoints. The _build_url method handles the details of combining base URL, path, and parameters while always including the API key.

URL Validation

Security-conscious applications need to validate URLs before following redirects or displaying content. This validator checks for common issues like missing protocols, invalid schemes, and potentially dangerous characters.

python

from urllib.parse import urlparse
from typing import Optional

def validate_url(url: str) -> tuple[bool, Optional[str]]:
    """
    Validate a URL and return (is_valid, error_message).
    """
    try:
        parsed = urlparse(url)

        # Must have scheme
        if not parsed.scheme:
            return False, "Missing protocol (http:// or https://)"

        # Must be http or https
        if parsed.scheme not in ('http', 'https'):
            return False, f"Invalid protocol: {parsed.scheme}"

        # Must have host
        if not parsed.netloc:
            return False, "Missing domain name"

        # Check for suspicious characters
        if '<' in url or '>' in url or 'javascript:' in url.lower():
            return False, "URL contains potentially dangerous characters"

        return True, None

    except Exception as e:
        return False, str(e)

# Usage
is_valid, error = validate_url("https://example.com/page")
if not is_valid:
    print(f"Invalid URL: {error}")

# Test cases
print(validate_url("https://example.com"))    # (True, None)
print(validate_url("example.com"))            # (False, "Missing protocol...")
print(validate_url("javascript:alert(1)"))    # (False, "Invalid protocol...")

The validator returns a tuple with a boolean and an optional error message, making it easy to provide user-friendly feedback. Note that urlparse() is permissive and won't catch all malformed URLs, so additional validation logic is necessary.

URL Normalizer

When comparing or deduplicating URLs, you need to normalize them first. Different URLs can point to the same resource: HTTP vs https, default ports, trailing slashes, and parameter ordering all create variations. This normalizer addresses them all.

python

from urllib.parse import urlparse, urlunparse, parse_qsl, urlencode

def normalize_url(url: str) -> str:
    """
    Normalize a URL for comparison and deduplication.
    - Lowercase scheme and host
    - Remove default ports
    - Sort query parameters
    - Remove trailing slash from path (optional)
    - Remove fragment
    """
    parsed = urlparse(url)

    # Lowercase scheme and host
    scheme = parsed.scheme.lower()
    netloc = parsed.hostname.lower() if parsed.hostname else ''

    # Include port only if non-default
    if parsed.port:
        default_ports = {'http': 80, 'https': 443}
        if parsed.port != default_ports.get(scheme):
            netloc = f"{netloc}:{parsed.port}"

    # Normalize path
    path = parsed.path or '/'
    if path != '/' and path.endswith('/'):
        path = path.rstrip('/')

    # Sort query parameters
    params = parse_qsl(parsed.query)
    params.sort(key=lambda x: (x[0], x[1]))
    query = urlencode(params)

    # Rebuild without fragment
    return urlunparse((scheme, netloc, path, '', query, ''))

# Usage
urls = [
    "HTTPS://Example.COM/path/?b=2&a=1#section",
    "https://example.com:443/path?a=1&b=2",
    "https://example.com/path/?a=1&b=2"
]

normalized = [normalize_url(url) for url in urls]
# All become: "https://example.com/path?a=1&b=2"

print(len(set(normalized)))  # 1 (all duplicates)

The normalizer lowercases the scheme and host, removes default ports, normalizes the path, sorts query parameters, and strips fragments. After normalization, all three example URLs become identical, making deduplication straightforward.

Common Patterns

Here are some utility functions you'll find yourself writing repeatedly. Feel free to copy these into your projects.

Extract Domain from URL

Extracting just the domain or root domain is useful for grouping URLs, security checks, and analytics.

python

from urllib.parse import urlparse

def get_domain(url: str) -> str:
    """Extract the domain from a URL."""
    parsed = urlparse(url)
    return parsed.hostname or ''

def get_root_domain(url: str) -> str:
    """Extract the root domain (without subdomains)."""
    hostname = get_domain(url)
    parts = hostname.split('.')
    # Handle cases like 'example.co.uk'
    if len(parts) >= 2:
        return '.'.join(parts[-2:])
    return hostname

# Usage
print(get_domain("https://docs.api.example.com/page"))
# "docs.api.example.com"

print(get_root_domain("https://docs.api.example.com/page"))
# "example.com"

The get_root_domain function is simplified and may not handle all cases correctly (like example.co.uk). For production use, consider the publicsuffix2 library which knows about all valid public suffixes.

Add Parameters to Existing URL

Adding parameters to an existing URL while preserving existing ones is a common operation. This function handles merging parameters correctly.

python

from urllib.parse import urlparse, urlunparse, parse_qs, urlencode

def add_params(url: str, new_params: dict) -> str:
    """Add query parameters to an existing URL."""
    parsed = urlparse(url)

    # Parse existing params
    params = parse_qs(parsed.query)

    # Add new params (convert to list format)
    for key, value in new_params.items():
        if isinstance(value, list):
            params[key] = value
        else:
            params[key] = [value]

    # Rebuild URL
    new_query = urlencode(params, doseq=True)
    return urlunparse(parsed._replace(query=new_query))

# Usage
url = "https://example.com/search?q=python"
new_url = add_params(url, {'page': 2, 'sort': 'date'})
print(new_url)
# "https://example.com/search?q=python&page=2&sort=date"

The function parses the existing URL, extracts current parameters, merges in the new ones, rebuilds the query string, and reconstructs the complete URL. This pattern ensures existing parameters aren't lost when adding new ones.

Parse URLs in Python

Key Takeaways

Parsing URLs with urlparse()

urlparse vs urlsplit

Working with Query Strings

Parsing Query Strings with parse_qs()

Getting Ordered Pairs with parse_qsl()

Handling Blank Values

Building Query Strings with urlencode()

Handling Multiple Values for Same Key

Building Complete URLs

URL Encoding with quote() and quote_plus()

The safe Parameter

Decoding with unquote() and unquote_plus()

Resolving Relative URLs with urljoin()

Using the requests Library

Practical Examples

Building an API Client

URL Validation

URL Normalizer

Common Patterns

Extract Domain from URL

Add Parameters to Existing URL

Frequently Asked Questions

Related Guides

URL Encoding

Parse URLs in JavaScript

Try it yourself