Python's urllib.parse module provides everything you need to work with URLs: parsing them into components, building them from parts, encoding special characters, and resolving relative URLs. This guide covers all the essential functions with practical examples.
Key Takeaways
- 1Use urlparse() to break URLs into components (scheme, netloc, path, query, fragment)
- 2Use parse_qs() for query strings—it returns lists because keys can repeat
- 3Use urlencode() to build query strings from dicts safely
- 4Use quote() for path segments and quote_plus() for query values
- 5Use urljoin() to resolve relative URLs against a base
Parsing URLs with urlparse()
The core of URL handling in Python is the urlparse() function. Unlike JavaScript's URL constructor that returns an object with methods, Python gives you a named tuple with direct attribute access. This functional approach fits well with Python's design philosophy.
The urlparse() function breaks a URL into its six components:
from urllib.parse import urlparse
url = "https://api.example.com:8080/v1/users?status=active&limit=10#results"
parsed = urlparse(url)
print(parsed.scheme) # "https"
print(parsed.netloc) # "api.example.com:8080"
print(parsed.hostname) # "api.example.com"
print(parsed.port) # 8080
print(parsed.path) # "/v1/users"
print(parsed.query) # "status=active&limit=10"
print(parsed.fragment) # "results"
# The result is a named tuple
print(parsed)
# ParseResult(scheme='https', netloc='api.example.com:8080',
# path='/v1/users', params='', query='status=active&limit=10',
# fragment='results')The code above parses a URL string into a ParseResult named tuple. You can access each component using dot notation like parsed.scheme or parsed.hostname. Notice that netloc includes both the hostname and port, while hostname and port give you these values separately.
The table below shows all available attributes on the ParseResult object. Pay attention to the difference between netloc (the full network location) and the individual hostname and port properties.
| Attribute | Description | Example Value |
|---|---|---|
scheme | Protocol (http, https, ftp) | "https" |
netloc | Network location (host + port) | "api.example.com:8080" |
hostname | Host without port | "api.example.com" |
port | Port number (int or None) | 8080 |
path | Path component | "/v1/users" |
query | Query string (unparsed) | "status=active&limit=10" |
fragment | Fragment/anchor | "results" |
params | Path parameters (rarely used) | "" |
urlparse vs urlsplit
Python actually offers two parsing functions that differ in how they handle an obscure URL feature called path parameters. Understanding when to use each can save you from subtle bugs.
urlsplit() is similar to urlparse() but doesn't separate the rarely-used params component:
from urllib.parse import urlparse, urlsplit
# urlparse separates params (text after ; in path)
url = "https://example.com/path;params?query"
parsed = urlparse(url)
print(parsed.path) # "/path"
print(parsed.params) # "params"
# urlsplit keeps them together (faster, usually what you want)
split = urlsplit(url)
print(split.path) # "/path;params"
# No params attribute
# For most modern URLs, urlsplit is sufficient and slightly fasterThe key difference is that urlparse() separates path parameters (the part after ; in a path segment), while urlsplit() keeps them as part of the path. Since path parameters are rarely used in modern web applications, urlsplit() is usually the better choice for performance-sensitive code.
With the URL parsed into components, you'll often need to work with query parameters. Let's look at how Python handles query string parsing.
Working with Query Strings
Parsing Query Strings with parse_qs()
Query strings are where most of the action happens in URL manipulation. Python's parse_qs() function is your primary tool for extracting parameter values. One important design choice to understand: it returns lists for all values because the same key can appear multiple times in a query string (like ?tag=python&tag=web).
from urllib.parse import parse_qs, urlparse
url = "https://example.com/search?q=python&tags=web&tags=api&page=1"
parsed = urlparse(url)
params = parse_qs(parsed.query)
print(params)
# {'q': ['python'], 'tags': ['web', 'api'], 'page': ['1']}
# Get single value (first in list)
query = params.get('q', [''])[0] # "python"
# Get all values for a key
tags = params.get('tags', []) # ['web', 'api']
# Check if parameter exists
if 'page' in params:
page = int(params['page'][0]) # 1The code demonstrates parsing a query string into a dictionary where each value is a list. Notice that even single values like q are wrapped in a list. This consistency means you always use [0] to get the first value, which prevents errors when parameters might have multiple values.
Getting Ordered Pairs with parse_qsl()
Sometimes you need to preserve the exact order of parameters as they appeared in the URL, or handle the same key appearing multiple times differently. The parse_qsl() function returns a list of tuples instead of a dictionary.
parse_qsl() returns a list of (key, value) tuples, preserving order and duplicates:
from urllib.parse import parse_qsl
query = "color=red&size=large&color=blue"
params = parse_qsl(query)
print(params)
# [('color', 'red'), ('size', 'large'), ('color', 'blue')]
# Convert to dict (loses duplicates - last value wins)
params_dict = dict(params)
# {'color': 'blue', 'size': 'large'}
# Filter by key
colors = [v for k, v in params if k == 'color']
# ['red', 'blue']The tuple format preserves ordering and allows you to handle duplicate keys any way you need. Converting to a dict loses duplicates (the last value wins), so use list comprehensions when you need all values for a repeated key.
Handling Blank Values
A subtle gotcha in query string parsing is how blank values are handled. Parameters like ?flag (no value) or ?empty= (empty string) are excluded by default. This matters when processing form submissions where empty fields still carry meaning.
from urllib.parse import parse_qs
# By default, blank values are excluded
query = "name=John&empty=&flag"
params = parse_qs(query)
print(params) # {'name': ['John']}
# Include blank values with keep_blank_values=True
params = parse_qs(query, keep_blank_values=True)
print(params) # {'name': ['John'], 'empty': [''], 'flag': ['']}
# This is important for forms where empty fields are meaningfulSetting keep_blank_values=True ensures that parameters without values are still included in the parsed result. This is essential when you need to distinguish between a missing parameter and one explicitly set to empty.
Now that you know how to parse query strings, let's look at the reverse operation: building them safely.
Building Query Strings with urlencode()
Building query strings by hand with string concatenation is error-prone. Special characters need encoding, and it's easy to forget an ampersand or equals sign. Python's urlencode() handles all of this safely.
from urllib.parse import urlencode
# From a dictionary
params = {
'q': 'python tutorials',
'category': 'programming',
'page': 1
}
query_string = urlencode(params)
print(query_string)
# "q=python+tutorials&category=programming&page=1"
# Special characters are encoded automatically
params = {'query': 'Tom & Jerry', 'filter': 'price>100'}
query_string = urlencode(params)
print(query_string)
# "query=Tom+%26+Jerry&filter=price%3E100"The code shows urlencode() converting a dictionary to a properly formatted query string. Notice how the ampersand in "Tom & Jerry" is encoded as %26 and the greater-than sign becomes %3E. You don't need to think about encoding at all.
Handling Multiple Values for Same Key
When a parameter needs multiple values (like selecting multiple checkboxes), you need to tell urlencode() to expand sequences with the doseq parameter.
from urllib.parse import urlencode
# Method 1: Use doseq=True with list values
params = {
'color': ['red', 'blue', 'green'],
'size': 'large'
}
query_string = urlencode(params, doseq=True)
print(query_string)
# "color=red&color=blue&color=green&size=large"
# Method 2: Use a list of tuples
params = [
('color', 'red'),
('color', 'blue'),
('size', 'large')
]
query_string = urlencode(params)
print(query_string)
# "color=red&color=blue&size=large"Both approaches produce the same result: separate parameters for each value. The doseq=True option is more convenient when working with dictionaries, while the list of tuples approach gives you explicit control over ordering.
Building Complete URLs
Often you need to construct a complete URL from scratch or modify an existing one. Python offers several approaches depending on how much control you need.
from urllib.parse import urlencode, urlunparse, ParseResult
# Method 1: String formatting (simple cases)
base = "https://api.example.com/search"
params = {'q': 'python', 'limit': 10}
url = f"{base}?{urlencode(params)}"
print(url)
# "https://api.example.com/search?q=python&limit=10"
# Method 2: Using urlunparse (more control)
components = ParseResult(
scheme='https',
netloc='api.example.com',
path='/v1/users',
params='',
query=urlencode({'status': 'active'}),
fragment=''
)
url = urlunparse(components)
print(url)
# "https://api.example.com/v1/users?status=active"
# Method 3: Modify parsed URL
from urllib.parse import urlparse
parsed = urlparse("https://api.example.com/users")
new_url = parsed._replace(
query=urlencode({'page': 2, 'limit': 20})
)
print(urlunparse(new_url))
# "https://api.example.com/users?page=2&limit=20"Method 1 using f-strings is simple and works for most cases. Method 2 with urlunparse() is more powerful when you need to construct URLs from all their components. Method 3 shows how to modify a parsed URL using _replace() on the named tuple.
Understanding URL encoding is crucial for building URLs correctly. Let's explore how Python handles this.
URL Encoding with quote() and quote_plus()
URL encoding (also called percent-encoding) is how you safely include special characters in URLs. Python provides two functions that differ in one important way: how they encode spaces. The table below shows which to use where.
| Function | Encodes Space As | Best For |
|---|---|---|
quote() | %20 | Path segments, general encoding |
quote_plus() | + | Query string values (form data) |
The key insight is that spaces can be encoded two ways: %20 (for paths) or + (for query strings). Using the wrong encoding can break your URLs.
from urllib.parse import quote, quote_plus
text = "Hello World & Friends"
# quote() - spaces become %20
print(quote(text))
# "Hello%20World%20%26%20Friends"
# quote_plus() - spaces become +
print(quote_plus(text))
# "Hello+World+%26+Friends"
# For path segments, use quote with safe=""
path_segment = "my file/name.pdf"
print(quote(path_segment, safe=''))
# "my%20file%2Fname.pdf"
# By default, quote() leaves / unencoded
print(quote(path_segment))
# "my%20file/name.pdf" (/ not encoded)The code demonstrates both encoding functions and their different treatment of spaces. For path segments, use quote() with safe='' to encode everything including slashes. For query string values, quote_plus() or urlencode() handles it automatically.
The safe Parameter
The safe parameter controls which characters are left unencoded. By default, quote() leaves slashes alone (useful for paths), but you can customize this behavior.
from urllib.parse import quote
# safe="" encodes everything except alphanumerics and _.-~
print(quote("a/b?c=d", safe=''))
# "a%2Fb%3Fc%3Dd"
# safe="/" leaves slashes unencoded (default)
print(quote("a/b?c=d"))
# "a/b%3Fc%3Dd"
# safe="/?" leaves both unencoded
print(quote("a/b?c=d", safe='/?'))
# "a/b?c%3Dd"
# For query values, urlencode() handles this automatically
# For path segments, explicitly use safe=''Setting safe='' encodes all special characters, which is what you need for individual path segments. Leaving characters in safe is useful when you're encoding a full path and want to preserve the structure.
Decoding with unquote() and unquote_plus()
Decoding is the reverse operation, and the same space-encoding distinction applies. Choose the right function based on where the encoded string came from.
from urllib.parse import unquote, unquote_plus
# unquote() decodes %XX sequences
encoded = "Hello%20World%20%26%20Friends"
print(unquote(encoded))
# "Hello World & Friends"
# unquote_plus() also decodes + as space
form_data = "Hello+World+%26+Friends"
print(unquote_plus(form_data))
# "Hello World & Friends"
# unquote() doesn't decode +
print(unquote(form_data))
# "Hello+World & Friends" (+ stays as +)
# For query strings from forms, use unquote_plus()
# For path segments and general URLs, use unquote()The key difference: unquote_plus() converts + to spaces, while unquote() leaves them as plus signs. When parsing query strings from form submissions, use unquote_plus() or let parse_qs() handle it automatically.
Resolving Relative URLs with urljoin()
When scraping websites or processing HTML, you'll encounter relative URLs like ../images/logo.png. Python's urljoin() resolves these against a base URL, following the same rules browsers use.
from urllib.parse import urljoin
base = "https://example.com/blog/posts/article.html"
# Relative paths
print(urljoin(base, "other.html"))
# "https://example.com/blog/posts/other.html"
print(urljoin(base, "./images/photo.jpg"))
# "https://example.com/blog/posts/images/photo.jpg"
print(urljoin(base, "../about.html"))
# "https://example.com/blog/about.html"
# Absolute paths (from root)
print(urljoin(base, "/contact"))
# "https://example.com/contact"
# Protocol-relative
print(urljoin(base, "//cdn.example.com/script.js"))
# "https://cdn.example.com/script.js"
# Full URLs (base is ignored)
print(urljoin(base, "https://other.com/page"))
# "https://other.com/page"The function handles all the edge cases: .. for parent directories, / for absolute paths from the root, and protocol-relative URLs. When the second argument is a complete URL, the base is ignored entirely.
While urllib.parse handles URL manipulation, you'll often use the requests library for actual HTTP operations. Let's see how they work together.
Using the requests Library
The requests library is the de facto standard for HTTP in Python. It integrates seamlessly with URL handling, automatically encoding query parameters so you can pass raw values.
import requests
# Query parameters are encoded automatically
response = requests.get(
"https://api.example.com/search",
params={
'q': 'Tom & Jerry',
'page': 1,
'tags': ['animation', 'classic']
}
)
print(response.url)
# "https://api.example.com/search?q=Tom+%26+Jerry&page=1&tags=animation&tags=classic"
# Parse the URL from a response
from urllib.parse import urlparse, parse_qs
parsed = urlparse(response.url)
params = parse_qs(parsed.query)
# Build URLs with sessions
session = requests.Session()
session.params = {'api_key': 'secret123'} # Added to all requests
response = session.get(
"https://api.example.com/users",
params={'limit': 10}
)
# Combines session params with request paramsNotice how requests handles encoding automatically when you pass the params dictionary. It also handles list values by repeating the parameter. You can still use urllib.parse to inspect or modify the resulting URL.
Let's put all these concepts together with some real-world examples you can adapt for your own projects.
Practical Examples
Building an API Client
This example shows a reusable API client class that handles URL construction, API key injection, and proper encoding. It's a pattern you'll use frequently when integrating with REST APIs.
from urllib.parse import urljoin, urlencode, urlparse, parse_qs
import requests
class APIClient:
def __init__(self, base_url: str, api_key: str):
self.base_url = base_url
self.api_key = api_key
def _build_url(self, endpoint: str, params: dict = None) -> str:
"""Build a full URL with query parameters."""
url = urljoin(self.base_url, endpoint)
if params:
# Add API key to params
params = {**params, 'api_key': self.api_key}
url = f"{url}?{urlencode(params, doseq=True)}"
else:
url = f"{url}?api_key={self.api_key}"
return url
def search(self, query: str, filters: dict = None) -> dict:
"""Search with optional filters."""
params = {'q': query}
if filters:
params.update(filters)
url = self._build_url('/search', params)
response = requests.get(url)
return response.json()
# Usage
client = APIClient(
base_url='https://api.example.com/v1/',
api_key='your-api-key'
)
results = client.search(
query='python tutorials',
filters={'category': 'programming', 'sort': 'date'}
)The API client encapsulates URL building logic, making it easy to add new endpoints. The _build_url method handles the details of combining base URL, path, and parameters while always including the API key.
URL Validation
Security-conscious applications need to validate URLs before following redirects or displaying content. This validator checks for common issues like missing protocols, invalid schemes, and potentially dangerous characters.
from urllib.parse import urlparse
from typing import Optional
def validate_url(url: str) -> tuple[bool, Optional[str]]:
"""
Validate a URL and return (is_valid, error_message).
"""
try:
parsed = urlparse(url)
# Must have scheme
if not parsed.scheme:
return False, "Missing protocol (http:// or https://)"
# Must be http or https
if parsed.scheme not in ('http', 'https'):
return False, f"Invalid protocol: {parsed.scheme}"
# Must have host
if not parsed.netloc:
return False, "Missing domain name"
# Check for suspicious characters
if '<' in url or '>' in url or 'javascript:' in url.lower():
return False, "URL contains potentially dangerous characters"
return True, None
except Exception as e:
return False, str(e)
# Usage
is_valid, error = validate_url("https://example.com/page")
if not is_valid:
print(f"Invalid URL: {error}")
# Test cases
print(validate_url("https://example.com")) # (True, None)
print(validate_url("example.com")) # (False, "Missing protocol...")
print(validate_url("javascript:alert(1)")) # (False, "Invalid protocol...")The validator returns a tuple with a boolean and an optional error message, making it easy to provide user-friendly feedback. Note that urlparse() is permissive and won't catch all malformed URLs, so additional validation logic is necessary.
URL Normalizer
When comparing or deduplicating URLs, you need to normalize them first. Different URLs can point to the same resource: HTTP vs https, default ports, trailing slashes, and parameter ordering all create variations. This normalizer addresses them all.
from urllib.parse import urlparse, urlunparse, parse_qsl, urlencode
def normalize_url(url: str) -> str:
"""
Normalize a URL for comparison and deduplication.
- Lowercase scheme and host
- Remove default ports
- Sort query parameters
- Remove trailing slash from path (optional)
- Remove fragment
"""
parsed = urlparse(url)
# Lowercase scheme and host
scheme = parsed.scheme.lower()
netloc = parsed.hostname.lower() if parsed.hostname else ''
# Include port only if non-default
if parsed.port:
default_ports = {'http': 80, 'https': 443}
if parsed.port != default_ports.get(scheme):
netloc = f"{netloc}:{parsed.port}"
# Normalize path
path = parsed.path or '/'
if path != '/' and path.endswith('/'):
path = path.rstrip('/')
# Sort query parameters
params = parse_qsl(parsed.query)
params.sort(key=lambda x: (x[0], x[1]))
query = urlencode(params)
# Rebuild without fragment
return urlunparse((scheme, netloc, path, '', query, ''))
# Usage
urls = [
"HTTPS://Example.COM/path/?b=2&a=1#section",
"https://example.com:443/path?a=1&b=2",
"https://example.com/path/?a=1&b=2"
]
normalized = [normalize_url(url) for url in urls]
# All become: "https://example.com/path?a=1&b=2"
print(len(set(normalized))) # 1 (all duplicates)The normalizer lowercases the scheme and host, removes default ports, normalizes the path, sorts query parameters, and strips fragments. After normalization, all three example URLs become identical, making deduplication straightforward.
Common Patterns
Here are some utility functions you'll find yourself writing repeatedly. Feel free to copy these into your projects.
Extract Domain from URL
Extracting just the domain or root domain is useful for grouping URLs, security checks, and analytics.
from urllib.parse import urlparse
def get_domain(url: str) -> str:
"""Extract the domain from a URL."""
parsed = urlparse(url)
return parsed.hostname or ''
def get_root_domain(url: str) -> str:
"""Extract the root domain (without subdomains)."""
hostname = get_domain(url)
parts = hostname.split('.')
# Handle cases like 'example.co.uk'
if len(parts) >= 2:
return '.'.join(parts[-2:])
return hostname
# Usage
print(get_domain("https://docs.api.example.com/page"))
# "docs.api.example.com"
print(get_root_domain("https://docs.api.example.com/page"))
# "example.com"The get_root_domain function is simplified and may not handle all cases correctly (like example.co.uk). For production use, consider the publicsuffix2 library which knows about all valid public suffixes.
Add Parameters to Existing URL
Adding parameters to an existing URL while preserving existing ones is a common operation. This function handles merging parameters correctly.
from urllib.parse import urlparse, urlunparse, parse_qs, urlencode
def add_params(url: str, new_params: dict) -> str:
"""Add query parameters to an existing URL."""
parsed = urlparse(url)
# Parse existing params
params = parse_qs(parsed.query)
# Add new params (convert to list format)
for key, value in new_params.items():
if isinstance(value, list):
params[key] = value
else:
params[key] = [value]
# Rebuild URL
new_query = urlencode(params, doseq=True)
return urlunparse(parsed._replace(query=new_query))
# Usage
url = "https://example.com/search?q=python"
new_url = add_params(url, {'page': 2, 'sort': 'date'})
print(new_url)
# "https://example.com/search?q=python&page=2&sort=date"The function parses the existing URL, extracts current parameters, merges in the new ones, rebuilds the query string, and reconstructs the complete URL. This pattern ensures existing parameters aren't lost when adding new ones.