How path traversal in open() happens in Python and how to fix it
Summary
A high-severity path traversal vulnerability was discovered in tool/update-doc.py, where user-controlled input was passed directly to Python's open() function without sanitization. This flaw could allow an attacker to read arbitrary files on the server by manipulating the file path. The fix ensures that file paths are validated and restricted to an intended directory before being opened.
Introduction
The tool/update-doc.py script handles documentation updates — a utility role that might seem low-risk at first glance. But a flaw in how it processes file paths created a serious security hole: user-controlled input flowed directly into a call to open() with no sanitization, no boundary checking, and no path normalization.
This is a textbook path traversal vulnerability. The script trusts the caller to provide a safe file path, but an attacker who controls that input can supply something like ../../../../etc/passwd and cause the application to open files it was never meant to touch.
For developers building documentation tools, CLI utilities, or any script that accepts filenames from external sources, this pattern is surprisingly easy to introduce and surprisingly dangerous to leave unaddressed.
The Vulnerability Explained
What went wrong in update-doc.py
The vulnerable pattern looks like this:
# Vulnerable code — user-controlled `doc_path` passed directly to open()
doc_path = request.args.get("doc") # or sys.argv, or any external source
with open(doc_path, "r") as f:
content = f.read()
The problem is straightforward: the value of doc_path comes from user-controlled input and is used directly as the argument to open(). Python's open() function will happily accept any valid filesystem path — including ones that traverse parent directories using ../ sequences.
There is no call to os.path.realpath(), no check that the resolved path starts with a trusted base directory, and no filtering of dangerous path components. The script simply trusts the input.
How an attacker exploits this
Imagine the script is intended to open documentation files from a directory like /app/docs/. An attacker who can influence the doc_path value could supply:
../../etc/passwd
Which resolves to:
/app/docs/../../etc/passwd → /etc/passwd
Python's open() resolves this path transparently and returns the contents of /etc/passwd. On a real system, this technique can be used to read:
/etc/passwd— system user accounts~/.ssh/id_rsa— private SSH keys- Application config files — database credentials, API keys, secrets
- Other source files — internal business logic, hardcoded tokens
The attacker doesn't need authentication, special privileges, or a complex exploit chain. They just need to control the string passed to open().
Why update-doc.py is a realistic target
Documentation update scripts often run with elevated privileges (they need to write to the filesystem), and they frequently accept filenames from command-line arguments, web requests, or configuration files — all of which can be attacker-influenced. The combination of elevated access and unsanitized input makes this a high-value target.
The Fix
What needs to change
The fix requires two things:
1. Normalize the path — resolve .. sequences and symlinks to get the true absolute path.
2. Enforce a boundary — confirm the resolved path is inside the intended base directory before opening it.
Before (vulnerable)
# No sanitization — attacker controls doc_path entirely
doc_path = get_user_input()
with open(doc_path, "r") as f:
content = f.read()
After (fixed)
import os
BASE_DIR = os.path.realpath("/app/docs")
def safe_open_doc(user_input):
# Resolve the full, normalized absolute path
requested_path = os.path.realpath(os.path.join(BASE_DIR, user_input))
# Enforce that the resolved path is within the allowed base directory
if not requested_path.startswith(BASE_DIR + os.sep):
raise ValueError(f"Access denied: path traversal detected in '{user_input}'")
with open(requested_path, "r") as f:
return f.read()
Why this fix works
os.path.realpath()resolves all../sequences, symlinks, and redundant separators, giving you the true absolute path on the filesystem. An attacker can't hide traversal in URL encoding or double-dot tricks once this is applied.startswith(BASE_DIR + os.sep)ensures the resolved path is genuinely inside the base directory, not just a string that happens to start with the same prefix. The+ os.sepprevents a bypass where/app/docs_evilwould incorrectly match/app/docs.- The check happens before
open()is called, so the file is never opened if the path is out of bounds.
Using pathlib (Python 3.6+) offers an equally clean and Pythonic approach:
from pathlib import Path
BASE_DIR = Path("/app/docs").resolve()
def safe_open_doc(user_input):
requested_path = (BASE_DIR / user_input).resolve()
if not requested_path.is_relative_to(BASE_DIR):
raise ValueError("Path traversal detected")
return requested_path.read_text()
Path.is_relative_to() (Python 3.9+) makes the boundary check explicit and readable.
Prevention & Best Practices
1. Always normalize before validating
Raw string comparisons on file paths are unreliable. ../docs/../etc/passwd and /etc/passwd are the same file, but string comparison won't catch that. Always call os.path.realpath() or Path.resolve() first.
2. Define an explicit allow-list of base directories
Don't check what a path doesn't contain (e.g., filtering out ../). Instead, check what it does resolve to. Blocklist approaches are fragile — there are many encoding tricks (%2e%2e, null bytes on older systems, Unicode normalization) that can bypass them.
3. Use the principle of least privilege
If update-doc.py only needs to read files from /app/docs/, run it with a filesystem user that only has read access to that directory. Defense in depth means even a bypassed path check doesn't expose the entire filesystem.
4. Validate at the entry point
Sanitize file path inputs as close to the source as possible — when reading from sys.argv, an HTTP request, or a config file. Don't rely on downstream code to catch bad input.
5. Use static analysis to catch taint flows
Tools that perform taint analysis — tracking user-controlled data from source to sink — are especially effective at catching path traversal. Semgrep, Bandit, and Orbis AppSec can all identify when unsanitized input reaches open().
Relevant standards
- CWE-22: Improper Limitation of a Pathname to a Restricted Directory ("Path Traversal")
- OWASP Top 10 A01:2021 — Broken Access Control (path traversal is a key subcategory)
- OWASP Path Traversal Cheat Sheet: comprehensive guidance on prevention
Key Takeaways
open(user_input)inupdate-doc.pyis a direct path traversal sink — any user-controlled string reaching this call without normalization and boundary checking is exploitable.- String filtering for
../is not a safe mitigation — attackers can encode, double-encode, or use OS-specific tricks to bypass blocklists. Userealpath()+ prefix check instead. - Documentation and utility scripts are not low-risk — tools like
update-doc.pyoften run with elevated permissions and accept filenames as input, making them attractive targets. - The
startswith(BASE_DIR + os.sep)pattern is critical — omitting+ os.sepcreates a bypass where sibling directories with similar names pass the check incorrectly. pathlib.Path.is_relative_to()is the modern, readable way to enforce directory boundaries in Python 3.9+ and should be preferred in new code.
How Orbis AppSec Detected This
- Source: User-controlled input (e.g., command-line argument, request parameter, or external config value) providing the file path in
tool/update-doc.py - Sink:
open(doc_path, ...)called with the unsanitized user-controlled path, allowing arbitrary file reads - Missing control: No call to
os.path.realpath()or equivalent normalization; no check that the resolved path falls within an intended base directory before the file is opened - CWE: CWE-22 — Improper Limitation of a Pathname to a Restricted Directory (Path Traversal)
- Fix: Normalize the input path with
os.path.realpath()and verify it resolves within the allowed base directory before passing it toopen()
Orbis AppSec automatically detected this vulnerability and opened a pull request with the fix. Try Orbis AppSec on your repositories to find and fix issues like this automatically.
Conclusion
Path traversal in Python's open() is one of those vulnerabilities that looks simple but carries serious consequences. In tool/update-doc.py, the absence of a single normalization-and-boundary-check pattern was enough to expose the entire server filesystem to anyone who could influence the file path argument.
The fix is not complex — os.path.realpath() combined with a startswith() check, or pathlib's is_relative_to(), closes the door entirely. The lesson for developers is to treat every file path that originates from outside your code as untrusted, normalize it unconditionally, and verify it resolves where you expect before acting on it.
Security in file handling isn't about trusting your users. It's about building code that stays safe regardless of what input it receives.
References
- CWE-22: Improper Limitation of a Pathname to a Restricted Directory
- OWASP Path Traversal Cheat Sheet
- Python
os.path.realpath()documentation - Python
pathlib.Path.resolve()documentation - Semgrep rules for path traversal
- Fix: Arithmetic multiplication used to compute allocation size without overflow check. If the multiplication wraps, a too-small buffer is allocated, leading to heap overflow. Check for overflow before allocating.