What is path traversal?

Path traversal is a vulnerability where attackers manipulate file paths to access files outside intended directories, often using sequences like `../` or absolute paths like `/etc/passwd`.

How do you prevent path traversal in Python?

Use `os.path.realpath()` and `os.path.abspath()` to resolve the canonical path, then verify it starts with your allowed base directory using string prefix checking with the path separator.

What CWE is path traversal?

Path traversal is classified as CWE-22 (Improper Limitation of a Pathname to a Restricted Directory).

Is checking for `../` enough to prevent path traversal?

No. Attackers can bypass simple string checks using absolute paths (`/etc/passwd`), URL encoding, or symlinks. Always resolve to the real path and validate against an allowed directory.

Can static analysis detect path traversal?

Yes. Static analysis tools can trace data flow from user inputs to filesystem functions and flag cases where validation is missing before calls like `os.path.exists()` or `open()`.

Path Traversal in Python os.path — Fix

Introduction

The TRL (Transformer Reinforcement Learning) backend in LocalAI handles fine-tuning requests via gRPC, but a critical flaw in backend/python/trl/backend.py at line 310 created a dangerous attack vector. The _do_training method accepted a dataset_source parameter from incoming StartFineTune gRPC requests and passed it directly to os.path.exists() and load_dataset() without any validation.

This meant an attacker with network access to the gRPC backend on port 50051 could specify paths like /etc/passwd or /proc/self/environ as their "dataset source" and trick the server into reading—and potentially leaking—sensitive system files. For a machine learning backend that's designed to load training data, this represents a complete breakdown of the security boundary between user-supplied input and the underlying filesystem.

The Vulnerability Explained

The Dangerous Code Pattern

The vulnerable code in _do_training looked like this:

dataset_split = request.dataset_split or "train"
if os.path.exists(request.dataset_source):
    if request.dataset_source.endswith('.json') or request.dataset_source.endswith('.jsonl'):
        dataset = load_dataset("json", data_files=request.dataset_source, split=dataset_split)
    elif request.dataset_source.endswith('.csv'):
        # ... load CSV dataset

The problem is immediately apparent: request.dataset_source comes directly from the gRPC request with zero validation. The code trusts that the client will only send legitimate dataset paths, but a malicious actor can send anything.

Attack Scenario

Here's how an attacker could exploit this:

Reconnaissance: The attacker discovers the TRL gRPC backend is exposed on port 50051 (either directly or through a misconfigured network)
Crafting the Request: They send a StartFineTune gRPC request with:
dataset_source: "/proc/self/environ" dataset_split: "train" model_name: "sshleifer/tiny-gpt2" output_dir: "/tmp/output"
File Access: The backend calls os.path.exists("/proc/self/environ"), which returns True. The code then attempts to load this "dataset," potentially exposing environment variables containing API keys, database credentials, or other secrets.
Escalation: With access to /etc/passwd, the attacker learns system usernames. With /proc/self/environ, they might find AWS_SECRET_ACCESS_KEY, DATABASE_URL, or other sensitive environment variables that the ML backend uses.

Why This Is Critical

This isn't just a theoretical risk. The TRL backend is a Python service that likely runs with access to:
- Model weights and training data (potentially proprietary)
- API tokens for HuggingFace and other services
- Cloud credentials for distributed training
- Database connections for experiment tracking

A path traversal here gives attackers a window into all of these.

The Fix

The fix implements a directory containment check—a security pattern that ensures all file operations stay within a designated safe directory.

Before (Vulnerable)

dataset_split = request.dataset_split or "train"
if os.path.exists(request.dataset_source):
    if request.dataset_source.endswith('.json') or request.dataset_source.endswith('.jsonl'):
        dataset = load_dataset("json", data_files=request.dataset_source, split=dataset_split)

After (Fixed)

dataset_split = request.dataset_split or "train"
if os.path.exists(request.dataset_source):
    _allowed_dir = os.path.realpath(os.path.abspath(os.environ.get("LOCALAI_DATASET_DIR", os.getcwd())))
    _real_path = os.path.realpath(os.path.abspath(request.dataset_source))
    if not (_real_path == _allowed_dir or _real_path.startswith(_allowed_dir + os.sep)):
        raise ValueError("Dataset source path is outside the allowed directory")
    if request.dataset_source.endswith('.json') or request.dataset_source.endswith('.jsonl'):
        dataset = load_dataset("json", data_files=request.dataset_source, split=dataset_split)

How the Fix Works

os.path.abspath(): Converts any relative path to an absolute path based on the current working directory
os.path.realpath(): Resolves all symbolic links to get the true canonical path—this prevents symlink-based bypasses
Boundary Check: The code verifies that the resolved path either equals the allowed directory or starts with the allowed directory followed by a path separator (os.sep). The separator check is crucial—without it, /allowed/dir_malicious would pass a check for /allowed/dir
Configurable Allowed Directory: The LOCALAI_DATASET_DIR environment variable lets operators define where datasets should live, defaulting to the current working directory

The Output Path Fix

The same vulnerability existed in the ExportModel method for output paths. The fix applies identical logic:

_allowed_output_dir = os.path.realpath(os.path.abspath(os.environ.get("LOCALAI_OUTPUT_DIR", os.getcwd())))
_real_output_path = os.path.realpath(os.path.abspath(output_path))
if not (_real_output_path == _allowed_output_dir or _real_output_path.startswith(_allowed_output_dir + os.sep)):
    raise ValueError("Output path is outside the allowed directory")
output_path = _real_output_path

This prevents attackers from writing model exports to arbitrary locations like /etc/cron.d/ or overwriting system files.

Prevention & Best Practices

1. Always Validate File Paths from User Input

Never trust paths from external sources. Always:
- Resolve to canonical paths with os.path.realpath()
- Check against an allowed directory boundary
- Use the path separator in your prefix check

def is_safe_path(base_dir: str, user_path: str) -> bool:
    """Check if user_path is safely within base_dir."""
    base = os.path.realpath(os.path.abspath(base_dir))
    target = os.path.realpath(os.path.abspath(user_path))
    return target == base or target.startswith(base + os.sep)

2. Use Allowlists for Dataset Sources

For ML backends, consider maintaining an allowlist of valid dataset identifiers:

ALLOWED_DATASETS = {"imdb", "squad", "glue", "custom_dataset_v1"}

if request.dataset_source in ALLOWED_DATASETS:
    dataset = load_dataset(request.dataset_source, split=dataset_split)
elif is_safe_path(DATASET_DIR, request.dataset_source):
    # Load from local file

3. Principle of Least Privilege

Run ML backends with minimal filesystem permissions. Use containerization or chroot to limit what the process can access even if path validation fails.

4. Input Validation at the API Boundary

Validate inputs as early as possible—ideally in the gRPC service definition or a middleware layer:

def validate_dataset_source(source: str) -> bool:
    # Reject absolute paths
    if os.path.isabs(source):
        return False
    # Reject path traversal sequences
    if '..' in source:
        return False
    return True

Key Takeaways

The _do_training method's direct use of request.dataset_source in os.path.exists() created a critical file read vulnerability—always validate paths before filesystem operations
gRPC services are network-exposed attack surfaces—treat all request parameters as potentially malicious, even in internal ML pipelines
os.path.realpath() + boundary checking is the correct fix pattern—simple string filtering for ../ is insufficient and can be bypassed
Both input paths (dataset_source) and output paths (output_path) need validation—the fix addressed both vectors in _do_training and ExportModel
Environment-configurable allowed directories (LOCALAI_DATASET_DIR, LOCALAI_OUTPUT_DIR) enable secure deployment flexibility without hardcoding paths

How Orbis AppSec Detected This

Source: The dataset_source field from incoming StartFineTune gRPC requests in backend/python/trl/backend.py
Sink: os.path.exists(request.dataset_source) at line 310 and subsequent load_dataset() calls
Missing control: No path validation or directory containment check before filesystem access
CWE: CWE-22 (Improper Limitation of a Pathname to a Restricted Directory)
Fix: Added os.path.realpath() resolution and boundary validation against LOCALAI_DATASET_DIR before any filesystem operations

Orbis AppSec automatically detected this vulnerability and opened a pull request with the fix. Try Orbis AppSec on your repositories to find and fix issues like this automatically.

Conclusion

Path traversal vulnerabilities remain one of the most common and dangerous security issues in web services and APIs. This TRL backend vulnerability demonstrates how easily they can slip into ML infrastructure code, where the focus is often on model performance rather than security hardening.

The fix is straightforward but must be applied correctly: resolve paths to their canonical form, check against an allowed directory boundary, and include the path separator in your prefix check. These three steps—implemented consistently across all file operations—form a robust defense against path traversal attacks.

For teams building ML pipelines and fine-tuning services, remember that your training infrastructure is a high-value target. Attackers who compromise these systems gain access to models, training data, and the credentials that connect to your broader infrastructure.

cwe	CWE-22
fix	Implement directory containment check using `os.path.realpath()` against allowed directory boundary
risk	Attackers can read sensitive system files (/etc/passwd, /proc/self/environ) via gRPC requests
language	Python
root cause	User-controlled `dataset_source` passed directly to `os.path.exists()` without path validation
vulnerability	Path Traversal / Arbitrary File Read

How path traversal happens in Python os.path and how to fix it

Answer Summary

Vulnerability at a Glance

Introduction

The Vulnerability Explained

The Dangerous Code Pattern

Attack Scenario

Why This Is Critical

The Fix

Before (Vulnerable)

After (Fixed)

How the Fix Works

The Output Path Fix

Prevention & Best Practices

1. Always Validate File Paths from User Input

2. Use Allowlists for Dataset Sources

3. Principle of Least Privilege

4. Input Validation at the API Boundary

Key Takeaways

How Orbis AppSec Detected This

Conclusion

References

Frequently Asked Questions

What is path traversal?

How do you prevent path traversal in Python?

What CWE is path traversal?

Is checking for `../` enough to prevent path traversal?

Can static analysis detect path traversal?

View the Security Fix

Related Articles

How integer overflow in tensor shape validation happens in C++ with OpenVINO and how to fix it

How heap buffer overflow happens in C memcpy() with untrusted PDU length and how to fix it

How integer overflow in _MALLOC() happens in C emulator memory allocation and how to fix it

How strcpy buffer overflow happens in C++ debugger command handling and how to fix it

How command injection happens in Python subprocess and how to fix it

How heap buffer overflow happens in C polygon rendering and how to fix it