Back to Blog
critical SEVERITY6 min read

How path traversal happens in Python os.path and how to fix it

A critical path traversal vulnerability in the TRL backend allowed attackers to read arbitrary system files like `/etc/passwd` and `/proc/self/environ` through the gRPC fine-tuning API. The `_do_training` method passed user-controlled `dataset_source` directly to `os.path.exists()` and `load_dataset()` without validation. The fix implements strict directory containment checks using `os.path.realpath()` to ensure all file operations stay within allowed directories.

O
By Orbis AppSec
Published June 22, 2026Reviewed June 22, 2026

Answer Summary

Path traversal (CWE-22) in Python occurs when user input is passed directly to filesystem functions like `os.path.exists()` without validation. In this TRL backend vulnerability, the `_do_training` method accepted arbitrary paths via gRPC requests, enabling attackers to read sensitive files. The fix validates paths using `os.path.realpath()` and `os.path.abspath()` to ensure they remain within a configurable allowed directory (`LOCALAI_DATASET_DIR`), rejecting any path that escapes the boundary.

Vulnerability at a Glance

cweCWE-22
fixImplement directory containment check using `os.path.realpath()` against allowed directory boundary
riskAttackers can read sensitive system files (/etc/passwd, /proc/self/environ) via gRPC requests
languagePython
root causeUser-controlled `dataset_source` passed directly to `os.path.exists()` without path validation
vulnerabilityPath Traversal / Arbitrary File Read

Introduction

The TRL (Transformer Reinforcement Learning) backend in LocalAI handles fine-tuning requests via gRPC, but a critical flaw in backend/python/trl/backend.py at line 310 created a dangerous attack vector. The _do_training method accepted a dataset_source parameter from incoming StartFineTune gRPC requests and passed it directly to os.path.exists() and load_dataset() without any validation.

This meant an attacker with network access to the gRPC backend on port 50051 could specify paths like /etc/passwd or /proc/self/environ as their "dataset source" and trick the server into reading—and potentially leaking—sensitive system files. For a machine learning backend that's designed to load training data, this represents a complete breakdown of the security boundary between user-supplied input and the underlying filesystem.

The Vulnerability Explained

The Dangerous Code Pattern

The vulnerable code in _do_training looked like this:

dataset_split = request.dataset_split or "train"
if os.path.exists(request.dataset_source):
    if request.dataset_source.endswith('.json') or request.dataset_source.endswith('.jsonl'):
        dataset = load_dataset("json", data_files=request.dataset_source, split=dataset_split)
    elif request.dataset_source.endswith('.csv'):
        # ... load CSV dataset

The problem is immediately apparent: request.dataset_source comes directly from the gRPC request with zero validation. The code trusts that the client will only send legitimate dataset paths, but a malicious actor can send anything.

Attack Scenario

Here's how an attacker could exploit this:

  1. Reconnaissance: The attacker discovers the TRL gRPC backend is exposed on port 50051 (either directly or through a misconfigured network)

  2. Crafting the Request: They send a StartFineTune gRPC request with:
    dataset_source: "/proc/self/environ" dataset_split: "train" model_name: "sshleifer/tiny-gpt2" output_dir: "/tmp/output"

  3. File Access: The backend calls os.path.exists("/proc/self/environ"), which returns True. The code then attempts to load this "dataset," potentially exposing environment variables containing API keys, database credentials, or other secrets.

  4. Escalation: With access to /etc/passwd, the attacker learns system usernames. With /proc/self/environ, they might find AWS_SECRET_ACCESS_KEY, DATABASE_URL, or other sensitive environment variables that the ML backend uses.

Why This Is Critical

This isn't just a theoretical risk. The TRL backend is a Python service that likely runs with access to:
- Model weights and training data (potentially proprietary)
- API tokens for HuggingFace and other services
- Cloud credentials for distributed training
- Database connections for experiment tracking

A path traversal here gives attackers a window into all of these.

The Fix

The fix implements a directory containment check—a security pattern that ensures all file operations stay within a designated safe directory.

Before (Vulnerable)

dataset_split = request.dataset_split or "train"
if os.path.exists(request.dataset_source):
    if request.dataset_source.endswith('.json') or request.dataset_source.endswith('.jsonl'):
        dataset = load_dataset("json", data_files=request.dataset_source, split=dataset_split)

After (Fixed)

dataset_split = request.dataset_split or "train"
if os.path.exists(request.dataset_source):
    _allowed_dir = os.path.realpath(os.path.abspath(os.environ.get("LOCALAI_DATASET_DIR", os.getcwd())))
    _real_path = os.path.realpath(os.path.abspath(request.dataset_source))
    if not (_real_path == _allowed_dir or _real_path.startswith(_allowed_dir + os.sep)):
        raise ValueError("Dataset source path is outside the allowed directory")
    if request.dataset_source.endswith('.json') or request.dataset_source.endswith('.jsonl'):
        dataset = load_dataset("json", data_files=request.dataset_source, split=dataset_split)

How the Fix Works

  1. os.path.abspath(): Converts any relative path to an absolute path based on the current working directory

  2. os.path.realpath(): Resolves all symbolic links to get the true canonical path—this prevents symlink-based bypasses

  3. Boundary Check: The code verifies that the resolved path either equals the allowed directory or starts with the allowed directory followed by a path separator (os.sep). The separator check is crucial—without it, /allowed/dir_malicious would pass a check for /allowed/dir

  4. Configurable Allowed Directory: The LOCALAI_DATASET_DIR environment variable lets operators define where datasets should live, defaulting to the current working directory

The Output Path Fix

The same vulnerability existed in the ExportModel method for output paths. The fix applies identical logic:

_allowed_output_dir = os.path.realpath(os.path.abspath(os.environ.get("LOCALAI_OUTPUT_DIR", os.getcwd())))
_real_output_path = os.path.realpath(os.path.abspath(output_path))
if not (_real_output_path == _allowed_output_dir or _real_output_path.startswith(_allowed_output_dir + os.sep)):
    raise ValueError("Output path is outside the allowed directory")
output_path = _real_output_path

This prevents attackers from writing model exports to arbitrary locations like /etc/cron.d/ or overwriting system files.

Prevention & Best Practices

1. Always Validate File Paths from User Input

Never trust paths from external sources. Always:
- Resolve to canonical paths with os.path.realpath()
- Check against an allowed directory boundary
- Use the path separator in your prefix check

def is_safe_path(base_dir: str, user_path: str) -> bool:
    """Check if user_path is safely within base_dir."""
    base = os.path.realpath(os.path.abspath(base_dir))
    target = os.path.realpath(os.path.abspath(user_path))
    return target == base or target.startswith(base + os.sep)

2. Use Allowlists for Dataset Sources

For ML backends, consider maintaining an allowlist of valid dataset identifiers:

ALLOWED_DATASETS = {"imdb", "squad", "glue", "custom_dataset_v1"}

if request.dataset_source in ALLOWED_DATASETS:
    dataset = load_dataset(request.dataset_source, split=dataset_split)
elif is_safe_path(DATASET_DIR, request.dataset_source):
    # Load from local file

3. Principle of Least Privilege

Run ML backends with minimal filesystem permissions. Use containerization or chroot to limit what the process can access even if path validation fails.

4. Input Validation at the API Boundary

Validate inputs as early as possible—ideally in the gRPC service definition or a middleware layer:

def validate_dataset_source(source: str) -> bool:
    # Reject absolute paths
    if os.path.isabs(source):
        return False
    # Reject path traversal sequences
    if '..' in source:
        return False
    return True

Key Takeaways

  • The _do_training method's direct use of request.dataset_source in os.path.exists() created a critical file read vulnerability—always validate paths before filesystem operations

  • gRPC services are network-exposed attack surfaces—treat all request parameters as potentially malicious, even in internal ML pipelines

  • os.path.realpath() + boundary checking is the correct fix pattern—simple string filtering for ../ is insufficient and can be bypassed

  • Both input paths (dataset_source) and output paths (output_path) need validation—the fix addressed both vectors in _do_training and ExportModel

  • Environment-configurable allowed directories (LOCALAI_DATASET_DIR, LOCALAI_OUTPUT_DIR) enable secure deployment flexibility without hardcoding paths

How Orbis AppSec Detected This

  • Source: The dataset_source field from incoming StartFineTune gRPC requests in backend/python/trl/backend.py
  • Sink: os.path.exists(request.dataset_source) at line 310 and subsequent load_dataset() calls
  • Missing control: No path validation or directory containment check before filesystem access
  • CWE: CWE-22 (Improper Limitation of a Pathname to a Restricted Directory)
  • Fix: Added os.path.realpath() resolution and boundary validation against LOCALAI_DATASET_DIR before any filesystem operations

Orbis AppSec automatically detected this vulnerability and opened a pull request with the fix. Try Orbis AppSec on your repositories to find and fix issues like this automatically.

Conclusion

Path traversal vulnerabilities remain one of the most common and dangerous security issues in web services and APIs. This TRL backend vulnerability demonstrates how easily they can slip into ML infrastructure code, where the focus is often on model performance rather than security hardening.

The fix is straightforward but must be applied correctly: resolve paths to their canonical form, check against an allowed directory boundary, and include the path separator in your prefix check. These three steps—implemented consistently across all file operations—form a robust defense against path traversal attacks.

For teams building ML pipelines and fine-tuning services, remember that your training infrastructure is a high-value target. Attackers who compromise these systems gain access to models, training data, and the credentials that connect to your broader infrastructure.

References

Frequently Asked Questions

What is path traversal?

Path traversal is a vulnerability where attackers manipulate file paths to access files outside intended directories, often using sequences like `../` or absolute paths like `/etc/passwd`.

How do you prevent path traversal in Python?

Use `os.path.realpath()` and `os.path.abspath()` to resolve the canonical path, then verify it starts with your allowed base directory using string prefix checking with the path separator.

What CWE is path traversal?

Path traversal is classified as CWE-22 (Improper Limitation of a Pathname to a Restricted Directory).

Is checking for `../` enough to prevent path traversal?

No. Attackers can bypass simple string checks using absolute paths (`/etc/passwd`), URL encoding, or symlinks. Always resolve to the real path and validate against an allowed directory.

Can static analysis detect path traversal?

Yes. Static analysis tools can trace data flow from user inputs to filesystem functions and flag cases where validation is missing before calls like `os.path.exists()` or `open()`.

View the Security Fix

Check out the pull request that fixed this vulnerability

View PR #10422

Related Articles

medium

How integer overflow in tensor shape validation happens in C++ with OpenVINO and how to fix it

A medium-severity integer overflow vulnerability was discovered in the OpenVINO noise suppression plugin where model input tensor shapes were loaded without dimension validation. An attacker could supply a crafted `.xml/.bin` model file with extremely large or zero-sized dimensions, causing integer overflow during memory allocation or zero-size allocations followed by out-of-bounds writes. The fix introduces a `NS_MAX_SHAPE_DIM` constant that validates each dimension against a safe upper bound b

critical

How heap buffer overflow happens in C memcpy() with untrusted PDU length and how to fix it

A critical heap buffer overflow vulnerability was discovered in the Net-SNMP agent's trap handling code where `memcpy()` copied data from a network-controlled PDU without validating that the destination buffer could hold it. An attacker could craft a malicious SNMPv1 trap with an oversized `enterprise_length` field to corrupt heap memory. The fix adds a simple bounds check against `MAX_OID_LEN` before the copy operation.

medium

How integer overflow in _MALLOC() happens in C emulator memory allocation and how to fix it

A critical integer overflow vulnerability was discovered in `i286c/i286c.c` at line 216, where the expression `_MALLOC(size + 16)` could wrap around to a tiny value when `size` approaches `UINT32_MAX`. This undersized allocation leads to a massive heap buffer overflow when the emulator writes the expected number of bytes. The fix adds a simple overflow guard that checks whether `size + 16` would wrap before performing the allocation.

critical

How strcpy buffer overflow happens in C++ debugger command handling and how to fix it

A critical stack-based buffer overflow was discovered in `src/debugger.cpp` at line 387, where `strcpy` copied user-entered debugger commands into a fixed-size stack buffer (`prevCommandBuffer`) without any length validation. An attacker could craft an oversized command string to overflow the buffer, overwrite the return address, and achieve arbitrary code execution. The fix replaces `strcpy` with bounded `strncpy` and explicit null-termination.

critical

How command injection happens in Python subprocess and how to fix it

A command injection vulnerability in `skills/skill-comply/scripts/runner.py` allowed attackers who could influence skill definition files to execute arbitrary binaries on the host system via `subprocess.run()`. The fix introduces an explicit allowlist of permitted executables (`ALLOWED_SETUP_EXECUTABLES`) that gates every command before it reaches the subprocess call at line 110. This closes a significant attack surface in the skill-comply pipeline without breaking legitimate setup workflows.

critical

How heap buffer overflow happens in C polygon rendering and how to fix it

A critical heap buffer overflow vulnerability was discovered in `sgl_polygon.c` where the `memcpy()` function copied user-controlled vertex data without validating that the count parameter didn't exceed the allocated buffer capacity. This could allow attackers to overwrite adjacent heap memory, potentially corrupting function pointers or heap metadata. The fix adds a bounds check before the copy operation to ensure the count never exceeds the maximum allocated size.