Introduction
The TRL (Transformer Reinforcement Learning) backend in LocalAI handles fine-tuning requests via gRPC, but a critical flaw in backend/python/trl/backend.py at line 310 created a dangerous attack vector. The _do_training method accepted a dataset_source parameter from incoming StartFineTune gRPC requests and passed it directly to os.path.exists() and load_dataset() without any validation.
This meant an attacker with network access to the gRPC backend on port 50051 could specify paths like /etc/passwd or /proc/self/environ as their "dataset source" and trick the server into reading—and potentially leaking—sensitive system files. For a machine learning backend that's designed to load training data, this represents a complete breakdown of the security boundary between user-supplied input and the underlying filesystem.
The Vulnerability Explained
The Dangerous Code Pattern
The vulnerable code in _do_training looked like this:
dataset_split = request.dataset_split or "train"
if os.path.exists(request.dataset_source):
if request.dataset_source.endswith('.json') or request.dataset_source.endswith('.jsonl'):
dataset = load_dataset("json", data_files=request.dataset_source, split=dataset_split)
elif request.dataset_source.endswith('.csv'):
# ... load CSV dataset
The problem is immediately apparent: request.dataset_source comes directly from the gRPC request with zero validation. The code trusts that the client will only send legitimate dataset paths, but a malicious actor can send anything.
Attack Scenario
Here's how an attacker could exploit this:
-
Reconnaissance: The attacker discovers the TRL gRPC backend is exposed on port 50051 (either directly or through a misconfigured network)
-
Crafting the Request: They send a
StartFineTunegRPC request with:
dataset_source: "/proc/self/environ" dataset_split: "train" model_name: "sshleifer/tiny-gpt2" output_dir: "/tmp/output" -
File Access: The backend calls
os.path.exists("/proc/self/environ"), which returnsTrue. The code then attempts to load this "dataset," potentially exposing environment variables containing API keys, database credentials, or other secrets. -
Escalation: With access to
/etc/passwd, the attacker learns system usernames. With/proc/self/environ, they might findAWS_SECRET_ACCESS_KEY,DATABASE_URL, or other sensitive environment variables that the ML backend uses.
Why This Is Critical
This isn't just a theoretical risk. The TRL backend is a Python service that likely runs with access to:
- Model weights and training data (potentially proprietary)
- API tokens for HuggingFace and other services
- Cloud credentials for distributed training
- Database connections for experiment tracking
A path traversal here gives attackers a window into all of these.
The Fix
The fix implements a directory containment check—a security pattern that ensures all file operations stay within a designated safe directory.
Before (Vulnerable)
dataset_split = request.dataset_split or "train"
if os.path.exists(request.dataset_source):
if request.dataset_source.endswith('.json') or request.dataset_source.endswith('.jsonl'):
dataset = load_dataset("json", data_files=request.dataset_source, split=dataset_split)
After (Fixed)
dataset_split = request.dataset_split or "train"
if os.path.exists(request.dataset_source):
_allowed_dir = os.path.realpath(os.path.abspath(os.environ.get("LOCALAI_DATASET_DIR", os.getcwd())))
_real_path = os.path.realpath(os.path.abspath(request.dataset_source))
if not (_real_path == _allowed_dir or _real_path.startswith(_allowed_dir + os.sep)):
raise ValueError("Dataset source path is outside the allowed directory")
if request.dataset_source.endswith('.json') or request.dataset_source.endswith('.jsonl'):
dataset = load_dataset("json", data_files=request.dataset_source, split=dataset_split)
How the Fix Works
-
os.path.abspath(): Converts any relative path to an absolute path based on the current working directory -
os.path.realpath(): Resolves all symbolic links to get the true canonical path—this prevents symlink-based bypasses -
Boundary Check: The code verifies that the resolved path either equals the allowed directory or starts with the allowed directory followed by a path separator (
os.sep). The separator check is crucial—without it,/allowed/dir_maliciouswould pass a check for/allowed/dir -
Configurable Allowed Directory: The
LOCALAI_DATASET_DIRenvironment variable lets operators define where datasets should live, defaulting to the current working directory
The Output Path Fix
The same vulnerability existed in the ExportModel method for output paths. The fix applies identical logic:
_allowed_output_dir = os.path.realpath(os.path.abspath(os.environ.get("LOCALAI_OUTPUT_DIR", os.getcwd())))
_real_output_path = os.path.realpath(os.path.abspath(output_path))
if not (_real_output_path == _allowed_output_dir or _real_output_path.startswith(_allowed_output_dir + os.sep)):
raise ValueError("Output path is outside the allowed directory")
output_path = _real_output_path
This prevents attackers from writing model exports to arbitrary locations like /etc/cron.d/ or overwriting system files.
Prevention & Best Practices
1. Always Validate File Paths from User Input
Never trust paths from external sources. Always:
- Resolve to canonical paths with os.path.realpath()
- Check against an allowed directory boundary
- Use the path separator in your prefix check
def is_safe_path(base_dir: str, user_path: str) -> bool:
"""Check if user_path is safely within base_dir."""
base = os.path.realpath(os.path.abspath(base_dir))
target = os.path.realpath(os.path.abspath(user_path))
return target == base or target.startswith(base + os.sep)
2. Use Allowlists for Dataset Sources
For ML backends, consider maintaining an allowlist of valid dataset identifiers:
ALLOWED_DATASETS = {"imdb", "squad", "glue", "custom_dataset_v1"}
if request.dataset_source in ALLOWED_DATASETS:
dataset = load_dataset(request.dataset_source, split=dataset_split)
elif is_safe_path(DATASET_DIR, request.dataset_source):
# Load from local file
3. Principle of Least Privilege
Run ML backends with minimal filesystem permissions. Use containerization or chroot to limit what the process can access even if path validation fails.
4. Input Validation at the API Boundary
Validate inputs as early as possible—ideally in the gRPC service definition or a middleware layer:
def validate_dataset_source(source: str) -> bool:
# Reject absolute paths
if os.path.isabs(source):
return False
# Reject path traversal sequences
if '..' in source:
return False
return True
Key Takeaways
-
The
_do_trainingmethod's direct use ofrequest.dataset_sourceinos.path.exists()created a critical file read vulnerability—always validate paths before filesystem operations -
gRPC services are network-exposed attack surfaces—treat all request parameters as potentially malicious, even in internal ML pipelines
-
os.path.realpath()+ boundary checking is the correct fix pattern—simple string filtering for../is insufficient and can be bypassed -
Both input paths (dataset_source) and output paths (output_path) need validation—the fix addressed both vectors in
_do_trainingandExportModel -
Environment-configurable allowed directories (
LOCALAI_DATASET_DIR,LOCALAI_OUTPUT_DIR) enable secure deployment flexibility without hardcoding paths
How Orbis AppSec Detected This
- Source: The
dataset_sourcefield from incomingStartFineTunegRPC requests inbackend/python/trl/backend.py - Sink:
os.path.exists(request.dataset_source)at line 310 and subsequentload_dataset()calls - Missing control: No path validation or directory containment check before filesystem access
- CWE: CWE-22 (Improper Limitation of a Pathname to a Restricted Directory)
- Fix: Added
os.path.realpath()resolution and boundary validation againstLOCALAI_DATASET_DIRbefore any filesystem operations
Orbis AppSec automatically detected this vulnerability and opened a pull request with the fix. Try Orbis AppSec on your repositories to find and fix issues like this automatically.
Conclusion
Path traversal vulnerabilities remain one of the most common and dangerous security issues in web services and APIs. This TRL backend vulnerability demonstrates how easily they can slip into ML infrastructure code, where the focus is often on model performance rather than security hardening.
The fix is straightforward but must be applied correctly: resolve paths to their canonical form, check against an allowed directory boundary, and include the path separator in your prefix check. These three steps—implemented consistently across all file operations—form a robust defense against path traversal attacks.
For teams building ML pipelines and fine-tuning services, remember that your training infrastructure is a high-value target. Attackers who compromise these systems gain access to models, training data, and the credentials that connect to your broader infrastructure.