Locking Down Docker: Preventing Privilege Escalation in Container Services
Introduction
Container security is often treated as an afterthought — developers focus on getting services running correctly, and security configuration gets deferred until "later." Unfortunately, "later" sometimes arrives as an incident report. This post examines a real-world high-severity vulnerability found in a Docker Compose file: a combination of missing privilege escalation controls and a writable root filesystem on an nginx reverse proxy service.
If you run containerized workloads in production (and who doesn't these days?), understanding these misconfigurations could be the difference between a contained incident and a full compromise.
The Vulnerability Explained
What Is Privilege Escalation via setuid/setgid Binaries?
Linux systems use two special file permission bits — setuid (Set User ID) and setgid (Set Group ID) — to allow executables to run with the permissions of their owner rather than the user who launched them. Classic examples include /usr/bin/passwd, which needs root privileges to modify /etc/shadow even when run by a regular user.
Inside a container, setuid and setgid binaries present a serious risk. If a container image contains such binaries (and many base images do), a process running inside the container could execute them to escalate its privileges beyond what was originally granted. Even if your application starts as a non-root user, a vulnerability in your app (like a Remote Code Execution flaw) could be chained with a setuid binary to gain root inside the container.
By default, Docker does not prevent this. Without explicit hardening, container processes are free to exploit any setuid/setgid binary present in the image.
What Is a Writable Root Filesystem Risk?
When a container's root filesystem is writable (the default), a compromised process can:
- Download additional malware or attacker tooling directly to the container's filesystem
- Modify application files — injecting backdoors into served content or configuration
- Persist changes across container restarts (if volumes are involved)
- Stage lateral movement by writing scripts that interact with mounted volumes or network resources
For a reverse proxy like nginx, a writable filesystem is especially dangerous. If an attacker achieves code execution (perhaps through a vulnerability in a proxied application), they could modify nginx configuration files, inject content into served responses, or use the container as a foothold for further attacks.
The Real-World Attack Scenario
Imagine this chain of events:
- A vulnerability in a proxied backend application allows Remote Code Execution (RCE)
- The attacker's payload runs inside the
nginxcontainer (perhaps via a misconfigured proxy or shared process namespace) - Because the root filesystem is writable, the attacker downloads a privilege escalation tool to
/tmp - A
setuidbinary in thenginx:alpineimage is exploited to gain root inside the container - With root and a writable filesystem, the attacker modifies
nginx.confor injects malicious content, pivoting to other services on the internalrustfs-network
Each of these steps is made possible by the absence of two simple configuration lines.
The Fix
The automated fix addresses both attack vectors simultaneously with four targeted changes to docker-compose.yml.
Before
# NGINX reverse proxy (optional)
nginx:
image: nginx:alpine
container_name: nginx-proxy
ports:
- "80:80"
- "443:443"
volumes:
- ./.docker/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
- ./.docker/nginx/ssl:/etc/nginx/ssl:ro
networks:
- rustfs-network
restart: unless-stopped
profiles:
- proxy
depends_on:
...
After
# NGINX reverse proxy (optional)
nginx:
security_opt:
- "no-new-privileges:true" # ← NEW: Blocks privilege escalation
image: nginx:alpine
container_name: nginx-proxy
ports:
- "80:80"
- "443:443"
volumes:
- ./.docker/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
- ./.docker/nginx/ssl:/etc/nginx/ssl:ro
tmpfs:
- /var/run # ← NEW: In-memory writable scratch space
- /var/cache/nginx # ← NEW: In-memory cache
- /var/log/nginx # ← NEW: In-memory logs
networks:
- rustfs-network
restart: unless-stopped
read_only: true # ← NEW: Immutable root filesystem
profiles:
- proxy
depends_on:
...
How Each Change Helps
1. security_opt: no-new-privileges:true
This is a Linux kernel-level control that sets the PR_SET_NO_NEW_PRIVS flag on the container's process tree. Once set, no child process can gain more privileges than its parent — even by executing a setuid or setgid binary. The bit is inherited across fork(), clone(), and execve() calls, making it sticky and reliable.
This directly neutralizes the privilege escalation vector described in the vulnerability report for the opensearch service pattern, and it's applied here to nginx as well.
security_opt:
- "no-new-privileges:true"
Key insight: This does not remove
setuidbinaries from the image — it prevents them from being effective. Even if an attacker finds asetuidbinary, executing it won't grant elevated privileges.
2. read_only: true
This mounts the container's root filesystem as read-only at the kernel level. Any attempt to write to the filesystem (outside of explicitly designated writable mounts) will result in a permission error. This directly prevents:
- Downloading and executing additional payloads
- Modifying container files
- Persisting attacker tooling
read_only: true
3. tmpfs Mounts for Legitimate Writable Paths
nginx legitimately needs to write to certain directories at runtime — PID files, cache, and logs. Making the root filesystem read-only would break nginx if these paths weren't handled. The fix uses tmpfs (in-memory, non-persistent) mounts for exactly these directories:
tmpfs:
- /var/run # PID files and Unix sockets
- /var/cache/nginx # Proxy cache and temp files
- /var/log/nginx # Access and error logs
tmpfs is the right tool here for several reasons:
- Ephemeral: Contents are lost when the container stops — no persistence for attacker tooling
- Memory-backed: Fast I/O, no disk writes
- Scoped: Only these specific paths are writable; everything else remains read-only
- Isolated: Each container gets its own tmpfs instance
Note on logs: If you need to retain
nginxlogs for auditing or analysis, consider shipping them to stdout/stderr (which Docker captures) or using a dedicated log driver rather than writing to the container filesystem.
Prevention & Best Practices
Harden Every Service by Default
Don't wait for a vulnerability scan to add these controls. Make them part of your standard Docker Compose template:
# Security hardening template for Docker Compose services
services:
my-service:
security_opt:
- no-new-privileges:true
read_only: true
tmpfs:
- /tmp
- /var/run
# Drop all capabilities and add back only what's needed
cap_drop:
- ALL
cap_add:
- NET_BIND_SERVICE # Only if binding to ports < 1024
Principle of Least Privilege for Containers
Apply these practices across your container fleet:
| Control | Docker Compose Setting | Purpose |
|---|---|---|
| Block privilege escalation | security_opt: no-new-privileges:true |
Prevents setuid/setgid abuse |
| Read-only filesystem | read_only: true |
Blocks payload downloads |
| Drop capabilities | cap_drop: [ALL] |
Removes unnecessary kernel powers |
| Non-root user | user: "1000:1000" |
Reduces blast radius |
| No host network | (avoid network_mode: host) |
Isolates network namespace |
| Resource limits | mem_limit, cpus |
Prevents resource exhaustion |
Use Automated Scanning Tools
Don't rely solely on manual review. Integrate these tools into your CI/CD pipeline:
- Trivy — Scans container images and IaC files including Docker Compose for misconfigurations
- Checkov — Static analysis for Docker, Kubernetes, and Terraform
- Hadolint — Dockerfile linter with security rules
- Docker Scout — Built-in Docker vulnerability scanning
- Semgrep — Custom rule-based scanning (the tool that caught this vulnerability)
Apply Defense in Depth
No single control is sufficient. Layer your defenses:
┌─────────────────────────────────────┐
│ Host OS / Kernel │ ← Seccomp, AppArmor, SELinux
├─────────────────────────────────────┤
│ Container Runtime │ ← Docker security defaults
├─────────────────────────────────────┤
│ Container Configuration │ ← no-new-privileges, read_only, cap_drop
├─────────────────────────────────────┤
│ Application │ ← Non-root user, minimal image
└─────────────────────────────────────┘
Relevant Security Standards
This fix aligns with established security frameworks:
- CWE-269: Improper Privilege Management
- CWE-732: Incorrect Permission Assignment for Critical Resource
- OWASP Docker Security Cheat Sheet: Recommends
no-new-privilegesand read-only filesystems - CIS Docker Benchmark: Controls 5.4 (no-new-privileges) and 5.12 (read-only rootfs)
- NIST SP 800-190: Application Container Security Guide — recommends immutable container images
The opensearch Service — Don't Forget It
The vulnerability report specifically calls out the opensearch service as also lacking no-new-privileges. This fix addressed nginx, but a complete remediation should audit every service in your docker-compose.yml. Run a quick check:
# Find services missing no-new-privileges in your compose file
grep -A 20 "^\s\{2\}[a-z]" docker-compose.yml | grep -v "no-new-privileges"
Or better yet, add Trivy or Checkov to your pre-commit hooks so misconfigurations are caught before they ever reach a pull request.
Conclusion
Two lines of configuration — no-new-privileges:true and read_only: true — combined with thoughtful tmpfs mounts, dramatically reduce the attack surface of a containerized nginx service. These aren't exotic hardening techniques; they're well-documented, widely supported, and have essentially zero impact on legitimate application behavior.
The key takeaways from this fix:
no-new-privileges:trueis a free security win — add it to every service that doesn't explicitly require privilege escalation (which should be all of them)- Read-only root filesystems contain compromises — even if an attacker achieves RCE, they can't download tools or modify files
tmpfsmakes read-only practical — identify the paths your application legitimately writes to and carve out in-memory mounts for them- Automate detection — human review of Docker Compose files will miss things; static analysis tools won't
- Audit all services, not just the flagged one — misconfigurations tend to be systemic, not isolated
Container security doesn't require a security team or expensive tooling to get right. It requires building good defaults into your templates, automating checks in CI/CD, and treating each service configuration as a security-relevant artifact — because it is.
Stay curious, stay secure. 🔒
This vulnerability was automatically detected and fixed as part of an ongoing security hardening initiative. Automated security tooling identified the misconfiguration in docker-compose.yml and generated the remediation pull request.