Heap Overflow in LZMA Decompression: When Attacker-Controlled Data Meets memcpy
Introduction
Compression libraries are everywhere. They decompress archives, stream data, and sit quietly inside applications handling files users upload, download, or open daily. Because of this ubiquity, they make an attractive and high-value target for attackers. A vulnerability in a compression routine doesn't just affect one feature — it can affect every code path that processes compressed data.
This post covers a critical heap buffer overflow discovered and fixed in deps/lzma/src/LzmaDec.c, the LZMA decompression engine. The root cause is a classic but dangerous pattern: a length value derived from attacker-controlled input is used in a memcpy call without first verifying it fits within the destination buffer. One malformed compressed stream is all it takes to corrupt heap memory.
If you write C or C++, work with compression libraries, or ship software that processes user-supplied binary data, this one is worth understanding deeply.
The Vulnerability Explained
What Is LZMA?
LZMA (Lempel–Ziv–Markov chain Algorithm) is a lossless data compression algorithm used in formats like .7z, .xz, and .lzma. It's embedded in many applications and system utilities. The decompression logic is complex by nature — it must faithfully reconstruct arbitrary byte sequences from compact encoded representations — and that complexity creates opportunities for subtle bugs.
The Vulnerable Code Path
The vulnerability lives in LzmaDec_DecodeToBuf, the core function responsible for decompressing LZMA data into a caller-supplied output buffer. Here's the vulnerable section (before the fix):
// LzmaDec.c:1226 — BEFORE FIX
outSizeCur = p->dicPos - dicPos;
// ⚠️ outSizeCur is derived from decompressor internal state,
// influenced by the compressed input stream
memcpy(dest, p->dic + dicPos, outSizeCur);
dest += outSizeCur;
outSize -= outSizeCur;
The variable outSizeCur represents how many bytes were produced by the current decompression step. It's calculated from p->dicPos and dicPos — values that are directly influenced by the contents of the compressed input stream. The problem: this value is never checked against outSize, the actual remaining capacity of the destination buffer dest.
If an attacker crafts a malicious compressed stream that causes outSizeCur to exceed outSize, the memcpy will write beyond the end of the dest buffer, overwriting whatever happens to live adjacent to it on the heap.
The Broader Attack Surface
This isn't an isolated issue. The PR description identifies related patterns in the encoding side as well:
LzmaEnc.c:2930— User-supplied data of sizesizeis copied intop->datawithout bounds validation.LzmaEnc.c:515— A left-shift operation onlclpcomputes thelitProbsarray copy size. Iflclpis attacker-influenced, the shift can produce an oversized value, again passed to a copy operation without a safety check.
All three paths share the same root cause: lengths derived from external input are trusted implicitly.
How Could This Be Exploited?
The attack scenario is straightforward for anyone who can supply compressed data to the application:
- Craft a malicious
.lzmaor.7zarchive — The attacker constructs a compressed stream where the decompressed output length for a single chunk exceeds the destination buffer size. - Trigger decompression — The application opens the file, passes it to the LZMA decoder, and calls
LzmaDec_DecodeToBuf. - Heap corruption occurs —
memcpywrites beyond the buffer boundary, overwriting heap metadata, adjacent objects, or function pointers. - Exploitation — Depending on the heap layout and platform, the attacker may achieve:
- Arbitrary code execution by overwriting a function pointer or vtable entry
- Denial of service by corrupting heap metadata and triggering a crash
- Information disclosure if the corruption causes sensitive data to be returned to the caller
This is classified as CWE-122: Heap-based Buffer Overflow, and it maps to CVSS critical severity — for good reason. Compressed file parsers are frequently exposed to untrusted input, and heap overflows in native code are among the most reliably exploitable vulnerability classes.
The Fix
The fix is elegant in its simplicity. A single bounds check was added immediately before the memcpy call:
// LzmaDec.c:1226 — AFTER FIX
outSizeCur = p->dicPos - dicPos;
// ✅ Validate before copying
if (outSizeCur > outSize)
return SZ_ERROR_DATA;
memcpy(dest, p->dic + dicPos, outSizeCur);
dest += outSizeCur;
outSize -= outSizeCur;
Why This Works
The check if (outSizeCur > outSize) ensures that the number of bytes about to be copied never exceeds the remaining space in the destination buffer. If a malformed stream would cause an overflow, the function returns SZ_ERROR_DATA — a defined error code indicating the input data is invalid — before any memory corruption can occur.
This approach is correct for several reasons:
- Fail-fast behavior: The function returns an error immediately rather than attempting partial recovery, preventing any ambiguous state.
- Semantically accurate: If
outSizeCur > outSize, the compressed data is genuinely malformed or malicious.SZ_ERROR_DATAis the appropriate signal. - No performance impact: A single integer comparison adds negligible overhead to what is already a CPU-intensive decompression loop.
- Minimal diff: The fix is surgical — it doesn't restructure logic, introduce new dependencies, or risk regressions in the happy path.
Before and After at a Glance
| Before | After | |
|---|---|---|
| Bounds check | ❌ None | ✅ outSizeCur > outSize |
| On malformed input | Heap corruption | Returns SZ_ERROR_DATA |
| Attack surface | Open | Closed |
| Performance impact | — | Negligible |
Prevention & Best Practices
This vulnerability is a textbook example of a class of bugs that has plagued C and C++ codebases for decades. Here's how to prevent it systematically.
1. Never Trust Lengths Derived from External Input
Any value that originates from a file, network packet, or user input — even indirectly through computation — must be treated as untrusted. Before using such a value as a copy length, validate it explicitly:
// ❌ Dangerous: length from external source, no validation
memcpy(dest, src, external_length);
// ✅ Safe: validate before use
if (external_length > dest_capacity) {
return ERROR_INVALID_INPUT;
}
memcpy(dest, src, external_length);
2. Use Safer Memory Functions
Where possible, prefer bounds-checking variants:
// Prefer memcpy_s (C11 Annex K) where available
memcpy_s(dest, dest_size, src, count);
// Or use explicit size tracking
assert(count <= dest_remaining);
memcpy(dest, src, count);
3. Enable Compiler and Runtime Mitigations
Modern toolchains offer multiple layers of protection that can detect or limit the impact of buffer overflows:
- AddressSanitizer (ASan): Detects out-of-bounds memory accesses at runtime. Run your test suite with
-fsanitize=address. - Stack/Heap canaries: Enable with
-fstack-protector-all(GCC/Clang). - FORTIFY_SOURCE: Compile with
-D_FORTIFY_SOURCE=2to enable compile-time and runtime checks onmemcpy,strcpy, and similar functions. - Control Flow Integrity (CFI): Limits the impact of heap corruption on control flow.
# Example: build with multiple hardening flags
gcc -O2 -D_FORTIFY_SOURCE=2 -fstack-protector-all \
-fsanitize=address,undefined \
-o myapp myapp.c
4. Fuzz Your Parsers
Compression libraries and file format parsers are prime candidates for fuzzing. Tools like libFuzzer and AFL++ are highly effective at finding exactly this class of bug by generating malformed inputs automatically:
# Example: fuzz a decompression function with libFuzzer
clang -fsanitize=fuzzer,address -o fuzz_lzma fuzz_lzma.c LzmaDec.c
./fuzz_lzma corpus/
Fuzzing would have found this vulnerability — a malformed stream that produces an oversized outSizeCur is exactly the kind of input a fuzzer generates naturally.
5. Apply the Principle of Least Privilege
Even if exploitation occurs, limiting the process's privileges reduces the blast radius. Run decompression in a sandboxed process or with reduced permissions where your architecture allows it.
6. Reference Security Standards
- CWE-122: Heap-based Buffer Overflow — https://cwe.mitre.org/data/definitions/122.html
- CWE-20: Improper Input Validation — https://cwe.mitre.org/data/definitions/20.html
- OWASP: Buffer Overflow — https://owasp.org/www-community/vulnerabilities/Buffer_Overflow
- SEI CERT C Coding Standard: MEM35-C — Allocate sufficient memory for an object
Conclusion
A two-line fix — one comparison, one early return — closed a critical attack path that could have allowed heap corruption via malformed compressed input. That's the nature of memory safety bugs in C: the vulnerability is often small and subtle, but the consequences can be severe.
The key lessons from this vulnerability:
- Lengths from compressed or encoded data are attacker-controlled. Always validate them before use.
memcpyhas no safety net. It will copy exactly what you tell it to, even if that means writing off the end of your buffer.- Fail fast and loudly. Returning an error on invalid input is always better than proceeding with corrupted state.
- Fuzz your parsers. Automated input generation is one of the most effective ways to find this class of bug before attackers do.
Memory corruption vulnerabilities in compression libraries have a long history of high-severity CVEs — from zlib to libpng to the LZMA SDK itself. The pattern is consistent, and so is the fix: validate inputs, check bounds, and never assume that data derived from an external source is well-formed.
Secure coding isn't about being paranoid — it's about being precise.
This vulnerability was identified and fixed as part of an automated security scanning workflow. The fix was verified by build pipeline, automated re-scan, and LLM-assisted code review.