What is a heap buffer overflow in LZMA decompression?

It is a vulnerability where decompressing attacker-crafted LZMA data causes a memcpy operation to write beyond the allocated output buffer on the heap, corrupting adjacent memory and potentially enabling code execution.

How do you prevent heap buffer overflows in C decompression code?

Always validate that the copy length does not exceed the remaining space in the destination buffer before calling memcpy. Use explicit bounds checks tied to the actual allocated buffer size, not values derived from compressed input.

What CWE is a heap buffer overflow?

CWE-122 (Heap-based Buffer Overflow), a subtype of CWE-119 (Improper Restriction of Operations within the Bounds of a Memory Buffer).

Is input validation alone enough to prevent this heap overflow?

Not if validation only occurs at a high level. The fix must be placed immediately before the dangerous memcpy call, using the actual remaining buffer capacity as the upper bound, because decompression state can diverge from initial input checks.

Can static analysis detect this type of heap buffer overflow?

Yes. Static analysis tools like Semgrep, CodeQL, and Coverity can identify memcpy calls where the length argument is derived from untrusted input without an intervening bounds check, which is exactly the pattern exploited here.

Heap Overflow in LZMA Decompression: When Attacker-Controlled Data Meets memcpy

Introduction

Compression libraries are everywhere. They decompress archives, stream data, and sit quietly inside applications handling files users upload, download, or open daily. Because of this ubiquity, they make an attractive and high-value target for attackers. A vulnerability in a compression routine doesn't just affect one feature — it can affect every code path that processes compressed data.

This post covers a critical heap buffer overflow discovered and fixed in deps/lzma/src/LzmaDec.c, the LZMA decompression engine. The root cause is a classic but dangerous pattern: a length value derived from attacker-controlled input is used in a memcpy call without first verifying it fits within the destination buffer. One malformed compressed stream is all it takes to corrupt heap memory.

If you write C or C++, work with compression libraries, or ship software that processes user-supplied binary data, this one is worth understanding deeply.

The Vulnerability Explained

What Is LZMA?

LZMA (Lempel–Ziv–Markov chain Algorithm) is a lossless data compression algorithm used in formats like .7z, .xz, and .lzma. It's embedded in many applications and system utilities. The decompression logic is complex by nature — it must faithfully reconstruct arbitrary byte sequences from compact encoded representations — and that complexity creates opportunities for subtle bugs.

The Vulnerable Code Path

The vulnerability lives in LzmaDec_DecodeToBuf, the core function responsible for decompressing LZMA data into a caller-supplied output buffer. Here's the vulnerable section (before the fix):

// LzmaDec.c:1226 — BEFORE FIX
outSizeCur = p->dicPos - dicPos;
// ⚠️ outSizeCur is derived from decompressor internal state,
// influenced by the compressed input stream
memcpy(dest, p->dic + dicPos, outSizeCur);
dest += outSizeCur;
outSize -= outSizeCur;

The variable outSizeCur represents how many bytes were produced by the current decompression step. It's calculated from p->dicPos and dicPos — values that are directly influenced by the contents of the compressed input stream. The problem: this value is never checked against outSize, the actual remaining capacity of the destination buffer dest.

If an attacker crafts a malicious compressed stream that causes outSizeCur to exceed outSize, the memcpy will write beyond the end of the dest buffer, overwriting whatever happens to live adjacent to it on the heap.

The Broader Attack Surface

This isn't an isolated issue. The PR description identifies related patterns in the encoding side as well:

LzmaEnc.c:2930 — User-supplied data of size size is copied into p->data without bounds validation.
LzmaEnc.c:515 — A left-shift operation on lclp computes the litProbs array copy size. If lclp is attacker-influenced, the shift can produce an oversized value, again passed to a copy operation without a safety check.

All three paths share the same root cause: lengths derived from external input are trusted implicitly.

How Could This Be Exploited?

The attack scenario is straightforward for anyone who can supply compressed data to the application:

Craft a malicious .lzma or .7z archive — The attacker constructs a compressed stream where the decompressed output length for a single chunk exceeds the destination buffer size.
Trigger decompression — The application opens the file, passes it to the LZMA decoder, and calls LzmaDec_DecodeToBuf.
Heap corruption occurs — memcpy writes beyond the buffer boundary, overwriting heap metadata, adjacent objects, or function pointers.
Exploitation — Depending on the heap layout and platform, the attacker may achieve:
- Arbitrary code execution by overwriting a function pointer or vtable entry
- Denial of service by corrupting heap metadata and triggering a crash
- Information disclosure if the corruption causes sensitive data to be returned to the caller

This is classified as CWE-122: Heap-based Buffer Overflow, and it maps to CVSS critical severity — for good reason. Compressed file parsers are frequently exposed to untrusted input, and heap overflows in native code are among the most reliably exploitable vulnerability classes.

The Fix

The fix is elegant in its simplicity. A single bounds check was added immediately before the memcpy call:

// LzmaDec.c:1226 — AFTER FIX
outSizeCur = p->dicPos - dicPos;

// ✅ Validate before copying
if (outSizeCur > outSize)
  return SZ_ERROR_DATA;

memcpy(dest, p->dic + dicPos, outSizeCur);
dest += outSizeCur;
outSize -= outSizeCur;

Why This Works

The check if (outSizeCur > outSize) ensures that the number of bytes about to be copied never exceeds the remaining space in the destination buffer. If a malformed stream would cause an overflow, the function returns SZ_ERROR_DATA — a defined error code indicating the input data is invalid — before any memory corruption can occur.

This approach is correct for several reasons:

Fail-fast behavior: The function returns an error immediately rather than attempting partial recovery, preventing any ambiguous state.
Semantically accurate: If outSizeCur > outSize, the compressed data is genuinely malformed or malicious. SZ_ERROR_DATA is the appropriate signal.
No performance impact: A single integer comparison adds negligible overhead to what is already a CPU-intensive decompression loop.
Minimal diff: The fix is surgical — it doesn't restructure logic, introduce new dependencies, or risk regressions in the happy path.

Before and After at a Glance

	Before	After
Bounds check	❌ None	✅ `outSizeCur > outSize`
On malformed input	Heap corruption	Returns `SZ_ERROR_DATA`
Attack surface	Open	Closed
Performance impact	—	Negligible

Prevention & Best Practices

This vulnerability is a textbook example of a class of bugs that has plagued C and C++ codebases for decades. Here's how to prevent it systematically.

1. Never Trust Lengths Derived from External Input

Any value that originates from a file, network packet, or user input — even indirectly through computation — must be treated as untrusted. Before using such a value as a copy length, validate it explicitly:

// ❌ Dangerous: length from external source, no validation
memcpy(dest, src, external_length);

// ✅ Safe: validate before use
if (external_length > dest_capacity) {
    return ERROR_INVALID_INPUT;
}
memcpy(dest, src, external_length);

2. Use Safer Memory Functions

Where possible, prefer bounds-checking variants:

// Prefer memcpy_s (C11 Annex K) where available
memcpy_s(dest, dest_size, src, count);

// Or use explicit size tracking
assert(count <= dest_remaining);
memcpy(dest, src, count);

3. Enable Compiler and Runtime Mitigations

Modern toolchains offer multiple layers of protection that can detect or limit the impact of buffer overflows:

AddressSanitizer (ASan): Detects out-of-bounds memory accesses at runtime. Run your test suite with -fsanitize=address.
Stack/Heap canaries: Enable with -fstack-protector-all (GCC/Clang).
FORTIFY_SOURCE: Compile with -D_FORTIFY_SOURCE=2 to enable compile-time and runtime checks on memcpy, strcpy, and similar functions.
Control Flow Integrity (CFI): Limits the impact of heap corruption on control flow.

# Example: build with multiple hardening flags
gcc -O2 -D_FORTIFY_SOURCE=2 -fstack-protector-all \
    -fsanitize=address,undefined \
    -o myapp myapp.c

4. Fuzz Your Parsers

Compression libraries and file format parsers are prime candidates for fuzzing. Tools like libFuzzer and AFL++ are highly effective at finding exactly this class of bug by generating malformed inputs automatically:

# Example: fuzz a decompression function with libFuzzer
clang -fsanitize=fuzzer,address -o fuzz_lzma fuzz_lzma.c LzmaDec.c
./fuzz_lzma corpus/

Fuzzing would have found this vulnerability — a malformed stream that produces an oversized outSizeCur is exactly the kind of input a fuzzer generates naturally.

5. Apply the Principle of Least Privilege

Even if exploitation occurs, limiting the process's privileges reduces the blast radius. Run decompression in a sandboxed process or with reduced permissions where your architecture allows it.

6. Reference Security Standards

CWE-122: Heap-based Buffer Overflow — https://cwe.mitre.org/data/definitions/122.html
CWE-20: Improper Input Validation — https://cwe.mitre.org/data/definitions/20.html
OWASP: Buffer Overflow — https://owasp.org/www-community/vulnerabilities/Buffer_Overflow
SEI CERT C Coding Standard: MEM35-C — Allocate sufficient memory for an object

Conclusion

A two-line fix — one comparison, one early return — closed a critical attack path that could have allowed heap corruption via malformed compressed input. That's the nature of memory safety bugs in C: the vulnerability is often small and subtle, but the consequences can be severe.

The key lessons from this vulnerability:

Lengths from compressed or encoded data are attacker-controlled. Always validate them before use.
memcpy has no safety net. It will copy exactly what you tell it to, even if that means writing off the end of your buffer.
Fail fast and loudly. Returning an error on invalid input is always better than proceeding with corrupted state.
Fuzz your parsers. Automated input generation is one of the most effective ways to find this class of bug before attackers do.

Memory corruption vulnerabilities in compression libraries have a long history of high-severity CVEs — from zlib to libpng to the LZMA SDK itself. The pattern is consistent, and so is the fix: validate inputs, check bounds, and never assume that data derived from an external source is well-formed.

Secure coding isn't about being paranoid — it's about being precise.

This vulnerability was identified and fixed as part of an automated security scanning workflow. The fix was verified by build pipeline, automated re-scan, and LLM-assisted code review.

Heap Overflow in LZMA Decompression: When Attacker-Controlled Data Meets memcpy

Answer Summary

Vulnerability at a Glance

Heap Overflow in LZMA Decompression: When Attacker-Controlled Data Meets memcpy

Introduction

The Vulnerability Explained

What Is LZMA?

The Vulnerable Code Path

The Broader Attack Surface

How Could This Be Exploited?

The Fix

Why This Works

Before and After at a Glance

Prevention & Best Practices

1. Never Trust Lengths Derived from External Input

2. Use Safer Memory Functions

3. Enable Compiler and Runtime Mitigations

4. Fuzz Your Parsers

5. Apply the Principle of Least Privilege

6. Reference Security Standards

Conclusion

Frequently Asked Questions

What is a heap buffer overflow in LZMA decompression?

How do you prevent heap buffer overflows in C decompression code?

What CWE is a heap buffer overflow?

Is input validation alone enough to prevent this heap overflow?

Can static analysis detect this type of heap buffer overflow?

View the Security Fix

Related Articles

How buffer overflow happens in C tar header parsing and how to fix it

How buffer overflow happens in C ieee80211_input() and how to fix it

How buffer overflow from unsafe string copy functions happens in C network interface code and how to fix it

How buffer overflow in FuzzIxml.c sprintf() happens in C and how to fix it

How buffer overflow happens in C HTML parsing and how to fix it

How buffer overflow in memcpy() happens in Node.js N-API bindings and how to fix it

cwe	CWE-122
fix	Added output size bounds check before the memcpy call in LzmaDec.c
risk	Arbitrary code execution or process crash via attacker-crafted compressed input
language	C
root cause	Copy length derived from attacker-controlled compressed data passed to memcpy without bounds validation
vulnerability	Heap Buffer Overflow via unchecked memcpy in LZMA decompression