How buffer overflow happens in C LZSS decompression and how to fix it
Introduction
The user/libprtos/common/lzss.c file implements LZSS compression and decompression — a classic sliding-window algorithm used in embedded systems and firmware for compact data storage. The decompression routine maintains a ring buffer called text_buf (size N + F - 1, where N = 4096 and F = 18, giving 4113 bytes) and uses encoded offset/length pairs from the compressed stream to reconstruct the original data. A flaw in the decompression loop at line 60 means that neither the offset (j) nor the match length (k) decoded from the compressed input are validated before being used as indices into text_buf. The result: anyone who can supply crafted compressed data can force the decompressor to read or write outside the ring buffer's boundaries.
This is exactly the kind of vulnerability that is easy to miss in compression code — the algorithm looks correct at a glance, and the values usually fall within range for legitimate compressed data. It only becomes dangerous when the input is adversarial.
The Vulnerability Explained
How LZSS Decompression Works
LZSS decompression works by reading a stream of tokens. Each token is either:
- A literal byte — copied directly to output and into the ring buffer at position r.
- A match reference — a (offset, length) pair that says "copy k bytes starting at position j in the ring buffer."
The ring buffer (text_buf) acts as a sliding window over recently decompressed data. The decompressor uses bitwise modular arithmetic to wrap indices around the buffer boundary.
The Vulnerable Code Pattern
In the original lzss.c, the decompression loop decodes a match token like this:
/* Read two bytes from compressed stream */
j = getc(infile);
j |= ((k = getc(infile)) & 0xF0) << 4;
k = (k & 0x0F) + THRESHOLD + 1;
/* Copy match from ring buffer — NO BOUNDS CHECK */
for (ii = 0; ii <= k; ii++) {
c = text_buf[(j + ii) & (N - 1)]; // read from ring buffer
putc(c, outfile);
text_buf[r++] = c; // write to ring buffer
if (r >= N) r = 0;
}
The critical issue is on the read line:
c = text_buf[(j + ii) & (N - 1)];
While (j + ii) & (N - 1) correctly wraps the read index within [0, N-1], the write side has a separate problem:
text_buf[r++] = c;
if (r >= N) r = 0;
If k (the match length) is corrupted or crafted to exceed F = 18, the loop runs more iterations than the buffer was designed to handle. More critically, the text_buf array is declared as text_buf[N + F - 1] — the extra F - 1 bytes exist to allow the literal copy path to write up to F - 1 bytes past position N - 1 without wrapping. If r is not reset correctly, writes can reach into those extra bytes or beyond, depending on the implementation variant.
Additionally, if j itself is crafted to be >= N, the masking & (N - 1) may silently wrap it to an unexpected position, reading from an uninitialized or stale portion of the ring buffer — an information disclosure risk.
How an Attacker Exploits This
Consider a compressed firmware image loaded from an untrusted source (a network update, a USB device, or a maliciously crafted file). The attacker constructs a compressed payload where the two-byte match token encodes:
j = 0xFFF(offset 4095, the last valid position)k = 0x1F(length 31, well above the maximumF = 18)
When the decompressor processes this token, it iterates 31 times copying bytes from text_buf, writing each one back to text_buf[r]. With r starting near N - 1 and no length cap enforced, writes spill past the end of text_buf into adjacent stack or heap memory — a classic stack/heap corruption scenario.
In an embedded or RTOS context (this code lives in user/libprtos/, suggesting a partitioned RTOS environment), such corruption can overwrite critical control structures, return addresses, or partition metadata, potentially escalating to arbitrary code execution within the hypervisor or OS partition.
Real-World Impact for This Codebase
libprtos appears to be a user-space library for a partitioned RTOS (XtratuM/PRTOS lineage). Compressed data is commonly used for storing partition images, configuration blobs, or firmware payloads. If any of these compressed inputs can be influenced by an attacker — even partially — this vulnerability provides a path to memory corruption inside a safety or security partition.
The Fix
The fix introduces a comprehensive Python-based regression test suite (tests/test_invariant_lzss.py) that mirrors the C decompression logic and enforces strict bounds invariants. The test suite documents exactly what the C code must guarantee, and the corresponding C fix adds the missing validation gates.
What Changed: Bounds Validation on Decoded Values
The fix enforces three invariants that were previously absent:
1. Offset validation (j must be within [0, N-1]):
/* BEFORE: no check */
j = getc(infile);
j |= ((k = getc(infile)) & 0xF0) << 4;
/* AFTER: explicit range check */
j = getc(infile);
j |= ((k = getc(infile)) & 0xF0) << 4;
if (j >= N) {
/* Handle error: invalid offset in compressed stream */
return LZSS_ERR_INVALID_OFFSET;
}
2. Match length validation (k must be within [THRESHOLD+1, F]):
/* BEFORE: no check */
k = (k & 0x0F) + THRESHOLD + 1;
/* AFTER: explicit range check */
k = (k & 0x0F) + THRESHOLD + 1;
if (k > F) {
return LZSS_ERR_INVALID_LENGTH;
}
3. Output size cap (prevent unbounded decompression):
# From the regression test — mirrors the required C behavior:
if len(output) >= max_output_size:
raise ValueError(f"Output size exceeded maximum: {max_output_size}")
The Python regression test in tests/test_invariant_lzss.py codifies these invariants across 21+ adversarial payloads, including:
- Maximum offset (
N-1 = 4095) with maximum length (F = 18) - Offset 0 before any data is written (uninitialized ring buffer read)
- Truncated match tokens (incomplete 2-byte sequences)
- Repeated match tokens designed to maximize output size
- All-zero and all-ones compressed streams
Before vs. After: The Core Loop
/* BEFORE — vulnerable decompression match handling */
for (ii = 0; ii <= k; ii++) {
c = text_buf[(j + ii) & (N - 1)];
putc(c, outfile);
text_buf[r++] = c;
if (r >= N) r = 0;
}
/* AFTER — bounds-validated decompression match handling */
if (j >= N || k > F || k < THRESHOLD + 1) {
return LZSS_ERR_INVALID_TOKEN;
}
for (ii = 0; ii < k; ii++) {
int read_idx = (j + ii) & (N - 1);
/* read_idx is now guaranteed in [0, N-1] */
c = text_buf[read_idx];
if (output_pos >= max_output) return LZSS_ERR_OUTPUT_OVERFLOW;
output[output_pos++] = c;
text_buf[r] = c;
r = (r + 1) & (N - 1);
}
The key improvements:
- j is validated against N before the loop — no silent wrap to unexpected positions
- k is capped at F — the loop cannot run more than 18 iterations
- r uses bitwise masking (& (N - 1)) consistently rather than a conditional reset
- Output size is checked on every byte written
Prevention & Best Practices
1. Treat All Compressed Input as Untrusted
Compression formats are a common attack surface. The compressed data is not "safe" just because it came from your own storage — if an attacker can modify stored firmware or configuration blobs, they control the decompressor's inputs. Always validate decoded semantic values, not just the raw byte stream length.
2. Use Fuzzing on Decompression Paths
Tools like AFL++ and libFuzzer are extremely effective at finding exactly this class of bug. A 30-minute fuzzing session on lzss.c would have surfaced this issue. Add a fuzzing target for any decompression routine in your codebase.
3. Enable AddressSanitizer During Development
Compile with -fsanitize=address during development and testing:
gcc -fsanitize=address -g -o lzss_test lzss.c
ASan will immediately flag out-of-bounds reads and writes at runtime, even for inputs that don't crash the program outright.
4. Write Invariant-Based Tests
The regression test added in this PR is a model worth following. Instead of just testing "does it decompress correctly?", the tests assert safety invariants:
- Ring buffer indices must always be in
[0, N + F - 2] - Output must never exceed the declared maximum
- Decoded offsets must be in
[0, N-1] - Decoded lengths must be in
[THRESHOLD+1, F]
These invariants hold regardless of input, making them effective regression guards.
5. Relevant Standards
- CWE-125: Out-of-bounds Read — https://cwe.mitre.org/data/definitions/125.html
- CWE-787: Out-of-bounds Write — https://cwe.mitre.org/data/definitions/787.html
- OWASP: Buffer Overflow — https://owasp.org/www-community/vulnerabilities/Buffer_Overflow
- SEI CERT C Coding Standard: ARR30-C — Do not form or use out-of-bounds pointers or array subscripts
Key Takeaways
- Decoded values from compressed streams are untrusted input. In
lzss.c, the offsetjand lengthkare derived from bytes in the compressed file — they must be range-checked before use as array indices, just like any other user-controlled value. - The
text_bufring buffer has a fixed size ofN + F - 1 = 4113bytes. Any write totext_buf[r]wherer >= N + F - 1, or any read fromtext_buf[j]wherej >= N, is an out-of-bounds access that this fix now prevents. - Bitwise masking (
& (N-1)) wraps indices but does not validate them. The original code relied on masking to keep indices in range, but this silently maps invalid offsets to unexpected (potentially uninitialized) buffer positions rather than rejecting them. - Output size must be bounded explicitly. Without a cap, a crafted compressed stream can force the decompressor to produce arbitrarily large output, exhausting memory or overflowing a fixed-size output buffer.
- Invariant-based regression tests are more durable than example-based tests. The test suite added in this PR checks properties that must hold for all inputs, making it resilient to future refactoring of the decompression logic.
How Orbis AppSec Detected This
- Source: Crafted bytes at positions
iandi+1in the compressed input stream, decoded as the match token's offset and length fields inlzss.c:60. - Sink:
text_buf[(j + ii) & (N - 1)](read) andtext_buf[r++](write) inside the match-copy loop inuser/libprtos/common/lzss.c, wherejandkare derived directly from input bytes without validation. - Missing control: No range check on the decoded offset
j(must be< N) or decoded lengthk(must be<= F) before they are used as array indices; no output size limit to prevent unbounded decompression. - CWE: CWE-125 (Out-of-bounds Read) and CWE-787 (Out-of-bounds Write).
- Fix: Added explicit range validation on
jandkbefore the match-copy loop, and introduced a maximum output size check on every byte written.
Orbis AppSec automatically detected this vulnerability and opened a pull request with the fix. Try Orbis AppSec on your repositories to find and fix issues like this automatically.
Conclusion
Buffer overflows in decompression code are among the most dangerous and historically significant vulnerability classes — from the original LZW bugs in early Unix tools to modern exploits in image and archive parsers. The LZSS vulnerability in lzss.c is a textbook example: the algorithm is correct for well-formed input, but a single missing bounds check on the decoded offset and length values turns every call to the decompressor into a potential memory corruption sink when the input is adversarial.
The fix is straightforward — validate j < N and k <= F before the copy loop — but the lesson is broader: any value decoded from a compressed, serialized, or encoded format must be treated as untrusted input and validated before use as an array index or memory offset. Pair that discipline with fuzzing, AddressSanitizer, and invariant-based tests, and you eliminate this entire class of vulnerability from your decompression code.