How heap buffer overflow happens in C JMA archive extraction and how to fix it
Introduction
The jma/jma.cpp file is responsible for parsing and extracting files from JMA-format ROM archives — a compressed archive format used in emulation tooling. Inside jma_open::extract_file(), the code decompresses chunk data and copies it into an output buffer. It's the kind of low-level file parsing code that handles attacker-controlled input by design, which makes it a high-value target.
A flaw at line 446 of jma.cpp meant that every call to extract_file() with a malicious archive was a potential heap corruption event. The memcpy call trusted two values — first_chunk_offset and copy_amount — that were derived from the archive's own header metadata, without ever checking whether those values actually fit inside the buffers they were indexing into.
This is a textbook example of CWE-120: Buffer Copy without Checking Size of Input, and it's particularly dangerous because JMA archives are user-supplied files: any user who can get the application to open an archive controls the values that feed this memcpy.
The Vulnerability Explained
Here is the vulnerable code as it existed before the fix, centered on line 446:
// jma/jma.cpp — BEFORE FIX (vulnerable)
size_t copy_amount = our_file_size - i > chunk_size - first_chunk_offset
? chunk_size - first_chunk_offset
: our_file_size - i;
// ⚠️ No bounds check — first_chunk_offset and copy_amount come from archive header
memcpy(buffer + i, decomp_buffer + first_chunk_offset, copy_amount);
first_chunk_offset = 0;
i += copy_amount;
At first glance, the copy_amount calculation looks like it's doing something safe — it takes the minimum of two values. But the problem is what those values represent and where they come from.
chunk_sizeis the declared size of the decompressed chunk, read from the archive header.first_chunk_offsetis a byte offset intodecomp_buffer, also from the archive header.our_file_sizeis the declared output file size, again from the archive.iis the running write position intobuffer.
None of these values are validated against the actual allocated sizes of decomp_buffer or buffer. The copy_amount ternary only ensures the value is consistent with the archive's own declared sizes — but a crafted archive can declare whatever sizes it wants.
The Attack Scenario
An attacker creates a malicious JMA archive with the following crafted metadata:
first_chunk_offsetis set to a value near the end ofdecomp_buffer— say,chunk_size - 4.copy_amountis computed aschunk_size - first_chunk_offset, which evaluates to4.- But
our_file_sizeis declared as much larger than the actualbufferallocation.
Because i accumulates across iterations and the loop doesn't stop when i approaches the real buffer boundary, the write target buffer + i can march past the end of the heap allocation. Meanwhile, decomp_buffer + first_chunk_offset can point past the end of the source allocation.
The result: attacker-controlled bytes written to arbitrary heap memory. Depending on what lives adjacent on the heap (vtable pointers, allocator metadata, other objects), this can lead to:
- Heap metadata corruption → crash or allocator exploitation
- Object pointer overwrite → potential code execution
- Silent data corruption → logic vulnerabilities downstream
This is not theoretical. The scanner confirmed this as exploitable with a concrete exploitation scenario, and the vulnerability affects production code that processes real user-supplied archive files.
The Fix
The fix is surgical and correct. A bounds check was inserted immediately before the memcpy call, validating both the source and destination sides of the copy:
// jma/jma.cpp — AFTER FIX (safe)
size_t copy_amount = our_file_size - i > chunk_size - first_chunk_offset
? chunk_size - first_chunk_offset
: our_file_size - i;
// ✅ Bounds check added — validates BOTH source and destination
if (first_chunk_offset + copy_amount > chunk_size || i + copy_amount > our_file_size) {
delete[] comp_buffer;
throw(JMA_BAD_FILE);
}
memcpy(buffer + i, decomp_buffer + first_chunk_offset, copy_amount);
first_chunk_offset = 0;
i += copy_amount;
Why Both Checks Are Necessary
The check has two independent conditions, and both matter:
first_chunk_offset + copy_amount > chunk_size
This guards the read side (decomp_buffer). It ensures the source read doesn't go past the end of the decompressed chunk buffer. Without this, a crafted offset could cause reads from unallocated heap memory.
i + copy_amount > our_file_size
This guards the write side (buffer). It ensures the destination write doesn't go past the end of the output buffer. Without this, accumulated writes across loop iterations could overflow the output allocation.
Cleanup Before Throw
Notice that the fix also calls delete[] comp_buffer before throwing. This is important — comp_buffer was heap-allocated earlier in the function, and throwing without freeing it would cause a memory leak. The fix correctly handles this cleanup, maintaining both safety and resource hygiene.
Before/After Comparison
| Aspect | Before | After |
|---|---|---|
| Source bounds checked | ❌ No | ✅ first_chunk_offset + copy_amount <= chunk_size |
| Destination bounds checked | ❌ No | ✅ i + copy_amount <= our_file_size |
| Malformed archive handling | Silent overflow | JMA_BAD_FILE exception |
| Memory cleanup on error | N/A | delete[] comp_buffer before throw |
Related Lines to Review
The PR notes that similar memcpy patterns exist at lines 451 and 467 in the same file. These should be audited with the same lens — any memcpy whose parameters flow from archive-controlled data needs the same treatment.
Prevention & Best Practices
1. Treat All File-Derived Values as Untrusted
Any value read from a file, network packet, or serialized format is attacker-controlled. This is especially true for archive parsers, image decoders, and protocol implementations. Validate every offset and length before use.
2. The Golden Rule for memcpy Safety
Before every memcpy(dst, src, n) call, verify:
// Always check both sides
assert(src_offset + n <= src_buffer_size); // Source won't overread
assert(dst_offset + n <= dst_buffer_size); // Destination won't overwrite
3. Use AddressSanitizer During Development
# Compile with AddressSanitizer to catch buffer overflows at runtime
g++ -fsanitize=address -fno-omit-frame-pointer -g jma.cpp -o jma_asan
AddressSanitizer would have caught this exact overflow the first time a malformed archive was processed in a test environment.
4. Fuzz Your Archive Parsers
Archive parsing code is an ideal target for fuzzing. Tools like libFuzzer or AFL++ can generate thousands of malformed archives per second and will quickly find cases where header values cause crashes:
# Example: fuzz the JMA extraction path
clang++ -fsanitize=fuzzer,address jma_fuzz_target.cpp jma.cpp -o jma_fuzzer
./jma_fuzzer corpus/
5. Consider Bounds-Checked Abstractions
In modern C++, prefer std::span for buffer views, which carries size information and enables bounds checking:
#include <span>
// Instead of raw pointer + separate size variable,
// use std::span which keeps size and pointer together
void extract_chunk(std::span<uint8_t> dest, std::span<const uint8_t> src, size_t offset) {
if (offset >= src.size()) throw std::out_of_range("bad offset");
auto src_view = src.subspan(offset); // Bounds-checked subspan
// ...
}
6. Reference Standards
- CWE-120: Buffer Copy without Checking Size of Input
- CWE-122: Heap-based Buffer Overflow
- OWASP: Buffer Overflow Prevention Cheat Sheet
Key Takeaways
- Archive header values are attacker-controlled input.
first_chunk_offset,chunk_size, andour_file_sizeall come from the JMA file itself — never assume they are safe to use directly asmemcpyparameters. - The ternary minimum trick is not a bounds check. The
copy_amountcalculation inextract_file()looked protective but only enforced internal consistency of the archive's own declared values, not their validity against actual buffer allocations. - Both source and destination must be validated independently. Checking only the write side (
i + copy_amount) would have left the read side (decomp_buffer + first_chunk_offset) vulnerable to out-of-bounds reads, and vice versa. - Lines 451 and 467 in the same file use similar patterns and should be reviewed with the same scrutiny — a single-function audit is not enough when the same pattern recurs.
- AddressSanitizer + fuzzing would have caught this before production. For any code that parses binary formats, these tools should be part of the standard development workflow.
How Orbis AppSec Detected This
- Source: JMA archive file header metadata — specifically the
first_chunk_offset,chunk_size, andour_file_sizefields parsed from the archive byjma_open::extract_file() - Sink:
memcpy(buffer + i, decomp_buffer + first_chunk_offset, copy_amount)atjma/jma.cpp:446 - Missing control: No validation that
first_chunk_offset + copy_amount <= chunk_size(source bounds) ori + copy_amount <= our_file_size(destination bounds) before the copy - CWE: CWE-120 — Buffer Copy without Checking Size of Input
- Fix: Added a pre-copy bounds check at line 446 that throws
JMA_BAD_FILEand freescomp_bufferif either the source or destination bounds would be exceeded
Orbis AppSec automatically detected this vulnerability and opened a pull request with the fix. Try Orbis AppSec on your repositories to find and fix issues like this automatically.
Conclusion
The heap buffer overflow in jma_open::extract_file() is a clear example of why archive parsers require a fundamentally adversarial mindset: every byte in the input file is a potential attack vector. The vulnerability wasn't caused by a complex logic error — it was a straightforward missing validation before a memcpy. The fix is equally straightforward: five lines of bounds checking that transform a silent heap corruption into a clean, recoverable error.
For developers working on any binary format parser in C or C++, this case reinforces a non-negotiable rule: validate every offset and length derived from external data before using it in a memory operation, on both the source and destination sides, every time.