How out-of-bounds reads happen in C gettext .mo file parsers and how to fix it
Summary
A missing bounds check in the gettext .mo file parser inside compose/asc-utils-l10n.c allowed a malformed or truncated file to trigger out-of-bounds reads from heap memory. The vulnerability affected two distinct read sites — a memcpy of the full AscLocaleGettextHeader struct at line 131 and a 4-byte offset read at line 224 — neither of which validated that the source buffer was large enough. The fix adds explicit size checks before both reads, rejecting invalid files with a descriptive error instead of reading past the end of allocated memory.
Introduction
The compose/asc-utils-l10n.c file is responsible for parsing localization data — specifically gettext .mo binary files and Qt .qm files — as part of the AppStream Compose toolchain. It is a natural target for malformed-input attacks because it processes files that can be supplied by third parties, packagers, or build pipelines. A flaw in asc_l10n_parse_file_gettext() created a situation where a single truncated .mo file could cause the parser to read heap memory it was never supposed to touch.
The problematic pattern is subtle and common in C: g_bytes_get_data() returns both a pointer and a length, but the length parameter is optional — you can pass NULL if you don't care about it. The original code did exactly that:
data = g_bytes_get_data (bytes, NULL); // length silently discarded
From that point forward, every read from data was unbounded. The very next substantive operation was a memcpy that assumed the buffer was at least sizeof(AscLocaleGettextHeader) bytes long — an assumption that held for well-formed files but silently broke for truncated or crafted ones.
The Vulnerability Explained
What the parser does
When asc_l10n_parse_file_gettext() is called, it reads a .mo file into a GBytes buffer, then immediately copies the first sizeof(AscLocaleGettextHeader) bytes into a local struct h to inspect the magic number and determine byte order:
/* VULNERABLE — before the fix */
data = g_bytes_get_data (bytes, NULL); // (1) length discarded
/* we only strictly need the header */
memcpy (&h, data, sizeof (AscLocaleGettextHeader)); // (2) no size check
If the file on disk is shorter than sizeof(AscLocaleGettextHeader) bytes — whether because it was truncated in transit, deliberately crafted by an attacker, or simply corrupt — the memcpy at step (2) reads past the end of the heap allocation backing bytes. The GLib GBytes allocator does not pad buffers, so adjacent heap metadata or unrelated allocations sit immediately after the buffer boundary.
A second vulnerable site existed in the Qt .qm section parser (asc_l10n_parse_file_qt):
/* VULNERABLE — before the fix */
AscLocaleQmSection section = _read_uint8 (data, &m);
guint32 section_len = _read_uint32 (data, &m); // no check that m+4 <= len
Here, _read_uint32 advances the cursor m by 4 bytes. If m was already within 4 bytes of len, the read extended past the buffer boundary.
How an attacker exploits this
An attacker who can supply a .mo or .qm file to the AppStream Compose pipeline — for example, by contributing a malicious package to a distribution repository, or by intercepting a build artifact — can craft a file that is exactly 1 byte long. When asc_l10n_parse_file_gettext() processes it:
g_bytes_get_data()returns a 1-byte heap allocation.memcpy(&h, data, sizeof(AscLocaleGettextHeader))copiessizeof(AscLocaleGettextHeader)bytes (typically 28 bytes on a 64-bit system) starting from that 1-byte buffer.- The 27 bytes past the buffer boundary are read from adjacent heap memory.
Depending on heap layout, those 27 bytes could contain fragments of other allocations — file paths, authentication tokens, or other sensitive strings processed earlier in the same compose run. In a crash scenario, the contents may appear in a core dump or error log. In a more targeted exploit, the attacker can use the magic-number branch (h.magic == 0x950412de) as an oracle to infer heap contents.
Real-world impact: The AppStream Compose tool runs as part of distribution metadata generation pipelines. A successful exploit could leak heap contents from the compose process, crash the pipeline, or — in combination with a write primitive — escalate to arbitrary code execution.
The Fix
The fix addresses both vulnerable sites with minimal, surgical changes.
Fix 1 — Capture the buffer length and guard the header memcpy
Before:
data = g_bytes_get_data (bytes, NULL);
/* we only strictly need the header */
memcpy (&h, data, sizeof (AscLocaleGettextHeader));
After:
gsize data_len = 0;
// ...
data = g_bytes_get_data (bytes, &data_len); // length is now captured
/* we only strictly need the header */
if (data_len < sizeof (AscLocaleGettextHeader)) {
g_set_error_literal (error,
ASC_COMPOSE_ERROR,
ASC_COMPOSE_ERROR_FAILED,
"Gettext file is too small to be valid");
return FALSE;
}
memcpy (&h, data, sizeof (AscLocaleGettextHeader));
The key change is passing &data_len instead of NULL to g_bytes_get_data(). This one-character change unlocks all subsequent bounds checking. The guard immediately before memcpy ensures that if the file is shorter than the header struct, the function returns a descriptive GError rather than reading invalid memory. The memcpy itself is unchanged — it was never the problem; the missing precondition was.
Fix 2 — Guard the Qt section length read
Before:
AscLocaleQmSection section = _read_uint8 (data, &m);
guint32 section_len = _read_uint32 (data, &m);
After:
guint32 section_len;
AscLocaleQmSection section = _read_uint8 (data, &m);
if (m + 4 > len)
break;
section_len = _read_uint32 (data, &m);
Note also that section_len was moved from a declaration-with-initializer inside the loop body to a plain declaration before the guard. This is necessary in C89/C90-compatible code where declarations must precede statements, but it also makes the control flow clearer: we declare the variable, check whether reading it is safe, and only then perform the read.
The break exits the section-parsing loop cleanly, allowing the caller to handle whatever translations were successfully parsed up to that point rather than crashing or returning garbage.
Why these two fixes are sufficient
Both vulnerable reads shared the same root cause: the buffer length was available (via g_bytes_get_data and the len variable respectively) but was not consulted before the read. The fix does not change the parsing logic — it simply inserts the missing precondition checks that should have been there from the start.
Prevention & Best Practices
1. Never discard the length from g_bytes_get_data()
The GLib documentation explicitly provides the length out-parameter for this reason. Treat g_bytes_get_data(bytes, NULL) as a code smell in any code that subsequently indexes into the returned pointer.
2. Validate every fixed-size read against the remaining buffer
A useful pattern is a helper macro or inline function:
#define CHECK_READ(offset, size, total_len) \
do { if ((offset) + (size) > (total_len)) goto parse_error; } while (0)
Apply it before every memcpy, pointer cast, or multi-byte integer read from an external data source.
3. Use AddressSanitizer (ASan) in CI
Compile with -fsanitize=address during testing. ASan would have caught both of these reads immediately on a fuzzing run with a truncated input file. Pair it with a fuzzer (libFuzzer or AFL++) that generates truncated and bit-flipped .mo files.
4. Apply fuzzing to binary file parsers
Binary format parsers are high-value fuzzing targets. A corpus of valid .mo files plus a mutation fuzzer would almost certainly have generated a 1-byte input and triggered this bug before it reached production.
5. Follow the CERT C Secure Coding Standard
CERT C rule ARR38-C states: "Guarantee that library functions do not form invalid pointers." The memcpy call violated this rule by not guaranteeing the source pointer was valid for the requested size.
Relevant standards:
- CWE-125: Out-of-bounds Read
- CWE-119: Improper Restriction of Operations within the Bounds of a Memory Buffer
- OWASP: Input Validation Cheat Sheet
Key Takeaways
- Passing
NULLas the length parameter tog_bytes_get_data()is safe only if you never index into the returned pointer — inasc_l10n_parse_file_gettext(), the pointer was immediately used in amemcpy, making the discarded length a direct vulnerability. - Both the gettext
.moparser and the Qt.qmparser inasc-utils-l10n.cshared the same class of bug: a read that assumed the buffer was large enough without checking. Always audit all read sites in a parser when you find one vulnerability. - The fix required zero changes to the parsing logic — only the addition of
&data_lenin one function call and twoifguards. Bounds checks are cheap; heap over-reads are not. - Binary locale files are attacker-controlled input in any system that processes third-party packages. Treat them with the same skepticism as network input.
- The
_read_uint32helper advancingmwithout a bounds check is a pattern that can hide in many cursor-based parsers — audit every call site of such helpers when the cursor approaches the buffer end.
How Orbis AppSec Detected This
- Source: A
.moor.qmlocale file read from disk viaasc_unit_read_data()inasc_l10n_parse_file_gettext()— an externally supplied binary file with no prior size validation. - Sink:
memcpy (&h, data, sizeof (AscLocaleGettextHeader))atcompose/asc-utils-l10n.c:131, and_read_uint32 (data, &m)in the Qt section loop — both operating on a pointer whose backing buffer length was unknown at the call site. - Missing control: The length out-parameter of
g_bytes_get_data()was passed asNULL, so no upper bound was available to check before either read. No guard existed between the data pointer acquisition and the first memory copy. - CWE: CWE-125 — Out-of-bounds Read
- Fix: Passed
&data_lentog_bytes_get_data()and insertedif (data_len < sizeof(AscLocaleGettextHeader))andif (m + 4 > len)guards before the respective unsafe reads.
Orbis AppSec automatically detected this vulnerability and opened a pull request with the fix. Try Orbis AppSec on your repositories to find and fix issues like this automatically.
Conclusion
Out-of-bounds reads in binary file parsers are among the most common and consequential memory-safety bugs in C codebases. The vulnerability in asc_l10n_parse_file_gettext() is a textbook example: a single optional parameter (NULL instead of &data_len) silently removed the only mechanism by which the code could have known it was about to read past the end of a heap buffer. The fix is three lines of guard code — but those three lines are the difference between a parser that crashes or leaks memory on malformed input and one that fails safely with a clear error message.
For developers writing binary format parsers in C, the lesson is clear: always capture buffer lengths, always check them before every read, and fuzz your parsers with truncated inputs before shipping. The cost of adding a bounds check is microseconds; the cost of shipping without one can be a heap disclosure in a production build pipeline.
References
- CWE-125: Out-of-bounds Read
- CWE-119: Improper Restriction of Operations within the Bounds of a Memory Buffer
- OWASP Input Validation Cheat Sheet
- CERT C ARR38-C: Guarantee that library functions do not form invalid pointers
- GLib
g_bytes_get_data()documentation - Semgrep rules for memcpy bounds issues
- fix: add bounds check before memcpy in asc-utils-l10n.c