What is an out-of-bounds read?

An out-of-bounds read occurs when a program reads memory beyond the end of an allocated buffer, potentially exposing sensitive heap contents or causing a crash.

How do you prevent out-of-bounds reads in C when parsing binary files?

Always capture the buffer length returned by functions like `g_bytes_get_data()`, then validate that every read offset plus the read size is strictly less than or equal to that length before performing `memcpy` or pointer arithmetic.

Is address sanitizer (ASan) enough to prevent out-of-bounds reads in production?

ASan is an excellent detection tool during development and CI, but it is not a prevention mechanism for production code. Explicit bounds checks in the source code are required to prevent the vulnerability from being reachable at all.

Can static analysis detect missing bounds checks before memcpy?

Yes. Tools like Semgrep, Coverity, and CodeChecker can flag `memcpy` calls where the size argument is not validated against the source buffer length. Orbis AppSec's multi-agent AI scanner detected this exact pattern in `asc-utils-l10n.c`.

How out-of-bounds reads happen in C gettext

Q: What CWE is out-of-bounds read?

CWE-125 — Out-of-bounds Read. Related weaknesses include CWE-119 (Improper Restriction of Operations within the Bounds of a Memory Buffer) and CWE-126 (Buffer Over-read).

How out-of-bounds reads happen in C gettext .mo file parsers and how to fix it

Summary

A missing bounds check in the gettext .mo file parser inside compose/asc-utils-l10n.c allowed a malformed or truncated file to trigger out-of-bounds reads from heap memory. The vulnerability affected two distinct read sites — a memcpy of the full AscLocaleGettextHeader struct at line 131 and a 4-byte offset read at line 224 — neither of which validated that the source buffer was large enough. The fix adds explicit size checks before both reads, rejecting invalid files with a descriptive error instead of reading past the end of allocated memory.

Introduction

The compose/asc-utils-l10n.c file is responsible for parsing localization data — specifically gettext .mo binary files and Qt .qm files — as part of the AppStream Compose toolchain. It is a natural target for malformed-input attacks because it processes files that can be supplied by third parties, packagers, or build pipelines. A flaw in asc_l10n_parse_file_gettext() created a situation where a single truncated .mo file could cause the parser to read heap memory it was never supposed to touch.

The problematic pattern is subtle and common in C: g_bytes_get_data() returns both a pointer and a length, but the length parameter is optional — you can pass NULL if you don't care about it. The original code did exactly that:

data = g_bytes_get_data (bytes, NULL);  // length silently discarded

From that point forward, every read from data was unbounded. The very next substantive operation was a memcpy that assumed the buffer was at least sizeof(AscLocaleGettextHeader) bytes long — an assumption that held for well-formed files but silently broke for truncated or crafted ones.

The Vulnerability Explained

What the parser does

When asc_l10n_parse_file_gettext() is called, it reads a .mo file into a GBytes buffer, then immediately copies the first sizeof(AscLocaleGettextHeader) bytes into a local struct h to inspect the magic number and determine byte order:

/* VULNERABLE — before the fix */
data = g_bytes_get_data (bytes, NULL);   // (1) length discarded

/* we only strictly need the header */
memcpy (&h, data, sizeof (AscLocaleGettextHeader));  // (2) no size check

If the file on disk is shorter than sizeof(AscLocaleGettextHeader) bytes — whether because it was truncated in transit, deliberately crafted by an attacker, or simply corrupt — the memcpy at step (2) reads past the end of the heap allocation backing bytes. The GLib GBytes allocator does not pad buffers, so adjacent heap metadata or unrelated allocations sit immediately after the buffer boundary.

A second vulnerable site existed in the Qt .qm section parser (asc_l10n_parse_file_qt):

/* VULNERABLE — before the fix */
AscLocaleQmSection section = _read_uint8 (data, &m);
guint32 section_len = _read_uint32 (data, &m);  // no check that m+4 <= len

Here, _read_uint32 advances the cursor m by 4 bytes. If m was already within 4 bytes of len, the read extended past the buffer boundary.

How an attacker exploits this

An attacker who can supply a .mo or .qm file to the AppStream Compose pipeline — for example, by contributing a malicious package to a distribution repository, or by intercepting a build artifact — can craft a file that is exactly 1 byte long. When asc_l10n_parse_file_gettext() processes it:

g_bytes_get_data() returns a 1-byte heap allocation.
memcpy(&h, data, sizeof(AscLocaleGettextHeader)) copies sizeof(AscLocaleGettextHeader) bytes (typically 28 bytes on a 64-bit system) starting from that 1-byte buffer.
The 27 bytes past the buffer boundary are read from adjacent heap memory.

Depending on heap layout, those 27 bytes could contain fragments of other allocations — file paths, authentication tokens, or other sensitive strings processed earlier in the same compose run. In a crash scenario, the contents may appear in a core dump or error log. In a more targeted exploit, the attacker can use the magic-number branch (h.magic == 0x950412de) as an oracle to infer heap contents.

Real-world impact: The AppStream Compose tool runs as part of distribution metadata generation pipelines. A successful exploit could leak heap contents from the compose process, crash the pipeline, or — in combination with a write primitive — escalate to arbitrary code execution.

The Fix

The fix addresses both vulnerable sites with minimal, surgical changes.

Fix 1 — Capture the buffer length and guard the header `memcpy`

Before:

data = g_bytes_get_data (bytes, NULL);

/* we only strictly need the header */
memcpy (&h, data, sizeof (AscLocaleGettextHeader));

After:

gsize data_len = 0;
// ...
data = g_bytes_get_data (bytes, &data_len);  // length is now captured

/* we only strictly need the header */
if (data_len < sizeof (AscLocaleGettextHeader)) {
    g_set_error_literal (error,
                         ASC_COMPOSE_ERROR,
                         ASC_COMPOSE_ERROR_FAILED,
                         "Gettext file is too small to be valid");
    return FALSE;
}
memcpy (&h, data, sizeof (AscLocaleGettextHeader));

The key change is passing &data_len instead of NULL to g_bytes_get_data(). This one-character change unlocks all subsequent bounds checking. The guard immediately before memcpy ensures that if the file is shorter than the header struct, the function returns a descriptive GError rather than reading invalid memory. The memcpy itself is unchanged — it was never the problem; the missing precondition was.

Fix 2 — Guard the Qt section length read

Before:

AscLocaleQmSection section = _read_uint8 (data, &m);
guint32 section_len = _read_uint32 (data, &m);

After:

guint32 section_len;
AscLocaleQmSection section = _read_uint8 (data, &m);
if (m + 4 > len)
    break;
section_len = _read_uint32 (data, &m);

Note also that section_len was moved from a declaration-with-initializer inside the loop body to a plain declaration before the guard. This is necessary in C89/C90-compatible code where declarations must precede statements, but it also makes the control flow clearer: we declare the variable, check whether reading it is safe, and only then perform the read.

The break exits the section-parsing loop cleanly, allowing the caller to handle whatever translations were successfully parsed up to that point rather than crashing or returning garbage.

Why these two fixes are sufficient

Both vulnerable reads shared the same root cause: the buffer length was available (via g_bytes_get_data and the len variable respectively) but was not consulted before the read. The fix does not change the parsing logic — it simply inserts the missing precondition checks that should have been there from the start.

Prevention & Best Practices

1. Never discard the length from `g_bytes_get_data()`

The GLib documentation explicitly provides the length out-parameter for this reason. Treat g_bytes_get_data(bytes, NULL) as a code smell in any code that subsequently indexes into the returned pointer.

2. Validate every fixed-size read against the remaining buffer

A useful pattern is a helper macro or inline function:

#define CHECK_READ(offset, size, total_len) \
    do { if ((offset) + (size) > (total_len)) goto parse_error; } while (0)

Apply it before every memcpy, pointer cast, or multi-byte integer read from an external data source.

3. Use AddressSanitizer (ASan) in CI

Compile with -fsanitize=address during testing. ASan would have caught both of these reads immediately on a fuzzing run with a truncated input file. Pair it with a fuzzer (libFuzzer or AFL++) that generates truncated and bit-flipped .mo files.

4. Apply fuzzing to binary file parsers

Binary format parsers are high-value fuzzing targets. A corpus of valid .mo files plus a mutation fuzzer would almost certainly have generated a 1-byte input and triggered this bug before it reached production.

5. Follow the CERT C Secure Coding Standard

CERT C rule ARR38-C states: "Guarantee that library functions do not form invalid pointers." The memcpy call violated this rule by not guaranteeing the source pointer was valid for the requested size.

Relevant standards:
- CWE-125: Out-of-bounds Read
- CWE-119: Improper Restriction of Operations within the Bounds of a Memory Buffer
- OWASP: Input Validation Cheat Sheet

Key Takeaways

Passing NULL as the length parameter to g_bytes_get_data() is safe only if you never index into the returned pointer — in asc_l10n_parse_file_gettext(), the pointer was immediately used in a memcpy, making the discarded length a direct vulnerability.
Both the gettext .mo parser and the Qt .qm parser in asc-utils-l10n.c shared the same class of bug: a read that assumed the buffer was large enough without checking. Always audit all read sites in a parser when you find one vulnerability.
The fix required zero changes to the parsing logic — only the addition of &data_len in one function call and two if guards. Bounds checks are cheap; heap over-reads are not.
Binary locale files are attacker-controlled input in any system that processes third-party packages. Treat them with the same skepticism as network input.
The _read_uint32 helper advancing m without a bounds check is a pattern that can hide in many cursor-based parsers — audit every call site of such helpers when the cursor approaches the buffer end.

How Orbis AppSec Detected This

Source: A .mo or .qm locale file read from disk via asc_unit_read_data() in asc_l10n_parse_file_gettext() — an externally supplied binary file with no prior size validation.
Sink: memcpy (&h, data, sizeof (AscLocaleGettextHeader)) at compose/asc-utils-l10n.c:131, and _read_uint32 (data, &m) in the Qt section loop — both operating on a pointer whose backing buffer length was unknown at the call site.
Missing control: The length out-parameter of g_bytes_get_data() was passed as NULL, so no upper bound was available to check before either read. No guard existed between the data pointer acquisition and the first memory copy.
CWE: CWE-125 — Out-of-bounds Read
Fix: Passed &data_len to g_bytes_get_data() and inserted if (data_len < sizeof(AscLocaleGettextHeader)) and if (m + 4 > len) guards before the respective unsafe reads.

Orbis AppSec automatically detected this vulnerability and opened a pull request with the fix. Try Orbis AppSec on your repositories to find and fix issues like this automatically.

Conclusion

Out-of-bounds reads in binary file parsers are among the most common and consequential memory-safety bugs in C codebases. The vulnerability in asc_l10n_parse_file_gettext() is a textbook example: a single optional parameter (NULL instead of &data_len) silently removed the only mechanism by which the code could have known it was about to read past the end of a heap buffer. The fix is three lines of guard code — but those three lines are the difference between a parser that crashes or leaks memory on malformed input and one that fails safely with a clear error message.

For developers writing binary format parsers in C, the lesson is clear: always capture buffer lengths, always check them before every read, and fuzz your parsers with truncated inputs before shipping. The cost of adding a bounds check is microseconds; the cost of shipping without one can be a heap disclosure in a production build pipeline.

cwe	CWE-125
fix	Capture `data_len` from `g_bytes_get_data()` and add `data_len < sizeof(AscLocaleGettextHeader)` and `m + 4 > len` guards before each unsafe read
risk	Heap memory disclosure or crash when parsing a malformed gettext .mo or Qt .qm locale file
language	C
root cause	`g_bytes_get_data()` was called without capturing the buffer length, so subsequent `memcpy` and offset reads had no upper bound to check against
vulnerability	Out-of-bounds read via unchecked memcpy on attacker-controlled file data

How out-of-bounds reads happen in C gettext .mo file parsers and how to fix it

Answer Summary

Vulnerability at a Glance

How out-of-bounds reads happen in C gettext .mo file parsers and how to fix it

Summary

Introduction

The Vulnerability Explained

What the parser does

How an attacker exploits this

The Fix

Fix 1 — Capture the buffer length and guard the header `memcpy`

Fix 2 — Guard the Qt section length read

Why these two fixes are sufficient

Prevention & Best Practices

1. Never discard the length from `g_bytes_get_data()`

2. Validate every fixed-size read against the remaining buffer

3. Use AddressSanitizer (ASan) in CI

4. Apply fuzzing to binary file parsers

5. Follow the CERT C Secure Coding Standard

Key Takeaways

How Orbis AppSec Detected This

Conclusion

References

Frequently Asked Questions

What is an out-of-bounds read?

How do you prevent out-of-bounds reads in C when parsing binary files?

What CWE is out-of-bounds read?

Is address sanitizer (ASan) enough to prevent out-of-bounds reads in production?

Can static analysis detect missing bounds checks before memcpy?

View the Security Fix

Related Articles

How buffer overflow in Intel SGX enclave ECALLs happens in C and how to fix it

How buffer overflow happens in C SGX enclave memcpy and how to fix it

How integer underflow in array splice operations happens in C and how to fix it

How buffer overflow via sprintf() happens in C string formatting and how to fix it

How buffer overflow happens in C ieee80211_input() and how to fix it

How buffer overflow in FuzzIxml.c sprintf() happens in C and how to fix it

How out-of-bounds reads happen in C gettext .mo file parsers and how to fix it

Answer Summary

Vulnerability at a Glance

How out-of-bounds reads happen in C gettext .mo file parsers and how to fix it

Summary

Introduction

The Vulnerability Explained

What the parser does

How an attacker exploits this

The Fix

Fix 1 — Capture the buffer length and guard the header memcpy

Fix 2 — Guard the Qt section length read

Why these two fixes are sufficient

Prevention & Best Practices

1. Never discard the length from g_bytes_get_data()

2. Validate every fixed-size read against the remaining buffer

3. Use AddressSanitizer (ASan) in CI

4. Apply fuzzing to binary file parsers

5. Follow the CERT C Secure Coding Standard

Key Takeaways

How Orbis AppSec Detected This

Conclusion

References

Frequently Asked Questions

What is an out-of-bounds read?

How do you prevent out-of-bounds reads in C when parsing binary files?

What CWE is out-of-bounds read?

Is address sanitizer (ASan) enough to prevent out-of-bounds reads in production?

Can static analysis detect missing bounds checks before memcpy?

View the Security Fix

Related Articles

How buffer overflow in Intel SGX enclave ECALLs happens in C and how to fix it

How buffer overflow happens in C SGX enclave memcpy and how to fix it

How integer underflow in array splice operations happens in C and how to fix it

How buffer overflow via sprintf() happens in C string formatting and how to fix it

How buffer overflow happens in C ieee80211_input() and how to fix it

How buffer overflow in FuzzIxml.c sprintf() happens in C and how to fix it

Fix 1 — Capture the buffer length and guard the header `memcpy`

1. Never discard the length from `g_bytes_get_data()`