How Integer Overflow in regexJIT.c Heap Allocation Happens in C and How to Fix It
The Short Answer
This is a heap buffer overflow (CWE-120) in regex_src/regexJIT.c caused by an integer overflow in sizeof(struct stack_item) * dfa_size. On 32-bit platforms, a large dfa_size wraps the multiplication to a small value, so SLJIT_MALLOC returns an undersized buffer. Subsequent writes overflow that buffer. The fix adds a division-based pre-check that returns REGEX_MEMORY_ERROR before any allocation is attempted.
Introduction
The regex_src/regexJIT.c file is the heart of a JIT-compiled regex engine — it takes a compiled DFA (Deterministic Finite Automaton) and generates native machine code for fast pattern matching. Inside generate_transitions(), the compiler allocates a contiguous array of struct stack_item objects sized to hold every DFA state. A flaw at line 983 of that function means a sufficiently complex regex pattern can cause the size calculation to silently overflow, handing SLJIT_MALLOC a number far smaller than needed.
This matters for any developer working with C-level memory allocation: the bug pattern — multiplying a user-influenced count by a struct size without overflow checking — is one of the most common sources of heap corruption in systems software.
The Vulnerability Explained
What Goes Wrong at Line 983
Inside generate_transitions(), the original code allocates the DFA transition table like this:
// regex_src/regexJIT.c — BEFORE fix (line 983)
compiler_common->dfa_transitions = (struct stack_item *)SLJIT_MALLOC(
sizeof(struct stack_item) * compiler_common->dfa_size,
NULL
);
if (!compiler_common->dfa_transitions)
return REGEX_MEMORY_ERROR;
The problem is the expression sizeof(struct stack_item) * compiler_common->dfa_size. Both operands are sljit_uw (an unsigned word-sized integer). On a 32-bit platform, sljit_uw is 32 bits wide, with a maximum value of 0xFFFFFFFF (4,294,967,295).
If dfa_size is large enough — say, derived from a deeply nested alternation pattern — the multiplication wraps around modulo 2³², producing a small number. For example:
sizeof(struct stack_item) = 16 bytes
dfa_size = 0x10000001 (268,435,457)
16 * 268,435,457 = 4,294,967,312 → overflows to 16 (on 32-bit)
SLJIT_MALLOC(16, NULL) succeeds and returns a 16-byte buffer. The code then tries to write dfa_size entries into it — hundreds of millions of writes into a 16-byte allocation. The heap is immediately corrupted.
What Makes dfa_size Attacker-Controlled?
dfa_size is derived from parsing the regex pattern itself. The PR's regression test illustrates the attack surface directly:
// Deeply nested alternation — maximizes dfa_size
"((((((((((a|b|c|d|e|f|g|h|i|j){100}){100}){100}){100}){10}){10}){10}){5}){5}){5}"
// Large repeated group — stresses dfa_size calculation
"(a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z){65535}"
Any caller of regex_compile() that accepts a regex string from an external source — a configuration file, a network protocol, a user-supplied search field — is a potential attack vector.
The Attack Scenario
- An attacker submits a crafted regex pattern (e.g., the deeply nested alternation above) to any API that calls
regex_compile(). - The DFA construction phase computes a large
dfa_size. generate_transitions()multipliessizeof(struct stack_item) * dfa_size, which overflows to a small value on a 32-bit build.SLJIT_MALLOCallocates a tiny buffer and returns a valid pointer.- The subsequent loop writes DFA transition data far beyond the end of the buffer, corrupting adjacent heap objects.
- Depending on heap layout and what gets corrupted, this can cause: a crash (denial of service), data corruption, or — with careful heap grooming — arbitrary code execution.
The Secondary Bug: Uninitialized Pointer
The diff also reveals a secondary issue in regex_compile():
// BEFORE fix — compiler_common.dfa_transitions is uninitialized
error_code = generate_transitions(&compiler_common);
stack_destroy(&compiler_common.stack);
stack_destroy(&compiler_common.depth);
If generate_transitions() returns early (e.g., due to the overflow check added by the fix), any cleanup code that conditionally frees dfa_transitions could dereference an uninitialized pointer. The fix initializes it to NULL first, so free(NULL) is always safe.
The Fix
Change 1: Pre-Allocation Overflow Guard in generate_transitions()
// regex_src/regexJIT.c — AFTER fix
stack_init(depth);
if (compiler_common->dfa_size > (~(sljit_uw)0) / sizeof(struct stack_item))
return REGEX_MEMORY_ERROR;
compiler_common->dfa_transitions = (struct stack_item *)SLJIT_MALLOC(
sizeof(struct stack_item) * compiler_common->dfa_size,
NULL
);
if (!compiler_common->dfa_transitions)
return REGEX_MEMORY_ERROR;
The guard condition compiler_common->dfa_size > (~(sljit_uw)0) / sizeof(struct stack_item) deserves a close look:
(~(sljit_uw)0)is the maximum value ofsljit_uwon any platform — all bits set, regardless of whether it's 32 or 64 bits. This avoids hardcodingUINT32_MAXorSIZE_MAX.- Dividing by
sizeof(struct stack_item)gives the maximum safe element count before multiplication would overflow. - If
dfa_sizeexceeds that threshold, the function returnsREGEX_MEMORY_ERRORimmediately — no allocation, no corruption.
This is the canonical safe pattern for multiplication overflow checks in C. It works because division is the inverse of multiplication: if a > MAX / b, then a * b > MAX.
Change 2: Initialize dfa_transitions to NULL Before Calling generate_transitions()
// regex_src/regexJIT.c — AFTER fix (in regex_compile)
compiler_common.dfa_transitions = NULL; // ← added
error_code = generate_transitions(&compiler_common);
This one-line change ensures that if generate_transitions() returns early (including via the new overflow guard), any subsequent code that frees compiler_common.dfa_transitions will safely call free(NULL) rather than freeing a garbage pointer.
Change 3: Regression Test in regexMain.c
// regex_src/regexMain.c — added regression test
{
struct regex_machine *machine;
int err = REGEX_NO_ERROR;
machine = regex_compile(
"(a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z){1000}",
64, 0, &err);
if (machine)
regex_free_machine(machine);
else if (err != REGEX_MEMORY_ERROR) {
printf("FAIL: overflow regression returned unexpected error %d\n", err);
return 1;
}
}
This test is embedded directly in the main test runner. It verifies that the overflow-inducing pattern either compiles successfully (acceptable on 64-bit where no overflow occurs) or fails with exactly REGEX_MEMORY_ERROR — not a crash, not a silent heap corruption, not any other error code.
Before vs. After Summary
| Aspect | Before | After |
|---|---|---|
| Overflow check | None | Division-based guard before SLJIT_MALLOC |
dfa_transitions init |
Uninitialized | Set to NULL before generate_transitions() |
| Oversized pattern | Heap corruption | Returns REGEX_MEMORY_ERROR |
| Regression coverage | None | Inline test + standalone check test |
Prevention & Best Practices
1. Always Use the Division Pattern for Allocation Size Checks
// ✅ Safe: check before multiplying
if (count > SIZE_MAX / sizeof(T)) {
return ERROR_OVERFLOW;
}
T *buf = malloc(count * sizeof(T));
// ❌ Unsafe: multiply first, check after
size_t total = count * sizeof(T);
if (total < count) { /* too late — UB may have already occurred */ }
The division check is portable, branchless-friendly, and works for any integer width.
2. Use reallocarray() or calloc() Where Available
On modern Linux/BSD systems, reallocarray(ptr, nmemb, size) performs the overflow check internally:
// ✅ Overflow-safe on glibc 2.26+, OpenBSD, macOS
compiler_common->dfa_transitions = reallocarray(
NULL,
compiler_common->dfa_size,
sizeof(struct stack_item)
);
calloc(nmemb, size) also performs an internal overflow check on most implementations. However, since this code targets portability across platforms (it's part of an SLJIT abstraction layer), the explicit guard is the right choice here.
3. Initialize All Pointers at Declaration
// ✅ Always initialize struct members before use
struct compiler_common compiler_common;
memset(&compiler_common, 0, sizeof(compiler_common));
// or explicitly:
compiler_common.dfa_transitions = NULL;
Uninitialized pointers in structs that are conditionally freed are a persistent source of undefined behavior in C.
4. Fuzz Regex Input
Regex engines are classic fuzzing targets. Tools like AFL++ and libFuzzer can generate the exact kind of deeply nested alternation patterns that triggered this bug:
# Example: fuzz regex_compile() with AFL++
afl-fuzz -i seeds/ -o findings/ -- ./regex_fuzz_harness @@
5. Enable Compiler Sanitizers During Development
# AddressSanitizer catches heap overflows immediately
gcc -fsanitize=address,undefined -g regex_src/regexJIT.c
# UBSan specifically catches integer overflow
gcc -fsanitize=undefined -fsanitize=integer-overflow -g ...
6. Reference Standards
- CWE-120: Buffer Copy without Checking Size of Input
- CWE-190: Integer Overflow or Wraparound
- OWASP: Memory Management Cheat Sheet
Key Takeaways
- The multiplication
sizeof(struct stack_item) * dfa_sizeingenerate_transitions()was the exact dangerous pattern — not a generic "unchecked input" issue, but a specific arithmetic operation on a compiler-internal size variable derived from regex complexity. - The fix uses
(~(sljit_uw)0) / sizeof(struct stack_item)as the safe threshold — this is portable to both 32-bit and 64-bit platforms without hardcoding any platform-specific constants. - Checking
malloc's return value is not sufficient — the overflow produces a valid but undersized allocation; only a pre-allocation size check can prevent the corruption. - Initializing
dfa_transitions = NULLbefore callinggenerate_transitions()prevents a separate use-of-uninitialized-pointer bug in cleanup paths triggered by the new early return. - Deeply nested alternation patterns like
((a|b|...|j){100}){100}...are the concrete exploit payload — any API that passes user-supplied regex strings toregex_compile()on a 32-bit build was vulnerable.
How Orbis AppSec Detected This
- Source: The
regex_stringparameter passed toregex_compile()— a caller-controlled regex pattern that influences the computeddfa_sizevalue during DFA construction. - Sink:
SLJIT_MALLOC(sizeof(struct stack_item) * compiler_common->dfa_size, NULL)atregex_src/regexJIT.c:983— an allocation whose size argument is the unchecked product of a user-influenced value. - Missing control: No overflow check on the multiplication
sizeof(struct stack_item) * dfa_sizebefore passing the result to the allocator; no upper bound validation ondfa_sizeduring regex compilation. - CWE: CWE-120 — Buffer Copy without Checking Size of Input (heap buffer overflow via undersized allocation).
- Fix: Added
if (compiler_common->dfa_size > (~(sljit_uw)0) / sizeof(struct stack_item)) return REGEX_MEMORY_ERROR;immediately before theSLJIT_MALLOCcall ingenerate_transitions().
Orbis AppSec automatically detected this vulnerability and opened a pull request with the fix. Try Orbis AppSec on your repositories to find and fix issues like this automatically.
Conclusion
Integer overflow in allocation size calculations is one of those bugs that looks harmless in isolation — a multiplication of two small-seeming numbers — but becomes critical when the inputs are attacker-controlled. In regexJIT.c, the dfa_size field is computed from regex pattern complexity, meaning any caller that accepts external regex input on a 32-bit platform was exposed to heap corruption.
The fix is a textbook two-line guard: check that dfa_size doesn't exceed MAX / sizeof(T) before multiplying, and return a clean error if it does. Paired with the NULL initialization of dfa_transitions, the fix closes both the overflow and the uninitialized-pointer cleanup risk.
For C developers working with JIT compilers, pattern matchers, or any code that sizes allocations based on parsed input: make the division-based overflow check a reflex, enable AddressSanitizer in your CI pipeline, and consider fuzzing any parser that feeds into a size calculation. The patterns that trigger these bugs are exactly the kind of edge cases that automated fuzzers find in minutes.