SQL Injection in OceanBase Connector: How f-Strings Can Sink Your RAG Platform

Severity: Critical | CVE Class: SQL Injection (CWE-89) | File: rag/utils/ob_conn.py

Introduction

There's a reason SQL injection has sat at the top of the OWASP Top 10 for over two decades. It's deceptively simple to introduce, devastatingly effective to exploit, and surprisingly easy to miss during code review — especially when it's hiding inside a modern Python f-string.

This post walks through a critical SQL injection vulnerability that was recently discovered and patched in the OceanBase database connector (ob_conn.py) of a Retrieval-Augmented Generation (RAG) platform. If you're building AI-powered applications backed by a knowledge base, this one should get your full attention.

What Is This Vulnerability?

At its core, this is a SQL injection vulnerability — specifically, one where user-controlled filter expressions were embedded directly into SQL WHERE clauses using Python f-strings. No parameterization. No allowlist validation. No escaping. Just raw user input flowing straight into a live database query.

The vulnerability was identified at multiple locations across ob_conn.py (lines 726, 777, 781, 787, 793, 821, and 827), meaning it wasn't an isolated mistake — it was a systemic pattern repeated throughout the file.

Why Should Developers Care?

If you're building a RAG application, your vector database or document store is your product. It contains the curated knowledge that powers your AI's responses. A SQL injection vulnerability here doesn't just leak a user table — it exposes your entire knowledge base to:

Unauthorized data exfiltration (an attacker reads everything)
Data manipulation (an attacker poisons your AI's knowledge)
Data destruction (an attacker deletes your knowledge base)
Privilege escalation (depending on database configuration)

This is why the security scanner flagged it as the most critical vulnerability in the codebase — it directly targets the platform's core business asset.

The Vulnerability Explained

Technical Details

The vulnerable pattern looks something like this:

# VULNERABLE CODE — Do not use this pattern
def search_documents(self, filter_expression: str, limit: int = 10):
    # filter_expression comes from user input or API response
    query = f"""
        SELECT * FROM documents
        WHERE {filter_expression}
        LIMIT {limit}
    """
    cursor.execute(query)
    return cursor.fetchall()

On the surface, this looks clean and Pythonic. The f-string is concise, readable, and feels natural to write. But here's the problem: filter_expression is a string that an attacker controls.

When you embed user input directly into a SQL statement, you're trusting the user to behave like a developer. They won't.

How Could It Be Exploited?

Consider what happens when an attacker passes this as a filter_expression:

1=1 UNION SELECT username, password, NULL FROM admin_users --

The resulting query becomes:

SELECT * FROM documents
WHERE 1=1 UNION SELECT username, password, NULL FROM admin_users --
LIMIT 10

The -- comments out the rest of the query. The UNION SELECT appends results from the admin_users table. The attacker now has credentials.

Or consider a destructive payload:

1=1; DROP TABLE documents; --

Which becomes:

SELECT * FROM documents
WHERE 1=1; DROP TABLE documents; --
LIMIT 10

Your entire knowledge base: gone.

Real-World Attack Scenario

Here's how an attack might unfold in practice:

Reconnaissance: An attacker discovers the RAG platform accepts filter parameters through its API (e.g., a document search endpoint).
Injection Testing: They submit a simple test payload like ' OR '1'='1 and observe whether the response changes — it does.
Data Exfiltration: Using a tool like sqlmap or manual UNION-based injection, they enumerate table names, then systematically dump the entire knowledge base.
Persistence or Destruction: Depending on their goal, they either exfiltrate the data quietly over time, or — if the database user has write permissions — they modify or delete records to corrupt the AI's outputs.
The AI Becomes a Weapon: If the attacker can write to the knowledge base, they can inject malicious content that the RAG system will confidently serve to end users. This is a prompt injection via database poisoning — a particularly nasty second-order attack.

The Scope of the Problem

What makes this especially concerning is that the pattern appeared seven times across the same file. This suggests the vulnerability was introduced as a design pattern rather than a one-off oversight, meaning every query builder function that accepted filter expressions was affected.

The Fix

What Changed?

The fix replaces unsafe f-string interpolation with parameterized queries (also called prepared statements). This is the gold-standard defense against SQL injection.

Here's the before/after comparison:

Before (Vulnerable)

# BEFORE: Direct f-string interpolation — UNSAFE
def get_chunks_by_filter(self, filter_expr: str):
    query = f"SELECT * FROM chunks WHERE {filter_expr}"
    self.cursor.execute(query)
    return self.cursor.fetchall()

def delete_by_filter(self, filter_expr: str):
    query = f"DELETE FROM documents WHERE {filter_expr}"
    self.cursor.execute(query)
    self.conn.commit()

After (Fixed)

# AFTER: Parameterized queries with allowlist validation — SAFE
ALLOWED_FILTER_FIELDS = {"doc_id", "kb_id", "status", "created_at"}

def get_chunks_by_filter(self, doc_id: str, kb_id: str):
    # Use placeholders — the database driver handles escaping
    query = "SELECT * FROM chunks WHERE doc_id = %s AND kb_id = %s"
    self.cursor.execute(query, (doc_id, kb_id))
    return self.cursor.fetchall()

def delete_by_filter(self, doc_id: str):
    # Parameters are passed separately, never interpolated into the string
    query = "DELETE FROM documents WHERE doc_id = %s"
    self.cursor.execute(query, (doc_id,))
    self.conn.commit()

How Does This Solve the Problem?

The key insight is separation of code and data.

With f-strings, the user's input becomes part of the SQL statement itself. The database parser sees it as SQL code and executes it accordingly.

With parameterized queries, the SQL statement is sent to the database first as a template with placeholders (%s or ?). The user's values are sent separately as data. The database engine knows these are values, not code — no matter what characters they contain.

An attacker passing ' OR '1'='1 as a parameter value would cause the database to literally search for a doc_id equal to the string ' OR '1'='1 — which won't match anything. The injection attempt fails harmlessly.

Additional Defense: Allowlist Validation

For cases where dynamic field names are genuinely needed (e.g., sorting by a user-selected column), the fix also applies allowlist validation:

# Allowlist approach for dynamic field names (column names can't be parameterized)
SORTABLE_FIELDS = {"created_at", "updated_at", "doc_name", "relevance_score"}

def get_documents_sorted(self, sort_field: str, limit: int):
    if sort_field not in SORTABLE_FIELDS:
        raise ValueError(f"Invalid sort field: {sort_field}")

    # Only safe, pre-approved field names reach the query
    query = f"SELECT * FROM documents ORDER BY {sort_field} LIMIT %s"
    self.cursor.execute(query, (limit,))
    return self.cursor.fetchall()

Note that column/table names cannot be parameterized in most database drivers — this is where allowlists become essential.

Prevention & Best Practices

1. Always Use Parameterized Queries

This is non-negotiable. Every major database driver supports them:

# MySQL / OceanBase (mysql-connector-python)
cursor.execute("SELECT * FROM t WHERE id = %s", (user_id,))

# SQLite
cursor.execute("SELECT * FROM t WHERE id = ?", (user_id,))

# PostgreSQL (psycopg2)
cursor.execute("SELECT * FROM t WHERE id = %s", (user_id,))

# SQLAlchemy ORM (parameterization is automatic)
session.query(Document).filter(Document.id == user_id).all()

2. Use an ORM

Object-Relational Mappers like SQLAlchemy, Django ORM, or Tortoise ORM handle parameterization automatically. They're not a silver bullet, but they dramatically reduce the surface area for injection:

# SQLAlchemy — safe by default
documents = session.query(Document).filter(
    Document.kb_id == kb_id,
    Document.status == "active"
).all()

3. Apply the Principle of Least Privilege

Your database user should only have the permissions it needs:

A read-only search function → use a read-only DB user
Never use root or admin DB credentials in application code
Separate DB users for read vs. read/write operations

4. Input Validation (Defense in Depth)

Validate inputs before they reach the database layer:

import re

def validate_kb_id(kb_id: str) -> bool:
    # Only allow alphanumeric IDs and hyphens
    return bool(re.match(r'^[a-zA-Z0-9\-]{1,64}$', kb_id))

5. Enable SQL Query Logging in Development

Seeing the actual queries your application generates is invaluable:

# SQLAlchemy query logging
import logging
logging.getLogger('sqlalchemy.engine').setLevel(logging.INFO)

6. Use Static Analysis Tools

Integrate security scanning into your CI/CD pipeline:

Bandit — Python-specific security linter, flags SQL injection patterns
Semgrep — Highly customizable static analysis with SQL injection rules
SonarQube — Enterprise-grade SAST with injection detection
CodeQL — GitHub's semantic code analysis engine

# Run Bandit on your codebase
pip install bandit
bandit -r rag/ -t B608  # B608 = SQL injection

7. Security Code Review Checklist

When reviewing database-related code, ask:

[ ] Are all user inputs passed as parameters, not interpolated?
[ ] Are dynamic identifiers (column/table names) validated against an allowlist?
[ ] Does the DB user have only the minimum required permissions?
[ ] Are error messages suppressed in production (to avoid leaking schema info)?
[ ] Is there logging/alerting for unusual query patterns?

Relevant Standards & References

Standard	Reference
OWASP Top 10	A03:2021 – Injection
CWE	CWE-89: SQL Injection
OWASP Cheat Sheet	SQL Injection Prevention
NIST	SP 800-53: SI-10 Information Input Validation

Conclusion

This vulnerability is a textbook reminder that modern syntax doesn't mean safe syntax. Python f-strings are elegant and powerful — but when they carry user input into a SQL statement, they're just as dangerous as the string concatenation patterns we've been warning against for 20 years.

The fix here isn't complicated. Parameterized queries have been the right answer since the 1990s, and every modern database driver supports them. The challenge is building team habits and code review processes that catch these patterns before they reach production.

Key Takeaways

🚨 Never interpolate user input into SQL strings — not with f-strings, not with .format(), not with + concatenation
✅ Always use parameterized queries — they're the only reliable defense against SQL injection
🔒 Apply least privilege — limit what damage a successful injection can do
🔍 Scan your codebase — tools like Bandit and Semgrep can find these patterns automatically
📋 Validate allowlists for any case where dynamic identifiers are unavoidable

In RAG applications especially, your knowledge base is your crown jewel. Protect the queries that touch it with the same rigor you'd apply to any other critical security boundary.

This vulnerability was identified and patched as part of an automated security review. The fix was verified by re-scan and LLM-assisted code review before merging.

Found a security issue in your codebase? Consider integrating automated security scanning into your CI/CD pipeline to catch vulnerabilities before they reach production.

SQL Injection in OceanBase Connector: How f-Strings Can Sink Your RAG Platform

SQL Injection in OceanBase Connector: How f-Strings Can Sink Your RAG Platform

Introduction

What Is This Vulnerability?

Why Should Developers Care?

The Vulnerability Explained

Technical Details

How Could It Be Exploited?

Real-World Attack Scenario

The Scope of the Problem

The Fix

What Changed?

Before (Vulnerable)

After (Fixed)

How Does This Solve the Problem?

Additional Defense: Allowlist Validation

Prevention & Best Practices

1. Always Use Parameterized Queries

2. Use an ORM

3. Apply the Principle of Least Privilege

4. Input Validation (Defense in Depth)

5. Enable SQL Query Logging in Development

6. Use Static Analysis Tools

7. Security Code Review Checklist

Relevant Standards & References

Conclusion

Key Takeaways

View the Security Fix

Related Articles

How Missing Checksum Validation Opens the Door to Supply Chain Attacks

Unauthenticated Debug Endpoints Expose Firmware Internals: A High-Severity Fix

Heap Buffer Overflow in SSL/TLS: When Proto Length Goes Wrong