SQL Injection in OceanBase Connector: How f-Strings Can Sink Your RAG Platform
Severity: Critical | CVE Class: SQL Injection (CWE-89) | File:
rag/utils/ob_conn.py
Introduction
There's a reason SQL injection has sat at the top of the OWASP Top 10 for over two decades. It's deceptively simple to introduce, devastatingly effective to exploit, and surprisingly easy to miss during code review — especially when it's hiding inside a modern Python f-string.
This post walks through a critical SQL injection vulnerability that was recently discovered and patched in the OceanBase database connector (ob_conn.py) of a Retrieval-Augmented Generation (RAG) platform. If you're building AI-powered applications backed by a knowledge base, this one should get your full attention.
What Is This Vulnerability?
At its core, this is a SQL injection vulnerability — specifically, one where user-controlled filter expressions were embedded directly into SQL WHERE clauses using Python f-strings. No parameterization. No allowlist validation. No escaping. Just raw user input flowing straight into a live database query.
The vulnerability was identified at multiple locations across ob_conn.py (lines 726, 777, 781, 787, 793, 821, and 827), meaning it wasn't an isolated mistake — it was a systemic pattern repeated throughout the file.
Why Should Developers Care?
If you're building a RAG application, your vector database or document store is your product. It contains the curated knowledge that powers your AI's responses. A SQL injection vulnerability here doesn't just leak a user table — it exposes your entire knowledge base to:
- Unauthorized data exfiltration (an attacker reads everything)
- Data manipulation (an attacker poisons your AI's knowledge)
- Data destruction (an attacker deletes your knowledge base)
- Privilege escalation (depending on database configuration)
This is why the security scanner flagged it as the most critical vulnerability in the codebase — it directly targets the platform's core business asset.
The Vulnerability Explained
Technical Details
The vulnerable pattern looks something like this:
# VULNERABLE CODE — Do not use this pattern
def search_documents(self, filter_expression: str, limit: int = 10):
# filter_expression comes from user input or API response
query = f"""
SELECT * FROM documents
WHERE {filter_expression}
LIMIT {limit}
"""
cursor.execute(query)
return cursor.fetchall()
On the surface, this looks clean and Pythonic. The f-string is concise, readable, and feels natural to write. But here's the problem: filter_expression is a string that an attacker controls.
When you embed user input directly into a SQL statement, you're trusting the user to behave like a developer. They won't.
How Could It Be Exploited?
Consider what happens when an attacker passes this as a filter_expression:
1=1 UNION SELECT username, password, NULL FROM admin_users --
The resulting query becomes:
SELECT * FROM documents
WHERE 1=1 UNION SELECT username, password, NULL FROM admin_users --
LIMIT 10
The -- comments out the rest of the query. The UNION SELECT appends results from the admin_users table. The attacker now has credentials.
Or consider a destructive payload:
1=1; DROP TABLE documents; --
Which becomes:
SELECT * FROM documents
WHERE 1=1; DROP TABLE documents; --
LIMIT 10
Your entire knowledge base: gone.
Real-World Attack Scenario
Here's how an attack might unfold in practice:
-
Reconnaissance: An attacker discovers the RAG platform accepts filter parameters through its API (e.g., a document search endpoint).
-
Injection Testing: They submit a simple test payload like
' OR '1'='1and observe whether the response changes — it does. -
Data Exfiltration: Using a tool like
sqlmapor manual UNION-based injection, they enumerate table names, then systematically dump the entire knowledge base. -
Persistence or Destruction: Depending on their goal, they either exfiltrate the data quietly over time, or — if the database user has write permissions — they modify or delete records to corrupt the AI's outputs.
-
The AI Becomes a Weapon: If the attacker can write to the knowledge base, they can inject malicious content that the RAG system will confidently serve to end users. This is a prompt injection via database poisoning — a particularly nasty second-order attack.
The Scope of the Problem
What makes this especially concerning is that the pattern appeared seven times across the same file. This suggests the vulnerability was introduced as a design pattern rather than a one-off oversight, meaning every query builder function that accepted filter expressions was affected.
The Fix
What Changed?
The fix replaces unsafe f-string interpolation with parameterized queries (also called prepared statements). This is the gold-standard defense against SQL injection.
Here's the before/after comparison:
Before (Vulnerable)
# BEFORE: Direct f-string interpolation — UNSAFE
def get_chunks_by_filter(self, filter_expr: str):
query = f"SELECT * FROM chunks WHERE {filter_expr}"
self.cursor.execute(query)
return self.cursor.fetchall()
def delete_by_filter(self, filter_expr: str):
query = f"DELETE FROM documents WHERE {filter_expr}"
self.cursor.execute(query)
self.conn.commit()
After (Fixed)
# AFTER: Parameterized queries with allowlist validation — SAFE
ALLOWED_FILTER_FIELDS = {"doc_id", "kb_id", "status", "created_at"}
def get_chunks_by_filter(self, doc_id: str, kb_id: str):
# Use placeholders — the database driver handles escaping
query = "SELECT * FROM chunks WHERE doc_id = %s AND kb_id = %s"
self.cursor.execute(query, (doc_id, kb_id))
return self.cursor.fetchall()
def delete_by_filter(self, doc_id: str):
# Parameters are passed separately, never interpolated into the string
query = "DELETE FROM documents WHERE doc_id = %s"
self.cursor.execute(query, (doc_id,))
self.conn.commit()
How Does This Solve the Problem?
The key insight is separation of code and data.
With f-strings, the user's input becomes part of the SQL statement itself. The database parser sees it as SQL code and executes it accordingly.
With parameterized queries, the SQL statement is sent to the database first as a template with placeholders (%s or ?). The user's values are sent separately as data. The database engine knows these are values, not code — no matter what characters they contain.
An attacker passing ' OR '1'='1 as a parameter value would cause the database to literally search for a doc_id equal to the string ' OR '1'='1 — which won't match anything. The injection attempt fails harmlessly.
Additional Defense: Allowlist Validation
For cases where dynamic field names are genuinely needed (e.g., sorting by a user-selected column), the fix also applies allowlist validation:
# Allowlist approach for dynamic field names (column names can't be parameterized)
SORTABLE_FIELDS = {"created_at", "updated_at", "doc_name", "relevance_score"}
def get_documents_sorted(self, sort_field: str, limit: int):
if sort_field not in SORTABLE_FIELDS:
raise ValueError(f"Invalid sort field: {sort_field}")
# Only safe, pre-approved field names reach the query
query = f"SELECT * FROM documents ORDER BY {sort_field} LIMIT %s"
self.cursor.execute(query, (limit,))
return self.cursor.fetchall()
Note that column/table names cannot be parameterized in most database drivers — this is where allowlists become essential.
Prevention & Best Practices
1. Always Use Parameterized Queries
This is non-negotiable. Every major database driver supports them:
# MySQL / OceanBase (mysql-connector-python)
cursor.execute("SELECT * FROM t WHERE id = %s", (user_id,))
# SQLite
cursor.execute("SELECT * FROM t WHERE id = ?", (user_id,))
# PostgreSQL (psycopg2)
cursor.execute("SELECT * FROM t WHERE id = %s", (user_id,))
# SQLAlchemy ORM (parameterization is automatic)
session.query(Document).filter(Document.id == user_id).all()
2. Use an ORM
Object-Relational Mappers like SQLAlchemy, Django ORM, or Tortoise ORM handle parameterization automatically. They're not a silver bullet, but they dramatically reduce the surface area for injection:
# SQLAlchemy — safe by default
documents = session.query(Document).filter(
Document.kb_id == kb_id,
Document.status == "active"
).all()
3. Apply the Principle of Least Privilege
Your database user should only have the permissions it needs:
- A read-only search function → use a read-only DB user
- Never use
rootoradminDB credentials in application code - Separate DB users for read vs. read/write operations
4. Input Validation (Defense in Depth)
Validate inputs before they reach the database layer:
import re
def validate_kb_id(kb_id: str) -> bool:
# Only allow alphanumeric IDs and hyphens
return bool(re.match(r'^[a-zA-Z0-9\-]{1,64}$', kb_id))
5. Enable SQL Query Logging in Development
Seeing the actual queries your application generates is invaluable:
# SQLAlchemy query logging
import logging
logging.getLogger('sqlalchemy.engine').setLevel(logging.INFO)
6. Use Static Analysis Tools
Integrate security scanning into your CI/CD pipeline:
- Bandit — Python-specific security linter, flags SQL injection patterns
- Semgrep — Highly customizable static analysis with SQL injection rules
- SonarQube — Enterprise-grade SAST with injection detection
- CodeQL — GitHub's semantic code analysis engine
# Run Bandit on your codebase
pip install bandit
bandit -r rag/ -t B608 # B608 = SQL injection
7. Security Code Review Checklist
When reviewing database-related code, ask:
- [ ] Are all user inputs passed as parameters, not interpolated?
- [ ] Are dynamic identifiers (column/table names) validated against an allowlist?
- [ ] Does the DB user have only the minimum required permissions?
- [ ] Are error messages suppressed in production (to avoid leaking schema info)?
- [ ] Is there logging/alerting for unusual query patterns?
Relevant Standards & References
| Standard | Reference |
|---|---|
| OWASP Top 10 | A03:2021 – Injection |
| CWE | CWE-89: SQL Injection |
| OWASP Cheat Sheet | SQL Injection Prevention |
| NIST | SP 800-53: SI-10 Information Input Validation |
Conclusion
This vulnerability is a textbook reminder that modern syntax doesn't mean safe syntax. Python f-strings are elegant and powerful — but when they carry user input into a SQL statement, they're just as dangerous as the string concatenation patterns we've been warning against for 20 years.
The fix here isn't complicated. Parameterized queries have been the right answer since the 1990s, and every modern database driver supports them. The challenge is building team habits and code review processes that catch these patterns before they reach production.
Key Takeaways
- 🚨 Never interpolate user input into SQL strings — not with f-strings, not with
.format(), not with+concatenation - ✅ Always use parameterized queries — they're the only reliable defense against SQL injection
- 🔒 Apply least privilege — limit what damage a successful injection can do
- 🔍 Scan your codebase — tools like Bandit and Semgrep can find these patterns automatically
- 📋 Validate allowlists for any case where dynamic identifiers are unavoidable
In RAG applications especially, your knowledge base is your crown jewel. Protect the queries that touch it with the same rigor you'd apply to any other critical security boundary.
This vulnerability was identified and patched as part of an automated security review. The fix was verified by re-scan and LLM-assisted code review before merging.
Found a security issue in your codebase? Consider integrating automated security scanning into your CI/CD pipeline to catch vulnerabilities before they reach production.