Skip to content

fix(vectorstore/faiss): HMAC-verify index.pkl before pickle.load (CWE-502)#2462

Open
sebastiondev wants to merge 1 commit intoarc53:mainfrom
sebastiondev:fix/cwe502-faiss-pickle-3b70-submit
Open

fix(vectorstore/faiss): HMAC-verify index.pkl before pickle.load (CWE-502)#2462
sebastiondev wants to merge 1 commit intoarc53:mainfrom
sebastiondev:fix/cwe502-faiss-pickle-3b70-submit

Conversation

@sebastiondev
Copy link
Copy Markdown

Summary

This PR adds HMAC-SHA256 integrity verification to the FAISS index.pkl files before they are passed to pickle.load, mitigating an arbitrary-code-execution risk (CWE-502: Deserialization of Untrusted Data) in application/vectorstore/faiss.py.

Vulnerability

  • CWE: CWE-502 (Deserialization of Untrusted Data)
  • Severity: High — arbitrary code execution as the application process
  • File: application/vectorstore/faiss.py, FaissStore.__init__
  • Sink: FAISS.load_local(temp_dir, self.embeddings, allow_dangerous_deserialization=True)

FAISS.load_local is invoked with allow_dangerous_deserialization=True, which causes pickle.load to be called on the contents of index.pkl. The bytes for index.pkl are read from the configured storage backend (local filesystem or S3, via StorageCreator) and written to a temp directory before being deserialized. There is no integrity check on the blob. An attacker who can write to the storage location for an index — for example, write access to an S3 bucket configured for vector storage — can replace index.pkl with a malicious pickle payload. The next time the index is loaded, pickle.load will execute the embedded __reduce__ payload as the application user.

This is a textbook unsafe-deserialization pattern. The exploitation primitive is well known and reliable; a one-line __reduce__ returning (os.system, ("...",)) is enough.

Fix

The fix derives a per-deployment key from ENCRYPTION_SECRET_KEY (with a domain separator so it cannot be confused with other uses of that secret) and uses it to compute an HMAC-SHA256 over the index.pkl bytes. The signature is stored alongside the pickle as index.pkl.sig.

  • On save (the save_local path that writes to storage), the signature is computed from the in-memory pickle bytes and written as index.pkl.sig next to index.pkl.
  • On load, before the bytes are written to the temp directory and handed to FAISS.load_local, _verify_pickle_integrity reads index.pkl.sig and compares it to the freshly computed HMAC using hmac.compare_digest. A mismatch raises ValueError and aborts the load before pickle.load is reached.
  • For legacy indexes that were written before this protection existed and therefore have no .sig file, the loader emits a warning and writes a TOFU (trust-on-first-use) signature for the current bytes, so any subsequent tampering is caught.

The TOFU branch is intentional and called out in a logger.warning so operators know to rebuild any index that may have been tampered with prior to upgrading. It is the only way to avoid breaking existing deployments on first upgrade.

Tests

A new test file tests/vectorstore/test_faiss_pickle_integrity.py covers:

  • HMAC of known bytes matches the documented derivation (regression-locks the key derivation so signatures stay valid across upgrades).
  • A correctly signed index.pkl loads without error.
  • A tampered index.pkl (correct signature, mutated bytes) raises ValueError and pickle.load / FAISS.load_local are never reached.
  • A tampered index.pkl.sig (correct bytes, mutated signature) raises ValueError.
  • A legacy index with no .sig triggers the TOFU path: a warning is logged, a signature is written, and a follow-up load with unchanged bytes succeeds while a follow-up load with mutated bytes fails.
  • save_local to storage writes the .sig file alongside index.pkl.

Existing FAISS-related tests still pass.

Security analysis

The fix raises the bar for exploitation from "write any bytes to index.pkl" to "write index.pkl and forge a valid HMAC", which requires possession of ENCRYPTION_SECRET_KEY. For S3-backed deployments — where bucket-write permissions are commonly delegated and do not imply code execution on the application host — this is a meaningful reduction in blast radius. The signature is compared with hmac.compare_digest so the comparison is constant-time. The key derivation includes a domain separator (docsgpt|faiss-pickle-integrity|v1|) so the same secret used for at-rest encryption cannot accidentally produce an interchangeable key.

There is one residual gap worth flagging for follow-up: an attacker who can race writes to a brand-new index.pkl before its first load will have their bytes signed by the TOFU branch. Closing this fully would require having the writer (the ingest worker / upload endpoint) compute and persist the .sig at write time in every code path, not just add_texts. This PR fixes the add_texts save path; tightening other write paths can be a small follow-up.

Adversarial review

Before submitting, we tried to disprove this. The argument against is that for the local-filesystem storage backend, an attacker with write access to indexes/ already has filesystem write on the application host and can pursue many other RCE paths (crontab, source modification, etc.) — for that backend the fix is defense-in-depth rather than a true privilege boundary. However, for S3 or other cloud-storage backends, bucket write does not equal host code execution, and allow_dangerous_deserialization=True with no integrity check is a real and exploitable RCE primitive. We also checked whether any framework-level mitigation in langchain_community would block tampered pickles — it does not; that is precisely what the allow_dangerous_deserialization flag opts out of. We are submitting the fix on the strength of the cloud-storage case.

cc @lewiswigmore

…-502)

FAISS.load_local is invoked with allow_dangerous_deserialization=True,
so a tampered index.pkl in storage results in arbitrary code execution
the moment the index is opened. Sign index.pkl on save with HMAC-SHA256
keyed by ENCRYPTION_SECRET_KEY (with a domain separator) and verify the
signature before deserializing. Legacy indexes without a signature are
accepted once and signed trust-on-first-use, with a warning, so the
upgrade is non-breaking.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 9, 2026

@sebastiondev is attempting to deploy a commit to the Arc53 Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions github-actions Bot added application Application tests Tests labels May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

application Application tests Tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant