Skip to content

fix: catch panics from external light-* crates in V2 parser#340

Closed
i-am-bert wants to merge 4 commits into
mainfrom
fix/catch-external-panic-in-parser
Closed

fix: catch panics from external light-* crates in V2 parser#340
i-am-bert wants to merge 4 commits into
mainfrom
fix/catch-external-panic-in-parser

Conversation

@i-am-bert
Copy link
Copy Markdown
Contributor

@i-am-bert i-am-bert commented Mar 18, 2026

Problem

Both pitt and fra Photon ingestors stuck at slot 407265370. light-event v0.21.0 panics in create_batched_transaction_eventcreate_nullifier_queue_indices with index out of bounds: the len is 3 but the index is 3.

Root Cause

create_nullifier_queue_indices() in light-event has a size mismatch bug:

  1. batch_input_accounts is built by filtering nullifiers to only those matching input_sequence_numbers
  2. create_nullifier_queue_indices(len) receives len = batch_input_accounts.len() (the filtered count)
  3. Inside, it creates nullifier_queue_indices = vec![u64::MAX; len] (sized to filtered count)
  4. But then iterates over all input_compressed_accounts by index i, accessing nullifier_queue_indices[i]
  5. When input_compressed_accounts.len() > batch_input_accounts.len() (i.e. some nullifiers were filtered out), i exceeds lenpanic

This bug exists in both v0.21.0 and v0.23.0 of light-event.

graph TD
    A[nullifiers array - 4 items] -->|filter by input_sequence_numbers| B[batch_input_accounts - 3 items]
    B -->|.len| C["len = 3"]
    C --> D["nullifier_queue_indices = vec![MAX; 3]"]
    E[input_compressed_accounts - 4 items] -->|iterate by index i| F["i goes 0,1,2,3"]
    F -->|"access [3]"| D
    D -->|"len is 3, index is 3"| G["💥 PANIC"]
Loading

Fix (two layers)

1. Root cause fix: Patch light-event (light-event-patched/)

Vendored light-event v0.21.0 with [patch.crates-io] override. Fixed create_nullifier_queue_indices to size the vec to input_merkle_tree_pubkeys.len() (the array actually being indexed), not the filtered batch_input_accounts count. The caller zips with batch_input_accounts, so extra entries are harmlessly ignored.

2. Safety net: catch_unwind in V2 parser (src/ingester/parser/mod.rs)

Wraps parse_public_transaction_event_v2() in catch_unwind(AssertUnwindSafe(...)) to prevent any future panic from external light-* crates from halting the ingestor. On caught panic: logs tx sig + slot + panic message, emits parser_panic_caught StatsD metric, returns None.

Testing

  • cargo check passes cleanly
  • The root cause fix ensures the panicking transaction at slot 407265370 will parse correctly — no data loss
  • The safety net ensures future unknown panics degrade gracefully instead of halting indexing

Upstream

Bug exists in light-event v0.21.0 and v0.23.0. Will file upstream issue on lightprotocol/light-protocol.

Wraps parse_public_transaction_event_v2 call in catch_unwind to prevent
panics in external Light Protocol dependencies (e.g. light-event) from
killing the ingestor task and halting indexing.

On caught panic:
- Logs tx signature, slot, and panic message at error level
- Emits parser_panic_caught StatsD metric for monitoring
- Returns None, skipping the problematic transaction
- Indexing continues with the next transaction

This fixes the active incident where light-event v0.21.0 panics with
'index out of bounds: the len is 3 but the index is 3' at slot 407265370,
causing both pitt and fra ingestors to be permanently stuck.
…indices

Root cause: create_nullifier_queue_indices() sizes nullifier_queue_indices
to batch_input_accounts.len() (filtered nullifiers), but iterates over
ALL input_compressed_accounts by index. When some nullifiers are filtered
out, input_compressed_accounts.len() > batch_input_accounts.len(),
causing an index-out-of-bounds panic.

Fix: size the vec to input_merkle_tree_pubkeys.len() (the actual array
being iterated). The caller zips with batch_input_accounts, so extra
entries are harmlessly ignored.

Uses [patch.crates-io] to override light-event with patched local copy
until upstream fixes this in a new release.
Per Het: Solana is source of truth. Skipping a transaction risks serving
stale/incorrect data. Halting is the lesser evil. The vendored light-event
fix resolves the root cause so no transactions are skipped.
@i-am-bert
Copy link
Copy Markdown
Contributor Author

Superseded by #341 (clean version bump to light-event 0.24). The vendored patch approach is no longer needed.

@i-am-bert i-am-bert closed this Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant