STR-3414: admin API for prover tasks via strata-dbtool#1842
Conversation
Admin tooling (strata-dbtool) needs to drop and bulk-scan prover task records — the runtime path only ever advances status, so the trait previously had no removal or full-iteration entry points. Extends both the OL ProofDBSled and the EE EeProverDbSled with the new methods and covers delete_task in the shared proof-db test suite.
|
Commit: d1de28c SP1 Execution Results
|
Codecov Report❌ Patch coverage is @@ Coverage Diff @@
## main #1842 +/- ##
==========================================
- Coverage 79.77% 79.68% -0.10%
==========================================
Files 669 683 +14
Lines 74269 75944 +1675
==========================================
+ Hits 59249 60516 +1267
- Misses 15020 15428 +408
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 30 files with indirect coverage changes 🚀 New features to boost your workflow:
|
a776b1c to
91e03f5
Compare
The integrated checkpoint prover's task-key wrapper was previously binary-internal to bin/strata. Moves it into strata-checkpoint-types so admin tooling (strata-dbtool) can construct the same borsh-encoded key without duplicating the layout — single source of truth for the on-disk byte format.
Surfaces the OL prover-task store via two read-only subcommands: get-prover-task fetches a single record by hex key, and get-prover-tasks-summary aggregates counts by status with a bounded slice of matching entries (filterable by status). Establishes the porcelain/JSON Output structs that the upcoming mutate commands will reuse.
Three single-task mutate verbs plus a bulk variant, each gated behind
--confirm:
- abandon-prover-task flips status to PermanentFailure { error:
"abandoned via dbtool" } so the row stays in the DB for audit but
is skipped by the startup recovery scanner.
- abandon-prover-tasks --all-unfinished is the bulk form (supports
--dry-run).
- reset-prover-task flips status back to Pending and clears
retry_after, forcing a fresh prove attempt.
- delete-prover-task hard-removes the row.
Directly addresses the "abandon/nuke in-progress proof tasks upon
restart" use case from the ticket.
Two commands operating on the OL CheckpointProofDatabase tree: get-checkpoint-proof fetches the stored receipt for an epoch (zkvm, proof type, program id, length), and delete-checkpoint-proof drops it (gated behind --confirm). Both resolve the canonical commitment at the supplied epoch so callers can work in epoch numbers rather than raw EpochCommitment bytes. Pairs with reset-prover-task for forcing a re-prove after a guest upgrade: delete the stale receipt, then reset the task.
Two ways to inject a fresh Pending task into the OL prover-task store
"from outside":
- backfill-checkpoint-proof-task <epoch> resolves the canonical
EpochCommitment for the supplied epoch and constructs the task key
via the shared CheckpointProofTask wrapper — so the running node
picks the task up on next startup recovery.
- backfill-prover-task-raw <key_hex> is the escape hatch for proof
kinds that don't have a typed helper yet; caller is responsible
for the key format.
Directly addresses the "backfill proof requests from outside" use
case from the ticket.
Adds a "Prover Task Admin" section to the README covering the new inspection, mutate, backfill, and checkpoint-proof verbs, with a semantics table contrasting abandon / reset / delete / backfill and a loud "node must be offline" warning. Includes a "Not yet supported" note flagging the EE prover store as a follow-up gap.
Spins up a Strata node, stops it, then drives the new dbtool surface through the full lifecycle against the offline datadir: confirm-gating on every mutate, raw-key backfill, get/summary round-trip, abandon (check the documented PermanentFailure reason lands), reset (back to Pending with retry_after cleared), delete, and a bulk dry-run + for-real abandon-prover-tasks. Locks in the DB-level admin contract without depending on the prover producing records.
91e03f5 to
13938c1
Compare
|
@prajwolrg
Should we have similar instead of |
Admin tooling needs to drop stale chunk receipts and acct proofs after guest-program upgrades or broken runs; without these the only way to clear a row was to wipe the whole sled directory. delete_acct_proof must also clear the ProofId -> BatchId secondary index in the same call, otherwise get_acct_proof_by_id would dangle on a missing primary row.
The forthcoming EE prover-task admin commands accept the same status filters, hex-key encoding, --confirm guard, and PermanentFailure reason string as the OL surface; lifting these to a shared module keeps the two surfaces in lockstep so operator workflows behave identically against either store.
The EE prover store lives in a separate sled instance under the alpen-client datadir, so the existing OL-only dbtool couldn't reach chunk or acct prover tasks at all — recovering from a stuck EE prover required wiping the whole sled directory. Add a top-level --ee-datadir flag and seven ee-* subcommands that mirror the OL surface but operate on EeProverDbSled directly: get/summary/abandon/abandon-bulk/reset/ delete/backfill-raw. Chunk and acct tasks share one tree, kind-tagged by the first byte of every key (b'c' / b'a'), so summary and bulk commands carry a --kind filter and inspection output surfaces a kind field for grepping. The EE DB is opened lazily so OL-only invocations keep working without --ee-datadir set.
Pairs with the EE prover-task admin surface. Chunk receipts and acct proofs live in distinct sled trees, so they get their own verbs: chunk receipts are keyed by the opaque chunk-task key, acct proofs by BatchId (accepted as <prev_block_hex>:<last_block_hex>, matching the Display format so operators can paste straight from alpen-client logs). Both delete verbs operate via the new EeProverDbSled methods — the acct path clears the ProofId secondary index so get_acct_proof_by_id can't dangle.
Replace the previous "Not yet supported" stub with the actual EE surface. The decision table up front lets an operator coming in cold pick the right verb against OL vs EE chunk vs EE acct without having to read the rest of the section.
Drives the alpen-client sequencer's offline datadir through the full ee-* lifecycle: confirm-gating, raw backfill of one chunk-tagged and one acct-tagged task, --kind filter assertions on both, abandon/ reset/delete on the single-key paths, and a bulk dry-run + for-real flow scoped by --kind. Records deltas against the baseline summary so the test stays correct if the alpen-client already wrote real chunk or acct tasks before we stopped it. Receipt admin coverage stays at the Rust unit-test level; driving it from a functional test would require the chunk/acct prover to actually produce a receipt.
The original surface used --confirm with a hard error when missing, while the older revert-ol-state command on the same tool uses -f/--force with a dry run as the default. Consistency wins: operators shouldn't have to remember which flavor of "I really mean it" each verb wants, and dry-run-by-default also gives a free preview path without needing a separate --dry-run flag on the bulk variants. Every mutating verb (OL + EE, prover-task + receipts) now: - accepts -f/--force; without it, prints "would <verb>: <target>" plus the standard "Use --force to execute these changes." hint - pre-reads the target row for delete-style verbs so the dry run can surface a clear "no row for key" error rather than silently previewing a no-op delete Drops the now-redundant --dry-run flag on the bulk-abandon variants.
We should definitely make things consistent. Handled in 3f29617. |
| ### Global Options | ||
|
|
||
| - `-d, --datadir <path>` - Node data directory (default: `data`) | ||
| - `--ee-datadir <path>` - Alpen-client data directory. Required for any `ee-*` subcommand; points at the alpen-client's `--datadir`, not the strata node's. |
There was a problem hiding this comment.
that's becoming messy
There was a problem hiding this comment.
Yes why can't we specify a different path with --datadir to run the command? Each command is standalone.
|
the dbtool becomes much messier and would need its own refactor imo. for the time being this seems ~alrightish, but ideally we'd want to have it (1) more structured (2) more declarative and more friendly with storage (3) with much less boilerplate (so when we add a new proof type, it's automatically/almost automatically brought into dbtool with minimal efforts and little-to-no ctrl-c-ctrl-v). P.S. in ideal world, our cli tools should resemble well known patterns for cli tools. examples: Similarly, we can define the "resource" can be checkpoint (for OL) / proof (for both) / state parts (specific to any) / etc. In contract, rn we are flat ass with resources, commands, etc. |
|
overall, pre-ACK on the effort (it does move in a much needed direction). |
|
Marking this for review. I'd defer the refactor out of this PR into a separate ticket instead, but the API you suggested look much better. CC: @krsnapaudel who I believe originally worked on the dbtool first. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c63442f1c5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| }); | ||
| let db = db.as_ref(); |
There was a problem hiding this comment.
Open OL DB lazily for ee- commands*
main opens the strata node DB before command dispatch, so ee-* operations still attempt to open and lock the OL sled even though they only need --ee-datadir. In practice this blocks EE admin flows when the strata node is running (or when -d points to an inaccessible path), because execution exits before reaching with_ee_db. This is a regression introduced with the new EE-only commands and makes their documented operational mode unreliable.
Useful? React with 👍 / 👎.
| let (prev, last) = s.split_once(':').ok_or_else(|| { | ||
| DisplayedError::UserError( | ||
| "Expected batch id as <prev_block_hex>:<last_block_hex>".to_string(), | ||
| Box::new(s.to_string()), |
There was a problem hiding this comment.
Parse logged BatchId format for acct proof commands
The acct-proof CLI parser only accepts "<prev_hex>:<last_hex>", but the codebase’s BatchId Display output is BatchId(prev_block=..., last_block=...) (used by %batch_id logging). That means operators cannot pass the value they are told to copy from logs, and ee-get-acct-proof/ee-delete-acct-proof will reject it immediately. This should either parse the displayed form too or align all user-facing guidance/output to the accepted input shape.
Useful? React with 👍 / 👎.
The test stops the alpen sequencer to poke at the sled DB directly, but was attached to the shared `alpen_ee` env via a string name. flexitest reuses services across tests in the same env, so leaving the sequencer stopped broke every later `alpen_ee` test (10 EVM tests failed with `Service is not running`). Mirror what `prover_task_admin.py` already does on the OL side and pass an `AlpenClientEnv` instance so the env is private to this test.
| @@ -38,6 +54,13 @@ fn main() { | |||
| }); | |||
| let db = db.as_ref(); | |||
There was a problem hiding this comment.
This still opens the OL database before command dispatch, so every ee-* invocation attempts to open/lock cli.datadir even though those commands only need --ee-datadir. That breaks the documented EE-only operational path when the strata node is running, or when the default -d data path is missing/inaccessible, because execution exits before reaching with_ee_db.
Permalink:
alpen/bin/strata-dbtool/src/main.rs
Lines 51 to 55 in 1756e76
I think the fix is to make the OL DB lazy too: dispatch ee-* arms before calling open_database, or wrap OL-only arms in a helper that opens the OL DB only for commands that actually need it.
| let (prev, last) = s.split_once(':').ok_or_else(|| { | ||
| DisplayedError::UserError( | ||
| "Expected batch id as <prev_block_hex>:<last_block_hex>".to_string(), | ||
| Box::new(s.to_string()), |
There was a problem hiding this comment.
The acct-proof parser only accepts <prev_block_hex>:<last_block_hex>, but the codebase logs/display path for this value is BatchId(prev_block=..., last_block=...). In particular, BatchId::Display emits that wrapper form and the acct proof hook logs it with %batch_id, so operators cannot copy the value they see in logs into ee-get-acct-proof / ee-delete-acct-proof as the docs/comments imply.
Permalinks:
- Parser:
alpen/bin/strata-dbtool/src/cmd/ee_receipts.rs
Lines 93 to 108 in 1756e76
- Display format:
alpen/crates/alpen-ee/common/src/types/batch.rs
Lines 17 to 24 in 1756e76
- Logged with
%batch_id:alpen/bin/alpen-client/src/prover/hooks.rs
Lines 84 to 90 in 1756e76
Please either parse the displayed BatchId(...) form too, or change the user-facing guidance/output so it consistently surfaces the accepted colon form.
delbonis
left a comment
There was a problem hiding this comment.
Implementation-wise this looks good, aside from the other comments.
But I'm wondering if it actually makes sense to put the EE-specific dbtool subcommands in a separate binary so that we aren't merging unrelated areas of concern. Like doesn't this now depend on both Strata and Alpen database concepts now?
There was a problem hiding this comment.
Comments before attributes!
|
Functionally this looks good enough to me - I looked into the subcommands and they seem to support the operations that are needed. I agree with comments on structuring the interface and making it less boilerplated. Thinking out loud, not a concrete direction: one thing worth exploring as a separate task is a database <-> commandline abstraction layer which would primarily support the basic CRUD operations which account for majority of the dbtool access right now - most of the subcommands are plain read/writes of db tables. The open question is whether our database trees are uniform enough to support this. Supporting custom operations besides basic CRUD would be a cherry-on-top. An example that comes to mind is how Django the web framework makes it reallyy easy to define CRUD interface for admin operations(declarative model registration -> free CRUD UI), at the same time being extremely customizable and flexible. |
|
agree with @bewakes on the idea! perhaps we can have a thin decl layer (through macro or otherwise) that's based off storage schema that delivers basic read/write logic (crud style). |
Description
Delivers an admin API for all three prover task stores in the Alpen stack via
strata-dbtool(per STR-3414): the OL integrated checkpoint prover, the EE chunk prover, and the EE acct/batch prover. None of these had an out-of-band path before — operators couldn't abandon stuck in-progress tasks across a restart, force a fresh re-prove, drop stale receipts after a guest-program upgrade, or queue a proof request "from outside." This PR adds offline admin verbs covering all of those for every store.Surface coverage
ee-*(--kind chunk)ee-*(--kind acct)The two EE stores share one sled tree, kind-tagged by the first byte of every key (
b'c'/b'a'), soee-*task commands operate on opaque keys with a--kindfilter on inspection and bulk-abandon. Receipt admin is kind-specific because the receipts live in distinct trees: chunk receipts keyed by task key, acct proofs typed byBatchId(<prev_block_hex>:<last_block_hex>, matching theDisplayformat).EE commands need
--ee-datadir(the alpen-client's--datadir). It's opened lazily, so OL-only invocations keep working without the flag set.Verb semantics
abandonPermanentFailure { error: "abandoned via dbtool" }resetPending(retry-after cleared)deleteabandonunless you really want no trace left.backfillPending(newly inserted)delete-*-receipt/delete-acct-proofProofId → BatchIdindex)Every mutating subcommand is a dry run unless
-f, --forceis passed — matching the existingrevert-ol-stateUX. Without--force, the command prints what would happen and aUse --force to execute these changes.hint; with--force, the mutation lands. See bin/strata-dbtool/README.md for the full command list and usage.Type of Change
CheckpointProofTaskand shared prover-task admin helpers to shared modules; addeddelete_chunk_receipt/delete_acct_prooftoEeProverDbSled)Notes to Reviewers
Checklist
AI Assistance Disclosure
This PR was written by Claude Code (Opus 4.7), with me reviewing and steering the design.
Related Issues
STR-3414