Skip to content

STR-3414: admin API for prover tasks via strata-dbtool#1842

Open
prajwolrg wants to merge 17 commits into
mainfrom
STR-3414-prover-dbtool
Open

STR-3414: admin API for prover tasks via strata-dbtool#1842
prajwolrg wants to merge 17 commits into
mainfrom
STR-3414-prover-dbtool

Conversation

@prajwolrg
Copy link
Copy Markdown
Contributor

@prajwolrg prajwolrg commented May 20, 2026

Description

Delivers an admin API for all three prover task stores in the Alpen stack via strata-dbtool (per STR-3414): the OL integrated checkpoint prover, the EE chunk prover, and the EE acct/batch prover. None of these had an out-of-band path before — operators couldn't abandon stuck in-progress tasks across a restart, force a fresh re-prove, drop stale receipts after a guest-program upgrade, or queue a proof request "from outside." This PR adds offline admin verbs covering all of those for every store.

Surface coverage

Concern Lives in Subcommand prefix
OL checkpoint proofs strata node datadir (no prefix)
EE chunk proofs alpen-client datadir ee-* (--kind chunk)
EE acct/batch proofs alpen-client datadir ee-* (--kind acct)

The two EE stores share one sled tree, kind-tagged by the first byte of every key (b'c' / b'a'), so ee-* task commands operate on opaque keys with a --kind filter on inspection and bulk-abandon. Receipt admin is kind-specific because the receipts live in distinct trees: chunk receipts keyed by task key, acct proofs typed by BatchId (<prev_block_hex>:<last_block_hex>, matching the Display format).

EE commands need --ee-datadir (the alpen-client's --datadir). It's opened lazily, so OL-only invocations keep working without the flag set.

Verb semantics

Verb Final status When to use
abandon PermanentFailure { error: "abandoned via dbtool" } Stop the recovery scanner from respawning a stuck task while keeping an audit trail.
reset Pending (retry-after cleared) Force a fresh prove attempt — drops accumulated retry count.
delete row removed Prefer abandon unless you really want no trace left.
backfill Pending (newly inserted) Queue a proof request "from outside" — e.g. for an epoch the node never picked up.
delete-*-receipt / delete-acct-proof row removed (acct also clears ProofId → BatchId index) Drop a stale receipt after a guest-program upgrade so the next run re-proves it.

Every mutating subcommand is a dry run unless -f, --force is passed — matching the existing revert-ol-state UX. Without --force, the command prints what would happen and a Use --force to execute these changes. hint; with --force, the mutation lands. See bin/strata-dbtool/README.md for the full command list and usage.

Type of Change

  • New feature/Enhancement (non-breaking change which adds functionality or enhances an existing one)
  • Refactor (extracted CheckpointProofTask and shared prover-task admin helpers to shared modules; added delete_chunk_receipt / delete_acct_proof to EeProverDbSled)
  • New or updated tests

Notes to Reviewers

Checklist

  • I have performed a self-review of my code.
  • I have commented my code where necessary.
  • I have updated the documentation if needed.
  • My changes do not introduce new warnings.
  • I have added (where necessary) tests that prove my changes are effective or that my feature works.
  • New and existing tests pass with my changes.
  • I have disclosed my use of AI in the body of this PR.

AI Assistance Disclosure

This PR was written by Claude Code (Opus 4.7), with me reviewing and steering the design.

Related Issues

STR-3414

Admin tooling (strata-dbtool) needs to drop and bulk-scan prover task
records — the runtime path only ever advances status, so the trait
previously had no removal or full-iteration entry points. Extends both
the OL ProofDBSled and the EE EeProverDbSled with the new methods and
covers delete_task in the shared proof-db test suite.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

Commit: d1de28c

SP1 Execution Results

program cycles gas
EVM EE Chunk 565,596 771,682
EVM EE Account 421,125 526,509
Checkpoint 2,241,398 2,582,981

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 74.79871% with 313 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.68%. Comparing base (1d298c7) to head (1756e76).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
bin/strata-dbtool/src/cmd/ee_receipts.rs 38.09% 91 Missing ⚠️
bin/strata-dbtool/src/cmd/checkpoint_proof.rs 0.00% 59 Missing ⚠️
bin/strata-dbtool/src/cmd/prover_task.rs 75.54% 56 Missing ⚠️
bin/strata-dbtool/src/cmd/ee_prover_task.rs 84.64% 35 Missing ⚠️
bin/strata-dbtool/src/output/prover_task.rs 80.45% 34 Missing ⚠️
bin/strata-dbtool/src/main.rs 65.11% 15 Missing ⚠️
bin/strata-dbtool/src/output/checkpoint_proof.rs 79.71% 14 Missing ⚠️
bin/strata-dbtool/src/db.rs 68.75% 5 Missing ⚠️
bin/strata-dbtool/src/cmd/prover_task_common.rs 95.83% 4 Missing ⚠️
@@            Coverage Diff             @@
##             main    #1842      +/-   ##
==========================================
- Coverage   79.77%   79.68%   -0.10%     
==========================================
  Files         669      683      +14     
  Lines       74269    75944    +1675     
==========================================
+ Hits        59249    60516    +1267     
- Misses      15020    15428     +408     
Flag Coverage Δ
functional 59.98% <49.47%> (-0.10%) ⬇️
unit 65.37% <42.27%> (-0.54%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
bin/strata-dbtool/src/cli.rs 33.33% <ø> (ø)
bin/strata-dbtool/src/output/ee_receipts.rs 100.00% <100.00%> (ø)
bin/strata/src/prover/spec.rs 77.08% <ø> (+0.89%) ⬆️
crates/alpen-ee/database/src/sleddb/prover_db.rs 99.37% <100.00%> (+5.82%) ⬆️
crates/checkpoint-types/src/prover_task.rs 100.00% <100.00%> (ø)
crates/db/store-sled/src/prover/db.rs 94.59% <100.00%> (+2.65%) ⬆️
bin/strata-dbtool/src/cmd/prover_task_common.rs 95.83% <95.83%> (ø)
bin/strata-dbtool/src/db.rs 80.00% <68.75%> (-20.00%) ⬇️
bin/strata-dbtool/src/output/checkpoint_proof.rs 79.71% <79.71%> (ø)
bin/strata-dbtool/src/main.rs 67.60% <65.11%> (-3.83%) ⬇️
... and 5 more

... and 30 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@prajwolrg prajwolrg force-pushed the STR-3414-prover-dbtool branch from a776b1c to 91e03f5 Compare May 20, 2026 14:17
prajwolrg added 7 commits May 20, 2026 20:08
The integrated checkpoint prover's task-key wrapper was previously
binary-internal to bin/strata. Moves it into strata-checkpoint-types
so admin tooling (strata-dbtool) can construct the same borsh-encoded
key without duplicating the layout — single source of truth for the
on-disk byte format.
Surfaces the OL prover-task store via two read-only subcommands:
get-prover-task fetches a single record by hex key, and
get-prover-tasks-summary aggregates counts by status with a bounded
slice of matching entries (filterable by status). Establishes the
porcelain/JSON Output structs that the upcoming mutate commands
will reuse.
Three single-task mutate verbs plus a bulk variant, each gated behind
--confirm:
  - abandon-prover-task flips status to PermanentFailure { error:
    "abandoned via dbtool" } so the row stays in the DB for audit but
    is skipped by the startup recovery scanner.
  - abandon-prover-tasks --all-unfinished is the bulk form (supports
    --dry-run).
  - reset-prover-task flips status back to Pending and clears
    retry_after, forcing a fresh prove attempt.
  - delete-prover-task hard-removes the row.

Directly addresses the "abandon/nuke in-progress proof tasks upon
restart" use case from the ticket.
Two commands operating on the OL CheckpointProofDatabase tree:
get-checkpoint-proof fetches the stored receipt for an epoch (zkvm,
proof type, program id, length), and delete-checkpoint-proof drops
it (gated behind --confirm). Both resolve the canonical commitment
at the supplied epoch so callers can work in epoch numbers rather
than raw EpochCommitment bytes.

Pairs with reset-prover-task for forcing a re-prove after a guest
upgrade: delete the stale receipt, then reset the task.
Two ways to inject a fresh Pending task into the OL prover-task store
"from outside":
  - backfill-checkpoint-proof-task <epoch> resolves the canonical
    EpochCommitment for the supplied epoch and constructs the task key
    via the shared CheckpointProofTask wrapper — so the running node
    picks the task up on next startup recovery.
  - backfill-prover-task-raw <key_hex> is the escape hatch for proof
    kinds that don't have a typed helper yet; caller is responsible
    for the key format.

Directly addresses the "backfill proof requests from outside" use
case from the ticket.
Adds a "Prover Task Admin" section to the README covering the new
inspection, mutate, backfill, and checkpoint-proof verbs, with a
semantics table contrasting abandon / reset / delete / backfill and
a loud "node must be offline" warning. Includes a "Not yet supported"
note flagging the EE prover store as a follow-up gap.
Spins up a Strata node, stops it, then drives the new dbtool surface
through the full lifecycle against the offline datadir: confirm-gating
on every mutate, raw-key backfill, get/summary round-trip, abandon
(check the documented PermanentFailure reason lands), reset (back to
Pending with retry_after cleared), delete, and a bulk dry-run +
for-real abandon-prover-tasks. Locks in the DB-level admin contract
without depending on the prover producing records.
@prajwolrg prajwolrg force-pushed the STR-3414-prover-dbtool branch from 91e03f5 to 13938c1 Compare May 20, 2026 14:24
@krsnapaudel
Copy link
Copy Markdown
Contributor

krsnapaudel commented May 20, 2026

@prajwolrg strata-dbtool has this option for revert-ol-state.

-f, --force - Force execution (without this flag, only a dry run is performed)

Should we have similar instead of --confirm flag? For consistency.

prajwolrg added 8 commits May 21, 2026 07:49
Admin tooling needs to drop stale chunk receipts and acct proofs after
guest-program upgrades or broken runs; without these the only way to
clear a row was to wipe the whole sled directory. delete_acct_proof
must also clear the ProofId -> BatchId secondary index in the same
call, otherwise get_acct_proof_by_id would dangle on a missing primary
row.
The forthcoming EE prover-task admin commands accept the same status
filters, hex-key encoding, --confirm guard, and PermanentFailure
reason string as the OL surface; lifting these to a shared module
keeps the two surfaces in lockstep so operator workflows behave
identically against either store.
The EE prover store lives in a separate sled instance under the
alpen-client datadir, so the existing OL-only dbtool couldn't reach
chunk or acct prover tasks at all — recovering from a stuck EE prover
required wiping the whole sled directory. Add a top-level --ee-datadir
flag and seven ee-* subcommands that mirror the OL surface but operate
on EeProverDbSled directly: get/summary/abandon/abandon-bulk/reset/
delete/backfill-raw. Chunk and acct tasks share one tree, kind-tagged
by the first byte of every key (b'c' / b'a'), so summary and bulk
commands carry a --kind filter and inspection output surfaces a kind
field for grepping. The EE DB is opened lazily so OL-only invocations
keep working without --ee-datadir set.
Pairs with the EE prover-task admin surface. Chunk receipts and acct
proofs live in distinct sled trees, so they get their own verbs:
chunk receipts are keyed by the opaque chunk-task key, acct proofs by
BatchId (accepted as <prev_block_hex>:<last_block_hex>, matching the
Display format so operators can paste straight from alpen-client
logs). Both delete verbs operate via the new EeProverDbSled
methods — the acct path clears the ProofId secondary index so
get_acct_proof_by_id can't dangle.
Replace the previous "Not yet supported" stub with the actual EE
surface. The decision table up front lets an operator coming in cold
pick the right verb against OL vs EE chunk vs EE acct without having
to read the rest of the section.
Drives the alpen-client sequencer's offline datadir through the full
ee-* lifecycle: confirm-gating, raw backfill of one chunk-tagged and
one acct-tagged task, --kind filter assertions on both, abandon/
reset/delete on the single-key paths, and a bulk dry-run + for-real
flow scoped by --kind. Records deltas against the baseline summary
so the test stays correct if the alpen-client already wrote real
chunk or acct tasks before we stopped it. Receipt admin coverage
stays at the Rust unit-test level; driving it from a functional test
would require the chunk/acct prover to actually produce a receipt.
The original surface used --confirm with a hard error when missing,
while the older revert-ol-state command on the same tool uses
-f/--force with a dry run as the default. Consistency wins: operators
shouldn't have to remember which flavor of "I really mean it" each
verb wants, and dry-run-by-default also gives a free preview path
without needing a separate --dry-run flag on the bulk variants.

Every mutating verb (OL + EE, prover-task + receipts) now:
- accepts -f/--force; without it, prints "would <verb>: <target>"
  plus the standard "Use --force to execute these changes." hint
- pre-reads the target row for delete-style verbs so the dry run can
  surface a clear "no row for key" error rather than silently
  previewing a no-op delete

Drops the now-redundant --dry-run flag on the bulk-abandon variants.
@prajwolrg
Copy link
Copy Markdown
Contributor Author

@prajwolrg strata-dbtool has this option for revert-ol-state.

-f, --force - Force execution (without this flag, only a dry run is performed)

Should we have similar instead of --confirm flag? For consistency.

We should definitely make things consistent. Handled in 3f29617.

### Global Options

- `-d, --datadir <path>` - Node data directory (default: `data`)
- `--ee-datadir <path>` - Alpen-client data directory. Required for any `ee-*` subcommand; points at the alpen-client's `--datadir`, not the strata node's.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's becoming messy

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes why can't we specify a different path with --datadir to run the command? Each command is standalone.

@evgenyzdanovich
Copy link
Copy Markdown
Contributor

evgenyzdanovich commented May 21, 2026

the dbtool becomes much messier and would need its own refactor imo.

for the time being this seems ~alrightish, but ideally we'd want to have it (1) more structured (2) more declarative and more friendly with storage (3) with much less boilerplate (so when we add a new proof type, it's automatically/almost automatically brought into dbtool with minimal efforts and little-to-no ctrl-c-ctrl-v).

P.S. in ideal world, our cli tools should resemble well known patterns for cli tools.
For example, when I interact with google cloud cli, it is very intuitive (because it is "layered" properly). it looks like this:

gcloud <api type> <resource type> <command type> <subcommand type>

examples:

gcloud compute vms <list|get|any other verbs>
gcloud compute disks <list|get|any other verb>
gcloud networking interfaces <list|get|any other verb>

Similarly, we can define

dbtool ol <ol resouce> <verb>
dbtool ee <resource> <verb>
dbtool proofs ...
etc

the "resource" can be checkpoint (for OL) / proof (for both) / state parts (specific to any) / etc.


In contract, rn we are flat ass with resources, commands, etc.

@evgenyzdanovich
Copy link
Copy Markdown
Contributor

overall, pre-ACK on the effort (it does move in a much needed direction).

@prajwolrg prajwolrg marked this pull request as ready for review May 21, 2026 13:40
@prajwolrg
Copy link
Copy Markdown
Contributor Author

Marking this for review. I'd defer the refactor out of this PR into a separate ticket instead, but the API you suggested look much better. CC: @krsnapaudel who I believe originally worked on the dbtool first.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c63442f1c5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 54 to 55
});
let db = db.as_ref();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Open OL DB lazily for ee- commands*

main opens the strata node DB before command dispatch, so ee-* operations still attempt to open and lock the OL sled even though they only need --ee-datadir. In practice this blocks EE admin flows when the strata node is running (or when -d points to an inaccessible path), because execution exits before reaching with_ee_db. This is a regression introduced with the new EE-only commands and makes their documented operational mode unreliable.

Useful? React with 👍 / 👎.

Comment on lines +99 to +102
let (prev, last) = s.split_once(':').ok_or_else(|| {
DisplayedError::UserError(
"Expected batch id as <prev_block_hex>:<last_block_hex>".to_string(),
Box::new(s.to_string()),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Parse logged BatchId format for acct proof commands

The acct-proof CLI parser only accepts "<prev_hex>:<last_hex>", but the codebase’s BatchId Display output is BatchId(prev_block=..., last_block=...) (used by %batch_id logging). That means operators cannot pass the value they are told to copy from logs, and ee-get-acct-proof/ee-delete-acct-proof will reject it immediately. This should either parse the displayed form too or align all user-facing guidance/output to the accepted input shape.

Useful? React with 👍 / 👎.

The test stops the alpen sequencer to poke at the sled DB directly,
but was attached to the shared `alpen_ee` env via a string name.
flexitest reuses services across tests in the same env, so leaving
the sequencer stopped broke every later `alpen_ee` test (10 EVM
tests failed with `Service is not running`).

Mirror what `prover_task_admin.py` already does on the OL side and
pass an `AlpenClientEnv` instance so the env is private to this
test.
@@ -38,6 +54,13 @@ fn main() {
});
let db = db.as_ref();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still opens the OL database before command dispatch, so every ee-* invocation attempts to open/lock cli.datadir even though those commands only need --ee-datadir. That breaks the documented EE-only operational path when the strata node is running, or when the default -d data path is missing/inaccessible, because execution exits before reaching with_ee_db.

Permalink:

let db = open_database(&cli.datadir).unwrap_or_else(|e| {
eprintln!("{e}");
exit(1);
});
let db = db.as_ref();

I think the fix is to make the OL DB lazy too: dispatch ee-* arms before calling open_database, or wrap OL-only arms in a helper that opens the OL DB only for commands that actually need it.

let (prev, last) = s.split_once(':').ok_or_else(|| {
DisplayedError::UserError(
"Expected batch id as <prev_block_hex>:<last_block_hex>".to_string(),
Box::new(s.to_string()),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The acct-proof parser only accepts <prev_block_hex>:<last_block_hex>, but the codebase logs/display path for this value is BatchId(prev_block=..., last_block=...). In particular, BatchId::Display emits that wrapper form and the acct proof hook logs it with %batch_id, so operators cannot copy the value they see in logs into ee-get-acct-proof / ee-delete-acct-proof as the docs/comments imply.

Permalinks:

  • Parser:
    /// Parses `"<prev_block_hex>:<last_block_hex>"` into a [`BatchId`].
    ///
    /// Each hex half is exactly 32 bytes (64 hex chars). Both parts are
    /// required — there is no fallback to a single-hash form because the
    /// underlying type literally is a pair of hashes.
    pub(crate) fn parse_batch_id(s: &str) -> Result<BatchId, DisplayedError> {
    let (prev, last) = s.split_once(':').ok_or_else(|| {
    DisplayedError::UserError(
    "Expected batch id as <prev_block_hex>:<last_block_hex>".to_string(),
    Box::new(s.to_string()),
    )
    })?;
    let prev_hash = parse_hash(prev)?;
    let last_hash = parse_hash(last)?;
    Ok(BatchId::from_parts(prev_hash, last_hash))
    }
  • Display format:
    impl fmt::Display for BatchId {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
    write!(
    f,
    "BatchId(prev_block={}, last_block={})",
    self.prev_block(),
    self.last_block()
    )
  • Logged with %batch_id:
    let batch_id = task.0;
    let proof_id = EeBatchProofDbManager::proof_id_for(batch_id);
    info!(%batch_id, %proof_id, "persisting batch acct proof");
    self.batch_proofs
    .put_proof(batch_id, receipt.clone())
    .map_err(|e| ProverError::Storage(format!("put_proof({batch_id}): {e}")))?;

Please either parse the displayed BatchId(...) form too, or change the user-facing guidance/output so it consistently surfaces the accepted colon form.

Copy link
Copy Markdown
Contributor

@delbonis delbonis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation-wise this looks good, aside from the other comments.

But I'm wondering if it actually makes sense to put the EE-specific dbtool subcommands in a separate binary so that we aren't merging unrelated areas of concern. Like doesn't this now depend on both Strata and Alpen database concepts now?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments before attributes!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too.

@bewakes
Copy link
Copy Markdown
Contributor

bewakes commented May 22, 2026

Functionally this looks good enough to me - I looked into the subcommands and they seem to support the operations that are needed.

I agree with comments on structuring the interface and making it less boilerplated.

Thinking out loud, not a concrete direction: one thing worth exploring as a separate task is a database <-> commandline abstraction layer which would primarily support the basic CRUD operations which account for majority of the dbtool access right now - most of the subcommands are plain read/writes of db tables. The open question is whether our database trees are uniform enough to support this. Supporting custom operations besides basic CRUD would be a cherry-on-top.

An example that comes to mind is how Django the web framework makes it reallyy easy to define CRUD interface for admin operations(declarative model registration -> free CRUD UI), at the same time being extremely customizable and flexible.

@evgenyzdanovich
Copy link
Copy Markdown
Contributor

agree with @bewakes on the idea!

perhaps we can have a thin decl layer (through macro or otherwise) that's based off storage schema that delivers basic read/write logic (crud style).
this should unify quite a lot of boilerplate across both ee&ol and allow for easier support in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants