The workflow engine is Sentinel's differentiator. It maps a structured extraction —
plus its per-field confidence and any guardrail signals — to one of three statuses
(auto_approved, needs_review, rejected) deterministically, persists that
status idempotently, and lets you replay the decision at any time from
stored inputs.
The implementation lives in backend/app/workflow.py.
Three reasons, in order of importance:
- Audit defensibility. Given a
workflow_items.id, the M7 audit replay test reconstructs current state fromaudit_events; the routing rules are part of that reconstruction. If the rules were nondeterministic, the audit log would be ambiguous. - Re-runs are safe. Re-routing the same extraction must never duplicate work or accidentally revisit a rejected record. Determinism + idempotency keys are the mechanism.
- Eval reproducibility. The M9 evaluation harness routes a labelled corpus and compares against ground truth. Deterministic routing means the eval result depends on the data and the rule version, not the wall clock.
The rule layer is a single function route(inputs: RoutingInputs) -> RoutingDecision
with no I/O. Decisions follow a top-down precedence:
| # | Condition | Status | Reason |
|---|---|---|---|
| 1 | Any guardrail flag in {invalid_citation} |
rejected |
guardrail_rejected |
| 2 | Any field's confidence < confidence_review_threshold |
needs_review |
low_confidence |
| 3 | Any other (non-rejecting) guardrail flag | needs_review |
guardrail_review |
| 4 | Otherwise | auto_approved |
ok |
Rejection is terminal at routing time: rule #1 wins over rule #2 even if both fire. This matches the M3 RAG citation-or-refuse posture (a fabricated citation is a hard failure, not a degradation to review).
RoutingInputs carries everything the rule layer needs:
extraction_id: intschema_name: strfield_confidence: Mapping[str, float]guardrail_flags: Sequence[str]confidence_review_threshold: float(defaults toSettings.confidence_review_threshold, i.e.0.75)
RoutingDecision is a frozen dataclass with status, reason, idempotency_key,
and routing_version.
The key is the SHA-256 hex digest of a canonical join of three values:
sha256(f"{extraction_id}|{schema_name}|{routing_version}")
What is deliberately not in the key:
- Field confidence values. They legitimately change when a re-extraction runs; including them would split routing into a new row every time, defeating idempotency.
- Guardrail flags. Same reason.
What is in the key:
extraction_idandschema_name— stable identity of the work item.ROUTING_VERSION— a constant inbackend/app/workflow.py. Bumping it changes every key, which is what you want when the rule set itself changes (the M6 PR's decision log records the current value asv1). Bumping is auditable and triggers a fresh routing pass.
A test in test_workflow.py pins this contract: changing
extraction_id, schema_name, or routing_version changes the key; changing
confidence or flags does not.
apply_routing(session, *, extraction_id, decision) -> WorkflowItemLooks up the existing workflow_items row by decision.idempotency_key and:
- Not found — inserts a new row.
- Found, same status — no-op; returns the existing row.
- Found, different status — updates the row in place (one row per key, ever).
- Found, status is
REJECTEDand new status isAUTO_APPROVED— refuses withIllegalTransition. Promotion off rejection requires a human event, which is M7's audit-drivenPOST /review/{id}/approve.
Caller owns the transaction. The engine flushes but never commits.
REJECTED → NEEDS_REVIEW is allowed (e.g., a re-run that finds an extraction
was misjudged as a rejection). Only the auto_approved promotion is gated, because
that is the one transition with material impact (the row would silently flow to
production-as-truth).
replay(session, *, extraction_id, settings=None) -> RoutingDecisionReads the persisted extractions row (payload, per-field confidence, schema name)
and runs route() against it with guardrail_flags=(). Returns the recomputed
decision and writes nothing.
Two uses:
- Audit assertion (M7). Given a
workflow_itemsrow, replay should produce the same status — modulo the audit trail of human decisions, which the M7 replay test layers on top. - Rule-version migration. Bumping
ROUTING_VERSIONtriggers a re-routing pass: callreplayfor every existing extraction; pass the new decision toapply_routing; observe per-row updates.
Replay does not carry guardrail flags, because they are a per-call signal, not
stored state. If you need flags during replay (e.g., to reconstruct an
invalid_citation rejection), record them on the audit event — that is M7.
route_extraction(session, *, extraction_id, guardrail_flags=()) is a small
convenience that ties replay-style routing and idempotent persistence together for
callers (e.g., a worker that consumes new extractions, or the M9 eval harness).
The engine refuses to ship a decision that violates either of two invariants
(_check_invariants, called inside route()):
AUTO_APPROVEDrequires every field at or above the threshold. Returningauto_approvedwhile at least one field is below the threshold is a rule bug and raisesAssertionError.- A rejecting guardrail flag MUST yield
REJECTED. Returning anything else whileinvalid_citationis present is a rule bug and raisesAssertionError.
These run at decision time so a regression in _decide_status is caught
immediately rather than silently shipping a bad routing.
| Env var | Default | Description |
|---|---|---|
CONFIDENCE_REVIEW_THRESHOLD |
0.75 |
Per-field cutoff used by the low-confidence rule. |
ROUTING_VERSION is a code-level constant (currently "v1"); bumping it is an
intentional, reviewable change, not a runtime knob.
backend/tests/test_workflow.py covers all four M6 DoD categories explicitly:
- Determinism:
route()is a pure function of inputs; key recipe pinned. - Idempotency: re-running
apply_routingproduces one DB row, verified at the SQL level. - Replay:
replay()reproduces decisions from storage, including low-confidence paths, and raises on unknown ids. - Invariants:
_check_invariantsraises on the two illegal pairings;route()never returnsauto_approvedwhen any field is low.
Plus rule-precedence coverage (5 tests), transition-policy coverage (2 tests), and
route_extraction integration (2 tests).