Skip to content

Latest commit

 

History

History
120 lines (85 loc) · 3.91 KB

File metadata and controls

120 lines (85 loc) · 3.91 KB

SettleProof Reliability-First Fast Plan (Highest Impact)

Goal

Turn this hackathon demo into a credible reliability-engineering AI+Crypto project with minimal time.

Scope (20% work, 80% impact)

Focus only on failure modes that directly affect payment correctness and uptime.


1) Remove Single Point of Failure in RPC Reads (Day 1)

Why

fetch_on_chain_logs currently uses one hardcoded RPC endpoint. One outage or bad node breaks settlement detection.

Implement

  • Add rpc_endpoints list in config (skill.yaml + env).
  • Query 3 providers in parallel (or sequential fallback if time is tight).
  • Add quorum rule: accept logs only if at least 2 providers agree on (tx_hash, block_number, log_index).
  • Add timeout + retry + circuit breaker per provider.

Success criteria

  • If one provider is down, settlement still works.
  • Metric/log: rpc_quorum_failure_rate, rpc_provider_latency_ms.

2) Add Idempotency + Checkpointing (Day 1-2)

Why

Current flow can re-process old logs (fromBlock is fixed) and risks duplicate settlement/webhook side effects.

Implement

  • Persist last_processed_block in synchronized/shared state.
  • On each cycle, query from last_processed_block + 1 to latest.
  • Create idempotency key: invoice_uuid + tx_hash.
  • Keep a processed set (or durable store/file/db) and skip duplicates.

Success criteria

  • Restarting agent does not duplicate processing.
  • Metric/log: duplicate_events_dropped_total.

3) Real Webhook Delivery Guarantees (Day 2)

Why

Webhook is currently mocked, so business-critical handoff is unreliable.

Implement

  • Replace _call_webhook with real HTTP POST.
  • Add retries with exponential backoff + max attempts.
  • Treat webhook as at-least-once delivery; include idempotency header.
  • Add dead-letter persistence for failed payloads after final retry.

Success criteria

  • Temporary merchant API outage no longer loses events.
  • Metric/log: webhook_delivery_success_rate, webhook_dlq_total.

4) Deterministic Keeper Selection (Day 2)

Why

Keeper is hardcoded, creating operational fragility and non-credible consensus behavior.

Implement

  • Replace hardcoded keeper with deterministic choice from participants:
    • e.g., keeper = participants[round_number % len(participants)]
  • Add fallback if selected keeper is unhealthy/unresponsive.

Success criteria

  • No hardcoded operator dependency.
  • Metric/log: keeper_selection_rounds_total, keeper_failover_total.

5) Minimal Reliability Observability (Day 3, half-day)

Why

Without measurable reliability, improvements are hard to prove in interviews.

Implement

  • Add structured logs for each stage with invoice_uuid, tx_hash, block, attempt.
  • Publish/track 4 core indicators:
    • invoice_verification_latency_p95
    • rpc_quorum_failure_rate
    • webhook_delivery_success_rate
    • consensus_round_retries

Success criteria

  • You can show before/after reliability posture in a short demo.

Out of Scope (for now)

  • Full production database migration
  • Advanced chaos engineering platform
  • Multi-chain expansion

Keep these for later; they are lower ROI under time pressure.


Fast Execution Order (if very short on time)

  1. RPC quorum + fallback
  2. Idempotency + checkpoint
  3. Webhook retries + DLQ
  4. Deterministic keeper
  5. Core metrics/logs

This order maximizes reliability impact per hour.


Resume Framing (Reliability Niche)

  • “Hardened a decentralized AI-agent payment verifier against RPC outages via multi-provider quorum and circuit breaking.”
  • “Implemented idempotent settlement processing with checkpointing to prevent duplicate financial side effects.”
  • “Designed at-least-once webhook delivery with retry/backoff and dead-letter handling for payment event reliability.”
  • “Added reliability SLO signals (p95 latency, quorum failure rate, delivery success) for operational visibility.”