Turn this hackathon demo into a credible reliability-engineering AI+Crypto project with minimal time.
Focus only on failure modes that directly affect payment correctness and uptime.
fetch_on_chain_logs currently uses one hardcoded RPC endpoint. One outage or bad node breaks settlement detection.
- Add
rpc_endpointslist in config (skill.yaml+ env). - Query 3 providers in parallel (or sequential fallback if time is tight).
- Add quorum rule: accept logs only if at least 2 providers agree on
(tx_hash, block_number, log_index). - Add timeout + retry + circuit breaker per provider.
- If one provider is down, settlement still works.
- Metric/log:
rpc_quorum_failure_rate,rpc_provider_latency_ms.
Current flow can re-process old logs (fromBlock is fixed) and risks duplicate settlement/webhook side effects.
- Persist
last_processed_blockin synchronized/shared state. - On each cycle, query from
last_processed_block + 1tolatest. - Create idempotency key:
invoice_uuid + tx_hash. - Keep a processed set (or durable store/file/db) and skip duplicates.
- Restarting agent does not duplicate processing.
- Metric/log:
duplicate_events_dropped_total.
Webhook is currently mocked, so business-critical handoff is unreliable.
- Replace
_call_webhookwith real HTTP POST. - Add retries with exponential backoff + max attempts.
- Treat webhook as at-least-once delivery; include idempotency header.
- Add dead-letter persistence for failed payloads after final retry.
- Temporary merchant API outage no longer loses events.
- Metric/log:
webhook_delivery_success_rate,webhook_dlq_total.
Keeper is hardcoded, creating operational fragility and non-credible consensus behavior.
- Replace hardcoded keeper with deterministic choice from participants:
- e.g.,
keeper = participants[round_number % len(participants)]
- e.g.,
- Add fallback if selected keeper is unhealthy/unresponsive.
- No hardcoded operator dependency.
- Metric/log:
keeper_selection_rounds_total,keeper_failover_total.
Without measurable reliability, improvements are hard to prove in interviews.
- Add structured logs for each stage with
invoice_uuid,tx_hash,block,attempt. - Publish/track 4 core indicators:
invoice_verification_latency_p95rpc_quorum_failure_ratewebhook_delivery_success_rateconsensus_round_retries
- You can show before/after reliability posture in a short demo.
- Full production database migration
- Advanced chaos engineering platform
- Multi-chain expansion
Keep these for later; they are lower ROI under time pressure.
- RPC quorum + fallback
- Idempotency + checkpoint
- Webhook retries + DLQ
- Deterministic keeper
- Core metrics/logs
This order maximizes reliability impact per hour.
- “Hardened a decentralized AI-agent payment verifier against RPC outages via multi-provider quorum and circuit breaking.”
- “Implemented idempotent settlement processing with checkpointing to prevent duplicate financial side effects.”
- “Designed at-least-once webhook delivery with retry/backoff and dead-letter handling for payment event reliability.”
- “Added reliability SLO signals (p95 latency, quorum failure rate, delivery success) for operational visibility.”