Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 106 additions & 6 deletions RUNBOOK.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,107 @@ Confirm `paused: true` and `pause_reason` matches the reason supplied.

---

## 2. Unpause After Incident Resolution
## 2. Circuit Breaker: Multi-Admin Vote-to-Unpause

The circuit breaker is a quorum-gated safety mechanism that prevents a single compromised admin from unilaterally resuming contract operations after an emergency pause. All three phases below must be completed in order.

### Phase 1 — Trigger the emergency pause

Any single admin can pause immediately (see **Section 1** for the full procedure):

```bash
soroban contract invoke \
--id $CONTRACT_ID \
--source $ADMIN_IDENTITY \
--network $NETWORK \
-- \
emergency_pause \
--caller $ADMIN_ADDRESS \
--reason SecurityIncident
```

Notify all other admins in `#incidents` immediately so they can participate in the quorum vote.

### Phase 2 — Coordinate the vote-to-unpause quorum

**Check current quorum state** (run this before and after each vote):

```bash
soroban contract invoke \
--id $CONTRACT_ID \
--source $ADMIN_IDENTITY \
--network $NETWORK \
-- \
health
```

Inspect the response for:
- `paused: true` — confirms the pause is active
- `pause_votes` — number of admins who have already voted to unpause
- `required_votes` — quorum threshold that must be reached
- `timelock_remaining` — seconds remaining before the timelock expires (must reach 0 before unpause is accepted)

**Each admin must cast a vote independently:**

```bash
soroban contract invoke \
--id $CONTRACT_ID \
--source $ADMIN_IDENTITY \
--network $NETWORK \
-- \
vote_unpause \
--caller $ADMIN_ADDRESS
```

Repeat this command for every admin until `pause_votes >= required_votes`. Once quorum is reached **and** the timelock has elapsed, the contract unpauses automatically.

**Tracking votes during a live incident:**

1. Designate one admin as incident commander to collect confirmation messages in `#incidents`.
2. Each admin posts their `ADMIN_ADDRESS` and transaction hash after voting.
3. The incident commander re-runs the `health` query after each vote to confirm `pause_votes` increments.
4. Do not proceed to Phase 3 until `pause_votes >= required_votes` is confirmed in `health` output.

### Phase 3 — Emergency unpause (last resort only)

Use `emergency_unpause` **only** when:
- Quorum cannot be reached (e.g., admins are unreachable), **and**
- The situation requires immediate contract resumption to prevent greater harm, **and**
- A post-incident review will be conducted to address the quorum failure.

```bash
soroban contract invoke \
--id $CONTRACT_ID \
--source $ADMIN_IDENTITY \
--network $NETWORK \
-- \
emergency_unpause \
--caller $ADMIN_ADDRESS
```

> **Warning:** `emergency_unpause` bypasses quorum. It should be treated as a break-glass action. Document the justification in the incident GitHub issue before executing.

Verify the contract is running normally after either path:

```bash
soroban contract invoke \
--id $CONTRACT_ID \
--source $ADMIN_IDENTITY \
--network $NETWORK \
-- \
health
```

Confirm `paused: false` and `pause_votes: 0` (votes are cleared on unpause).

**After completing the circuit breaker procedure:**
- Close the incident GitHub issue with a summary of which path was taken (quorum or last-resort).
- Post a resolution notice in `#incidents` including the ledger sequence of the unpause.
- If `emergency_unpause` was used, open a follow-up issue tagged `security-review` to evaluate whether the admin quorum configuration needs adjustment.

---

## 3. Unpause After Incident Resolution

Unpausing requires admin quorum votes (default: 1). If a timelock is configured, the elapsed time since the pause must exceed `timelock_seconds` before the unpause is accepted.

Expand Down Expand Up @@ -97,7 +197,7 @@ Confirm `paused: false`.

---

## 3. Rotate Admin Keys via Governance Proposal
## 4. Rotate Admin Keys via Governance Proposal

Admin key rotation uses the on-chain governance module. The process is: propose → vote → execute (after timelock).

Expand Down Expand Up @@ -169,7 +269,7 @@ soroban contract invoke \

---

## 4. Handle a Stuck Migration
## 5. Handle a Stuck Migration

A migration can become stuck if a batch import fails mid-flight or the contract is paused during migration.

Expand Down Expand Up @@ -229,7 +329,7 @@ soroban contract invoke \

---

## 5. Replay Failed Webhook Deliveries
## 6. Replay Failed Webhook Deliveries

The webhook dispatcher persists delivery attempts in the `webhook_deliveries` table. Failed deliveries can be replayed via the backend admin API.

Expand Down Expand Up @@ -283,7 +383,7 @@ psql $DATABASE_URL -c "

---

## 6. Extend Contract Storage TTL
## 7. Extend Contract Storage TTL

Soroban persistent storage entries expire after a set number of ledgers. Extend TTL before entries expire to avoid data loss.

Expand Down Expand Up @@ -326,7 +426,7 @@ Recommended: run a scheduled job (weekly) to bump TTL on all active remittances

---

## 7. Escalation Contacts and SLA Targets
## 8. Escalation Contacts and SLA Targets

| Severity | Definition | Response SLA | Resolution SLA | Escalation Path |
|----------|-----------|-------------|----------------|-----------------|
Expand Down
142 changes: 140 additions & 2 deletions backend/src/__tests__/sep24-service.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -56,13 +56,33 @@ class MockSep24AnchorServer {
private server: http.Server | null = null;
private port: number = 0;
private transactions: Map<string, { status: string; amount_in?: string; amount_out?: string }> = new Map();
private timeoutEnabled: boolean = false;
private timeoutPaths: Set<string> = new Set();

enableTimeoutSimulation(paths: string[] = ['/sep24/deposit', '/sep24/withdraw', '/sep24/transaction']): void {
this.timeoutEnabled = true;
paths.forEach((p) => this.timeoutPaths.add(p));
}

disableTimeoutSimulation(): void {
this.timeoutEnabled = false;
this.timeoutPaths.clear();
}

private shouldTimeout(path: string): boolean {
return this.timeoutEnabled && this.timeoutPaths.has(path);
}

async start(): Promise<string> {
this.app = express();
this.app.use(express.json());

// Mock /deposit endpoint (SEP-24)
this.app.post('/sep24/deposit', (req: Request, res: Response) => {
if (this.shouldTimeout('/sep24/deposit')) {
// Never respond — simulates a network timeout
return;
}
const { transaction_id, asset_code, amount } = req.body;

if (!transaction_id || !asset_code || !amount) {
Expand All @@ -86,6 +106,9 @@ class MockSep24AnchorServer {

// Mock /withdraw endpoint (SEP-24)
this.app.post('/sep24/withdraw', (req: Request, res: Response) => {
if (this.shouldTimeout('/sep24/withdraw')) {
return;
}
const { transaction_id, asset_code, amount } = req.body;

if (!transaction_id || !asset_code || !amount) {
Expand All @@ -106,6 +129,9 @@ class MockSep24AnchorServer {

// Mock /transaction endpoint (SEP-24 status query)
this.app.get('/sep24/transaction', (req: Request, res: Response) => {
if (this.shouldTimeout('/sep24/transaction')) {
return;
}
const { id } = req.query;

if (!id) {
Expand Down Expand Up @@ -395,10 +421,10 @@ describe('Error Handling', () => {

it('should handle anchor connection error', async () => {
process.env.SEP24_SERVER_ANCHOR_TEST = 'http://localhost:9999/nonexistent';

const service = new Sep24Service(pool);
await service.initialize();

const request: Sep24InitiateRequest = {
user_id: 'test-user-123',
anchor_id: 'anchor_test',
Expand All @@ -409,4 +435,116 @@ describe('Error Handling', () => {

await expect(service.initiateFlow(request)).rejects.toThrow();
});
});

describe('Timeout Handling', () => {
let mockServer: MockSep24AnchorServer;
let serverUrl: string;
let pool: Pool;

beforeEach(async () => {
mockServer = new MockSep24AnchorServer();
serverUrl = await mockServer.start();
pool = createMockPool();

process.env.SEP24_ENABLED_ANCHOR_TEST = 'true';
process.env.SEP24_SERVER_ANCHOR_TEST = serverUrl + '/sep24';
process.env.SEP24_POLL_INTERVAL_ANCHOR_TEST = '1';
// Very short timeout (1 ms) so the hanging server triggers a timeout error
process.env.SEP24_TIMEOUT_ANCHOR_TEST = '1';
resetSep24Rows();
});

afterEach(async () => {
mockServer.disableTimeoutSimulation();
await mockServer.stop();
resetSep24Rows();
vi.clearAllMocks();
});

it('should reject deposit initiation with a timeout error when the anchor does not respond', async () => {
mockServer.enableTimeoutSimulation(['/sep24/deposit']);

const service = new Sep24Service(pool);
await service.initialize();

const request: Sep24InitiateRequest = {
user_id: 'timeout-user',
anchor_id: 'anchor_test',
direction: 'deposit',
asset_code: 'USDC',
amount: '100.00',
};

await expect(service.initiateFlow(request)).rejects.toThrow();
});

it('should reject withdrawal initiation with a timeout error when the anchor does not respond', async () => {
mockServer.enableTimeoutSimulation(['/sep24/withdraw']);

const service = new Sep24Service(pool);
await service.initialize();

const request: Sep24InitiateRequest = {
user_id: 'timeout-user',
anchor_id: 'anchor_test',
direction: 'withdrawal',
asset_code: 'USDC',
amount: '50.00',
user_address: 'GAXXX',
};

await expect(service.initiateFlow(request)).rejects.toThrow();
});

it('should handle timeout during transaction status polling gracefully', async () => {
const service = new Sep24Service(pool);
await service.initialize();

// Initiate successfully before enabling timeout
const request: Sep24InitiateRequest = {
user_id: 'poll-timeout-user',
anchor_id: 'anchor_test',
direction: 'deposit',
asset_code: 'USDC',
amount: '75.00',
};
await service.initiateFlow(request);

// Now make the transaction status endpoint hang
mockServer.enableTimeoutSimulation(['/sep24/transaction']);

// pollAllTransactions should not throw — it must handle the timeout internally
await expect(service.pollAllTransactions()).resolves.not.toThrow();
});

it('should report an appropriate error message on timeout, not a generic network error', async () => {
mockServer.enableTimeoutSimulation(['/sep24/deposit']);

const service = new Sep24Service(pool);
await service.initialize();

const request: Sep24InitiateRequest = {
user_id: 'error-msg-user',
anchor_id: 'anchor_test',
direction: 'deposit',
asset_code: 'USDC',
amount: '200.00',
};

let caughtError: unknown;
try {
await service.initiateFlow(request);
} catch (err) {
caughtError = err;
}

expect(caughtError).toBeDefined();
// The error should be an Error instance with a meaningful message
expect(caughtError).toBeInstanceOf(Error);
const message = (caughtError as Error).message.toLowerCase();
expect(
message.includes('timeout') || message.includes('timed out') || message.includes('time') || message.includes('abort') || message.includes('network')
).toBe(true);
});
});
2 changes: 2 additions & 0 deletions frontend/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@
"@testing-library/jest-dom": "^6.9.1",
"@testing-library/react": "^14.1.2",
"@testing-library/user-event": "^14.5.1",
"jest-axe": "^9.0.0",
"@types/jest-axe": "^3.5.9",
"@typescript-eslint/parser": "^7.18.0",
"@types/react": "^18.3.12",
"@types/react-dom": "^18.3.1",
Expand Down
Loading
Loading