Purpose: Switch to backup Stellar RPC node
RTO Target: 5 minutes
RPO Target: N/A (blockchain data is distributed)
Last Updated: April 29, 2026
Last Tested:
Use this runbook when:
- Primary Stellar RPC node is unresponsive
- RPC node returns errors or timeouts
- RPC node performance degradation
- Planned RPC node maintenance
- Network connectivity issues to primary RPC
- Rate limiting from RPC provider
- SSH access to backend servers
- Environment variable access
- Backup RPC node credentials (if needed)
- Monitoring dashboard access
-
curlfor testing -
jqfor JSON parsing - Text editor (
nano,vim) - Process manager access (
systemctl,pm2)
- Primary RPC URL
- Backup RPC URLs
- Network passphrase
- Incident ticket number
- Verify primary RPC is down (not a temporary glitch)
- Notify team via Slack/PagerDuty
- Create incident ticket
- Document current RPC URL
- Verify backup RPC is available
- Check backup RPC health
# Get current RPC URL from environment
PRIMARY_RPC=$(grep STELLAR_RPC_URL /app/yieldvault-backend/.env | cut -d '=' -f2)
echo "Primary RPC: $PRIMARY_RPC"
# Test RPC health
curl -X POST "$PRIMARY_RPC" \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 1,
"method": "getHealth"
}' | jq .
# Test with timeout
timeout 5 curl -X POST "$PRIMARY_RPC" \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 1,
"method": "getLatestLedger"
}' | jq .Expected Output (Healthy):
{
"jsonrpc": "2.0",
"id": 1,
"result": {
"status": "healthy"
}
}Failure Indicators:
- Connection timeout
- Connection refused
- HTTP 5xx errors
- Rate limit errors (429)
- Slow response (>5 seconds)
# Check for RPC errors in application logs
tail -100 /var/log/yieldvault/backend.log | grep -i "stellar\|rpc\|soroban"
# Look for patterns like:
# - "RPC connection failed"
# - "Timeout waiting for RPC"
# - "Rate limit exceeded"
# - "Network error"# Test network connectivity
ping -c 5 soroban-testnet.stellar.org
# Test DNS resolution
nslookup soroban-testnet.stellar.org
# Test HTTPS connectivity
curl -I https://soroban-testnet.stellar.orgTestnet Options:
-
Stellar Foundation Public RPC (Primary)
- URL:
https://soroban-testnet.stellar.org - Rate Limit: Moderate
- Reliability: High
- URL:
-
Custom Hosted Node (Backup 1)
- URL:
https://rpc.yieldvault.finance - Rate Limit: None (self-hosted)
- Reliability: High (if maintained)
- URL:
-
Third-Party Provider (Backup 2)
- URL:
https://stellar-testnet.ankr.com(example) - Rate Limit: Varies by plan
- Reliability: High
- URL:
Mainnet Options:
-
Stellar Foundation Public RPC (Primary)
- URL:
https://soroban-mainnet.stellar.org - Rate Limit: Moderate
- Reliability: High
- URL:
-
QuickNode (Backup 1)
- URL:
https://your-endpoint.stellar-mainnet.quiknode.pro - Rate Limit: Based on plan
- Reliability: Very High
- URL:
-
Custom Hosted Node (Backup 2)
- URL:
https://rpc-mainnet.yieldvault.finance - Rate Limit: None
- Reliability: High (if maintained)
- URL:
# Test backup RPC health
BACKUP_RPC="https://rpc.yieldvault.finance"
curl -X POST "$BACKUP_RPC" \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 1,
"method": "getHealth"
}' | jq .
# Test ledger query
curl -X POST "$BACKUP_RPC" \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 1,
"method": "getLatestLedger"
}' | jq .
# Measure response time
time curl -X POST "$BACKUP_RPC" \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 1,
"method": "getLatestLedger"
}' > /dev/null 2>&1Expected Output:
real 0m0.234s # Should be < 1 second
user 0m0.012s
sys 0m0.008s
If backup RPC is also down:
- Try next backup in list
- Consider temporary degraded service
- Escalate to senior engineer
# Backup current .env file
cd /app/yieldvault-backend
cp .env .env.backup.$(date +%Y%m%d_%H%M%S)
# Document current RPC
grep STELLAR_RPC_URL .env >> /var/log/yieldvault/rpc_changes.log
echo "Changed at: $(date -Iseconds)" >> /var/log/yieldvault/rpc_changes.log# Update .env file
BACKUP_RPC="https://rpc.yieldvault.finance"
# Using sed (Linux)
sed -i "s|STELLAR_RPC_URL=.*|STELLAR_RPC_URL=$BACKUP_RPC|" .env
# Or using sed (macOS)
sed -i '' "s|STELLAR_RPC_URL=.*|STELLAR_RPC_URL=$BACKUP_RPC|" .env
# Or manually edit
nano .env
# Change: STELLAR_RPC_URL=https://soroban-testnet.stellar.org
# To: STELLAR_RPC_URL=https://rpc.yieldvault.finance
# Verify change
grep STELLAR_RPC_URL .envExpected Output:
STELLAR_RPC_URL=https://rpc.yieldvault.finance
# Ensure network passphrase matches the RPC network
grep STELLAR_NETWORK_PASSPHRASE .env
# For testnet, should be:
# STELLAR_NETWORK_PASSPHRASE=Test SDF Network ; September 2015
# For mainnet, should be:
# STELLAR_NETWORK_PASSPHRASE=Public Global Stellar Network ; September 2015# If using systemd
sudo systemctl restart yieldvault-backend
# If using PM2
pm2 restart yieldvault-backend
# If using Docker
docker restart yieldvault-backend
# Wait for startup
sleep 5# Check service status
systemctl status yieldvault-backend
# Or PM2
pm2 status yieldvault-backend
# Or Docker
docker ps | grep yieldvault-backend# Check health endpoint
curl -s http://localhost:3000/health | jq .
# Expected response
{
"status": "healthy",
"checks": {
"api": "up",
"database": "up",
"stellarRpc": "up" # Should be "up"
}
}# Test vault contract query
curl -s http://localhost:3000/api/v1/vault/total-assets | jq .
# Test vault summary
curl -s http://localhost:3000/api/v1/vault/summary | jq .
# Should return data without errors# Check for successful RPC connection
tail -50 /var/log/yieldvault/backend.log | grep -i "stellar\|rpc"
# Look for:
# ✓ "Connected to Stellar RPC"
# ✓ "RPC health check passed"
# ✗ No error messages# Test a read-only contract call
curl -X POST http://localhost:3000/api/v1/vault/simulate-deposit \
-H "Content-Type: application/json" \
-d '{
"user": "GXXX...",
"amount": "100"
}' | jq .
# Should return simulation result without errors# Measure RPC response time
for i in {1..10}; do
time curl -s http://localhost:3000/api/v1/vault/summary > /dev/null
sleep 1
done
# Average should be < 1 second# Watch logs for errors
tail -f /var/log/yieldvault/backend.log | grep -i error
# Should see no RPC-related errors# If using monitoring (Grafana, etc.)
# - Check RPC latency metrics
# - Check error rate
# - Check success rate
# - Compare with baselineIf frontend connects directly to RPC:
# Update frontend environment
cd /app/yieldvault-frontend
# Backup current config
cp .env .env.backup.$(date +%Y%m%d_%H%M%S)
# Update RPC URL
sed -i "s|VITE_SOROBAN_RPC_URL=.*|VITE_SOROBAN_RPC_URL=$BACKUP_RPC|" .env
# Rebuild and redeploy frontend
npm run build
# Deploy to CDN/hosting# Log the failover
cat >> /var/log/yieldvault/rpc_changes.log <<EOF
=== RPC Failover ===
Date: $(date -Iseconds)
From: $PRIMARY_RPC
To: $BACKUP_RPC
Reason: Primary RPC unresponsive
Performed By: $(whoami)
Incident: INCIDENT-123
Status: Success
EOF# Send Slack notification
curl -X POST $SLACK_WEBHOOK_URL \
-H 'Content-Type: application/json' \
-d '{
"text": "⚠️ RPC Failover Completed",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*RPC Failover*\n• From: Primary RPC\n• To: Backup RPC\n• Downtime: ~2 minutes\n• Status: Operational"
}
}
]
}'- Investigate primary RPC failure
- Contact primary RPC provider
- Monitor backup RPC performance
- Plan return to primary (if applicable)
- Update documentation
To switch back to primary RPC:
- Primary RPC is restored and healthy
- Backup RPC has issues
- Performance is better on primary
- Cost considerations
# 1. Verify primary RPC is healthy
curl -X POST "$PRIMARY_RPC" \
-H "Content-Type: application/json" \
-d '{"jsonrpc": "2.0", "id": 1, "method": "getHealth"}' | jq .
# 2. Update configuration
cd /app/yieldvault-backend
sed -i "s|STELLAR_RPC_URL=.*|STELLAR_RPC_URL=$PRIMARY_RPC|" .env
# 3. Restart service
sudo systemctl restart yieldvault-backend
# 4. Verify
curl -s http://localhost:3000/health | jq .
# 5. Monitor for 10 minutes
tail -f /var/log/yieldvault/backend.log | grep -i "stellar\|rpc"Symptoms: Backup RPC returns errors or timeouts
Solutions:
- Try next backup in list
- Check network connectivity
- Verify RPC provider status page
- Consider temporary degraded service
- Escalate to senior engineer
Symptoms: "Invalid network passphrase" errors
Solutions:
- Verify RPC network (testnet vs mainnet)
- Update STELLAR_NETWORK_PASSPHRASE to match
- Restart service
- Verify contract calls work
Symptoms: 429 Too Many Requests errors
Solutions:
- Implement request throttling in application
- Upgrade RPC provider plan
- Switch to different backup RPC
- Implement request caching
Symptoms: Increased response times
Solutions:
- Check RPC provider status
- Verify network latency:
ping rpc-host - Consider geographic proximity
- Switch to faster backup RPC
- Implement timeout adjustments
// backend/src/rpcHealthCheck.ts
import { Server } from '@stellar/stellar-sdk/rpc';
const PRIMARY_RPC = process.env.STELLAR_RPC_URL;
const BACKUP_RPC = process.env.STELLAR_RPC_URL_BACKUP;
let currentRpc = PRIMARY_RPC;
let failoverCount = 0;
async function checkRpcHealth(url: string): Promise<boolean> {
try {
const server = new Server(url);
const health = await server.getHealth();
return health.status === 'healthy';
} catch (error) {
return false;
}
}
async function autoFailover() {
const primaryHealthy = await checkRpcHealth(PRIMARY_RPC);
if (!primaryHealthy && currentRpc === PRIMARY_RPC) {
console.warn('Primary RPC unhealthy, failing over to backup');
currentRpc = BACKUP_RPC;
failoverCount++;
// Send alert
await sendAlert('RPC failover to backup');
} else if (primaryHealthy && currentRpc === BACKUP_RPC) {
console.info('Primary RPC restored, failing back');
currentRpc = PRIMARY_RPC;
// Send alert
await sendAlert('RPC failback to primary');
}
return currentRpc;
}
// Check every 30 seconds
setInterval(autoFailover, 30000);| Provider | Testnet | Mainnet | Rate Limit | Latency | Cost | Reliability |
|---|---|---|---|---|---|---|
| Stellar Foundation | ✓ | ✓ | Moderate | Low | Free | High |
| QuickNode | ✓ | ✓ | High | Very Low | $$$ | Very High |
| Ankr | ✓ | ✓ | Moderate | Low | $$ | High |
| Self-Hosted | ✓ | ✓ | None | Varies | $ | Varies |
After failover, verify:
- Backup RPC is healthy
- Service restarted successfully
- Health checks pass
- RPC connectivity works
- Contract queries succeed
- No errors in logs
- Response times acceptable
- Frontend works (if applicable)
- Monitoring shows healthy state
- Team notified
- Change documented
Document these metrics for each failover:
| Metric | Target | Actual | Status |
|---|---|---|---|
| Detection Time | 1 min | ___ min | ___ |
| Decision Time | 1 min | ___ min | ___ |
| Configuration Update | 1 min | ___ min | ___ |
| Service Restart | 1 min | ___ min | ___ |
| Verification | 1 min | ___ min | ___ |
| Total RTO | 5 min | ___ min | ___ |
| Errors During Failover | 0 | ___ | ___ |
| Role | Name | Phone | |
|---|---|---|---|
| DevOps Lead | TBD | TBD | TBD |
| Infrastructure Lead | TBD | TBD | TBD |
| On-Call Engineer | TBD | TBD | TBD |
| RPC Provider Support | TBD | TBD | TBD |
PagerDuty: [Escalation Policy Link]
Slack Channel: #yieldvault-incidents
Last Updated: April 29, 2026
Next Review: July 29, 2026
Tested: