This directory contains operational runbooks for disaster recovery and incident response for the YieldVault platform.
| Runbook | RTO | RPO | Use When |
|---|---|---|---|
| RTO/RPO Targets | N/A | N/A | Understanding recovery objectives |
| Database Restore | 1 hour | 15 min | Database corruption or failure |
| Backend Redeploy | 30 min | N/A | Backend service issues |
| RPC Failover | 5 min | N/A | Stellar RPC node failure |
| Full DR Procedure | 4 hours | 15 min | Complete infrastructure failure |
Runbooks are step-by-step operational guides that enable any engineer to execute complex procedures consistently and reliably. They are designed to be followed during high-stress situations when quick, accurate action is critical.
- During incidents: Follow the appropriate runbook for the failure type
- During testing: Use runbooks to practice disaster recovery
- During training: Familiarize new team members with procedures
- During planning: Reference RTO/RPO targets for capacity planning
File: RTO_RPO_TARGETS.md
Purpose: Defines Recovery Time Objective (RTO) and Recovery Point Objective (RPO) targets for all system components.
Key Information:
- Component-specific RTO/RPO targets
- Disaster scenario analysis
- Backup schedules
- Cost analysis
- Testing requirements
When to Read:
- Before any disaster recovery activity
- During capacity planning
- When evaluating infrastructure changes
- During compliance audits
File: DATABASE_RESTORE.md
Purpose: Restore the YieldVault database from backup.
RTO: 1 hour
RPO: 15 minutes
Use Cases:
- Database corruption
- Accidental data deletion
- Database server failure
- Rollback after failed migration
- Data integrity issues
Prerequisites:
- Database admin credentials
- Backup storage access
- SSH access to database server
Key Steps:
- Assess situation (5 min)
- Stop backend services (5 min)
- Backup current state (10 min)
- Download backup (10 min)
- Restore database (20 min)
- Verify restore (10 min)
- Restart services (5 min)
File: BACKEND_REDEPLOY.md
Purpose: Redeploy the backend API service.
RTO: 30 minutes
RPO: N/A (stateless)
Use Cases:
- Backend service unresponsive
- Application crashes
- Security patch deployment
- Configuration changes
- Performance issues
Prerequisites:
- Git repository access
- SSH access to servers
- Environment variables
- Docker/PM2 access
Key Steps:
- Assess current state (5 min)
- Prepare for deployment (5 min)
- Stop service (2 min)
- Deploy new version (10 min)
- Start service (2 min)
- Verify deployment (5 min)
- Smoke tests (5 min)
File: RPC_FAILOVER.md
Purpose: Switch to backup Stellar RPC node.
RTO: 5 minutes
RPO: N/A (blockchain data)
Use Cases:
- Primary RPC node unresponsive
- RPC errors or timeouts
- Performance degradation
- Rate limiting issues
- Planned maintenance
Prerequisites:
- Backup RPC URLs
- Environment variable access
- SSH access to servers
Key Steps:
- Verify RPC failure (2 min)
- Select backup RPC (1 min)
- Update configuration (1 min)
- Restart service (1 min)
- Verify failover (2 min)
File: FULL_DR_PROCEDURE.md
Purpose: Complete system recovery from catastrophic failure.
RTO: 4 hours (1.5 hours with standby infrastructure)
RPO: 15 minutes
Use Cases:
- Data center outage
- Multiple component failures
- Complete infrastructure loss
- Natural disaster
- Cyber attack requiring rebuild
Prerequisites:
- Cloud provider admin access
- All backup access
- Full team availability
- Emergency budget approval
Key Phases:
- Assessment & Planning (30 min)
- Infrastructure Provisioning (60 min)
- Database Recovery (60 min)
- Backend Deployment (30 min)
- Frontend Deployment (15 min)
- External Services (15 min)
- Verification & Testing (30 min)
- Cutover (15 min)
- Post-Recovery (30 min)
Use this decision tree to select the appropriate runbook:
Is the entire infrastructure down?
├─ YES → Use Full DR Procedure
└─ NO → Continue
Is the database corrupted or inaccessible?
├─ YES → Use Database Restore
└─ NO → Continue
Is the backend service down or malfunctioning?
├─ YES → Use Backend Redeploy
└─ NO → Continue
Is the Stellar RPC node failing?
├─ YES → Use RPC Failover
└─ NO → Check component-specific documentation
All runbooks must be tested according to this schedule:
| Runbook | Test Frequency | Last Tested | Next Test |
|---|---|---|---|
| Database Restore | Monthly | TBD | |
| Backend Redeploy | Weekly | TBD | |
| RPC Failover | Monthly | TBD | |
| Full DR Procedure | Annually | TBD |
-
Tabletop Exercise (Quarterly)
- Walk through runbook as a team
- Identify gaps and issues
- Update documentation
- Duration: 2 hours
-
Partial Test (Monthly)
- Execute runbook in non-production
- Verify all steps work
- Measure actual RTO/RPO
- Duration: 1-4 hours
-
Full DR Test (Annually)
- Execute complete DR in production-like environment
- Involve entire team
- Simulate real disaster
- Duration: 8 hours
Update runbooks when:
- Infrastructure changes
- New tools or processes adopted
- Testing reveals issues
- Actual incident occurs
- Team feedback received
- Quarterly review cycle
- Monthly: Quick review of recent changes
- Quarterly: Full review and testing
- Annually: Complete rewrite if needed
All runbooks are version controlled in git:
- Track changes over time
- Review history of updates
- Collaborate on improvements
- Maintain audit trail
- Monitoring alerts
- User reports
- Health check failures
- Manual discovery
- Determine severity
- Identify affected components
- Estimate impact
- Select appropriate runbook
- Assemble team
- Create incident channel
- Follow runbook
- Document actions
- Execute recovery steps
- Verify restoration
- Monitor closely
- Notify stakeholders
- Post-incident review
- Update runbooks
- Implement improvements
- Share learnings
- Declares disaster
- Assembles team
- Makes final decisions
- Communicates with stakeholders
- Executes database restore
- Verifies data integrity
- Manages database configuration
- Provisions infrastructure
- Deploys applications
- Configures networking
- Manages monitoring
- Deploys backend code
- Verifies functionality
- Troubleshoots issues
- Deploys frontend code
- Verifies user experience
- Updates configuration
- Assesses security implications
- Verifies security controls
- Manages secrets and keys
Slack Channels:
#yieldvault-incidents- General incident updates#yieldvault-war-room- Active incident coordination#yieldvault-ops- Operational updates
PagerDuty:
- Escalation policies defined
- On-call rotation maintained
- Alert routing configured
Status Page:
- Update during incidents
- Provide ETAs
- Post-incident reports
Customer Communication:
- Email notifications
- In-app messages
- Social media updates
- SSH Client: Access to servers
- psql: PostgreSQL client
- curl: API testing
- jq: JSON parsing
- git: Version control
- aws/gcloud/az: Cloud CLI tools
- Health Dashboard: [Link]
- Metrics Dashboard: [Link]
- Logs Dashboard: [Link]
- Alerts Dashboard: [Link]
| Metric | Target | Current |
|---|---|---|
| Mean Time To Detect (MTTD) | < 5 min | TBD |
| Mean Time To Respond (MTTR) | < 30 min | TBD |
| Recovery Success Rate | > 95% | TBD |
| RTO Achievement | > 90% | TBD |
| RPO Achievement | > 95% | TBD |
For each incident, track:
- Detection time
- Response time
- Recovery time
- Data loss
- Root cause
- Lessons learned
- Read all runbooks
- Attend tabletop exercise
- Shadow experienced engineer
- Execute runbook in test environment
- Participate in on-call rotation
- Quarterly tabletop exercises
- Monthly runbook reviews
- Annual full DR test
- Post-incident reviews
-
Collect Feedback
- After each incident
- During testing
- From team members
-
Analyze
- What worked well?
- What could be improved?
- What was missing?
-
Update
- Revise runbooks
- Update procedures
- Improve tools
-
Test
- Verify improvements
- Measure impact
- Iterate
- Runbooks documented
- RTO/RPO defined and approved
- Testing schedule established
- Tests performed and documented
- Incidents documented
- Improvements implemented
- SOC 2 Type II
- ISO 27001
- GDPR (if applicable)
- Industry best practices
-
During Incident:
- Use PagerDuty escalation
- Post in #yieldvault-war-room
- Call emergency contacts
-
For Runbook Questions:
- Post in #yieldvault-ops
- Contact DevOps team
- Review documentation
-
For Updates:
- Submit PR to update runbook
- Discuss in team meeting
- Document in incident review
| Role | Name | Phone | |
|---|---|---|---|
| Incident Commander | TBD | TBD | TBD |
| Database Admin | TBD | TBD | TBD |
| DevOps Lead | TBD | TBD | TBD |
| Backend Lead | TBD | TBD | TBD |
| Frontend Lead | TBD | TBD | TBD |
| Security Lead | TBD | TBD | TBD |
| Team Lead | TBD | TBD | TBD |
| CEO/CTO | TBD | TBD | TBD |
PagerDuty: [Escalation Policy Link]
Slack: #yieldvault-war-room
Zoom: [Emergency Meeting Link]
- RTO: Recovery Time Objective - Maximum acceptable downtime
- RPO: Recovery Point Objective - Maximum acceptable data loss
- MTTR: Mean Time To Repair - Average time to fix issues
- MTBF: Mean Time Between Failures - Average time between incidents
- DR: Disaster Recovery
- HA: High Availability
See individual runbooks for detailed checklists.
Last Updated: April 29, 2026
Maintained By: DevOps Team
Next Review: July 29, 2026