A hands-on reference for relational database fundamentals, performance engineering, and operational patterns across PostgreSQL, MySQL, Oracle, CockroachDB, and Redshift.
- ACID Guarantees
- Distributed Transactions & Consensus
- Relational Algebra & JOINs
- Indexing
- Partitioning
- Data Archiving & WAL
- Connection Pooling
- Storage Internals — Database Pages
- Transaction Isolation & Locking
- High Availability & Replication
- Performance Engineering
- Schema Design
- Database Engines
- Managed Database Services
ACID is the foundational contract that a database makes to application code. Violating any property shifts correctness responsibility to the application layer.
| Property | Guarantee | Failure Scenario |
|---|---|---|
| Atomicity | A transaction is all-or-nothing. On crash or error, the engine rolls back to the pre-transaction state. | Partial writes on power loss |
| Consistency | Every committed transaction moves the database from one valid state to another, enforcing all constraints and rules. | Referential integrity violations |
| Isolation | Concurrent transactions produce the same result as serial execution. JDBC default: TRANSACTION_READ_COMMITTED. |
Dirty reads, phantom reads |
| Durability | Committed data survives crashes, power loss, and restarts. Achieved via write-ahead logging (WAL). | Silent data loss after commit ACK |
Key refs:
- ACID and database transactions (Stack Overflow)
- Isolation levels (Wikipedia)
- JDBC isolation defaults (Oracle docs)
2PC is the standard protocol for achieving strong consistency across distributed nodes. It involves a coordinator and one or more cohort nodes.
Coordinator (Master) Cohort (Secondary)
QUERY TO COMMIT
─────────────────────────────────>
Execute transaction
(writes to UNDO_LOG & REDO_LOG)
VOTE YES / NO
<─────────────────────────────────
commit / abort COMMIT / ROLLBACK
─────────────────────────────────>
ACKNOWLEDGMENT
<─────────────────────────────────
end
Phase 1 — Voting: Coordinator sends QUERY TO COMMIT. Each cohort executes the transaction locally, logs it, then votes YES (prepared) or NO (abort).
Phase 2 — Completion: If all cohorts vote YES, coordinator sends COMMIT. Otherwise, sends ROLLBACK. Cohorts apply and acknowledge.
| Weakness | Detail |
|---|---|
| Blocking | If the coordinator crashes after Phase 1, cohorts that voted YES are indefinitely blocked — they cannot unilaterally commit or abort. |
| Single point of failure | The coordinator's persistent failure leaves cohorts in an uncertain state. Requires manual intervention or a timeout-based heuristic. |
| Latency | Requires 2 full round-trips plus disk flushes on each node. |
Mitigations: 3PC (Three-Phase Commit) adds a pre-commit phase to reduce blocking, but at higher latency cost. Modern systems (Spanner, CockroachDB) use Paxos/Raft-based consensus instead.
Refs:
The base operation that underpins all JOINs. Produces every combination of rows between two sets.
weights = { [pkgA1, 100g], [pkgA2, 200g] }
shippingDates = { [pkgB1, Jul], [pkgB2, Aug], [pkgB3, Sep] }
weights × shippingDates = {
[pkgA1, 100g, pkgB1, Jul], [pkgA1, 100g, pkgB2, Aug], [pkgA1, 100g, pkgB3, Sep],
[pkgA2, 200g, pkgB1, Jul], [pkgA2, 200g, pkgB2, Aug], [pkgA2, 200g, pkgB3, Sep]
}
| JOIN | Set Operation | Behaviour |
|---|---|---|
INNER JOIN |
A ∩ B | Only rows with matching keys in both tables |
LEFT OUTER JOIN |
A ∪ (A − B) | All rows from A; unmatched B rows filled with NULL |
RIGHT OUTER JOIN |
B ∪ (B − A) | All rows from B; unmatched A rows filled with NULL |
FULL OUTER JOIN |
A ∪ B | All rows from both; NULL where no match |
CROSS JOIN |
A × B | Full Cartesian product — use only when intentional |
-- Set difference (A − B): rows in A with no match in B
SELECT DISTINCT a.*
FROM a
LEFT OUTER JOIN b ON a.id = b.id
WHERE b.id IS NULL;Performance note: JOINs are expensive because they are O(M × N) in the worst case (nested loop). Prefer hash joins or merge joins on large datasets, and always index JOIN key columns.
Refs:
- Relational algebra (Wikipedia)
- The Join Operation — Use The Index, Luke
- When and why are relational database joins expensive?
- Performance considerations for JOIN queries (Cloudera)
- LEFT vs LEFT OUTER JOIN (Stack Overflow)
Indexes trade write amplification and storage for dramatically faster reads. Choose index types deliberately.
| Index Type | Best For |
|---|---|
| B-Tree (default) | Equality and range queries on high-cardinality columns |
| Hash | Equality-only lookups (avoid for ranges) |
| GIN / GiST | Full-text search, JSONB, array columns |
| BRIN | Very large append-only tables with natural sort order (e.g., time-series) |
| Partial index | Indexes on a subset of rows (e.g., WHERE is_deleted = false) |
| Composite index | Multi-column predicates — column order matters |
Operational rules:
- Index all foreign keys to avoid full-table scans on JOINs.
- Avoid indexing low-cardinality columns (e.g., boolean flags) — the planner will ignore them.
- Use
EXPLAIN ANALYZEto confirm index usage before and after schema changes. DEFERRABLEconstraints allow constraint checking at commit time, not row-insert time.
Refs:
- PostgreSQL indexes (Heroku Dev Center)
- SET CONSTRAINTS (PostgreSQL docs)
- DEFERRABLE vs NOT DEFERRABLE (Stack Overflow)
Partitioning splits a logical table into physical sub-tables (partitions), pruning I/O at query time.
| Strategy | Use Case |
|---|---|
| Range | Time-series data — one partition per month/year |
| List | Discrete category splits — e.g., region, tenant |
| Hash | Even distribution when no natural range or list key exists |
Operational considerations:
- Partition pruning only fires when the partition key appears in the
WHEREclause with a literal or bind parameter the planner can evaluate. - Indexes must be created on each partition individually (PostgreSQL 10 declarative partitioning handles this automatically for primary keys).
- Cross-partition queries (e.g.,
ORDER BYspanning partitions) have higher overhead — plan schemas accordingly.
Refs:
Every mutation is first written to the WAL before being applied to the data pages. This guarantees Durability and enables:
- Point-in-time recovery (PITR)
- Streaming replication (ship WAL segments to standby)
- Logical replication (decode WAL into row-level change events)
Active DB → WAL segments → Archive storage (S3 / GCS)
↓
Standby DB (streaming replication)
Vacuum & space reclamation:
Dead rows (from UPDATE/DELETE) are not immediately reclaimed — PostgreSQL uses MVCC and requires VACUUM to reclaim space. AUTOVACUUM handles this automatically but must be tuned for high-write tables.
-- Manual vacuum with analyze
VACUUM ANALYZE my_table;
-- Check bloat
SELECT schemaname, tablename, n_dead_tup, n_live_tup,
round(n_dead_tup::numeric / nullif(n_live_tup,0) * 100, 1) AS dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;Refs:
- WAL archiving with wal-g and S3 (Fusionbox)
- WAL retention and pg_archivecleanup (Percona)
- Continuous archiving (PostgreSQL docs)
- Routine vacuuming (PostgreSQL docs)
- Archiving RDS PostgreSQL to S3/Glacier (AWS blog)
- AWS DMS
Every database connection is a long-lived OS thread/process plus a TCP socket. Opening a new connection per request costs 5–15ms of latency and significant memory.
Application threads → Connection Pool (HikariCP / PgBouncer) → DB backend processes
HikariCP pool sizing formula:
pool_size = (cpu_core_count × 2) + effective_spindle_count
Example: 4-core server, 1 SSD spindle:
pool_size = (4 × 2) + 1 = 9 connections
Key insight: More connections ≠ better throughput. Beyond the saturation point, additional connections increase context switching and lock contention. PostgreSQL's practical limit is typically max_connections = 100–200; use PgBouncer in transaction-pooling mode to multiplex thousands of app connections onto a small backend pool.
Transport options:
TCP— standard, works across hosts, supports TLS.Unix domain socket— lower latency for local connections (no TCP stack overhead).
Refs:
- Database connection pooling (Stack Overflow)
- HikariCP pool sizing rationale
- Unix socket vs TCP (Unix SE)
The database engine reads and writes data in fixed-size units called pages (PostgreSQL default: 8 KB; MySQL InnoDB: 16 KB). Understanding pages is essential for tuning I/O-bound workloads.
| Concept | Detail |
|---|---|
| Page | Fixed-size disk unit storing a set of rows (tuples). All reads/writes are page-granular. |
| Heap | Unordered collection of pages that make up a table. |
| FSM | Free Space Map — tracks available space per page for INSERT placement. |
| Visibility Map | Tracks which pages have no dead tuples (used to skip VACUUM work). |
| Fill Factor | Reserve space in pages for in-place UPDATEs (default 100; set 70–80 for hot-update tables). |
Implications for schema design:
- Wide rows (many columns or
TEXT/JSONB) reduce rows-per-page, increasing I/O for table scans. TOAST(PostgreSQL) spills large values to a side table, keeping the main page dense.
Refs:
| Level | Dirty Read | Non-Repeatable Read | Phantom Read |
|---|---|---|---|
READ UNCOMMITTED |
✅ possible | ✅ possible | ✅ possible |
READ COMMITTED (JDBC default) |
❌ prevented | ✅ possible | ✅ possible |
REPEATABLE READ |
❌ | ❌ | ✅ possible |
SERIALIZABLE |
❌ | ❌ | ❌ |
PostgreSQL nuance: PostgreSQL's REPEATABLE READ actually prevents phantom reads as well (due to MVCC snapshot semantics), which is stronger than the SQL standard requires.
| Lock | Acquired By | Blocks |
|---|---|---|
ACCESS SHARE |
SELECT |
Only ACCESS EXCLUSIVE |
ROW EXCLUSIVE |
INSERT / UPDATE / DELETE |
SHARE and above |
SHARE UPDATE EXCLUSIVE |
VACUUM, ANALYZE, CREATE INDEX CONCURRENTLY |
Schema changes |
ACCESS EXCLUSIVE |
ALTER TABLE, DROP TABLE, VACUUM FULL |
Everything |
Deadlock avoidance: Always acquire locks in a consistent order across transactions. Use SELECT ... FOR UPDATE SKIP LOCKED for queue-style workloads to avoid contention.
Refs:
| Mode | Data Loss Risk | Read Scalability | Use Case |
|---|---|---|---|
| Synchronous streaming | Zero (at the cost of write latency) | Yes | Financial systems, strong consistency |
| Asynchronous streaming | Small window (WAL lag) | Yes | Most production systems |
| Logical replication | Configurable | Partial | Cross-version migrations, selective table replication |
┌─────────────────┐
│ Primary (R/W) │
└────────┬────────┘
WAL streaming│
┌──────────────────┴──────────────────┐
▼ ▼
┌─────────┐ ┌─────────┐
│Standby 1│ ← auto-failover via │Standby 2│
│ (R/O) │ Patroni / pgpool-II │ (R/O) │
└─────────┘ └─────────┘
Operational checklist:
- Monitor replication lag (
pg_stat_replication.write_lag). - Set
synchronous_standby_namesfor zero-RPO requirements. - Test failover procedure regularly — not just in theory.
Refs:
- Run
EXPLAIN (ANALYZE, BUFFERS)— look at actual vs. estimated rows; large divergence signals stale statistics. RunANALYZEto refresh them. - Avoid
SELECT *in application code — fetches unnecessary columns and bloats network I/O. - Use bind parameters (prepared statements) for plan caching and SQL injection prevention.
| Parameter | Purpose | Recommended Starting Point |
|---|---|---|
shared_buffers |
Buffer pool size | 25% of RAM |
work_mem |
Per-sort / per-hash memory | 4–16 MB (lower for high concurrency) |
effective_cache_size |
Planner hint for OS page cache | 50–75% of RAM |
random_page_cost |
Planner cost for random I/O | 1.1 for SSD; 4.0 for HDD |
max_connections |
Hard cap on connections | 100–200; use PgBouncer beyond that |
Refs:
See database_practices.md for standards.
| Engine | Type | Docs |
|---|---|---|
| PostgreSQL | RDBMS | db/pg/README.md |
| MySQL | RDBMS | db/mysql/README.md |
| Oracle | RDBMS | db/oracle/README.md |
| CockroachDB | Distributed SQL | db/cdb/README.md |
| Redshift | Columnar / OLAP | db/pg/README.md (Redshift section) |
| Provider | Service | Notes |
|---|---|---|
| AWS | RDS / Aurora | Aurora offers up to 5× throughput vs. RDS PostgreSQL via custom storage layer |
| AWS | Redshift | Columnar MPP; optimized for OLAP, not OLTP |
| AWS | DMS | Schema/data migration across heterogeneous engines |
| Azure | Azure SQL | Elastic pools for multi-tenant cost sharing |
| Azure | Synapse Analytics | Unified analytics platform (formerly SQL Data Warehouse) |
| Azure | Time Series Insights | Purpose-built for IoT time-series data |
Refs: