PostgreSQL engine internals reconstructed in Go - storage, WAL, MVCC, and execution, built subsystem by subsystem from the source.
Heap page layout (faithful to PostgreSQL's 8 KB PageHeaderData format),
B-tree index pages, buffer manager, and access methods.
- Page layout:
PageHeaderData(24 bytes),ItemIdData(4-byte bit-packed line pointers withLP_UNUSED,LP_NORMAL,LP_REDIRECT,LP_DEADstates),pd_flags(PD_HAS_FREE_LINES,PD_PAGE_FULL,PD_ALL_VISIBLE),pd_prune_xid,pd_lsn - HOT chains:
FollowRedirect()resolves oneLP_REDIRECThop;HeapHotSearchBuffer()walks the fullt_ctidchain applying MVCC visibility - Heap tuples:
HeapTupleHeader(24 bytes),HEAP_HOT_UPDATED,HEAP_ONLY_TUPLE,HEAP_XMIN_COMMITTED/HEAP_XMAX_INVALIDhint bits;HeapTupleSatisfiesMVCC,HeapTupleSatisfiesDirty,HeapTupleSatisfiesNow,HeapTupleSatisfiesAny - Buffer manager: clock-sweep eviction, packed atomic state word (bits 0-17 pin
count, 18-22 usage count, 23-27 flags), content
RWMutex,BM_IO_IN_PROGRESScoordination, WAL write-back hook - B-tree: leaf/internal page ops, binary search, multi-level inserts with page
splits, persistent
BTreeIndexbacked bysmgr; high key at slot 0 on non-rightmost pages (P_HIKEY/P_FIRSTDATAKEY), right-link traversal (key > highKey → follow BtpoNext),BTP_INCOMPLETE_SPLITlifecycle (set before parent downlink insertion, cleared on success),BTP_HALF_DEAD/BTP_DELETEDflags defined;FindLeaf/SearchFromBlockfor stale-pointer recovery - TOAST: out-of-line tuple storage -
Toastify/Detoast, chunked storage relation, inline/external threshold - Vacuum:
VacuumPage(dead tuple pruning, HOT chain compaction, freezing),CompactPage(fragmentation reclaim),VacuumFullwith trailing-page truncation - Free Space Map (
FreeSpaceMap), Visibility Map (VisibilityMap) - smgr:
FileStorageManagerfor block-granular fork-aware file I/O
WAL record layout, redo engine, and resource manager dispatch - faithful to
PostgreSQL's XLogRecord wire format and crash recovery pipeline.
- Record encoding:
XLogRecord24-byte header,BlockRef/BlockImage,RelFileLocator, CRC-32C two-phase checksum;Encode/Decode - Page framing:
XLogPageHeaderData(24 B) /XLogLongPageHeaderData(40 B);SegmentReaderwith transparent page-boundary reassembly - Redo engine:
RedoEngine(start/end LSN,SegmentProvider, progress callback,SetStore,SetLogical,SetTracer);InMemoryProvider;SegmentBuilder - Resource managers:
RmgrHeap(INSERT, DELETE, UPDATE + HOT, FPW),RmgrHeap2(MULTI_INSERT, PRUNE, VACUUM, VISIBLE; FREEZE_PAGE/LOCK_UPDATED/NEW_CID/REWRITE acknowledged),RmgrBtree(leaf/upper/meta insert, split),RmgrXact(commit/abort),RmgrXlog(checkpoint, WAL-switch) - Logical decoding:
ReorderBufferbuffers per-XID changes and emits on commit;ChangeHandlerinterface;LogicalDecoderwired intoRedoEngine - WAL–storage bridge:
WalPageStoreimplementswal.PageWriter, applying FPW, heap insert/delete/update, multi-insert, prune, vacuum dead-item removal, all-visible marking, and B-tree insert/split to the buffer pool - File-backed replay:
FileSegmentProviderreads realpg_waldirectories;ParseLSN,ListSegments,DetectSegmentSize - CLI tool (
cmd/walreplay):--wal-dir / --timeline / --start-lsn / --end-lsn, auto-detects segment size, prints per-rmgr applied/skipped statistics
Snapshot manager and transaction lifecycle.
- CLOG (
Clog): in-memorypg_xactequivalent;Status/SetStatus; special XID handling (InvalidTransactionId→ aborted,FrozenTransactionId→ committed) - Transaction manager (
TransactionManager):Begin/Commit/Abort/AdvanceCommand;SnapshotmirrorsGetSnapshotData(xmin, xmax, Xip, CurCid);GlobalXminfor VACUUM horizon; implementsTransactionOracle
Query execution operators wired to the buffer pool and MVCC layer.
- Node model:
Nodeinterface (Next() (*ScanTuple, error)+Close()); Volcano iterator protocol;Collectdrains any node to a slice - Filter: wraps any
Nodeand discards tuples for which a predicate returns false; mirrors PostgreSQL'sExecQualqual evaluation - Limit: stops the tuple stream after at most N tuples; mirrors
nodeLimit.c - TracedNode: wraps any
Nodeand fires aTracercallback on Open, Tuple, and Close lifecycle events; emitsTraceEvent{NodeName, Phase, Tuple, Seq}; mirrors PostgreSQL'sInstrStartNode / InstrStopNode - Project: applies a
ProjectFnto each tuple;Schema(list of namedColumn{Offset, Len}) extracts fixed-width fields fromDatabytes;Schema.Getlooks up a column by name in a projected tuple; mirrors PostgreSQL'sExecProject/TupleTableSlot/TupleDesc - SeqScan: forward heap scan with per-tuple
HeapTupleSatisfiesMVCCfiltering; VM all-visible fast path; hint-bit writeback - IndexScan: B-tree equality lookup via
SearchAll, thenHeapFetchper TID - HeapFetch:
LP_REDIRECTresolution +t_ctidHOT chain walk with MVCC check - HeapInsert: FSM-guided page selection, relation extension fallback, TOAST threshold check, VM / FSM update
- HeapDelete: stamps
xmaxon the target tuple, clearsHEAP_XMAX_INVALID, clears the page's all-visible VM bit; mirrorsheap_delete()inheapam.c - HeapUpdate: HOT update (same-page new version) with fallback to cross-page;
old-tuple xmax /
HEAP_HOT_UPDATED/t_ctidstamping - Vacuum / VacuumFull: dead-tuple reclaim, HOT chain pruning, tuple freezing, FSM/VM refresh, trailing-page truncation
- Sort: in-memory stable sort of any upstream node; lazy materialization on first
Nextcall; mirrors PostgreSQL'snodeSort.c - HashAgg: hash-based
GROUP BY; each distinct key gets its ownAggFnset; outputs synthetic tuples withData = key || agg₀ || agg₁ …; mirrorsnodeAgg.cwithAGG_HASHEDstrategy - AggFn interface:
Update(*ScanTuple)+Result() []byte; built-insCountAgg,SumAgg,MinAgg,MaxAgg(all 8-byte big-endian int64); decode helpersDecodeCount,DecodeSum,DecodeMin,DecodeMax - NestedLoopJoin: materializes inner side once; for each outer tuple re-scans
the buffer applying a join predicate; user-supplied
combineassembles output rows; mirrors PostgreSQL'sExecNestLoopinnodeNestloop.c - HashJoin: two-phase build/probe - inner side hashed by
buildKeyFninto an in-memory table; outer side probed viaprobeKeyFn; O(inner) build + O(outer) probe; mirrorsExecHashJoininnodeHashjoin.c+nodeHash.c - MergeJoin: sort-merge join; both inputs must be pre-sorted on the join key;
inner group buffered once per key and replayed for each matching outer row;
O(N+M) with no hash table; produces sorted output for free; mirrors
ExecMergeJoininnodeMergejoin.c
- Buffer benchmark harness:
AccessPatterninterface;SequentialPattern,RandomPattern,HotSetPattern;BenchmarkPolicymeasures hit ratio and eviction counts for anyEvictionPolicy - BufferStats: hit/miss/eviction/dirty-eviction counters via
BufferTracer
Isolated benchmarks and correctness proofs.
| Experiment | What it shows |
|---|---|
pin-contention/ |
Atomic CAS vs mutex for pin/unpin under 1000-goroutine contention; CAS is 4.3× faster at 4 cores |
mvcc-isolation/ |
Eight MVCC snapshot isolation scenarios: dirty-read prevention, snapshot consistency, HOT chain visibility, command-counter fence |
buffer-eviction/ |
ClockSweep vs LRU vs ARC under sequential, random, and hot-set workloads; ARC gains +7.6 pp on skewed workloads at a 13–38 % CPU cost |
crash-recovery/ |
End-to-end crash recovery: WAL replay into a fresh pool, commit tracking via logical decoder, MVCC scan - uncommitted tx invisible, committed visible |
concurrent-split/ |
Concurrent B-tree correctness: 4 goroutines × 250 keys with no losses; right-link traversal from a stale leaf reference (pre-split block) finds a key that migrated to the right sibling |
tree-viz/ |
B-tree structural visualizer: renders root / internal / leaf pages, sibling links, and high keys; verifies 5 invariants (sorted-order, high-key-bound, sibling-links, high-key-match, no-incomplete-split) across 5 insert scenarios |
executor-pipeline/ |
Integrated sandbox: heap storage + MVCC + SeqScan → Filter → Project → Limit → TracedNode wired end-to-end; includes EXPLAIN ANALYZE-style trace output and MVCC snapshot isolation verification |
vacuum-autovacuum/ |
Vacuum simulation: heap bloat (LP_NORMAL with committed xmax), regular Vacuum (LP_UNUSED + FSM update), VACUUM FULL (CompactPage), XID freezing, FSM slot reuse, and a threshold-based autovacuum control loop |
group-by/ |
Full GROUP BY pipeline: SeqScan → Filter(region=west) → HashAgg(category, COUNT/SUM/MIN/MAX) → Sort(ORDER BY sum DESC); mirrors nodeAgg.c + nodeSort.c; verifies aggregate correctness and sort ordering |
join/ |
NestedLoopJoin and HashJoin strategies on orders JOIN customers; both produce identical results; also shows JOIN + GROUP BY tier + ORDER BY total_amount DESC; mirrors nodeNestloop.c + nodeHashjoin.c |
merge-join/ |
Completes the join trilogy with MergeJoin on employees JOIN departments; pre-sorted inputs, sorted output for free; TestAllThreeStrategiesAgree runs NL, Hash, and Merge on identical data and verifies they match; mirrors nodeMergejoin.c |
Prerequisites: Go (see go.mod), golangci-lint (optional).
make test # run all tests
make test-race # with -race detector
make bench # run benchmarks
make lint # golangci-lint
make build # compile binariesRun experiments directly:
go test -v ./experiments/mvcc-isolation/
go test -bench=. -benchtime=5s -cpu=1,4,8 ./experiments/pin-contention/Replay a real pg_wal directory:
go run ./cmd/walreplay \
--wal-dir /var/lib/postgresql/16/main/pg_wal \
--timeline 1 \
--start-lsn 0/1000000docker compose build
docker compose run --rm devOr via Makefile:
make docker-test
make docker-benchApache-2.0 (see LICENSE).