TinyLSM is a small Python key-value store built to explore LSM-tree storage ideas in code. It started as a learning project while reading Designing Data-Intensive Applications, and the repo now includes both a local storage engine and a simple replicated HTTP cluster built on top of it.
It is not meant to be a production database. It is meant to be readable, hackable, and useful for learning how the pieces fit together.
- Write-ahead logging
- Mutable and immutable memtables
- Background flush to SSTables
- Bloom filter sidecars
- Sparse index sidecars
- Leveled compaction
- CRC32 checksums on SSTable records
- Atomic manifest writes with
os.replace - Snapshot reads with sequence numbers
- Concurrent reads with a read-write lock
- FastAPI-based multi-node replication with leader election, majority write acknowledgement, follower catch-up, and log snapshots
- Python 3.11 or newer
- Docker and Docker Compose (for cluster mode)
python -m venv .venv
.venv\Scripts\activate
python -m pip install -r requirements.txt
pytest testsRun the local REPL:
python -m src.mainRun the benchmark script:
python -m src.benchmarkThe standalone store writes data files into the current working directory. For a clean run, use an empty folder or clear old sst_*, manifest.json, and WAL files before starting again.
Once python -m src.main is running, these commands are available:
SET key value
GET key
DELETE key
SCAN start_key end_key
STATS
EXIT
Keys and values are currently space-delimited, so the REPL works best with single-token keys and values.
The cluster runs as three containers using Docker Compose. Each container gets its own named volume so data survives restarts.
docker compose up --buildThis starts nodes on ports 8000, 8001, and 8002. Nodes discover each other by service name inside the Docker network. Once up, any of the HTTP endpoints below work against any port.
To stop and remove containers (volumes are kept):
docker compose downEach node runs a FastAPI server from src.cluster.node. To run without Docker, start a 3-node cluster in three terminals:
python -m src.cluster.node 8000 node_data_8000 http://localhost:8000 http://localhost:8000,http://localhost:8001,http://localhost:8002
python -m src.cluster.node 8001 node_data_8001 http://localhost:8000 http://localhost:8000,http://localhost:8001,http://localhost:8002
python -m src.cluster.node 8002 node_data_8002 http://localhost:8000 http://localhost:8000,http://localhost:8001,http://localhost:8002Arguments:
<port> <data_dir> <leader_url> <comma_separated_node_urls>
Notes:
- The third argument is the node to treat as leader on startup.
- If that leader goes away, the remaining nodes elect a new leader.
- Writes can be sent to any node. Followers forward them to the leader.
- A write succeeds only after a majority of nodes acknowledge it.
- A consistent read is forwarded to the leader.
POST /setwith{"key": "foo", "value": "bar"}POST /deletewith{"key": "foo"}GET /get?key=fooGET /get?key=foo&consistent=trueGET /statusPOST /add_nodewith{"node_url": "http://localhost:8003"}POST /remove_nodewith{"node_url": "http://localhost:8003"}
Cluster nodes persist their own files inside the data_dir you pass at startup.
Manifests are in k8s/. The cluster runs as a StatefulSet so each pod gets a stable DNS name and its own PersistentVolume.
With minikube:
minikube start
minikube image load tinylsm:latest
kubectl apply -f k8s/
kubectl get pods -wPods start in order (tinylsm-0 first, then tinylsm-1, then tinylsm-2) so the initial leader is ready before followers try to sync from it.
Once all pods are running:
curl http://$(minikube ip):30000/status
curl -X POST http://$(minikube ip):30000/set \
-H "Content-Type: application/json" \
-d '{"key":"foo","value":"bar"}'GitHub Actions runs on every push and pull request to main. The test job runs the full pytest suite. The docker job builds the image and runs a smoke test against a live 3-node cluster.
Configuration is loaded from a .env file in the project root.
| Variable | Default | Purpose |
|---|---|---|
LOG_FILE_NAME |
log_file.txt |
WAL file for the standalone store |
MAX_MEMTABLE_SIZE |
4096 |
Flush threshold in bytes |
TOMBSTONE_VALUE |
__TOMBSTONE__ |
Delete marker |
BLOOM_FALSE_POSITIVE_RATE |
0.01 |
Target false positive rate for bloom filters |
SPARSE_INDEX_N |
4 |
Record every Nth key in the sparse index |
MAX_L0_FILES |
2 |
Number of L0 files before compaction kicks in |
BENCHMARK_N |
100000 |
Number of benchmark operations |
WAL_BUFFER_SIZE |
100 |
WAL flush interval in operations |
LOG_COMPACTION_THRESHOLD |
10000 |
Cluster log length before snapshotting |
The checked-in defaults are intentionally small so flushes, compactions, and tests happen quickly. For real experiments, you will probably want larger level thresholds.
Every write is appended to the WAL and applied to the active memtable. Once the active memtable grows past MAX_MEMTABLE_SIZE, it is rotated into an immutable memtable, a fresh memtable becomes active immediately, and a background thread flushes the immutable one to a new SSTable.
Reads check the active memtable first, then the immutable memtable, then SSTables. SSTable lookups use:
- Manifest key ranges to skip unrelated files
- Bloom filters to skip SSTables that cannot contain the key. Each filter is sized at creation time using the number of keys in the SSTable and the configured false positive rate (
BLOOM_FALSE_POSITIVE_RATE). Bit count and hash function count are both derived from those two inputs using standard formulas, and both are stored in the.bloomfile so the filter can be correctly reconstructed on reload. - Sparse indexes to seek close to the target key before scanning
Each SSTable record stores key seq value checksum. The checksum is verified on read. L0 files may overlap. When enough L0 files build up, they are compacted into the next level along with overlapping files there. During compaction, overwritten versions are dropped and tombstones are removed once older data can no longer resurface from a lower level.
Each write gets a monotonically increasing sequence number. The Python API supports snapshot reads:
store.get("foo", at=seq)
store.scan("a", "z", at=seq)That lets you read the latest value at or before a specific sequence number.
On startup, the standalone store loads the manifest, then loads bloom filters and sparse indexes for each SSTable (SSTable data itself stays on disk). It restores the sequence counter from the seq file if present. It then replays log_file.txt.flushing if it exists (data from a flush that was in progress when the process last stopped), followed by the regular WAL. The manifest is stored in manifest.json and written atomically through a temporary file plus os.replace.
When a flush begins, the current WAL is renamed to log_file.txt.flushing and a fresh WAL is opened immediately so new writes are never blocked. The .flushing file is deleted only after the SSTable is written and the manifest is updated, making the manifest write the commit point for the flush.
The cluster layer wraps the storage engine with a small HTTP service. The protocol is Raft-inspired rather than a full Raft implementation.
- One node acts as leader at a time.
- Followers receive heartbeats from the leader.
- If heartbeats stop, followers start an election after a randomized timeout.
- The leader appends a write to
replication.log, sends it to followers in parallel, and applies it to the store only after a majority of nodes acknowledge it. - Followers that miss entries can catch up through heartbeats or through
/syncon startup. - Once the replication log grows past
LOG_COMPACTION_THRESHOLD, the node snapshots state tosnapshot.jsonand truncates the log.
Each node also persists election state in state.json.
Run:
python -m src.benchmarkThe benchmark script creates a temporary store, runs writes, single-threaded reads, 4-thread reads, and misses, then prints ops/sec for your machine.
Example output with certain configurations on my personal computer:
Doing the benchmarks with N=100000, MAX_MEMTABLE_SIZE=1048576, MAX_L0_FILES=8, WAL_BUFFER_SIZE=1000...
Writes: 100000 ops in 0.64s -> 155594 ops/sec
Reads (1 thread): 100000 ops in 0.18s -> 558577 ops/sec
Reads (4 threads): 100000 ops in 0.19s -> 519888 ops/sec
Misses: 100000 ops in 0.19s -> 514599 ops/sec
Run the full test suite with:
pytest testsThe tests cover:
- Basic set, get, delete, scan, and iteration
- WAL replay and restart behavior
- Compaction and tombstone handling
- Snapshot reads
- Checksum and manifest durability cases
- Concurrent reads
- Cluster replication, forwarding, elections, restart recovery, snapshots, and membership changes
Standalone store files:
log_file.txtlog_file.txt.flushingseqseq.tmpmanifest.jsonmanifest.tmpsst_<n>sst_<n>.indexsst_<n>.bloom
Cluster node files:
replication.logsnapshot.jsonsnapshot.tmpstate.jsonstate.tmp
- Keys and values are treated as plain strings.
- The REPL and WAL format do not safely encode values with spaces.
- The cluster protocol is intentionally small and simplified.
- There is no authentication, encryption, or production hardening.