TinyLSM

TinyLSM is a small Python key-value store built to explore LSM-tree storage ideas in code. It started as a learning project while reading Designing Data-Intensive Applications, and the repo now includes both a local storage engine and a simple replicated HTTP cluster built on top of it.

It is not meant to be a production database. It is meant to be readable, hackable, and useful for learning how the pieces fit together.

What is implemented

Write-ahead logging
Mutable and immutable memtables
Background flush to SSTables
Bloom filter sidecars
Sparse index sidecars
Leveled compaction
CRC32 checksums on SSTable records
Atomic manifest writes with os.replace
Snapshot reads with sequence numbers
Concurrent reads with a read-write lock
FastAPI-based multi-node replication with leader election, majority write acknowledgement, follower catch-up, and log snapshots

Requirements

Python 3.11 or newer
Docker and Docker Compose (for cluster mode)

Quick Start

python -m venv .venv
.venv\Scripts\activate
python -m pip install -r requirements.txt
pytest tests

Run the local REPL:

python -m src.main

Run the benchmark script:

python -m src.benchmark

The standalone store writes data files into the current working directory. For a clean run, use an empty folder or clear old sst_*, manifest.json, and WAL files before starting again.

REPL Commands

Once python -m src.main is running, these commands are available:

SET key value
GET key
DELETE key
SCAN start_key end_key
STATS
EXIT

Keys and values are currently space-delimited, so the REPL works best with single-token keys and values.

Docker

The cluster runs as three containers using Docker Compose. Each container gets its own named volume so data survives restarts.

docker compose up --build

This starts nodes on ports 8000, 8001, and 8002. Nodes discover each other by service name inside the Docker network. Once up, any of the HTTP endpoints below work against any port.

To stop and remove containers (volumes are kept):

docker compose down

Cluster Mode

Each node runs a FastAPI server from src.cluster.node. To run without Docker, start a 3-node cluster in three terminals:

python -m src.cluster.node 8000 node_data_8000 http://localhost:8000 http://localhost:8000,http://localhost:8001,http://localhost:8002
python -m src.cluster.node 8001 node_data_8001 http://localhost:8000 http://localhost:8000,http://localhost:8001,http://localhost:8002
python -m src.cluster.node 8002 node_data_8002 http://localhost:8000 http://localhost:8000,http://localhost:8001,http://localhost:8002

Arguments:

<port> <data_dir> <leader_url> <comma_separated_node_urls>

Notes:

The third argument is the node to treat as leader on startup.
If that leader goes away, the remaining nodes elect a new leader.
Writes can be sent to any node. Followers forward them to the leader.
A write succeeds only after a majority of nodes acknowledge it.
A consistent read is forwarded to the leader.

HTTP Endpoints

POST /set with {"key": "foo", "value": "bar"}
POST /delete with {"key": "foo"}
GET /get?key=foo
GET /get?key=foo&consistent=true
GET /status
POST /add_node with {"node_url": "http://localhost:8003"}
POST /remove_node with {"node_url": "http://localhost:8003"}

Cluster nodes persist their own files inside the data_dir you pass at startup.

Kubernetes

Manifests are in k8s/. The cluster runs as a StatefulSet so each pod gets a stable DNS name and its own PersistentVolume.

With minikube:

minikube start
minikube image load tinylsm:latest
kubectl apply -f k8s/
kubectl get pods -w

Pods start in order (tinylsm-0 first, then tinylsm-1, then tinylsm-2) so the initial leader is ready before followers try to sync from it.

Once all pods are running:

curl http://$(minikube ip):30000/status
curl -X POST http://$(minikube ip):30000/set \
  -H "Content-Type: application/json" \
  -d '{"key":"foo","value":"bar"}'

CI

GitHub Actions runs on every push and pull request to main. The test job runs the full pytest suite. The docker job builds the image and runs a smoke test against a live 3-node cluster.

Configuration

Configuration is loaded from a .env file in the project root.

Variable	Default	Purpose
`LOG_FILE_NAME`	`log_file.txt`	WAL file for the standalone store
`MAX_MEMTABLE_SIZE`	`4096`	Flush threshold in bytes
`TOMBSTONE_VALUE`	`__TOMBSTONE__`	Delete marker
`BLOOM_FALSE_POSITIVE_RATE`	`0.01`	Target false positive rate for bloom filters
`SPARSE_INDEX_N`	`4`	Record every Nth key in the sparse index
`MAX_L0_FILES`	`2`	Number of L0 files before compaction kicks in
`BENCHMARK_N`	`100000`	Number of benchmark operations
`WAL_BUFFER_SIZE`	`100`	WAL flush interval in operations
`LOG_COMPACTION_THRESHOLD`	`10000`	Cluster log length before snapshotting

The checked-in defaults are intentionally small so flushes, compactions, and tests happen quickly. For real experiments, you will probably want larger level thresholds.

How the Store Works

Write Path

Every write is appended to the WAL and applied to the active memtable. Once the active memtable grows past MAX_MEMTABLE_SIZE, it is rotated into an immutable memtable, a fresh memtable becomes active immediately, and a background thread flushes the immutable one to a new SSTable.

Read Path

Reads check the active memtable first, then the immutable memtable, then SSTables. SSTable lookups use:

Manifest key ranges to skip unrelated files
Bloom filters to skip SSTables that cannot contain the key. Each filter is sized at creation time using the number of keys in the SSTable and the configured false positive rate (BLOOM_FALSE_POSITIVE_RATE). Bit count and hash function count are both derived from those two inputs using standard formulas, and both are stored in the .bloom file so the filter can be correctly reconstructed on reload.
Sparse indexes to seek close to the target key before scanning

SSTables and Compaction

Each SSTable record stores key seq value checksum. The checksum is verified on read. L0 files may overlap. When enough L0 files build up, they are compacted into the next level along with overlapping files there. During compaction, overwritten versions are dropped and tombstones are removed once older data can no longer resurface from a lower level.

Snapshot Reads

Each write gets a monotonically increasing sequence number. The Python API supports snapshot reads:

store.get("foo", at=seq)
store.scan("a", "z", at=seq)

That lets you read the latest value at or before a specific sequence number.

Startup and Recovery

On startup, the standalone store loads the manifest, then loads bloom filters and sparse indexes for each SSTable (SSTable data itself stays on disk). It restores the sequence counter from the seq file if present. It then replays log_file.txt.flushing if it exists (data from a flush that was in progress when the process last stopped), followed by the regular WAL. The manifest is stored in manifest.json and written atomically through a temporary file plus os.replace.

When a flush begins, the current WAL is renamed to log_file.txt.flushing and a fresh WAL is opened immediately so new writes are never blocked. The .flushing file is deleted only after the SSTable is written and the manifest is updated, making the manifest write the commit point for the flush.

How the Cluster Works

The cluster layer wraps the storage engine with a small HTTP service. The protocol is Raft-inspired rather than a full Raft implementation.

One node acts as leader at a time.
Followers receive heartbeats from the leader.
If heartbeats stop, followers start an election after a randomized timeout.
The leader appends a write to replication.log, sends it to followers in parallel, and applies it to the store only after a majority of nodes acknowledge it.
Followers that miss entries can catch up through heartbeats or through /sync on startup.
Once the replication log grows past LOG_COMPACTION_THRESHOLD, the node snapshots state to snapshot.json and truncates the log.

Each node also persists election state in state.json.

Benchmarks

Run:

python -m src.benchmark

The benchmark script creates a temporary store, runs writes, single-threaded reads, 4-thread reads, and misses, then prints ops/sec for your machine.

Example output with certain configurations on my personal computer:

Doing the benchmarks with N=100000, MAX_MEMTABLE_SIZE=1048576, MAX_L0_FILES=8, WAL_BUFFER_SIZE=1000...
Writes: 100000 ops in 0.64s -> 155594 ops/sec
Reads (1 thread):  100000 ops in 0.18s -> 558577 ops/sec
Reads (4 threads): 100000 ops in 0.19s -> 519888 ops/sec
Misses: 100000 ops in 0.19s -> 514599 ops/sec

Tests

Run the full test suite with:

pytest tests

The tests cover:

Basic set, get, delete, scan, and iteration
WAL replay and restart behavior
Compaction and tombstone handling
Snapshot reads
Checksum and manifest durability cases
Concurrent reads
Cluster replication, forwarding, elections, restart recovery, snapshots, and membership changes

Files You Will See

Standalone store files:

log_file.txt
log_file.txt.flushing
seq
seq.tmp
manifest.json
manifest.tmp
sst_<n>
sst_<n>.index
sst_<n>.bloom

Cluster node files:

replication.log
snapshot.json
snapshot.tmp
state.json
state.tmp

Limitations

Keys and values are treated as plain strings.
The REPL and WAL format do not safely encode values with spaces.
The cluster protocol is intentionally small and simplified.
There is no authentication, encryption, or production hardening.

Name		Name	Last commit message	Last commit date
Latest commit History 396 Commits
.github/workflows		.github/workflows
k8s		k8s
src		src
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TinyLSM

What is implemented

Requirements

Quick Start

REPL Commands

Docker

Cluster Mode

HTTP Endpoints

Kubernetes

CI

Configuration

How the Store Works

Write Path

Read Path

SSTables and Compaction

Snapshot Reads

Startup and Recovery

How the Cluster Works

Benchmarks

Tests

Files You Will See

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TinyLSM

What is implemented

Requirements

Quick Start

REPL Commands

Docker

Cluster Mode

HTTP Endpoints

Kubernetes

CI

Configuration

How the Store Works

Write Path

Read Path

SSTables and Compaction

Snapshot Reads

Startup and Recovery

How the Cluster Works

Benchmarks

Tests

Files You Will See

Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages