Skip to content

Xander-Morris/TinyLSM

Repository files navigation

TinyLSM

TinyLSM is a small Python key-value store built to explore LSM-tree storage ideas in code. It started as a learning project while reading Designing Data-Intensive Applications, and the repo now includes both a local storage engine and a simple replicated HTTP cluster built on top of it.

It is not meant to be a production database. It is meant to be readable, hackable, and useful for learning how the pieces fit together.

What is implemented

  • Write-ahead logging
  • Mutable and immutable memtables
  • Background flush to SSTables
  • Bloom filter sidecars
  • Sparse index sidecars
  • Leveled compaction
  • CRC32 checksums on SSTable records
  • Atomic manifest writes with os.replace
  • Snapshot reads with sequence numbers
  • Concurrent reads with a read-write lock
  • FastAPI-based multi-node replication with leader election, majority write acknowledgement, follower catch-up, and log snapshots

Requirements

  • Python 3.11 or newer
  • Docker and Docker Compose (for cluster mode)

Quick Start

python -m venv .venv
.venv\Scripts\activate
python -m pip install -r requirements.txt
pytest tests

Run the local REPL:

python -m src.main

Run the benchmark script:

python -m src.benchmark

The standalone store writes data files into the current working directory. For a clean run, use an empty folder or clear old sst_*, manifest.json, and WAL files before starting again.

REPL Commands

Once python -m src.main is running, these commands are available:

SET key value
GET key
DELETE key
SCAN start_key end_key
STATS
EXIT

Keys and values are currently space-delimited, so the REPL works best with single-token keys and values.

Docker

The cluster runs as three containers using Docker Compose. Each container gets its own named volume so data survives restarts.

docker compose up --build

This starts nodes on ports 8000, 8001, and 8002. Nodes discover each other by service name inside the Docker network. Once up, any of the HTTP endpoints below work against any port.

To stop and remove containers (volumes are kept):

docker compose down

Cluster Mode

Each node runs a FastAPI server from src.cluster.node. To run without Docker, start a 3-node cluster in three terminals:

python -m src.cluster.node 8000 node_data_8000 http://localhost:8000 http://localhost:8000,http://localhost:8001,http://localhost:8002
python -m src.cluster.node 8001 node_data_8001 http://localhost:8000 http://localhost:8000,http://localhost:8001,http://localhost:8002
python -m src.cluster.node 8002 node_data_8002 http://localhost:8000 http://localhost:8000,http://localhost:8001,http://localhost:8002

Arguments:

<port> <data_dir> <leader_url> <comma_separated_node_urls>

Notes:

  • The third argument is the node to treat as leader on startup.
  • If that leader goes away, the remaining nodes elect a new leader.
  • Writes can be sent to any node. Followers forward them to the leader.
  • A write succeeds only after a majority of nodes acknowledge it.
  • A consistent read is forwarded to the leader.

HTTP Endpoints

  • POST /set with {"key": "foo", "value": "bar"}
  • POST /delete with {"key": "foo"}
  • GET /get?key=foo
  • GET /get?key=foo&consistent=true
  • GET /status
  • POST /add_node with {"node_url": "http://localhost:8003"}
  • POST /remove_node with {"node_url": "http://localhost:8003"}

Cluster nodes persist their own files inside the data_dir you pass at startup.

Kubernetes

Manifests are in k8s/. The cluster runs as a StatefulSet so each pod gets a stable DNS name and its own PersistentVolume.

With minikube:

minikube start
minikube image load tinylsm:latest
kubectl apply -f k8s/
kubectl get pods -w

Pods start in order (tinylsm-0 first, then tinylsm-1, then tinylsm-2) so the initial leader is ready before followers try to sync from it.

Once all pods are running:

curl http://$(minikube ip):30000/status
curl -X POST http://$(minikube ip):30000/set \
  -H "Content-Type: application/json" \
  -d '{"key":"foo","value":"bar"}'

CI

GitHub Actions runs on every push and pull request to main. The test job runs the full pytest suite. The docker job builds the image and runs a smoke test against a live 3-node cluster.

Configuration

Configuration is loaded from a .env file in the project root.

Variable Default Purpose
LOG_FILE_NAME log_file.txt WAL file for the standalone store
MAX_MEMTABLE_SIZE 4096 Flush threshold in bytes
TOMBSTONE_VALUE __TOMBSTONE__ Delete marker
BLOOM_FALSE_POSITIVE_RATE 0.01 Target false positive rate for bloom filters
SPARSE_INDEX_N 4 Record every Nth key in the sparse index
MAX_L0_FILES 2 Number of L0 files before compaction kicks in
BENCHMARK_N 100000 Number of benchmark operations
WAL_BUFFER_SIZE 100 WAL flush interval in operations
LOG_COMPACTION_THRESHOLD 10000 Cluster log length before snapshotting

The checked-in defaults are intentionally small so flushes, compactions, and tests happen quickly. For real experiments, you will probably want larger level thresholds.

How the Store Works

Write Path

Every write is appended to the WAL and applied to the active memtable. Once the active memtable grows past MAX_MEMTABLE_SIZE, it is rotated into an immutable memtable, a fresh memtable becomes active immediately, and a background thread flushes the immutable one to a new SSTable.

Read Path

Reads check the active memtable first, then the immutable memtable, then SSTables. SSTable lookups use:

  • Manifest key ranges to skip unrelated files
  • Bloom filters to skip SSTables that cannot contain the key. Each filter is sized at creation time using the number of keys in the SSTable and the configured false positive rate (BLOOM_FALSE_POSITIVE_RATE). Bit count and hash function count are both derived from those two inputs using standard formulas, and both are stored in the .bloom file so the filter can be correctly reconstructed on reload.
  • Sparse indexes to seek close to the target key before scanning

SSTables and Compaction

Each SSTable record stores key seq value checksum. The checksum is verified on read. L0 files may overlap. When enough L0 files build up, they are compacted into the next level along with overlapping files there. During compaction, overwritten versions are dropped and tombstones are removed once older data can no longer resurface from a lower level.

Snapshot Reads

Each write gets a monotonically increasing sequence number. The Python API supports snapshot reads:

store.get("foo", at=seq)
store.scan("a", "z", at=seq)

That lets you read the latest value at or before a specific sequence number.

Startup and Recovery

On startup, the standalone store loads the manifest, then loads bloom filters and sparse indexes for each SSTable (SSTable data itself stays on disk). It restores the sequence counter from the seq file if present. It then replays log_file.txt.flushing if it exists (data from a flush that was in progress when the process last stopped), followed by the regular WAL. The manifest is stored in manifest.json and written atomically through a temporary file plus os.replace.

When a flush begins, the current WAL is renamed to log_file.txt.flushing and a fresh WAL is opened immediately so new writes are never blocked. The .flushing file is deleted only after the SSTable is written and the manifest is updated, making the manifest write the commit point for the flush.

How the Cluster Works

The cluster layer wraps the storage engine with a small HTTP service. The protocol is Raft-inspired rather than a full Raft implementation.

  • One node acts as leader at a time.
  • Followers receive heartbeats from the leader.
  • If heartbeats stop, followers start an election after a randomized timeout.
  • The leader appends a write to replication.log, sends it to followers in parallel, and applies it to the store only after a majority of nodes acknowledge it.
  • Followers that miss entries can catch up through heartbeats or through /sync on startup.
  • Once the replication log grows past LOG_COMPACTION_THRESHOLD, the node snapshots state to snapshot.json and truncates the log.

Each node also persists election state in state.json.

Benchmarks

Run:

python -m src.benchmark

The benchmark script creates a temporary store, runs writes, single-threaded reads, 4-thread reads, and misses, then prints ops/sec for your machine.

Example output with certain configurations on my personal computer:

Doing the benchmarks with N=100000, MAX_MEMTABLE_SIZE=1048576, MAX_L0_FILES=8, WAL_BUFFER_SIZE=1000...
Writes: 100000 ops in 0.64s -> 155594 ops/sec
Reads (1 thread):  100000 ops in 0.18s -> 558577 ops/sec
Reads (4 threads): 100000 ops in 0.19s -> 519888 ops/sec
Misses: 100000 ops in 0.19s -> 514599 ops/sec

Tests

Run the full test suite with:

pytest tests

The tests cover:

  • Basic set, get, delete, scan, and iteration
  • WAL replay and restart behavior
  • Compaction and tombstone handling
  • Snapshot reads
  • Checksum and manifest durability cases
  • Concurrent reads
  • Cluster replication, forwarding, elections, restart recovery, snapshots, and membership changes

Files You Will See

Standalone store files:

  • log_file.txt
  • log_file.txt.flushing
  • seq
  • seq.tmp
  • manifest.json
  • manifest.tmp
  • sst_<n>
  • sst_<n>.index
  • sst_<n>.bloom

Cluster node files:

  • replication.log
  • snapshot.json
  • snapshot.tmp
  • state.json
  • state.tmp

Limitations

  • Keys and values are treated as plain strings.
  • The REPL and WAL format do not safely encode values with spaces.
  • The cluster protocol is intentionally small and simplified.
  • There is no authentication, encryption, or production hardening.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors