WAL and Compaction

WAL & Compaction

Write-Ahead Log (WAL)

Every write (upsert or delete) goes through the WAL before being indexed. WAL fragments are immutable JSON files stored on S3.

Fragment Format

Each fragment is a JSON file with an xxHash checksum:

{
  "id": "01HXYZ...",
  "namespace": "my_namespace",
  "vectors": [
    {
      "id": "vec-1",
      "values": [0.1, 0.2, 0.3],
      "attributes": {"color": "red"}
    }
  ],
  "deletes": ["vec-old-1", "vec-old-2"],
  "checksum": 12345678901234,
  "created_at": "2026-01-15T10:00:00Z"
}

S3 key: <namespace>/wal/<ulid>.fragment.json
ID: ULID (time-sortable, unique)
Checksum: xxHash-64 over the serialized vectors + deletes
Immutable: Never modified after write

Fragment Lifecycle

Write: Client upsert/delete → create fragment → PUT to S3
Read: Strong-consistency queries scan all uncompacted fragments
Compact: Compaction merges fragments into an indexed segment
Delete: Compacted fragments are removed after successful CAS

Manifest

The manifest (manifest.json) is the authoritative record of what data exists in a namespace. It tracks WAL fragments, segments, pending deletes, and the fencing token.

Structure

{
  "fragments": [
    {
      "id": "01HXYZ...",
      "key": "my_ns/wal/01HXYZ.fragment.json",
      "vector_count": 100,
      "delete_count": 2,
      "created_at": "2026-01-15T10:00:00Z"
    }
  ],
  "segments": [
    {
      "id": "seg-abc",
      "key_prefix": "my_ns/segments/seg-abc/",
      "vector_count": 5000,
      "centroid_count": 32,
      "created_at": "2026-01-15T12:00:00Z",
      "quantization": null,
      "bitmap_fields": ["color", "price"],
      "fts_fields": ["content"]
    }
  ],
  "pending_deletes": ["vec-old-1"],
  "fencing_token": 42,
  "updated_at": "2026-01-15T12:00:00Z"
}

Key Fields

Field	Description
`fragments`	List of `FragmentRef` — uncompacted WAL fragments
`segments`	List of `SegmentRef` — indexed segments
`pending_deletes`	Vector IDs to exclude from query results
`fencing_token`	Monotonic counter for multi-writer lease protocol
`updated_at`	Last modification timestamp

SegmentRef Fields

Field	Description
`id`	Segment identifier
`key_prefix`	S3 key prefix for all segment artifacts
`vector_count`	Total vectors in this segment
`centroid_count`	Number of IVF centroids
`quantization`	Quantization type (`null`, `"scalar"`, `"product"`)
`bitmap_fields`	Fields with bitmap indexes
`fts_fields`	Fields with inverted indexes

CAS (Compare-and-Swap)

The manifest is updated atomically using ETag-based conditional PUTs. This prevents concurrent writers from corrupting the manifest.

1. GET manifest.json → (data, etag)
2. Modify data
3. PUT manifest.json with If-Match: etag
   → Success: commit
   → 412 Precondition Failed: retry from step 1

This requires S3ConditionalPut::ETagMatch to be enabled in the object_store builder.

Compaction Pipeline

Compaction merges WAL fragments into indexed segments. It runs on a configurable interval (default: 30 seconds).

Steps

Read manifest — Load current fragments and segments
Acquire lease — Obtain fencing token for exclusive write access
Load fragments — Download and deserialize all WAL fragments
Merge data — Combine fragment vectors with existing segment data
Apply deletes — Remove vectors in the delete set
Train centroids — Run k-means to find cluster centers
Assign vectors — Assign each vector to its nearest centroid
Write artifacts — Upload cluster data, bitmaps, inverted indexes to S3
CAS manifest — Atomically update manifest: add new SegmentRef, clear processed FragmentRefs
Deferred deletion — Delete old segment artifacts and compacted fragment files

Deferred Deletion

Old artifacts are deleted after the new manifest is committed. This ensures that concurrent readers using the old manifest can still read old data. The sequence is:

Write new segment → CAS manifest → Delete old segment → Delete old fragments

If deletion fails, it's safe — the new manifest doesn't reference old artifacts, so they're just orphaned storage. A cleanup job can reclaim them later.

Retrain Threshold

If cluster sizes become highly imbalanced (largest cluster / smallest cluster > retrain_imbalance_threshold), the compactor retrains centroids from scratch instead of incrementally updating.

Multi-Writer Lease Protocol

Zeppelin supports a lease-based protocol for multi-writer safety.

Lease Object

Stored at <namespace>/lease.json:

{
  "holder": "node-abc",
  "fencing_token": 42,
  "acquired_at": "2026-01-15T10:00:00Z",
  "expires_at": "2026-01-15T10:05:00Z"
}

Two-Layer Defense

Neither fencing alone nor CAS alone is sufficient to prevent zombie writes:

Fencing check: Before writing, verify your fencing token matches the manifest's token. This catches most stale writers.
CAS on manifest: Use ETag conditional PUT to atomically commit. This catches the TOCTOU gap between fencing check and write.

Both layers are required:

Fencing without CAS: TOCTOU race between check and write
CAS without fencing: A stale writer that reads the manifest can still win the CAS race

Lease Rules

Leases have a TTL (default: 5 minutes)
A lease is never deleted — it is released by marking it expired
If a lease expires and is acquired by another writer, the old writer's subsequent writes fail at the fencing check
Release is best-effort: if it fails, the lease will expire naturally

Getting Started

API & SDKs

Configuration

Architecture

Operations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WAL and Compaction

WAL & Compaction

Write-Ahead Log (WAL)

Fragment Format

Fragment Lifecycle

Manifest

Structure

Key Fields

SegmentRef Fields

CAS (Compare-and-Swap)

Compaction Pipeline

Steps

Deferred Deletion

Retrain Threshold

Multi-Writer Lease Protocol

Lease Object

Two-Layer Defense

Lease Rules

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally