Skip to content

WAL and Compaction

Anup Ghatage edited this page Feb 12, 2026 · 1 revision

WAL & Compaction

Write-Ahead Log (WAL)

Every write (upsert or delete) goes through the WAL before being indexed. WAL fragments are immutable JSON files stored on S3.

Fragment Format

Each fragment is a JSON file with an xxHash checksum:

{
  "id": "01HXYZ...",
  "namespace": "my_namespace",
  "vectors": [
    {
      "id": "vec-1",
      "values": [0.1, 0.2, 0.3],
      "attributes": {"color": "red"}
    }
  ],
  "deletes": ["vec-old-1", "vec-old-2"],
  "checksum": 12345678901234,
  "created_at": "2026-01-15T10:00:00Z"
}
  • S3 key: <namespace>/wal/<ulid>.fragment.json
  • ID: ULID (time-sortable, unique)
  • Checksum: xxHash-64 over the serialized vectors + deletes
  • Immutable: Never modified after write

Fragment Lifecycle

  1. Write: Client upsert/delete → create fragment → PUT to S3
  2. Read: Strong-consistency queries scan all uncompacted fragments
  3. Compact: Compaction merges fragments into an indexed segment
  4. Delete: Compacted fragments are removed after successful CAS

Manifest

The manifest (manifest.json) is the authoritative record of what data exists in a namespace. It tracks WAL fragments, segments, pending deletes, and the fencing token.

Structure

{
  "fragments": [
    {
      "id": "01HXYZ...",
      "key": "my_ns/wal/01HXYZ.fragment.json",
      "vector_count": 100,
      "delete_count": 2,
      "created_at": "2026-01-15T10:00:00Z"
    }
  ],
  "segments": [
    {
      "id": "seg-abc",
      "key_prefix": "my_ns/segments/seg-abc/",
      "vector_count": 5000,
      "centroid_count": 32,
      "created_at": "2026-01-15T12:00:00Z",
      "quantization": null,
      "bitmap_fields": ["color", "price"],
      "fts_fields": ["content"]
    }
  ],
  "pending_deletes": ["vec-old-1"],
  "fencing_token": 42,
  "updated_at": "2026-01-15T12:00:00Z"
}

Key Fields

Field Description
fragments List of FragmentRef — uncompacted WAL fragments
segments List of SegmentRef — indexed segments
pending_deletes Vector IDs to exclude from query results
fencing_token Monotonic counter for multi-writer lease protocol
updated_at Last modification timestamp

SegmentRef Fields

Field Description
id Segment identifier
key_prefix S3 key prefix for all segment artifacts
vector_count Total vectors in this segment
centroid_count Number of IVF centroids
quantization Quantization type (null, "scalar", "product")
bitmap_fields Fields with bitmap indexes
fts_fields Fields with inverted indexes

CAS (Compare-and-Swap)

The manifest is updated atomically using ETag-based conditional PUTs. This prevents concurrent writers from corrupting the manifest.

1. GET manifest.json → (data, etag)
2. Modify data
3. PUT manifest.json with If-Match: etag
   → Success: commit
   → 412 Precondition Failed: retry from step 1

This requires S3ConditionalPut::ETagMatch to be enabled in the object_store builder.

Compaction Pipeline

Compaction merges WAL fragments into indexed segments. It runs on a configurable interval (default: 30 seconds).

Steps

  1. Read manifest — Load current fragments and segments
  2. Acquire lease — Obtain fencing token for exclusive write access
  3. Load fragments — Download and deserialize all WAL fragments
  4. Merge data — Combine fragment vectors with existing segment data
  5. Apply deletes — Remove vectors in the delete set
  6. Train centroids — Run k-means to find cluster centers
  7. Assign vectors — Assign each vector to its nearest centroid
  8. Write artifacts — Upload cluster data, bitmaps, inverted indexes to S3
  9. CAS manifest — Atomically update manifest: add new SegmentRef, clear processed FragmentRefs
  10. Deferred deletion — Delete old segment artifacts and compacted fragment files

Deferred Deletion

Old artifacts are deleted after the new manifest is committed. This ensures that concurrent readers using the old manifest can still read old data. The sequence is:

Write new segment → CAS manifest → Delete old segment → Delete old fragments

If deletion fails, it's safe — the new manifest doesn't reference old artifacts, so they're just orphaned storage. A cleanup job can reclaim them later.

Retrain Threshold

If cluster sizes become highly imbalanced (largest cluster / smallest cluster > retrain_imbalance_threshold), the compactor retrains centroids from scratch instead of incrementally updating.

Multi-Writer Lease Protocol

Zeppelin supports a lease-based protocol for multi-writer safety.

Lease Object

Stored at <namespace>/lease.json:

{
  "holder": "node-abc",
  "fencing_token": 42,
  "acquired_at": "2026-01-15T10:00:00Z",
  "expires_at": "2026-01-15T10:05:00Z"
}

Two-Layer Defense

Neither fencing alone nor CAS alone is sufficient to prevent zombie writes:

  1. Fencing check: Before writing, verify your fencing token matches the manifest's token. This catches most stale writers.
  2. CAS on manifest: Use ETag conditional PUT to atomically commit. This catches the TOCTOU gap between fencing check and write.

Both layers are required:

  • Fencing without CAS: TOCTOU race between check and write
  • CAS without fencing: A stale writer that reads the manifest can still win the CAS race

Lease Rules

  • Leases have a TTL (default: 5 minutes)
  • A lease is never deleted — it is released by marking it expired
  • If a lease expires and is acquired by another writer, the old writer's subsequent writes fail at the fencing check
  • Release is best-effort: if it fails, the lease will expire naturally

Clone this wiki locally