WShard Deep Dive

What WShard Is
The Binary Format — Byte by Byte
The Problem It Solves — And Why Nobody Else Has
Cross-Language Interop — One Format, Three Runtimes
DeepData Bridge — Similarity Search Over Episodes

For market positioning, competitive landscape, and adoption context see MARKETING_BRIEF.md.

1. What WShard Is

WShard (World-Model Episode Shard) is a binary file format and cross-language library for storing, reading, and managing trajectory data from robotics, reinforcement learning, and world model training pipelines.

A single .wshard file is a self-contained episode: a time-indexed bundle of observations, actions, rewards, and termination signals recorded from an agent interacting with an environment. The file is a flat binary with O(1) block lookup, per-block compression, and data alignment for zero-copy reads.

Not a database. Not a framework. A format.

WShard does not run a server. It does not require GPUs. It does not impose a training loop. It is a file format — like Parquet is to tabular data or EXR is to HDR images — purpose-built for one domain: sequential decision-making episodes.

The Episode as First-Class Citizen

┌─────────────────────────────────────────────────┐
│  episode_abc.wshard                             │
│                                                 │
│  meta/wshard      → {"version": 1, "timebase":…}│
│  meta/episode     → {"episode_id": "abc", …}   │
│  meta/channels    → {"channels": [{…}, {…}]}   │
│                                                 │
│  signal/rgb_cam   → [T, 84, 84, 3] uint8       │
│  signal/joint_pos → [T, 7] float32             │
│  signal/force     → [T, 6] float32             │
│  action/ctrl      → [T, 7] float32             │
│  reward           → [T] float32                │
│  done             → [T] bool                   │
│                                                 │
│  omen/joint_pos/dreamer → [T, 7] float32       │ ← model predictions
│  residual/joint_pos/sign2nddiff → packed bits  │ ← compressed deltas
│  uncert/joint_pos/dreamer/std → [T, 7] float32 │ ← uncertainty
└─────────────────────────────────────────────────┘

Each named block is independently addressable. A training loop that only needs signal/joint_pos and action/ctrl reads exactly those two blocks — no deserialization of video frames or metadata.

What Ships Today

Component	Language	Lines	Status
`wshard` Python package	Python	~4,000	Production
`@wshard/core` npm package	TypeScript	~3,500	Production
`shard` Go package	Go	~2,000	Production
Golden test fixtures	Go generator	553	Verified
DeepData trajectory bridge	Python	~500	Production

Python: 103 tests, TypeScript: 15 tests, all passing. Go wshard is covered by the core shard package's test suite.

2. The Binary Format — Byte by Byte

WShard rides on Shard, a general-purpose binary container format (like a simplified ZIP with aligned data sections). WShard is Shard with role = 0x05.

File Layout

Offset    Size    Content
────────────────────────────────────────────────
0x00      4       Magic: "SHRD" (0x53 0x48 0x52 0x44)
0x04      1       Version: 0x02
0x05      1       Role: 0x05 (WShard)
0x06      2       Flags (LE uint16)
0x08      1       Alignment (0, 16, 32, 64 bytes)
0x09      1       Compression default (0=none, 1=zstd, 2=lz4)
0x0A      2       Index entry size: 48 (LE uint16)
0x0C      4       Entry count (LE uint32)
0x10      8       String table offset (LE uint64)
0x18      8       Data section offset (LE uint64)
0x20      8       Schema offset (LE uint64, 0 if absent)
0x28      8       Total file size (LE uint64)
0x30      16      Reserved (zeroed)
────────────────────────────────────────────────
0x40      N×48    Index entries
          var     String table
          pad     Alignment padding (0x00)
          var     Data blocks (each aligned)

Total header: 64 bytes. Fixed. Parseable with a single read(64) call.

Index Entry (48 bytes)

Each block in the file has one index entry:

Offset    Size    Field
────────────────────────────────────
0x00      8       Name hash (xxHash64 of UTF-8 name)
0x08      4       Name offset into string table (LE)
0x0C      2       Name length (LE)
0x0E      2       Flags (LE) — bit 0: compressed,
                               bit 1: zstd, bit 2: lz4
0x10      8       Data offset in file (absolute, LE)
0x18      8       Disk size (compressed, LE)
0x20      8       Original size (uncompressed, LE)
0x28      4       CRC32C checksum (of uncompressed data)
0x2C      2       Content type (0=raw, 2=JSON)
0x2E      2       Reserved

Why This Matters

O(1) lookup. The name hash in each index entry allows binary search or hash-table lookup without touching the string table. For a file with 50 blocks, finding signal/joint_pos requires reading 64 bytes (header) + scanning 50 × 8-byte hashes.

Per-block compression. Each block carries its own compression flag. Video blocks can use zstd at high ratios. Small scalar blocks (reward, done) stay uncompressed. The reader detects compression from the entry flags — no header-level assumption needed.

Alignment. Data blocks start at 32-byte or 64-byte boundaries. This means mmap() + pointer cast gives you an AVX-aligned float32 array with zero copies. The Go reader uses this for memory-mapped reads; Python can use it with np.frombuffer() on mmap'd files.

Checksums. CRC32C (Castagnoli polynomial 0x82F63B78) on uncompressed data. This is the hardware-accelerated CRC on x86 (_mm_crc32_*) and ARM (__crc32c*). Go's crc32.Castagnoli, Python's crc32c package, and the TypeScript implementation all use this polynomial.

Block Naming Convention

Names are hierarchical paths separated by /:

Prefix	Purpose	Examples
`meta/`	JSON metadata	`meta/wshard`, `meta/episode`, `meta/channels`
`signal/`	Ground truth observations	`signal/rgb`, `signal/joint_pos`
`action/`	Agent actions	`action/ctrl`, `action/gripper`
`omen/`	Model predictions	`omen/joint_pos/dreamer`
`uncert/`	Uncertainty estimates	`uncert/joint_pos/dreamer/std`
`residual/`	Compressed residuals	`residual/joint_pos/sign2nddiff`
`time/`	Timestamps	`time/ticks`, `time/timestamps_ns`
`reward`	Reward signal (no prefix)	`reward`
`done`	Termination flags	`done`

This naming convention is semantic, not syntactic. The reader uses it to route blocks to the correct Episode fields. Action blocks go to ep.actions, signal blocks to ep.observations, omen blocks to ep.omens.

Timebase

The meta/wshard block contains a timebase object describing the episode's time axis:

{
  "timebase": {
    "type": "ticks",
    "tick_hz": 30.0
  }
}

Field	Type	Description
`type`	`string`	`"ticks"` (fixed-rate) or `"timestamps_ns"` (variable-rate wall clock)
`tick_hz`	`float64`	Ticks per second. Only meaningful when `type == "ticks"`.

When type == "ticks", each timestep t corresponds to real time t / tick_hz seconds. When type == "timestamps_ns", the time/timestamps_ns block contains per-timestep nanosecond timestamps (int64 LE).

Multi-Modal Signal Naming

Multi-modal observations use a two-level signal path: signal/{group}/{modality}:

Modality	Constant	Example block name
RGB camera	`rgb`	`signal/cam0/rgb`
Depth sensor	`depth`	`signal/cam0/depth`
Language	`language`	`signal/cmd/language`
Proprioception	`proprioception`	`signal/arm/proprioception`
Audio	`audio`	`signal/mic/audio`
Video	`video`	`signal/cam0/video`
Point cloud	`pointcloud`	`signal/lidar/pointcloud`

Latent Action Naming

Latent action embeddings and codebook indices use the omen/ namespace:

Block name pattern	Description
`omen/latent_action/{model}`	Latent action embeddings from model
`omen/latent_action_codebook/{model}`	Discrete codebook indices for latent actions

Supported Data Types

13 types, matching the union of numpy, PyTorch, and Go primitive types:

WShard	Size	numpy	Notes
`f32`	4	float32	Default for most signals
`f64`	8	float64	High-precision physics
`f16`	2	float16	Inference outputs
`bf16`	2	bfloat16*	Training-native (requires ml_dtypes)
`i64`	8	int64	Timestamps, large indices
`i32`	4	int32	Action indices, labels
`i16`	2	int16	Quantized signals
`i8`	1	int8	Quantized weights
`u64`	8	uint64	Hashes, addresses
`u32`	4	uint32	Indices, masks
`u16`	2	uint16	Depth images
`u8`	1	uint8	RGB pixels, raw bytes
`bool`	1	bool_	Done/termination flags

*bf16 uses ml_dtypes.bfloat16 when available, falls back to uint16 to preserve byte layout.

3. The Problem It Solves — And Why Nobody Else Has

The Trajectory Data Problem

Every world model, every RL agent, every robot learning system consumes the same thing: episodes. Sequences of (observation, action, reward, done) tuples collected from environments. The datasets are large (DROID: 76K demonstrations, 2TB+), multi-modal (cameras + joint states + force/torque + audio), and heterogeneous (different robots, different environments, different sampling rates).

There is no standard format for this data. Teams use:

Format	Limitations
HDF5	Single-writer lock. No streaming append. Poor compression control. Python-centric.
RLDS/TFDS	TensorFlow dependency. Rigid schema. Google ecosystem lock-in.
Parquet + MP4 (LeRobot v3)	Great for publishing. Poor for training-time random access. Two files per episode.
NPZ (DreamerV3)	No metadata. No compression choice. No cross-language support.
MCAP (Foxglove)	ROS-oriented. Message-based, not tensor-based.
Custom	Every lab rolls their own. No interop.

What's Actually Different About WShard

1. Per-block compression in a flat binary.

An episode with 5 camera streams and 20 scalar channels shouldn't compress everything the same way. WShard lets you zstd the video at level 19 and leave the 40-byte reward vector uncompressed. Each block carries its own compression flag — the reader auto-detects from the index entry bits.

# Write with per-block compression control
writer = WShardStreamWriter(path, "ep_001", channels=[
    {"id": "rgb", "dtype": "u8", "shape": [84, 84, 3], "compression": "zstd"},
    {"id": "joint", "dtype": "f32", "shape": [7]},  # no compression needed
])

2. Streaming append with crash safety.

Robot data collection runs for hours. If the process crashes at minute 47, you lose everything with HDF5 (corrupted file). WShard's streaming writer uses a reserve-write-finalize pattern:

Write to episode.wshard.partial
Reserve space for the header (rewritten at finalization)
Append timesteps incrementally
On success: atomic rename() to episode.wshard
On crash: .partial file is deleted or identifiable as incomplete

with WShardStreamWriter(path, "ep_001", channels) as w:
    w.begin_episode()
    for t in range(T):
        w.write_timestep({"state": obs, "ctrl": act}, reward=r, done=d)
    w.end_episode()  # atomic finalize

3. Cross-language without serialization overhead.

The same binary file is readable from Go, Python, and TypeScript. No protobuf compilation step. No schema registry. No code generation.

Go robots write .wshard files during data collection
Python training scripts read them directly with load_wshard()
TypeScript dashboards visualize episodes in the browser

All three implementations agree on CRC32C checksums (0x9a71bb4c for "hello"), xxHash64 name hashes, dtype sizes, and block layout. This is verified by golden file tests: Go generates .wshard files, Python and TypeScript read them and assert byte-level correctness.

4. Chunked episodes for distributed training.

A 10-minute manipulation episode at 30Hz with 3 cameras is ~2GB. You don't want that as a single file on a networked filesystem. WShard splits episodes into chunks with a manifest shard that tracks continuity:

writer = ChunkedEpisodeWriter("data/ep_001", "ep_001", chunk_size_t=1000)
for chunk in chunks:
    writer.write_chunk(chunk)
manifest = writer.finalize_manifest()

# Validation catches gaps, duplicates, and discontinuities
validate_chunk_continuity(manifest)

Each chunk is a standalone .wshard file with chunk_index, total_chunks, and timestep_range in its metadata. The manifest shard (role=0x04) ties them together.

5. Semantic lanes for model training.

WShard doesn't just store raw data. It has dedicated namespaces for the artifacts of model training:

omen/ — Model predictions stored alongside ground truth for comparison
uncert/ — Uncertainty estimates (ensemble variance, dropout entropy)
residual/ — Sign2ndDiff compressed deltas between prediction and truth

This means a single .wshard file can contain the ground truth episode AND the model's predictions about that episode, enabling offline evaluation without joins across separate files.

What WShard Deliberately Does Not Do

No server. It's a file format. Put the files on S3, NFS, local SSD — doesn't matter.
No training loop. Use PyTorch DataLoader, JAX data pipeline, whatever you want.
No schema enforcement. Blocks are named byte arrays. The convention layer is advisory.
No video codec. Camera data is stored as raw tensors. If you want H.264, encode before writing.
No query language. For search-by-similarity, use the DeepData bridge (see below).

4. Cross-Language Interop — One Format, Three Runtimes

The Parity Problem

Cross-language file formats sound simple. They are not. The wshard codebase had three critical interop bugs that would have caused silent data corruption in production:

Bug	What Happened	Impact
CRC32 IEEE vs Castagnoli	Python/TS used polynomial `0xEDB88320`. Go used `0x82F63B78`.	Every checksum validation fails silently or rejects valid files
FNV-1a vs xxHash64	Python/TS used inline FNV-1a hash. Go used `xxhash.Sum64String()`.	O(1) name lookup returns wrong block or misses entirely
bf16 → float32 reinterpret	Python mapped bf16 (2 bytes) to numpy float32 (4 bytes).	Every bf16 tensor has wrong shape and corrupted values

These are exactly the bugs that unit tests don't catch. Each implementation passes its own tests. The bug only appears when Go writes a file and Python reads it. This is Class 6 (Cross-Language Parity Drift) from the CoGS testing philosophy.

How We Fixed It

Golden file testing. A standalone Go program (golden/generate.go) writes three .wshard files using the authoritative Go shard implementation:

File	Purpose
`simple_episode.wshard`	Basic episode: obs [T,4], actions [T,2], reward, done
`dtype_zoo.wshard`	All 13 dtypes exercised
`per_block_compressed.wshard`	Zstd-compressed blocks with 100 timesteps
`omen_uncert.wshard`	Omen predictions, uncertainty estimates, sign2nddiff residuals
`multimodal.wshard`	Multi-modal observations (RGB + proprioception groups)
`latent_action.wshard`	Latent action embeddings and codebook indices

Python and TypeScript tests read these files and assert:

CRC32C checksums match golden_hashes.json
xxHash64 of known strings match
Dtype sizes match
Episode metadata (id, env_id, length) parses correctly
Tensor shapes and values are correct
Compressed blocks decompress correctly

# From test_interop.py — golden file parity test
def test_golden_simple_episode_loads():
    ep = load_wshard(GOLDEN_DIR / "simple_episode.wshard")
    assert ep.id == "golden_simple"
    assert ep.env_id == "TestEnv-v1"
    assert ep.length == 10
    assert ep.observations["state"].data.shape == (10, 4)
    assert ep.actions["ctrl"].data.shape == (10, 2)
    np.testing.assert_allclose(ep.observations["state"].data[0], [0, 1, 2, 3])

Hash and Checksum Alignment

All three implementations now produce identical outputs for reference inputs:

CRC32C("hello")       = 0x9a71bb4c  (Go, Python, TypeScript)
xxHash64("signal/obs") = 0x86f8c8413116a0ae
xxHash64("meta/manifest") = 0x9a191dcd325813d3

These values are committed in golden/golden_hashes.json and verified by CI.

Implementation Details by Language

Python uses crc32c (C extension, hardware-accelerated on x86/ARM) and xxhash (C extension wrapping xxHash):

import crc32c
import xxhash

def compute_crc32(data: bytes) -> int:
    return crc32c.crc32c(data)

def name_hash(name: str) -> int:
    return xxhash.xxh64(name.encode("utf-8")).intdigest()

TypeScript uses a pure-JS CRC32C table and xxhash-wasm (WebAssembly):

// CRC32C with Castagnoli polynomial
const CRC32C_TABLE = makeCrc32Table(0x82f63b78);

export function crc32C(data: Uint8Array): number { ... }

// xxHash64 via WASM (async init)
import xxhashWasm from 'xxhash-wasm';
let _xxh64: ((s: string) => bigint) | null = null;

export async function initXxHash(): Promise<void> {
    const hasher = await xxhashWasm();
    _xxh64 = (s) => hasher.h64(s);
}

Go uses the standard library and cespare/xxhash/v2:

var crc32cTable = crc32.MakeTable(crc32.Castagnoli)

func computeChecksum(data []byte) uint32 {
    return crc32.Checksum(data, crc32cTable)
}

func nameHash(name string) uint64 {
    return xxhash.Sum64String(name)
}

5. DeepData Bridge — Similarity Search Over Episodes

WShard files stay on disk as the authoritative store. The deepdata_bridge module indexes episode metadata and observation embeddings into DeepData (a vector database) so callers can retrieve episodes by behavioural similarity:

from wshard.deepdata_bridge import TrajectoryIngestor, TrajectoryRetriever

# Index episodes
ingestor = TrajectoryIngestor("http://deepdata:8080", embedder=my_embedder)
ingestor.ingest_episode("episodes/ep_001.wshard")

# Search by behavioral similarity
retriever = TrajectoryRetriever("http://deepdata:8080", embedder=my_embedder)
results = retriever.search_similar_episodes(
    query_obs=current_observation,
    top_k=10,
    env_id="ManipulationEnv-v2",
    min_length=100,
    reward_range=(0.8, 1.0),
)
# Returns EpisodeRef(episode_id, file_path, score)
# Caller loads wshard file directly for data access

Hits return episode references; the caller reads the .wshard file directly for bulk data.

Appendix: File Locations

Item	Path
Python package	`cogs/shard/wshard/py/wshard/`
TypeScript package	`cogs/shard/wshard/js/src/`
Go shard package	`cogs/shard/go/shard/`
Golden fixtures	`cogs/shard/wshard/golden/`
Python tests	`cogs/shard/wshard/py/tests/`
TypeScript tests	`cogs/shard/wshard/js/tests/`
Go tests	`cogs/shard/go/shard/*_test.go`
DeepData bridge	`cogs/shard/wshard/py/wshard/deepdata_bridge.py`

Appendix: Dependency Map

Python (pyproject.toml):

numpy>=1.20 — Array operations
crc32c>=2.3 — Hardware-accelerated CRC32C
xxhash>=3.0 — xxHash64 name hashing
zstandard>=0.21.0 — Zstd compression
lz4>=4.0.0 — LZ4 compression
Optional: ml-dtypes>=0.3 (bf16), h5py>=3.0 (HDF5 import), torch>=2.0 (PyTorch tensors)

TypeScript (package.json):

@bokuweb/zstd-wasm — Zstd compression via WebAssembly
fflate — Deflate/LZ4 compression
xxhash-wasm — xxHash64 via WebAssembly

Go (go.mod):

github.com/cespare/xxhash/v2 — xxHash64
github.com/klauspost/compress — Zstd and LZ4
Standard library hash/crc32 — CRC32C (Castagnoli)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WShard Deep Dive

Table of Contents

1. What WShard Is

Not a database. Not a framework. A format.

The Episode as First-Class Citizen

What Ships Today

2. The Binary Format — Byte by Byte

File Layout

Index Entry (48 bytes)

Why This Matters

Block Naming Convention

Timebase

Multi-Modal Signal Naming

Latent Action Naming

Supported Data Types

3. The Problem It Solves — And Why Nobody Else Has

The Trajectory Data Problem

What's Actually Different About WShard

What WShard Deliberately Does Not Do

4. Cross-Language Interop — One Format, Three Runtimes

The Parity Problem

How We Fixed It

Hash and Checksum Alignment

Implementation Details by Language

5. DeepData Bridge — Similarity Search Over Episodes

Appendix: File Locations

Appendix: Dependency Map

FilesExpand file tree

DEEP_DIVE.md

Latest commit

History

DEEP_DIVE.md

File metadata and controls

WShard Deep Dive

Table of Contents

1. What WShard Is

Not a database. Not a framework. A format.

The Episode as First-Class Citizen

What Ships Today

2. The Binary Format — Byte by Byte

File Layout

Index Entry (48 bytes)

Why This Matters

Block Naming Convention

Timebase

Multi-Modal Signal Naming

Latent Action Naming

Supported Data Types

3. The Problem It Solves — And Why Nobody Else Has

The Trajectory Data Problem

What's Actually Different About WShard

What WShard Deliberately Does Not Do

4. Cross-Language Interop — One Format, Three Runtimes

The Parity Problem

How We Fixed It

Hash and Checksum Alignment

Implementation Details by Language

5. DeepData Bridge — Similarity Search Over Episodes

Appendix: File Locations

Appendix: Dependency Map