- What WShard Is
- The Binary Format — Byte by Byte
- The Problem It Solves — And Why Nobody Else Has
- Cross-Language Interop — One Format, Three Runtimes
- DeepData Bridge — Similarity Search Over Episodes
For market positioning, competitive landscape, and adoption context see MARKETING_BRIEF.md.
WShard (World-Model Episode Shard) is a binary file format and cross-language library for storing, reading, and managing trajectory data from robotics, reinforcement learning, and world model training pipelines.
A single .wshard file is a self-contained episode: a time-indexed bundle of observations, actions, rewards, and termination signals recorded from an agent interacting with an environment. The file is a flat binary with O(1) block lookup, per-block compression, and data alignment for zero-copy reads.
WShard does not run a server. It does not require GPUs. It does not impose a training loop. It is a file format — like Parquet is to tabular data or EXR is to HDR images — purpose-built for one domain: sequential decision-making episodes.
┌─────────────────────────────────────────────────┐
│ episode_abc.wshard │
│ │
│ meta/wshard → {"version": 1, "timebase":…}│
│ meta/episode → {"episode_id": "abc", …} │
│ meta/channels → {"channels": [{…}, {…}]} │
│ │
│ signal/rgb_cam → [T, 84, 84, 3] uint8 │
│ signal/joint_pos → [T, 7] float32 │
│ signal/force → [T, 6] float32 │
│ action/ctrl → [T, 7] float32 │
│ reward → [T] float32 │
│ done → [T] bool │
│ │
│ omen/joint_pos/dreamer → [T, 7] float32 │ ← model predictions
│ residual/joint_pos/sign2nddiff → packed bits │ ← compressed deltas
│ uncert/joint_pos/dreamer/std → [T, 7] float32 │ ← uncertainty
└─────────────────────────────────────────────────┘
Each named block is independently addressable. A training loop that only needs signal/joint_pos and action/ctrl reads exactly those two blocks — no deserialization of video frames or metadata.
| Component | Language | Lines | Status |
|---|---|---|---|
wshard Python package |
Python | ~4,000 | Production |
@wshard/core npm package |
TypeScript | ~3,500 | Production |
shard Go package |
Go | ~2,000 | Production |
| Golden test fixtures | Go generator | 553 | Verified |
| DeepData trajectory bridge | Python | ~500 | Production |
Python: 103 tests, TypeScript: 15 tests, all passing. Go wshard is covered by the core shard package's test suite.
WShard rides on Shard, a general-purpose binary container format (like a simplified ZIP with aligned data sections). WShard is Shard with role = 0x05.
Offset Size Content
────────────────────────────────────────────────
0x00 4 Magic: "SHRD" (0x53 0x48 0x52 0x44)
0x04 1 Version: 0x02
0x05 1 Role: 0x05 (WShard)
0x06 2 Flags (LE uint16)
0x08 1 Alignment (0, 16, 32, 64 bytes)
0x09 1 Compression default (0=none, 1=zstd, 2=lz4)
0x0A 2 Index entry size: 48 (LE uint16)
0x0C 4 Entry count (LE uint32)
0x10 8 String table offset (LE uint64)
0x18 8 Data section offset (LE uint64)
0x20 8 Schema offset (LE uint64, 0 if absent)
0x28 8 Total file size (LE uint64)
0x30 16 Reserved (zeroed)
────────────────────────────────────────────────
0x40 N×48 Index entries
var String table
pad Alignment padding (0x00)
var Data blocks (each aligned)
Total header: 64 bytes. Fixed. Parseable with a single read(64) call.
Each block in the file has one index entry:
Offset Size Field
────────────────────────────────────
0x00 8 Name hash (xxHash64 of UTF-8 name)
0x08 4 Name offset into string table (LE)
0x0C 2 Name length (LE)
0x0E 2 Flags (LE) — bit 0: compressed,
bit 1: zstd, bit 2: lz4
0x10 8 Data offset in file (absolute, LE)
0x18 8 Disk size (compressed, LE)
0x20 8 Original size (uncompressed, LE)
0x28 4 CRC32C checksum (of uncompressed data)
0x2C 2 Content type (0=raw, 2=JSON)
0x2E 2 Reserved
O(1) lookup. The name hash in each index entry allows binary search or hash-table lookup without touching the string table. For a file with 50 blocks, finding signal/joint_pos requires reading 64 bytes (header) + scanning 50 × 8-byte hashes.
Per-block compression. Each block carries its own compression flag. Video blocks can use zstd at high ratios. Small scalar blocks (reward, done) stay uncompressed. The reader detects compression from the entry flags — no header-level assumption needed.
Alignment. Data blocks start at 32-byte or 64-byte boundaries. This means mmap() + pointer cast gives you an AVX-aligned float32 array with zero copies. The Go reader uses this for memory-mapped reads; Python can use it with np.frombuffer() on mmap'd files.
Checksums. CRC32C (Castagnoli polynomial 0x82F63B78) on uncompressed data. This is the hardware-accelerated CRC on x86 (_mm_crc32_*) and ARM (__crc32c*). Go's crc32.Castagnoli, Python's crc32c package, and the TypeScript implementation all use this polynomial.
Names are hierarchical paths separated by /:
| Prefix | Purpose | Examples |
|---|---|---|
meta/ |
JSON metadata | meta/wshard, meta/episode, meta/channels |
signal/ |
Ground truth observations | signal/rgb, signal/joint_pos |
action/ |
Agent actions | action/ctrl, action/gripper |
omen/ |
Model predictions | omen/joint_pos/dreamer |
uncert/ |
Uncertainty estimates | uncert/joint_pos/dreamer/std |
residual/ |
Compressed residuals | residual/joint_pos/sign2nddiff |
time/ |
Timestamps | time/ticks, time/timestamps_ns |
reward |
Reward signal (no prefix) | reward |
done |
Termination flags | done |
This naming convention is semantic, not syntactic. The reader uses it to route blocks to the correct Episode fields. Action blocks go to ep.actions, signal blocks to ep.observations, omen blocks to ep.omens.
The meta/wshard block contains a timebase object describing the episode's time axis:
{
"timebase": {
"type": "ticks",
"tick_hz": 30.0
}
}| Field | Type | Description |
|---|---|---|
type |
string |
"ticks" (fixed-rate) or "timestamps_ns" (variable-rate wall clock) |
tick_hz |
float64 |
Ticks per second. Only meaningful when type == "ticks". |
When type == "ticks", each timestep t corresponds to real time t / tick_hz seconds. When type == "timestamps_ns", the time/timestamps_ns block contains per-timestep nanosecond timestamps (int64 LE).
Multi-modal observations use a two-level signal path: signal/{group}/{modality}:
| Modality | Constant | Example block name |
|---|---|---|
| RGB camera | rgb |
signal/cam0/rgb |
| Depth sensor | depth |
signal/cam0/depth |
| Language | language |
signal/cmd/language |
| Proprioception | proprioception |
signal/arm/proprioception |
| Audio | audio |
signal/mic/audio |
| Video | video |
signal/cam0/video |
| Point cloud | pointcloud |
signal/lidar/pointcloud |
Latent action embeddings and codebook indices use the omen/ namespace:
| Block name pattern | Description |
|---|---|
omen/latent_action/{model} |
Latent action embeddings from model |
omen/latent_action_codebook/{model} |
Discrete codebook indices for latent actions |
13 types, matching the union of numpy, PyTorch, and Go primitive types:
| WShard | Size | numpy | Notes |
|---|---|---|---|
f32 |
4 | float32 | Default for most signals |
f64 |
8 | float64 | High-precision physics |
f16 |
2 | float16 | Inference outputs |
bf16 |
2 | bfloat16* | Training-native (requires ml_dtypes) |
i64 |
8 | int64 | Timestamps, large indices |
i32 |
4 | int32 | Action indices, labels |
i16 |
2 | int16 | Quantized signals |
i8 |
1 | int8 | Quantized weights |
u64 |
8 | uint64 | Hashes, addresses |
u32 |
4 | uint32 | Indices, masks |
u16 |
2 | uint16 | Depth images |
u8 |
1 | uint8 | RGB pixels, raw bytes |
bool |
1 | bool_ | Done/termination flags |
*bf16 uses ml_dtypes.bfloat16 when available, falls back to uint16 to preserve byte layout.
Every world model, every RL agent, every robot learning system consumes the same thing: episodes. Sequences of (observation, action, reward, done) tuples collected from environments. The datasets are large (DROID: 76K demonstrations, 2TB+), multi-modal (cameras + joint states + force/torque + audio), and heterogeneous (different robots, different environments, different sampling rates).
There is no standard format for this data. Teams use:
| Format | Limitations |
|---|---|
| HDF5 | Single-writer lock. No streaming append. Poor compression control. Python-centric. |
| RLDS/TFDS | TensorFlow dependency. Rigid schema. Google ecosystem lock-in. |
| Parquet + MP4 (LeRobot v3) | Great for publishing. Poor for training-time random access. Two files per episode. |
| NPZ (DreamerV3) | No metadata. No compression choice. No cross-language support. |
| MCAP (Foxglove) | ROS-oriented. Message-based, not tensor-based. |
| Custom | Every lab rolls their own. No interop. |
1. Per-block compression in a flat binary.
An episode with 5 camera streams and 20 scalar channels shouldn't compress everything the same way. WShard lets you zstd the video at level 19 and leave the 40-byte reward vector uncompressed. Each block carries its own compression flag — the reader auto-detects from the index entry bits.
# Write with per-block compression control
writer = WShardStreamWriter(path, "ep_001", channels=[
{"id": "rgb", "dtype": "u8", "shape": [84, 84, 3], "compression": "zstd"},
{"id": "joint", "dtype": "f32", "shape": [7]}, # no compression needed
])2. Streaming append with crash safety.
Robot data collection runs for hours. If the process crashes at minute 47, you lose everything with HDF5 (corrupted file). WShard's streaming writer uses a reserve-write-finalize pattern:
- Write to
episode.wshard.partial - Reserve space for the header (rewritten at finalization)
- Append timesteps incrementally
- On success: atomic
rename()toepisode.wshard - On crash:
.partialfile is deleted or identifiable as incomplete
with WShardStreamWriter(path, "ep_001", channels) as w:
w.begin_episode()
for t in range(T):
w.write_timestep({"state": obs, "ctrl": act}, reward=r, done=d)
w.end_episode() # atomic finalize3. Cross-language without serialization overhead.
The same binary file is readable from Go, Python, and TypeScript. No protobuf compilation step. No schema registry. No code generation.
- Go robots write
.wshardfiles during data collection - Python training scripts read them directly with
load_wshard() - TypeScript dashboards visualize episodes in the browser
All three implementations agree on CRC32C checksums (0x9a71bb4c for "hello"), xxHash64 name hashes, dtype sizes, and block layout. This is verified by golden file tests: Go generates .wshard files, Python and TypeScript read them and assert byte-level correctness.
4. Chunked episodes for distributed training.
A 10-minute manipulation episode at 30Hz with 3 cameras is ~2GB. You don't want that as a single file on a networked filesystem. WShard splits episodes into chunks with a manifest shard that tracks continuity:
writer = ChunkedEpisodeWriter("data/ep_001", "ep_001", chunk_size_t=1000)
for chunk in chunks:
writer.write_chunk(chunk)
manifest = writer.finalize_manifest()
# Validation catches gaps, duplicates, and discontinuities
validate_chunk_continuity(manifest)Each chunk is a standalone .wshard file with chunk_index, total_chunks, and timestep_range in its metadata. The manifest shard (role=0x04) ties them together.
5. Semantic lanes for model training.
WShard doesn't just store raw data. It has dedicated namespaces for the artifacts of model training:
- omen/ — Model predictions stored alongside ground truth for comparison
- uncert/ — Uncertainty estimates (ensemble variance, dropout entropy)
- residual/ — Sign2ndDiff compressed deltas between prediction and truth
This means a single .wshard file can contain the ground truth episode AND the model's predictions about that episode, enabling offline evaluation without joins across separate files.
- No server. It's a file format. Put the files on S3, NFS, local SSD — doesn't matter.
- No training loop. Use PyTorch DataLoader, JAX data pipeline, whatever you want.
- No schema enforcement. Blocks are named byte arrays. The convention layer is advisory.
- No video codec. Camera data is stored as raw tensors. If you want H.264, encode before writing.
- No query language. For search-by-similarity, use the DeepData bridge (see below).
Cross-language file formats sound simple. They are not. The wshard codebase had three critical interop bugs that would have caused silent data corruption in production:
| Bug | What Happened | Impact |
|---|---|---|
| CRC32 IEEE vs Castagnoli | Python/TS used polynomial 0xEDB88320. Go used 0x82F63B78. |
Every checksum validation fails silently or rejects valid files |
| FNV-1a vs xxHash64 | Python/TS used inline FNV-1a hash. Go used xxhash.Sum64String(). |
O(1) name lookup returns wrong block or misses entirely |
| bf16 → float32 reinterpret | Python mapped bf16 (2 bytes) to numpy float32 (4 bytes). | Every bf16 tensor has wrong shape and corrupted values |
These are exactly the bugs that unit tests don't catch. Each implementation passes its own tests. The bug only appears when Go writes a file and Python reads it. This is Class 6 (Cross-Language Parity Drift) from the CoGS testing philosophy.
Golden file testing. A standalone Go program (golden/generate.go) writes three .wshard files using the authoritative Go shard implementation:
| File | Purpose |
|---|---|
simple_episode.wshard |
Basic episode: obs [T,4], actions [T,2], reward, done |
dtype_zoo.wshard |
All 13 dtypes exercised |
per_block_compressed.wshard |
Zstd-compressed blocks with 100 timesteps |
omen_uncert.wshard |
Omen predictions, uncertainty estimates, sign2nddiff residuals |
multimodal.wshard |
Multi-modal observations (RGB + proprioception groups) |
latent_action.wshard |
Latent action embeddings and codebook indices |
Python and TypeScript tests read these files and assert:
- CRC32C checksums match
golden_hashes.json - xxHash64 of known strings match
- Dtype sizes match
- Episode metadata (id, env_id, length) parses correctly
- Tensor shapes and values are correct
- Compressed blocks decompress correctly
# From test_interop.py — golden file parity test
def test_golden_simple_episode_loads():
ep = load_wshard(GOLDEN_DIR / "simple_episode.wshard")
assert ep.id == "golden_simple"
assert ep.env_id == "TestEnv-v1"
assert ep.length == 10
assert ep.observations["state"].data.shape == (10, 4)
assert ep.actions["ctrl"].data.shape == (10, 2)
np.testing.assert_allclose(ep.observations["state"].data[0], [0, 1, 2, 3])All three implementations now produce identical outputs for reference inputs:
CRC32C("hello") = 0x9a71bb4c (Go, Python, TypeScript)
xxHash64("signal/obs") = 0x86f8c8413116a0ae
xxHash64("meta/manifest") = 0x9a191dcd325813d3
These values are committed in golden/golden_hashes.json and verified by CI.
Python uses crc32c (C extension, hardware-accelerated on x86/ARM) and xxhash (C extension wrapping xxHash):
import crc32c
import xxhash
def compute_crc32(data: bytes) -> int:
return crc32c.crc32c(data)
def name_hash(name: str) -> int:
return xxhash.xxh64(name.encode("utf-8")).intdigest()TypeScript uses a pure-JS CRC32C table and xxhash-wasm (WebAssembly):
// CRC32C with Castagnoli polynomial
const CRC32C_TABLE = makeCrc32Table(0x82f63b78);
export function crc32C(data: Uint8Array): number { ... }
// xxHash64 via WASM (async init)
import xxhashWasm from 'xxhash-wasm';
let _xxh64: ((s: string) => bigint) | null = null;
export async function initXxHash(): Promise<void> {
const hasher = await xxhashWasm();
_xxh64 = (s) => hasher.h64(s);
}Go uses the standard library and cespare/xxhash/v2:
var crc32cTable = crc32.MakeTable(crc32.Castagnoli)
func computeChecksum(data []byte) uint32 {
return crc32.Checksum(data, crc32cTable)
}
func nameHash(name string) uint64 {
return xxhash.Sum64String(name)
}WShard files stay on disk as the authoritative store. The deepdata_bridge module indexes episode metadata and observation embeddings into DeepData (a vector database) so callers can retrieve episodes by behavioural similarity:
from wshard.deepdata_bridge import TrajectoryIngestor, TrajectoryRetriever
# Index episodes
ingestor = TrajectoryIngestor("http://deepdata:8080", embedder=my_embedder)
ingestor.ingest_episode("episodes/ep_001.wshard")
# Search by behavioral similarity
retriever = TrajectoryRetriever("http://deepdata:8080", embedder=my_embedder)
results = retriever.search_similar_episodes(
query_obs=current_observation,
top_k=10,
env_id="ManipulationEnv-v2",
min_length=100,
reward_range=(0.8, 1.0),
)
# Returns EpisodeRef(episode_id, file_path, score)
# Caller loads wshard file directly for data accessHits return episode references; the caller reads the .wshard file directly for bulk data.
| Item | Path |
|---|---|
| Python package | cogs/shard/wshard/py/wshard/ |
| TypeScript package | cogs/shard/wshard/js/src/ |
| Go shard package | cogs/shard/go/shard/ |
| Golden fixtures | cogs/shard/wshard/golden/ |
| Python tests | cogs/shard/wshard/py/tests/ |
| TypeScript tests | cogs/shard/wshard/js/tests/ |
| Go tests | cogs/shard/go/shard/*_test.go |
| DeepData bridge | cogs/shard/wshard/py/wshard/deepdata_bridge.py |
Python (pyproject.toml):
numpy>=1.20— Array operationscrc32c>=2.3— Hardware-accelerated CRC32Cxxhash>=3.0— xxHash64 name hashingzstandard>=0.21.0— Zstd compressionlz4>=4.0.0— LZ4 compression- Optional:
ml-dtypes>=0.3(bf16),h5py>=3.0(HDF5 import),torch>=2.0(PyTorch tensors)
TypeScript (package.json):
@bokuweb/zstd-wasm— Zstd compression via WebAssemblyfflate— Deflate/LZ4 compressionxxhash-wasm— xxHash64 via WebAssembly
Go (go.mod):
github.com/cespare/xxhash/v2— xxHash64github.com/klauspost/compress— Zstd and LZ4- Standard library
hash/crc32— CRC32C (Castagnoli)