Decentralized, pipeline-parallel training with automatic expert discovery, DiLoCo-style optimization, and resilient failover.
New to DistQat? We have to Quick Start Guides one for manual setup and a console based workflow and one for more visual and simpler workflow. We recommend first trying the GUI Quick Start Guide but if you are encountering any errors that you don't understand or are not well explained please try the Manual Quick Start Guide.
For more information continue reading below or refer to the full documentation:
- Overview
- Key Capabilities
- Architecture
- Run Modes
- Configuration
- Observability
- Resilience & Failover
- Troubleshooting & FAQ
- Roadmap & Inspirations
Distqat is a collaborative training framework for large models. Each participant contributes GPU/ CPU capacity by hosting model shards that are chained into full pipelines on demand. Trainers stream batches from a shared-memory data server, dispatch forward/backward calls across the swarm, and use DiLoCo’s inner/outer optimization to synchronize the replicas with less communication overhead.
The design builds on lessons from projects such as OpenDiLoCo and SWARM parallelism, while focusing on:
- pipeline-parallel experts that can churn in and out,
- dynamic discovery through a libp2p-backed Distributed Hash Table (DHT),
- optional quantization
- proactive failover
- Pipeline-aware experts –
src/distqat/distributed/server.pyhosts stages advertised in the DHT, letting clients assemble full end-to-end executors. - Adaptive trainers –
src/distqat/distributed/client.pyspawns one trainer per complete pipeline (trainer.py), supervises health, and restarts crashed peers. - DiLoCo optimizer –
src/distqat/distributed/optim/diloco.pyimplements local inner updates with periodic global averaging, giving robustness on unreliable links. - Shared-memory data plane –
src/distqat/distributed/data_server.pyshares batches viamultiprocessing.shared_memory, so trainers avoid redundant preprocessing. - Quantization hooks –
src/distqat/quantexposes per-stage overrides to keep traffic manageable on commodity uplinks. - Param mirroring –
src/distqat/distributed/param_mirror.pykeeps a CPU copy of expert weights and checkpoints for quick warm starts. - W&B-first monitoring –
src/distqat/distributed/monitor.pypushes fleet-wide metrics and pipeline health to a single run dashboard.
| Role | Binary | Description |
|---|---|---|
| Monitor | monitor.py |
Seeds the DHT, records active multiaddresses, streams DiLoCo progress/metrics to Weights & Biases, and writes initial_peers.txt for bootstrap. |
| Client | client.py |
Watches the DHT for complete pipelines, spawns/stops trainers, controls the shared data server, and triggers expert reassignment when gaps appear. |
| Data Server | data_server.py |
Preloads datasets defined in Config.data, shuffles with BufferedShuffleIterable, and serves batches via shared memory pools. |
| Trainer | trainer.py |
Runs DiLoCo inner loops, performs forward/backward via SwarmModel, and logs metrics back to the DHT (LocalMetrics). |
| Server / Expert | server.py |
Hosts a model shard (e.g., resnet.tail) with optional quantization, advertises itself to the DHT, and reacts to reassignment signals. |
| Param Mirror (optional) | param_mirror.py |
Periodically pulls weights/optimizer state from peers for checkpointing and crash recovery. |
- Monitor boots and starts a DHT node, recording its visible multiaddresses to
logs/<experiment>/initial_peers.txt. - Client connects to the same DHT, launches the shared-memory data server, and waits for complete pipelines.
- Experts register themselves (
SwarmServer.create) with stage/expert UIDs. Auto-discovery back-fills gaps if a stage is missing. - Client assembles pipelines from the DHT view, spawns one trainer per complete pipeline, and restarts trainers that exit unexpectedly.
- Trainer loops execute
inner_stepslocal batches, participate in DiLoCo averaging rounds (optimizers.get_diloco_optimizer_cls_kwargs), and publish metrics. - Failover loop detects incomplete pipelines.
ReassignmentMonitorThreadon each server listens for reassignment signals and gracefully restarts into the swarm.
Best for iterating on configs or operations demos.
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv --python=3.10
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
source .venv/bin/activate
uv pip install --editable .
wandb login
python run_local.py # defaults to configs/resnet18.yamlrun_local.py orchestrates the full stack (monitor → client → data server → baseline trainer → experts) and streams logs to logs/<experiment_prefix>.
-
Trainer + monitor host
wandb login python start_trainer_client.py
Copy the peer addresses seen in the
std ouputor as written tologs/<experiment_prefix>/initial_peers.txt. -
Server hosts
# initial peers gotten from step above export INITIAL_PEERS='/ip4/<trainer_public_ip>/tcp/50000/p2p/<peer_id>' python start_servers.py \ --initial-peers "${INITIAL_PEERS}"
Repeat on as many machines as you like; each process will auto-assign to open pipeline slots or create new model replicas (experts).
Here is an example of how heterogenous training will look like with adaptive batchsizing. Currently the batch size still has to be set manually.
-
Trainer + monitor host
wandb login python start_trainer_client.py
Copy the peer addresses written to
logs/<experiment_prefix>/initial_peers.txt. -
Server host with GPU
export INITIAL_PEERS='/ip4/<trainer_public_ip>/tcp/50000/p2p/<peer_id>' python start_servers.py \ --initial-peers "${INITIAL_PEERS}" \ --config-path "configs/resnet50.yaml"
-
Server host with a CPU
export INITIAL_PEERS='/ip4/<trainer_public_ip>/tcp/50000/p2p/<peer_id>'
python start_servers.py \
--initial-peers "${INITIAL_PEERS}" \
--config-path "configs/resnet50.yaml" \
--diloco-batch-size-per-step 1 \
--device cpuIf we look at the logs we should be able to see that the servers train at a similar speed since the performances are aligned with the batch size and thus they can use DiLoCo to average without the bottleneck of needing to wait for slower peers.
- Configs live in
configs/*.yaml. The defaultexperiment_prefix, DiLoCo parameters, and model pipeline (e.g.resnet,gpt-neo,wav2vec) live here. - Pydantic models in
src/distqat/config.pyenforce structure and expose every field via CLI flags. - Quantization – adjust
quantwith aQuantConfigurationdisable with--disable-quant - Datasets – update
DataConfigto select CV/LLM/Speech tasks; the data server will automatically pick the rightcollate_fn. - Param mirroring – enable/disable checkpoints via
param_mirror.enableand tunerefresh_every/checkpoint_dir.
To inspect the resolved config at runtime log cfg.model_dump() from custom scripts.
- All components log to
logs/<experiment_prefix>/with one file per role (e.g.server_back_0.log) when started withrun_local.pyotherwise with0 to num_serverswhen started withstart_servers.py(e.g.server_0.log). monitor.pystreams fleet metrics (loss, SPS, alive peers) to your configured W&B project.- Trainers publish per-step telemetry to the DHT (
experiment_prefix_metrics), consumable by automation or custom dashboards.
ReassignmentMonitorThread(within each server) watches for incomplete pipelines and triggers a controlled shutdown so the server can grab a new slot.- Run
python test_failover.pyto simulate staged shutdowns and validate that trainers keep progressing. - Parameter mirroring ensures fresh experts can
load_state_from_peers()instead of random initialization if no server peers are available.
-
“Address already in use” when restarting locally – stale processes may still be running. Try
pkill -f distqatto clean up orphaned processes, then relaunch. -
Trainers stall at “Auto-discovering pipeline gaps” – ensure at least one expert per stage is reachable and that the monitor host’s port (default 50000) is open to inbound traffic.
-
Timeouts during averaging – increase
diloco.averaging_timeoutor reduce batch size (diloco.batch_size_per_step) for slower links. -
No metrics in W&B – confirm
wandb loginsucceeded everywhere and that the monitor is launched with--wandb-run-idwhen orchestrating multiple processes. -
How do I run a baseline without the swarm? – launch
start_trainer_client.pyto start the data server and thentrainer.py --run-locally --trainer-id "-1" --config-path "configs/resnet18.yaml" --network-initial-peers "[\"/ip4/127.0.0.1/tcp/50000/p2p/<peer-id>\"]"to run a fully local model.