Skip to content

Add socket and gRPC server modes for shared model deployment#10

Merged
jsilvanus merged 18 commits into
mainfrom
claude/socket-grpc-architecture-u4NRI
Apr 14, 2026
Merged

Add socket and gRPC server modes for shared model deployment#10
jsilvanus merged 18 commits into
mainfrom
claude/socket-grpc-architecture-u4NRI

Conversation

@jsilvanus
Copy link
Copy Markdown
Owner

Summary

This PR adds two new optional server modes (socket and grpc) to the embedder, enabling shared model deployment across multiple OS processes or over the network. These modes complement the existing process (isolated workers) and thread (in-process workers) modes.

Key Changes

New Server Modes

  • Socket mode (mode: 'socket'): Runs a persistent Unix socket daemon that loads the model once and serves embedding requests from any number of connected clients. Ideal for multiple independent OS processes sharing a single model copy.

  • gRPC mode (mode: 'grpc'): Exposes the embedding pipeline as a typed HTTP/2 gRPC service with Protocol Buffer serialization. Supports both unary and server-streaming RPCs, works locally or remotely, and enables cross-language client integration.

New Files

  • src/socket-model-server.js — Long-lived socket daemon that loads the model once and processes requests from multiple clients via an internal FIFO queue (since transformers.js is not concurrency-safe)
  • src/grpc-model-server.js — gRPC server implementing EmbedderService with Embed (unary) and EmbedStream (server-streaming) RPCs
  • src/proto/embedder.proto — Protocol Buffer service definition for gRPC
  • src/socket-worker.js — Drop-in replacement for ChildProcessWorker that connects to a socket server
  • src/grpc-worker.js — Drop-in replacement for ChildProcessWorker that connects to a gRPC server
  • bench/server-bench.js — Benchmark comparing socket and gRPC server throughput against baseline process/thread modes

Modified Files

  • src/worker-pool.js — Added constructor branches for socket and grpc modes; new _startServers() method to spawn and manage server subprocesses; _createWorker() distributes workers across multiple servers
  • src/embeder.js — Added JSDoc for new options (grpcAddress, servers, grpcLoadBalancing, socketPath, autoStartServer)
  • README.md — Updated feature list and architecture diagram to show all four modes
  • package.json — Added @grpc/grpc-js and @grpc/proto-loader dependencies; added server-bench script

Notable Implementation Details

  • Shared idle-worker queue: Both socket and gRPC modes support multi-server load balancing via a single idleWorkers queue. Faster servers (GPU) naturally receive more tasks as their workers return to the queue sooner.

  • Model idle offload: Both server modes support --idle-timeout to release GPU/CPU memory after inactivity and transparently reload on the next request via Pipeline.dispose() and session.release().

  • No code generation: gRPC uses @grpc/proto-loader to load .proto files dynamically at runtime, avoiding the need for a protoc build step.

  • Worker interface compatibility: SocketWorker and GrpcWorker implement the same postMessage, terminate, and event interface as existing workers, so WorkerPool._dispatch() and _removeWorker() require no changes.

  • Graceful shutdown: Both servers handle SIGTERM/SIGINT, drain in-flight requests, dispose the model, and exit cleanly.

This is purely additive — no existing public APIs change, and all four modes coexist without modification to the core embedding pipeline.

https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV

claude and others added 18 commits April 14, 2026 14:56
Plan files for two new optional worker modes:
- socket_plan.md: Unix socket model server (one model copy, N client workers)
- gRpc_plan.md: gRPC model server with protobuf, streaming, and remote support

https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
Both plan files updated with:
- Multi-server architecture (servers[] array with per-server workers as weight)
- WorkerPool-level load balancing via idle-worker queue (no dedicated LB process)
- gRPC built-in round_robin policy for homogeneous clusters
- Usage examples for all modes: process, thread, socket, grpc, and their
  single-server, pre-running, remote, multi-server, and round-robin variants
- Options reference tables and updated file/implementation lists

https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
dtype (quantization), pooling, and normalize must be identical across all
servers in a multi-server setup — different values would produce incompatible
embedding vectors. Moved these out of per-server entries and into the
top-level options table in both plan files. Per-server entries now only
contain routing/hardware config: address/socketPath, workers, device, provider.

https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
The primary value of socket mode is sharing one loaded model across multiple
independent OS processes — not single-process memory savings (which process
mode with concurrency:1 already covers). Updated overview and usage section
to lead with the shared daemon use case, add server-outlives-client pattern,
and remove the stale mention of cross-process sharing from the LB section.

https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
Pipeline.dispose() confirmed available in @huggingface/transformers:
  Pipeline.dispose() → model.dispose() → session.release() per ONNX session
Cleanly frees GPU VRAM and CPU RAM; reload uses the same pipeline() call.

Changes to both plan files:
- Add --idle-timeout CLI arg to server scripts (ms, optional)
- Add ensureLoaded() / idle setInterval pattern with extractor.dispose()
- Update graceful shutdown to call extractor?.dispose() before exit
- Add idleTimeout to Options Reference tables
- gRpc plan: handlers updated to async, calling ensureLoaded() before inference

https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
src/proto/embedder.proto:
  - EmbedderService with Embed (unary) and EmbedStream (server-streaming) RPCs
  - EmbedRequest, EmbedResponse, EmbedChunk, FloatVec message types

src/socket-model-server.js:
  - NDJSON protocol over Unix socket (named pipe on Windows)
  - Serial task queue (transformers.js is not concurrency-safe)
  - Stale socket file cleanup on startup
  - Idle timeout: extractor.dispose() → model.dispose() → session.release()
  - Signals {"type":"ready"} to stdout when bound and model is loaded
  - Graceful shutdown via SIGTERM/SIGINT

src/grpc-model-server.js:
  - Dynamic proto loading via @grpc/proto-loader (no protoc required)
  - Embed (unary) and EmbedStream (server-streaming) handlers
  - loadingPromise lock prevents duplicate pipeline() calls on concurrent reload
  - Idle timeout with same dispose chain as socket server
  - Signals {"type":"ready"} to stdout when bound and model is loaded
  - Graceful shutdown via server.tryShutdown()

package.json: @grpc/grpc-js and @grpc/proto-loader added to dependencies

https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
…ates

package.json:
  - npm run daemon  → socket-model-server.js (Xenova/all-MiniLM-L6-v2)
  - npm run server  → grpc-model-server.js   (Xenova/all-MiniLM-L6-v2)
  - npm run server-bench → bench/server-bench.js

bench/server-bench.js:
  - Spawns socket and gRPC servers as subprocesses, waits for ready signal
  - Raw socket client: sends all batches up front, collects NDJSON results
  - Raw gRPC client: concurrent batch calls via Promise.all over HTTP/2
  - Process/thread baseline runners for comparison
  - Reports startup time (spawn→ready) and embedding throughput separately
  - Options: --model, --batch-size, --dtype, --sample-size, --skip-*

README.md:
  - Updated description and features list with socket/gRPC/multi-server
  - Updated "How it works" diagram showing all four modes
  - Added Socket daemon mode section with daemon usage and code examples
  - Added gRPC server mode section with local and remote examples
  - Added Multi-server load balancing section
  - Added server-bench documentation under Performance Optimizations

https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
Move default model (Xenova/all-MiniLM-L6-v2) into parseArgs defaults
in both server scripts instead of hardcoding it in the npm scripts.
This allows all arguments to be forwarded cleanly via npm's -- separator
without duplicate-flag errors from util.parseArgs.

  npm run daemon                                   # default model, CPU
  npm run daemon -- --model my/model               # custom model
  npm run daemon -- --idle-timeout 600000          # with idle offload
  npm run server -- --address 0.0.0.0:50051 --dtype fp16 --device gpu

https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
Added argument tables and usage examples for both server commands,
covering --model, --socket/--address, --pooling, --normalize,
--dtype, --device, --provider, --token, --cache-dir, --idle-timeout.
Shows the npm -- passthrough pattern with practical examples.

https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
worker-pool.js:
  - 'socket' and 'grpc' branches in the constructor set _WorkerClass=null
    instead of importing the worker classes at module level
  - initialize() resolves _WorkerClass via dynamic import() just before
    spinning up workers — @grpc/grpc-js is never loaded unless mode:'grpc'
    is actually used; same for socket-worker.js

socket_plan.md / gRpc_plan.md:
  - Document the lazy-load pattern and the reason for it
  - Update constructor and initialize() snippets to reflect the null
    placeholder and the dynamic import in initialize()

https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
package.json:
  - Move @grpc/grpc-js and @grpc/proto-loader from dependencies to
    optionalDependencies — installed by default but skippable with
    npm install --omit=optional

worker-pool.js:
  - Wrap the grpc dynamic import in try/catch; if ERR_MODULE_NOT_FOUND
    mentions 'grpc', throw a clear actionable message telling the user
    to run: npm install @grpc/grpc-js @grpc/proto-loader

README.md:
  - Add gRPC dependencies section under Installation explaining
    optionalDependencies, --omit=optional, and the error message

https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
…r.start()

- socket-model-server.js: default SOCKET_PATH now uses \\.\pipe\... on win32
  instead of a .sock file (which Windows doesn't support)
- bench/server-bench.js: same fix for the benchmark's ephemeral socket path;
  also replaced dynamic import('os') with a static top-level import
- grpc-model-server.js: removed server.start() call after bindAsync — it is
  no longer necessary in @grpc/grpc-js and emits a DeprecationWarning

https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
@jsilvanus jsilvanus merged commit c8cd33f into main Apr 14, 2026
3 checks passed
@jsilvanus jsilvanus deleted the claude/socket-grpc-architecture-u4NRI branch April 14, 2026 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants