Add socket and gRPC server modes for shared model deployment#10
Merged
Conversation
Plan files for two new optional worker modes: - socket_plan.md: Unix socket model server (one model copy, N client workers) - gRpc_plan.md: gRPC model server with protobuf, streaming, and remote support https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
Both plan files updated with: - Multi-server architecture (servers[] array with per-server workers as weight) - WorkerPool-level load balancing via idle-worker queue (no dedicated LB process) - gRPC built-in round_robin policy for homogeneous clusters - Usage examples for all modes: process, thread, socket, grpc, and their single-server, pre-running, remote, multi-server, and round-robin variants - Options reference tables and updated file/implementation lists https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
dtype (quantization), pooling, and normalize must be identical across all servers in a multi-server setup — different values would produce incompatible embedding vectors. Moved these out of per-server entries and into the top-level options table in both plan files. Per-server entries now only contain routing/hardware config: address/socketPath, workers, device, provider. https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
The primary value of socket mode is sharing one loaded model across multiple independent OS processes — not single-process memory savings (which process mode with concurrency:1 already covers). Updated overview and usage section to lead with the shared daemon use case, add server-outlives-client pattern, and remove the stale mention of cross-process sharing from the LB section. https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
Pipeline.dispose() confirmed available in @huggingface/transformers: Pipeline.dispose() → model.dispose() → session.release() per ONNX session Cleanly frees GPU VRAM and CPU RAM; reload uses the same pipeline() call. Changes to both plan files: - Add --idle-timeout CLI arg to server scripts (ms, optional) - Add ensureLoaded() / idle setInterval pattern with extractor.dispose() - Update graceful shutdown to call extractor?.dispose() before exit - Add idleTimeout to Options Reference tables - gRpc plan: handlers updated to async, calling ensureLoaded() before inference https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
src/proto/embedder.proto:
- EmbedderService with Embed (unary) and EmbedStream (server-streaming) RPCs
- EmbedRequest, EmbedResponse, EmbedChunk, FloatVec message types
src/socket-model-server.js:
- NDJSON protocol over Unix socket (named pipe on Windows)
- Serial task queue (transformers.js is not concurrency-safe)
- Stale socket file cleanup on startup
- Idle timeout: extractor.dispose() → model.dispose() → session.release()
- Signals {"type":"ready"} to stdout when bound and model is loaded
- Graceful shutdown via SIGTERM/SIGINT
src/grpc-model-server.js:
- Dynamic proto loading via @grpc/proto-loader (no protoc required)
- Embed (unary) and EmbedStream (server-streaming) handlers
- loadingPromise lock prevents duplicate pipeline() calls on concurrent reload
- Idle timeout with same dispose chain as socket server
- Signals {"type":"ready"} to stdout when bound and model is loaded
- Graceful shutdown via server.tryShutdown()
package.json: @grpc/grpc-js and @grpc/proto-loader added to dependencies
https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
…ates package.json: - npm run daemon → socket-model-server.js (Xenova/all-MiniLM-L6-v2) - npm run server → grpc-model-server.js (Xenova/all-MiniLM-L6-v2) - npm run server-bench → bench/server-bench.js bench/server-bench.js: - Spawns socket and gRPC servers as subprocesses, waits for ready signal - Raw socket client: sends all batches up front, collects NDJSON results - Raw gRPC client: concurrent batch calls via Promise.all over HTTP/2 - Process/thread baseline runners for comparison - Reports startup time (spawn→ready) and embedding throughput separately - Options: --model, --batch-size, --dtype, --sample-size, --skip-* README.md: - Updated description and features list with socket/gRPC/multi-server - Updated "How it works" diagram showing all four modes - Added Socket daemon mode section with daemon usage and code examples - Added gRPC server mode section with local and remote examples - Added Multi-server load balancing section - Added server-bench documentation under Performance Optimizations https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
Move default model (Xenova/all-MiniLM-L6-v2) into parseArgs defaults in both server scripts instead of hardcoding it in the npm scripts. This allows all arguments to be forwarded cleanly via npm's -- separator without duplicate-flag errors from util.parseArgs. npm run daemon # default model, CPU npm run daemon -- --model my/model # custom model npm run daemon -- --idle-timeout 600000 # with idle offload npm run server -- --address 0.0.0.0:50051 --dtype fp16 --device gpu https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
Added argument tables and usage examples for both server commands, covering --model, --socket/--address, --pooling, --normalize, --dtype, --device, --provider, --token, --cache-dir, --idle-timeout. Shows the npm -- passthrough pattern with practical examples. https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
… automatic tuning instructions
worker-pool.js:
- 'socket' and 'grpc' branches in the constructor set _WorkerClass=null
instead of importing the worker classes at module level
- initialize() resolves _WorkerClass via dynamic import() just before
spinning up workers — @grpc/grpc-js is never loaded unless mode:'grpc'
is actually used; same for socket-worker.js
socket_plan.md / gRpc_plan.md:
- Document the lazy-load pattern and the reason for it
- Update constructor and initialize() snippets to reflect the null
placeholder and the dynamic import in initialize()
https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
package.json:
- Move @grpc/grpc-js and @grpc/proto-loader from dependencies to
optionalDependencies — installed by default but skippable with
npm install --omit=optional
worker-pool.js:
- Wrap the grpc dynamic import in try/catch; if ERR_MODULE_NOT_FOUND
mentions 'grpc', throw a clear actionable message telling the user
to run: npm install @grpc/grpc-js @grpc/proto-loader
README.md:
- Add gRPC dependencies section under Installation explaining
optionalDependencies, --omit=optional, and the error message
https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
…r.start()
- socket-model-server.js: default SOCKET_PATH now uses \\.\pipe\... on win32
instead of a .sock file (which Windows doesn't support)
- bench/server-bench.js: same fix for the benchmark's ephemeral socket path;
also replaced dynamic import('os') with a static top-level import
- grpc-model-server.js: removed server.start() call after bindAsync — it is
no longer necessary in @grpc/grpc-js and emits a DeprecationWarning
https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds two new optional server modes (
socketandgrpc) to the embedder, enabling shared model deployment across multiple OS processes or over the network. These modes complement the existingprocess(isolated workers) andthread(in-process workers) modes.Key Changes
New Server Modes
Socket mode (
mode: 'socket'): Runs a persistent Unix socket daemon that loads the model once and serves embedding requests from any number of connected clients. Ideal for multiple independent OS processes sharing a single model copy.gRPC mode (
mode: 'grpc'): Exposes the embedding pipeline as a typed HTTP/2 gRPC service with Protocol Buffer serialization. Supports both unary and server-streaming RPCs, works locally or remotely, and enables cross-language client integration.New Files
src/socket-model-server.js— Long-lived socket daemon that loads the model once and processes requests from multiple clients via an internal FIFO queue (since transformers.js is not concurrency-safe)src/grpc-model-server.js— gRPC server implementingEmbedderServicewithEmbed(unary) andEmbedStream(server-streaming) RPCssrc/proto/embedder.proto— Protocol Buffer service definition for gRPCsrc/socket-worker.js— Drop-in replacement forChildProcessWorkerthat connects to a socket serversrc/grpc-worker.js— Drop-in replacement forChildProcessWorkerthat connects to a gRPC serverbench/server-bench.js— Benchmark comparing socket and gRPC server throughput against baseline process/thread modesModified Files
src/worker-pool.js— Added constructor branches forsocketandgrpcmodes; new_startServers()method to spawn and manage server subprocesses;_createWorker()distributes workers across multiple serverssrc/embeder.js— Added JSDoc for new options (grpcAddress,servers,grpcLoadBalancing,socketPath,autoStartServer)README.md— Updated feature list and architecture diagram to show all four modespackage.json— Added@grpc/grpc-jsand@grpc/proto-loaderdependencies; addedserver-benchscriptNotable Implementation Details
Shared idle-worker queue: Both socket and gRPC modes support multi-server load balancing via a single
idleWorkersqueue. Faster servers (GPU) naturally receive more tasks as their workers return to the queue sooner.Model idle offload: Both server modes support
--idle-timeoutto release GPU/CPU memory after inactivity and transparently reload on the next request viaPipeline.dispose()andsession.release().No code generation: gRPC uses
@grpc/proto-loaderto load.protofiles dynamically at runtime, avoiding the need for aprotocbuild step.Worker interface compatibility:
SocketWorkerandGrpcWorkerimplement the samepostMessage,terminate, and event interface as existing workers, soWorkerPool._dispatch()and_removeWorker()require no changes.Graceful shutdown: Both servers handle
SIGTERM/SIGINT, drain in-flight requests, dispose the model, and exit cleanly.This is purely additive — no existing public APIs change, and all four modes coexist without modification to the core embedding pipeline.
https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV