Add socket and gRPC server modes for shared model deployment by jsilvanus · Pull Request #10 · jsilvanus/embedeer

jsilvanus · 2026-04-14T18:27:08Z

Summary

This PR adds two new optional server modes (socket and grpc) to the embedder, enabling shared model deployment across multiple OS processes or over the network. These modes complement the existing process (isolated workers) and thread (in-process workers) modes.

Key Changes

New Server Modes

Socket mode (mode: 'socket'): Runs a persistent Unix socket daemon that loads the model once and serves embedding requests from any number of connected clients. Ideal for multiple independent OS processes sharing a single model copy.
gRPC mode (mode: 'grpc'): Exposes the embedding pipeline as a typed HTTP/2 gRPC service with Protocol Buffer serialization. Supports both unary and server-streaming RPCs, works locally or remotely, and enables cross-language client integration.

New Files

src/socket-model-server.js — Long-lived socket daemon that loads the model once and processes requests from multiple clients via an internal FIFO queue (since transformers.js is not concurrency-safe)
src/grpc-model-server.js — gRPC server implementing EmbedderService with Embed (unary) and EmbedStream (server-streaming) RPCs
src/proto/embedder.proto — Protocol Buffer service definition for gRPC
src/socket-worker.js — Drop-in replacement for ChildProcessWorker that connects to a socket server
src/grpc-worker.js — Drop-in replacement for ChildProcessWorker that connects to a gRPC server
bench/server-bench.js — Benchmark comparing socket and gRPC server throughput against baseline process/thread modes

Modified Files

src/worker-pool.js — Added constructor branches for socket and grpc modes; new _startServers() method to spawn and manage server subprocesses; _createWorker() distributes workers across multiple servers
src/embeder.js — Added JSDoc for new options (grpcAddress, servers, grpcLoadBalancing, socketPath, autoStartServer)
README.md — Updated feature list and architecture diagram to show all four modes
package.json — Added @grpc/grpc-js and @grpc/proto-loader dependencies; added server-bench script

Notable Implementation Details

Shared idle-worker queue: Both socket and gRPC modes support multi-server load balancing via a single idleWorkers queue. Faster servers (GPU) naturally receive more tasks as their workers return to the queue sooner.
Model idle offload: Both server modes support --idle-timeout to release GPU/CPU memory after inactivity and transparently reload on the next request via Pipeline.dispose() and session.release().
No code generation: gRPC uses @grpc/proto-loader to load .proto files dynamically at runtime, avoiding the need for a protoc build step.
Worker interface compatibility: SocketWorker and GrpcWorker implement the same postMessage, terminate, and event interface as existing workers, so WorkerPool._dispatch() and _removeWorker() require no changes.
Graceful shutdown: Both servers handle SIGTERM/SIGINT, drain in-flight requests, dispose the model, and exit cleanly.

This is purely additive — no existing public APIs change, and all four modes coexist without modification to the core embedding pipeline.

https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV

Plan files for two new optional worker modes: - socket_plan.md: Unix socket model server (one model copy, N client workers) - gRpc_plan.md: gRPC model server with protobuf, streaming, and remote support https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV

Both plan files updated with: - Multi-server architecture (servers[] array with per-server workers as weight) - WorkerPool-level load balancing via idle-worker queue (no dedicated LB process) - gRPC built-in round_robin policy for homogeneous clusters - Usage examples for all modes: process, thread, socket, grpc, and their single-server, pre-running, remote, multi-server, and round-robin variants - Options reference tables and updated file/implementation lists https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV

dtype (quantization), pooling, and normalize must be identical across all servers in a multi-server setup — different values would produce incompatible embedding vectors. Moved these out of per-server entries and into the top-level options table in both plan files. Per-server entries now only contain routing/hardware config: address/socketPath, workers, device, provider. https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV

The primary value of socket mode is sharing one loaded model across multiple independent OS processes — not single-process memory savings (which process mode with concurrency:1 already covers). Updated overview and usage section to lead with the shared daemon use case, add server-outlives-client pattern, and remove the stale mention of cross-process sharing from the LB section. https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV

Pipeline.dispose() confirmed available in @huggingface/transformers: Pipeline.dispose() → model.dispose() → session.release() per ONNX session Cleanly frees GPU VRAM and CPU RAM; reload uses the same pipeline() call. Changes to both plan files: - Add --idle-timeout CLI arg to server scripts (ms, optional) - Add ensureLoaded() / idle setInterval pattern with extractor.dispose() - Update graceful shutdown to call extractor?.dispose() before exit - Add idleTimeout to Options Reference tables - gRpc plan: handlers updated to async, calling ensureLoaded() before inference https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV

https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV

src/proto/embedder.proto: - EmbedderService with Embed (unary) and EmbedStream (server-streaming) RPCs - EmbedRequest, EmbedResponse, EmbedChunk, FloatVec message types src/socket-model-server.js: - NDJSON protocol over Unix socket (named pipe on Windows) - Serial task queue (transformers.js is not concurrency-safe) - Stale socket file cleanup on startup - Idle timeout: extractor.dispose() → model.dispose() → session.release() - Signals {"type":"ready"} to stdout when bound and model is loaded - Graceful shutdown via SIGTERM/SIGINT src/grpc-model-server.js: - Dynamic proto loading via @grpc/proto-loader (no protoc required) - Embed (unary) and EmbedStream (server-streaming) handlers - loadingPromise lock prevents duplicate pipeline() calls on concurrent reload - Idle timeout with same dispose chain as socket server - Signals {"type":"ready"} to stdout when bound and model is loaded - Graceful shutdown via server.tryShutdown() package.json: @grpc/grpc-js and @grpc/proto-loader added to dependencies https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV

…ates package.json: - npm run daemon → socket-model-server.js (Xenova/all-MiniLM-L6-v2) - npm run server → grpc-model-server.js (Xenova/all-MiniLM-L6-v2) - npm run server-bench → bench/server-bench.js bench/server-bench.js: - Spawns socket and gRPC servers as subprocesses, waits for ready signal - Raw socket client: sends all batches up front, collects NDJSON results - Raw gRPC client: concurrent batch calls via Promise.all over HTTP/2 - Process/thread baseline runners for comparison - Reports startup time (spawn→ready) and embedding throughput separately - Options: --model, --batch-size, --dtype, --sample-size, --skip-* README.md: - Updated description and features list with socket/gRPC/multi-server - Updated "How it works" diagram showing all four modes - Added Socket daemon mode section with daemon usage and code examples - Added gRPC server mode section with local and remote examples - Added Multi-server load balancing section - Added server-bench documentation under Performance Optimizations https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV

Move default model (Xenova/all-MiniLM-L6-v2) into parseArgs defaults in both server scripts instead of hardcoding it in the npm scripts. This allows all arguments to be forwarded cleanly via npm's -- separator without duplicate-flag errors from util.parseArgs. npm run daemon # default model, CPU npm run daemon -- --model my/model # custom model npm run daemon -- --idle-timeout 600000 # with idle offload npm run server -- --address 0.0.0.0:50051 --dtype fp16 --device gpu https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV

Added argument tables and usage examples for both server commands, covering --model, --socket/--address, --pooling, --normalize, --dtype, --device, --provider, --token, --cache-dir, --idle-timeout. Shows the npm -- passthrough pattern with practical examples. https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV

… automatic tuning instructions

worker-pool.js: - 'socket' and 'grpc' branches in the constructor set _WorkerClass=null instead of importing the worker classes at module level - initialize() resolves _WorkerClass via dynamic import() just before spinning up workers — @grpc/grpc-js is never loaded unless mode:'grpc' is actually used; same for socket-worker.js socket_plan.md / gRpc_plan.md: - Document the lazy-load pattern and the reason for it - Update constructor and initialize() snippets to reflect the null placeholder and the dynamic import in initialize() https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV

package.json: - Move @grpc/grpc-js and @grpc/proto-loader from dependencies to optionalDependencies — installed by default but skippable with npm install --omit=optional worker-pool.js: - Wrap the grpc dynamic import in try/catch; if ERR_MODULE_NOT_FOUND mentions 'grpc', throw a clear actionable message telling the user to run: npm install @grpc/grpc-js @grpc/proto-loader README.md: - Add gRPC dependencies section under Installation explaining optionalDependencies, --omit=optional, and the error message https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV

…s in README

…r.start() - socket-model-server.js: default SOCKET_PATH now uses \\.\pipe\... on win32 instead of a .sock file (which Windows doesn't support) - bench/server-bench.js: same fix for the benchmark's ephemeral socket path; also replaced dynamic import('os') with a static top-level import - grpc-model-server.js: removed server.start() call after bindAsync — it is no longer necessary in @grpc/grpc-js and emits a DeprecationWarning https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV

claude and others added 18 commits April 14, 2026 14:56

chore: update package-lock.json after npm install

d5bd4b9

https://claude.ai/code/session_014CN6YqoLttXe3MTckDevUV

docs: update README for clarity on GPU support and model usage

9cdf7d9

docs: enhance README with programmatic performance tuning details and…

2d9c642

… automatic tuning instructions

fix: add peer dependency flag to relevant packages in package-lock.json

edeac4e

docs: clarify installation instructions and optional gRPC dependencie…

35b37e0

…s in README

docs: reorganize plans and reviews

f343230

jsilvanus merged commit c8cd33f into main Apr 14, 2026
3 checks passed

jsilvanus deleted the claude/socket-grpc-architecture-u4NRI branch April 14, 2026 19:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add socket and gRPC server modes for shared model deployment#10

Add socket and gRPC server modes for shared model deployment#10
jsilvanus merged 18 commits into
mainfrom
claude/socket-grpc-architecture-u4NRI

jsilvanus commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jsilvanus commented Apr 14, 2026

Summary

Key Changes

New Server Modes

New Files

Modified Files

Notable Implementation Details

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants