feat: support engine metrics from PROMETHEUS_MULTIPROC_DIR in grpc mode#1038
feat: support engine metrics from PROMETHEUS_MULTIPROC_DIR in grpc mode#1038ConnorLi96 wants to merge 3 commits intomainfrom
Conversation
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
📝 WalkthroughWalkthroughAdded gRPC-only Prometheus multiprocess metric handling: Python orchestrator creates and cleans a temporary PROMETHEUS_MULTIPROC_DIR for workers; Rust WorkerManager aggregates gRPC metrics by invoking a Python subprocess to read multiprocess metrics and merges results with HTTP worker metrics. Changes
Sequence Diagram(s)sequenceDiagram
participant Orch as Python Orchestrator
participant FS as Filesystem (PROMETHEUS_MULTIPROC_DIR)
participant WorkerMgr as Rust WorkerManager
participant PySub as Python subprocess (prometheus_client)
participant HTTP as HTTP Workers
participant GRPC as gRPC Workers
Orch->>FS: mkdtemp() and set PROMETHEUS_MULTIPROC_DIR
Orch->>GRPC: launch workers (env points to FS)
Orch->>HTTP: launch HTTP workers
WorkerMgr->>HTTP: fan_out GET /metrics (per-worker)
HTTP-->>WorkerMgr: per-worker metrics -> MetricPacks
alt gRPC workers exist
WorkerMgr->>PySub: spawn python3 to run MultiProcessCollector (reads PROMETHEUS_MULTIPROC_DIR)
PySub->>FS: read metrics files
FS-->>PySub: metrics contents
PySub-->>WorkerMgr: aggregated metrics text
WorkerMgr->>WorkerMgr: add aggregated MetricPack (empty labels)
end
Orch->>FS: _cleanup_prometheus_dir() -> rmtree
Orch-->>FS: PROMETHEUS_MULTIPROC_DIR cleared
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request implements gRPC worker metrics collection by utilizing a temporary directory for Prometheus multiprocess data and a Python-based aggregation subprocess. The changes include setting up the environment in the Python orchestrator and updating the Rust worker manager to collect these metrics. Feedback suggests passing the specific Python executable path via an environment variable to ensure the metrics collector runs in the correct environment, rather than relying on a hardcoded 'python3' command.
| if getattr(self.args, "connection_mode", "grpc") == "grpc": | ||
| self._prometheus_dir = tempfile.mkdtemp(prefix="smg_prometheus_") | ||
| os.environ["PROMETHEUS_MULTIPROC_DIR"] = self._prometheus_dir | ||
| logger.info( | ||
| "Set PROMETHEUS_MULTIPROC_DIR=%s for gRPC metrics collection", | ||
| self._prometheus_dir, | ||
| ) |
There was a problem hiding this comment.
To ensure the metrics collector uses the same Python environment as the orchestrator (which is critical when running in a virtual environment), consider passing sys.executable to the router via an environment variable.
| if getattr(self.args, "connection_mode", "grpc") == "grpc": | |
| self._prometheus_dir = tempfile.mkdtemp(prefix="smg_prometheus_") | |
| os.environ["PROMETHEUS_MULTIPROC_DIR"] = self._prometheus_dir | |
| logger.info( | |
| "Set PROMETHEUS_MULTIPROC_DIR=%s for gRPC metrics collection", | |
| self._prometheus_dir, | |
| ) | |
| if getattr(self.args, "connection_mode", "grpc") == "grpc": | |
| self._prometheus_dir = tempfile.mkdtemp(prefix="smg_prometheus_") | |
| os.environ["PROMETHEUS_MULTIPROC_DIR"] = self._prometheus_dir | |
| os.environ["SMG_PYTHON_EXECUTABLE"] = sys.executable | |
| logger.info( | |
| "Set PROMETHEUS_MULTIPROC_DIR=%s for gRPC metrics collection", | |
| self._prometheus_dir, | |
| ) |
| "PROMETHEUS_MULTIPROC_DIR not set; cannot collect metrics from gRPC workers".to_string() | ||
| })?; | ||
|
|
||
| let output = tokio::process::Command::new("python3") |
There was a problem hiding this comment.
Instead of hardcoding python3, use the Python executable path provided by the orchestrator if available. This ensures that the metrics collection subprocess runs in the correct environment (e.g., within a virtualenv).
| let output = tokio::process::Command::new("python3") | |
| let python_exe = std::env::var("SMG_PYTHON_EXECUTABLE").unwrap_or_else(|_| "python3".to_string()); | |
| let output = tokio::process::Command::new(python_exe) |
| let output = tokio::process::Command::new("python3") | ||
| .args([ | ||
| "-c", | ||
| "import sys\n\ | ||
| from prometheus_client import CollectorRegistry, generate_latest\n\ | ||
| from prometheus_client.multiprocess import MultiProcessCollector\n\ | ||
| registry = CollectorRegistry()\n\ | ||
| MultiProcessCollector(registry)\n\ | ||
| sys.stdout.buffer.write(generate_latest(registry))\n", | ||
| ]) | ||
| .env("PROMETHEUS_MULTIPROC_DIR", &dir) | ||
| .output() | ||
| .await | ||
| .map_err(|e| format!("failed to run python3 prometheus collector: {e}"))?; |
There was a problem hiding this comment.
🔴 Important: This subprocess call has no timeout. The HTTP fan-out path uses REQUEST_TIMEOUT (5s), but the python3 subprocess can hang indefinitely (e.g., corrupted .db file in the multiproc dir, or python3 not on PATH causing a slow lookup). Since /metrics is typically polled by Prometheus every 15–30s, a hung subprocess will accumulate blocked tasks.
Consider wrapping with tokio::time::timeout:
| let output = tokio::process::Command::new("python3") | |
| .args([ | |
| "-c", | |
| "import sys\n\ | |
| from prometheus_client import CollectorRegistry, generate_latest\n\ | |
| from prometheus_client.multiprocess import MultiProcessCollector\n\ | |
| registry = CollectorRegistry()\n\ | |
| MultiProcessCollector(registry)\n\ | |
| sys.stdout.buffer.write(generate_latest(registry))\n", | |
| ]) | |
| .env("PROMETHEUS_MULTIPROC_DIR", &dir) | |
| .output() | |
| .await | |
| .map_err(|e| format!("failed to run python3 prometheus collector: {e}"))?; | |
| let output = tokio::time::timeout( | |
| REQUEST_TIMEOUT, | |
| tokio::process::Command::new("python3") | |
| .args([ | |
| "-c", | |
| "import sys\n\ | |
| from prometheus_client import CollectorRegistry, generate_latest\n\ | |
| from prometheus_client.multiprocess import MultiProcessCollector\n\ | |
| registry = CollectorRegistry()\n\ | |
| MultiProcessCollector(registry)\n\ | |
| sys.stdout.buffer.write(generate_latest(registry))\n", | |
| ]) | |
| .env("PROMETHEUS_MULTIPROC_DIR", &dir) | |
| .output(), | |
| ) | |
| .await | |
| .map_err(|_| "python3 prometheus collector timed out".to_string())? | |
| .map_err(|e| format!("failed to run python3 prometheus collector: {e}"))?; |
| ports = _find_available_ports(self.args.worker_base_port, self.args.data_parallel_size) | ||
| host = self.args.worker_host | ||
|
|
||
| if getattr(self.args, "connection_mode", "grpc") == "grpc": |
There was a problem hiding this comment.
🟡 Nit: getattr(self.args, "connection_mode", "grpc") defaults to "grpc" when the attribute is absent. If an older or HTTP-only configuration doesn't set connection_mode at all, this will unnecessarily create a temp directory and set PROMETHEUS_MULTIPROC_DIR in the process environment — which could interfere with any other prometheus_client usage in the same process.
Consider defaulting to "http" (the safer no-op path), or checking for the attribute explicitly:
if getattr(self.args, "connection_mode", None) == "grpc":There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 65b14bb137
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| except (ProcessLookupError, OSError): | ||
| pass | ||
|
|
||
| self._cleanup_prometheus_dir() |
There was a problem hiding this comment.
Run Prometheus dir cleanup even when worker list is empty
In gRPC mode _launch_workers() creates self._prometheus_dir before launching subprocesses, but _cleanup_workers() returns early when self.workers is empty. If startup fails before the first worker is appended (or data_parallel_size is 0), the new cleanup path is skipped and the temp multiprocess directory is leaked for the rest of the process. Ensure _cleanup_prometheus_dir() also runs on the empty-worker path.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
bindings/python/src/smg/serve.py (1)
632-671:⚠️ Potential issue | 🟡 MinorRun Prometheus-dir cleanup even when worker launch fails before the first append.
Line 634 returns before the new cleanup path. If
_launch_workers()createsself._prometheus_dirand the firstlauncher.launch()raises, the temp directory is leaked under/tmpandPROMETHEUS_MULTIPROC_DIRstays stale for the rest of the process.🛠 Minimal fix
def _cleanup_workers(self) -> None: """SIGTERM all worker process groups, wait, then SIGKILL stragglers.""" if not self.workers: + self._cleanup_prometheus_dir() return🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bindings/python/src/smg/serve.py` around lines 632 - 671, _cleanup_workers currently returns early when self.workers is empty, which skips cleaning up any temporary Prometheus directory created earlier; update _cleanup_workers so it always calls self._cleanup_prometheus_dir() before returning (i.e., move or add a call to _cleanup_prometheus_dir after the initial empty-check return path) to ensure self._prometheus_dir is removed even if no workers were launched (addresses directories created by _launch_workers and failures from launcher.launch()).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@model_gateway/src/core/worker_manager.rs`:
- Around line 323-338: The MetricPack construction uses the wrong field names
and types; change places that push MetricPack (e.g., where
metric_packs.push(MetricPack { labels: vec![("worker_addr".into(), resp.url)],
metrics_text: text, }) and the block after
collect_prometheus_multiproc_metrics()) to match the real struct signature
MetricPack { labels: HashMap<String,String>, text: String }: build a HashMap for
labels (insert ("worker_addr".to_string(), resp.url) when present) and use the
field name text: text.into() (or text) for the metric body, and update any other
instantiations to use these exact fields and types.
- Around line 94-107: The subprocess is hardcoded with Command::new("python3");
change it to read the SMG_PYTHON_EXECUTABLE environment variable (falling back
to "python3") and use that value in Command::new(...) so the collector runs with
the same interpreter as serve.py; update the call that builds the process (the
block constructing tokio::process::Command::new("python3") and
.args(...).env(...).output().await) to first fetch
std::env::var("SMG_PYTHON_EXECUTABLE").unwrap_or_else(|_| "python3".into()) and
pass that string to Command::new instead of the literal "python3".
---
Outside diff comments:
In `@bindings/python/src/smg/serve.py`:
- Around line 632-671: _cleanup_workers currently returns early when
self.workers is empty, which skips cleaning up any temporary Prometheus
directory created earlier; update _cleanup_workers so it always calls
self._cleanup_prometheus_dir() before returning (i.e., move or add a call to
_cleanup_prometheus_dir after the initial empty-check return path) to ensure
self._prometheus_dir is removed even if no workers were launched (addresses
directories created by _launch_workers and failures from launcher.launch()).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 170e1229-4cee-46b9-9233-4ed27663bea1
📒 Files selected for processing (2)
bindings/python/src/smg/serve.pymodel_gateway/src/core/worker_manager.rs
| let output = tokio::process::Command::new("python3") | ||
| .args([ | ||
| "-c", | ||
| "import sys\n\ | ||
| from prometheus_client import CollectorRegistry, generate_latest\n\ | ||
| from prometheus_client.multiprocess import MultiProcessCollector\n\ | ||
| registry = CollectorRegistry()\n\ | ||
| MultiProcessCollector(registry)\n\ | ||
| sys.stdout.buffer.write(generate_latest(registry))\n", | ||
| ]) | ||
| .env("PROMETHEUS_MULTIPROC_DIR", &dir) | ||
| .output() | ||
| .await | ||
| .map_err(|e| format!("failed to run python3 prometheus collector: {e}"))?; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n -C2 'python3|sys\.executable|SMG_PYTHON_EXECUTABLE|PROMETHEUS_MULTIPROC_DIR' \
bindings/python/src/smg model_gateway/srcRepository: lightseekorg/smg
Length of output: 5084
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Get more context around serve.py to see how the router is started
cat -n bindings/python/src/smg/serve.py | head -100Repository: lightseekorg/smg
Length of output: 4451
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for where the router/model_gateway is actually invoked in serve.py
rg -n 'subprocess|Popen|run\(|spawn|model_gateway' bindings/python/src/smg/serve.py | head -30Repository: lightseekorg/smg
Length of output: 370
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if serve.py exports SMG_PYTHON_EXECUTABLE anywhere (not just in docstrings)
rg -n 'SMG_PYTHON_EXECUTABLE' bindings/python/src/smg/serve.pyRepository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find and examine launch_router to see how the router process is spawned
fd -t f "launch_router.py" bindings/python/src/smg/Repository: lightseekorg/smg
Length of output: 100
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check the launch_router implementation - see how the router subprocess is created
rg -n -A 20 'def launch_router' bindings/python/src/smg/Repository: lightseekorg/smg
Length of output: 1665
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Get the full launch_router function implementation
cat -n bindings/python/src/smg/launch_router.pyRepository: lightseekorg/smg
Length of output: 4526
Don't hardcode python3 for the collector subprocess.
Line 94 can bypass the interpreter that launched smg serve. In venv/conda installs, python3 may resolve to a different environment or be missing entirely, so /engine_metrics fails even though the workers started successfully. The subprocess needs the same Python interpreter that serve.py runs under.
Export the interpreter from serve.py via an environment variable and read it in Rust, falling back to python3 if absent:
🛠 Suggested direction
- let output = tokio::process::Command::new("python3")
+ let python = std::env::var("SMG_PYTHON_EXECUTABLE")
+ .unwrap_or_else(|_| "python3".to_string());
+ let output = tokio::process::Command::new(&python)Add this to bindings/python/src/smg/serve.py before launching the router:
os.environ["SMG_PYTHON_EXECUTABLE"] = sys.executableThis mirrors the pattern already used for worker launchers, which correctly use sys.executable instead of hardcoding python3.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/src/core/worker_manager.rs` around lines 94 - 107, The
subprocess is hardcoded with Command::new("python3"); change it to read the
SMG_PYTHON_EXECUTABLE environment variable (falling back to "python3") and use
that value in Command::new(...) so the collector runs with the same interpreter
as serve.py; update the call that builds the process (the block constructing
tokio::process::Command::new("python3") and .args(...).env(...).output().await)
to first fetch std::env::var("SMG_PYTHON_EXECUTABLE").unwrap_or_else(|_|
"python3".into()) and pass that string to Command::new instead of the literal
"python3".
| metric_packs.push(MetricPack { | ||
| labels: vec![("worker_addr".into(), resp.url)], | ||
| metrics_text: text, | ||
| }); | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if has_grpc { | ||
| match collect_prometheus_multiproc_metrics().await { | ||
| Ok(text) => { | ||
| metric_packs.push(MetricPack { | ||
| labels: vec![], | ||
| metrics_text: text, |
There was a problem hiding this comment.
Build MetricPack with its actual fields and types.
These literals do not match model_gateway/src/core/metrics_aggregator.rs: MetricPack is { labels: HashMap<String, String>, text: String }. As written, Lines 323-338 will not compile.
🛠 Proposed fix
- metric_packs.push(MetricPack {
- labels: vec![("worker_addr".into(), resp.url)],
- metrics_text: text,
- });
+ metric_packs.push(MetricPack::new(
+ HashMap::from([("worker_addr".to_string(), resp.url)]),
+ text,
+ ));
...
- metric_packs.push(MetricPack {
- labels: vec![],
- metrics_text: text,
- });
+ metric_packs.push(MetricPack::new(
+ HashMap::<String, String>::new(),
+ text,
+ ));📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| metric_packs.push(MetricPack { | |
| labels: vec![("worker_addr".into(), resp.url)], | |
| metrics_text: text, | |
| }); | |
| } | |
| } | |
| } | |
| } | |
| } | |
| if has_grpc { | |
| match collect_prometheus_multiproc_metrics().await { | |
| Ok(text) => { | |
| metric_packs.push(MetricPack { | |
| labels: vec![], | |
| metrics_text: text, | |
| metric_packs.push(MetricPack { | |
| labels: HashMap::from([("worker_addr".to_string(), resp.url)]), | |
| text: text, | |
| }); | |
| } | |
| } | |
| } | |
| } | |
| } | |
| if has_grpc { | |
| match collect_prometheus_multiproc_metrics().await { | |
| Ok(text) => { | |
| metric_packs.push(MetricPack { | |
| labels: HashMap::new(), | |
| text: text, | |
| }); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/src/core/worker_manager.rs` around lines 323 - 338, The
MetricPack construction uses the wrong field names and types; change places that
push MetricPack (e.g., where metric_packs.push(MetricPack { labels:
vec![("worker_addr".into(), resp.url)], metrics_text: text, }) and the block
after collect_prometheus_multiproc_metrics()) to match the real struct signature
MetricPack { labels: HashMap<String,String>, text: String }: build a HashMap for
labels (insert ("worker_addr".to_string(), resp.url) when present) and use the
field name text: text.into() (or text) for the metric body, and update any other
instantiations to use these exact fields and types.
When the gRPC worker hasn't written any .db files yet (startup phase or no requests processed), generate_latest() returns empty bytes. Passing this empty string to parse_prometheus() triggers a parse error WARN log on every scrape interval. Add guards at both the collector function (return Err for empty output) and the call site (skip empty text silently) to prevent log spam. Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com> Made-with: Cursor
|
|
||
| if has_grpc { | ||
| match collect_prometheus_multiproc_metrics().await { | ||
| Ok(text) if !text.trim().is_empty() => { |
There was a problem hiding this comment.
🟡 Nit: This if !text.trim().is_empty() guard is now dead code. collect_prometheus_multiproc_metrics() (line 116-118 above) already returns Err("no metrics available from gRPC workers yet") when the text is empty, so Ok(text) can never contain an empty string. The Ok(_) arm on line 345 is therefore unreachable.
Consider simplifying to just Ok(text) => { ... } without the guard, or removing the Ok(_) arm entirely.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
model_gateway/src/core/worker_manager.rs (2)
327-342:⚠️ Potential issue | 🔴 CriticalBuild
MetricPackwith its actual fields and types.This issue was already flagged in a previous review: The
MetricPackstruct uses{ labels: HashMap<String, String>, text: String }, but lines 327-330 and 340-343 uselabels: vec andmetrics_text: ...(wrong field name). This code will not compile.The correct construction is:
MetricPack { labels: HashMap::from([("worker_addr".to_string(), resp.url)]), text: text, }and for the gRPC case (no worker_addr label):
MetricPack { labels: HashMap::new(), text: text, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/core/worker_manager.rs` around lines 327 - 342, The MetricPack constructions use incorrect field names and types; replace the vec![("worker_addr"...)] and metrics_text usages with the struct's actual fields: set labels to a HashMap<String,String> and the metrics string to text. For the worker case in the loop that references resp.url, build MetricPack with labels = HashMap::from([("worker_addr".to_string(), resp.url)]) and text = text; for the gRPC case use labels = HashMap::new() and text = text; ensure these changes are applied where MetricPack is created (see MetricPack, resp.url, collect_prometheus_multiproc_metrics, has_grpc).
94-107:⚠️ Potential issue | 🔴 CriticalDon't hardcode
python3for the collector subprocess.This issue was already flagged in a previous review: Line 94 bypasses the interpreter that launched
smg serve. In venv/conda installs,python3may resolve to a different environment or be missing entirely.The subprocess needs the same Python interpreter that
serve.pyruns under. Export the interpreter fromserve.pyvia an environment variable (e.g.,SMG_PYTHON_EXECUTABLE = sys.executable) and read it in Rust with a fallback topython3.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/core/worker_manager.rs` around lines 94 - 107, Replace the hardcoded Command::new("python3") call with a program path read from the SMG_PYTHON_EXECUTABLE environment variable (fallback to "python3" if not set) so the subprocess uses the same interpreter as serve.py; e.g., obtain the executable via std::env::var("SMG_PYTHON_EXECUTABLE").unwrap_or_else(|_| "python3".to_string()) and pass that string into tokio::process::Command::new(...) before keeping the existing .args([...]), .env("PROMETHEUS_MULTIPROC_DIR", &dir) and .output().await.map_err(...) logic. Ensure the env var name is exactly "SMG_PYTHON_EXECUTABLE" so it matches the exporter from serve.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@model_gateway/src/core/worker_manager.rs`:
- Around line 327-342: The MetricPack constructions use incorrect field names
and types; replace the vec![("worker_addr"...)] and metrics_text usages with the
struct's actual fields: set labels to a HashMap<String,String> and the metrics
string to text. For the worker case in the loop that references resp.url, build
MetricPack with labels = HashMap::from([("worker_addr".to_string(), resp.url)])
and text = text; for the gRPC case use labels = HashMap::new() and text = text;
ensure these changes are applied where MetricPack is created (see MetricPack,
resp.url, collect_prometheus_multiproc_metrics, has_grpc).
- Around line 94-107: Replace the hardcoded Command::new("python3") call with a
program path read from the SMG_PYTHON_EXECUTABLE environment variable (fallback
to "python3" if not set) so the subprocess uses the same interpreter as
serve.py; e.g., obtain the executable via
std::env::var("SMG_PYTHON_EXECUTABLE").unwrap_or_else(|_| "python3".to_string())
and pass that string into tokio::process::Command::new(...) before keeping the
existing .args([...]), .env("PROMETHEUS_MULTIPROC_DIR", &dir) and
.output().await.map_err(...) logic. Ensure the env var name is exactly
"SMG_PYTHON_EXECUTABLE" so it matches the exporter from serve.py.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 83b4467c-7d25-4d13-9b91-4c6d7c7533db
📒 Files selected for processing (1)
model_gateway/src/core/worker_manager.rs
Description
Problem
When TensorRT-LLM (or any backend) runs in gRPC mode, the SMG router cannot collect engine metrics via
/engine_metrics. HTTP-mode workers expose a/metricsendpoint that the router scrapes directly, but gRPC workers have no HTTP endpoint. The router returns500 All backend requests failed.Solution
Use the
prometheus_clientmultiprocess pattern (same approach sglang uses):serve.py): Create a temporaryPROMETHEUS_MULTIPROC_DIRbefore launching workers, so worker processes inherit it and write.dbfiles there.worker_manager.rs): For gRPC workers, readPROMETHEUS_MULTIPROC_DIRand spawn apython3subprocess that aggregates the.dbfiles viaMultiProcessCollectorand returns standard prometheus text format.Changes
bindings/python/src/smg/serve.py: SetPROMETHEUS_MULTIPROC_DIRenv var in_launch_workers()whenconnection_mode == "grpc". Clean up the temp directory on shutdown.model_gateway/src/core/worker_manager.rs: Addcollect_prometheus_multiproc_metrics()for gRPC workers. Splitget_engine_metrics()into HTTP (fan-out to/metrics) and gRPC (read fromPROMETHEUS_MULTIPROC_DIR) paths.Test Plan
Tested end-to-end with TensorRT-LLM (Qwen3-0.6B, pytorch backend) in gRPC mode: