Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR restructures the resource statistics API by introducing a hierarchical Changes
Sequence Diagram(s)sequenceDiagram
actor Client
participant GatewayLB as "Gateway LB"
participant Agent1 as "Agent 1"
participant Agent2 as "Agent 2"
participant AgentN as "Agent N"
participant StatsUtil as "Stats Util"
Client->>GatewayLB: ResourceStatsDetail()
GatewayLB->>GatewayLB: start trace span\ninit Details map
par broadcast
GatewayLB->>Agent1: ResourceStats() [sub-span]
Agent1->>StatsUtil: GetResourceStats(ctx)
StatsUtil-->>Agent1: stats (hostname, ip, cgroup)
Agent1-->>GatewayLB: ResourceStats response -> Details[agent1]
GatewayLB->>Agent2: ResourceStats() [sub-span]
Agent2->>StatsUtil: GetResourceStats(ctx)
StatsUtil-->>Agent2: stats
Agent2-->>GatewayLB: Details[agent2]
GatewayLB->>AgentN: ResourceStats() [sub-span]
AgentN->>StatsUtil: GetResourceStats(ctx)
StatsUtil-->>AgentN: stats
AgentN-->>GatewayLB: Details[agentN]
end
GatewayLB->>GatewayLB: aggregate Details\nrecord per-call statuses
GatewayLB-->>Client: ResourceStatsDetail{ details: map[...] }
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~28 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can scan for known vulnerabilities in your dependencies using OSV Scanner.OSV Scanner will automatically detect and report security vulnerabilities in your project's dependencies. No additional configuration is required. |
|
[CHATOPS:HELP] ChatOps commands.
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3435 +/- ##
==========================================
+ Coverage 25.02% 25.97% +0.94%
==========================================
Files 544 572 +28
Lines 43800 50475 +6675
==========================================
+ Hits 10963 13109 +2146
- Misses 32108 36395 +4287
- Partials 729 971 +242 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
c3df358 to
2df183b
Compare
Deploying vald with
|
| Latest commit: |
63e5986
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://1d6d77f5.vald.pages.dev |
| Branch Preview URL: | https://feature-gateway-add-resource.vald.pages.dev |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
apis/grpc/v1/vald/vald.go (1)
57-65:⚠️ Potential issue | 🟠 MajorClient and server interfaces in the aggregate Vald helper are asymmetric.
Clientnow includesstats.StatsClientand is initialized properly inNewValdClient, butServerlacksstats.StatsServer, andRegisterValdServerdoes not register the stats service. This creates a runtime contract violation: clients advertise stats RPCs while servers registered viaRegisterValdServerreturnUNIMPLEMENTED. The issue propagates downstream to types that embedvald.Client, includinginternal/client/v1/client/agent/core/client.go.All four callers of
RegisterValdServer(gateways and agents) would need to manually register stats if they implement it—a burden that the aggregate helper should eliminate. Addstats.StatsServerandstats.UnimplementedStatsServerto the interfaces, embedstats.UnimplementedStatsServerinUnimplementedValdServer, and callstats.RegisterStatsServerinRegisterValdServer.Proposed change
type Server interface { + stats.StatsServer FlushServer IndexServer InsertServer ObjectServer RemoveServer SearchServer UpdateServer UpsertServer } type UnimplementedValdServer struct { + stats.UnimplementedStatsServer UnimplementedFlushServer UnimplementedIndexServer UnimplementedInsertServer UnimplementedObjectServer UnimplementedRemoveServer UnimplementedSearchServer UnimplementedUpdateServer UnimplementedUpsertServer } func RegisterValdServer(s *grpc.Server, srv Server) { + stats.RegisterStatsServer(s, srv) RegisterFlushServer(s, srv) RegisterIndexServer(s, srv) RegisterInsertServer(s, srv) RegisterObjectServer(s, srv) RegisterRemoveServer(s, srv) RegisterSearchServer(s, srv) RegisterUpdateServer(s, srv) RegisterUpsertServer(s, srv) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apis/grpc/v1/vald/vald.go` around lines 57 - 65, The aggregate Vald helper is asymmetric: Client includes stats.StatsClient while Server and RegisterValdServer do not expose or register the stats service, causing UNIMPLEMENTED at runtime; to fix, add stats.StatsServer to the Server interface, include stats.UnimplementedStatsServer in the UnimplementedValdServer struct, and update RegisterValdServer to call stats.RegisterStatsServer alongside existing registrations so servers created via NewValdClient/NewValdServer advertise and serve the same RPCs; ensure symbols referenced are stats.StatsServer, stats.UnimplementedStatsServer, UnimplementedValdServer, RegisterValdServer, and NewValdClient.internal/net/grpc/stats/stats_test.go (1)
130-145:⚠️ Potential issue | 🟠 MajorMake this success case independent of host cgroup support.
GetResourceStatsdeliberately returns a non-nil stats payload even when cgroup sampling fails, leavingCgroupStatsnil. RequiringCgroupStatshere makes the test depend on the runner's cgroup mounts and can fail even when the production path is behaving correctly. AssertNameandIpunconditionally, and only validate cgroup fields when they are present or behind a stubbed collector.💡 Suggested fix
- if stats.CgroupStats == nil { - return errors.New("cgroup stats should not be nil") - } + if stats.CgroupStats != nil && stats.CgroupStats.MemoryUsageBytes == 0 { + return errors.New("memory usage should be greater than 0 when cgroup stats are available") + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/net/grpc/stats/stats_test.go` around lines 130 - 145, The test's success path currently requires stats.CgroupStats to be non-nil which makes it flaky on hosts without cgroup support; update the checkFunc used in the test for GetResourceStats to unconditionally assert stats != nil and that stats.Name and stats.Ip are non-empty, but remove the unconditional failure when stats.CgroupStats is nil — instead only validate cgroup fields (e.g. contents of payload.Info_Stats_ResourceStats.CgroupStats) when stats.CgroupStats != nil or when you have explicitly stubbed a cgroup collector for the test.internal/net/grpc/stats/stats.go (1)
84-111:⚠️ Potential issue | 🟠 MajorPropagate canceled or expired contexts instead of silently degrading to partial stats.
measureCgroupStatsreturnsctx.Err()when the caller cancels during the 100ms sampling window. Swallowing that here makesGetResourceStatslook successful to direct callers and causes the gRPC handler to lose the correct cancellation/deadline status.Suggested fix
cgroupStats, err := measureCgroupStats(ctx) if err != nil { + if ctxErr := ctx.Err(); ctxErr != nil { + return nil, ctxErr + } log.Warn("failed to measure cgroup stats", err) }As per coding guidelines, "ensure context is passed through all calls (no missing context.Context propagation) and that proper gRPC error codes are returned when applicable."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/net/grpc/stats/stats.go` around lines 84 - 111, GetResourceStats currently swallows errors from measureCgroupStats causing callers to think the call succeeded; modify the error handling after calling measureCgroupStats so that if err is context.Canceled or context.DeadlineExceeded (use errors.Is(err, context.Canceled) / errors.Is(err, context.DeadlineExceeded) or compare to ctx.Err()) you propagate that error immediately (return nil, err) instead of logging and continuing; for non-context errors keep the existing log.Warn path and continue to populate stats. This change touches GetResourceStats and its call to measureCgroupStats.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apis/proto/v1/payload/payload.proto`:
- Around line 652-677: Restore the original top-level message Info.ResourceStats
(with fields name, ip, cgroup_stats) so the generated type
payload.v1.Info.ResourceStats remains unchanged; keep the new grouped shape for
stats under Info.Stats but rename the nested message (currently
Info.Stats.ResourceStats) to a different identifier (e.g., ResourceStatsItem or
ResourceStatsDetailItem) and/or make the map in Info.Stats.ResourceStatsDetail
use the top-level Info.ResourceStats as its value type (map<string,
payload.v1.Info.ResourceStats>) and keep CgroupStats as-is, ensuring no existing
symbol payload.v1.Info.ResourceStats was removed or renamed.
In `@apis/proto/v1/rpc/stats/stats.proto`:
- Line 33: The change makes rpc ResourceStats return
payload.v1.Info.Stats.ResourceStats which is source-breaking; revert
ResourceStats to its original response type (the previous generated type) and
keep the new nested payload.v1.Info.Stats.ResourceStats only for the
new/additive RPC (e.g., ResourceStatsDetail) or introduce a versioned RPC name;
update the proto so ResourceStats signature is unchanged and the nested-type
change is exposed only via the new ResourceStatsDetail (or a v2 RPC) to preserve
client source-compatibility.
In `@apis/swagger/v1/rpc/stats/stats.swagger.json`:
- Around line 39-58: The REST layer is missing the ResourceStatsDetail endpoint:
add ResourceStatsDetail(ctx http.Context or context.Context, w
http.ResponseWriter, r *http.Request) to the Handler interface in
pkg/gateway/lb/handler/rest/handler.go and implement it in the existing handler
type (mirror the behavior/response shape of other handlers like GetObject), then
register the route GET "/resource/stats/detail" in
pkg/gateway/lb/router/router.go (add a route entry similar to GetObject
registration) so the documented swagger path is served; note the gRPC service is
already registered via stats.RegisterStatsDetailServer(srv, v) in
pkg/gateway/lb/usecase/vald.go, so ensure the REST handler delegates or
translates requests consistently with that implementation.
In `@pkg/gateway/lb/handler/grpc/stats.go`:
- Around line 60-118: The callback passed to s.gateway.BroadCast is mutating the
method-level named return err (via vc.ResourceStats and status.ParseError),
causing races across broadcast workers; make error state local to the closure
(e.g., use a local variable like localErr or ierr) and use that for all error
handling, tracing (sspan.RecordError/SetStatus) and returns from the closure,
avoiding writes to the outer err inside the BroadCast worker; if you must
surface a worker error to the caller, collect it from the BroadCast/errgroup
result (s.eg) or send it through the ech channel and assign to the method-level
err only after the broadcast goroutines have completed.
In `@tests/v2/e2e/crud/stats_test.go`:
- Around line 50-52: The test currently calls single(..., r.resourceStatsDetail,
printCallback[*payload.Info_Stats_ResourceStats_ResourceStatsDetail](passThrough))
but only checks for no error; update the callback used with single (or replace
passThrough) to assert the response's Details are populated by calling
res.GetDetails() and asserting len(res.GetDetails()) > 0, and when the scenario
provides the expected agent count also assert len(res.GetDetails()) ==
expectedAgentCount; locate the call site around resourceStatsDetail and
payload.Info_Stats_ResourceStatsDetail and add the small assertion callback to
fail the test when Details is empty or count mismatches.
---
Outside diff comments:
In `@apis/grpc/v1/vald/vald.go`:
- Around line 57-65: The aggregate Vald helper is asymmetric: Client includes
stats.StatsClient while Server and RegisterValdServer do not expose or register
the stats service, causing UNIMPLEMENTED at runtime; to fix, add
stats.StatsServer to the Server interface, include
stats.UnimplementedStatsServer in the UnimplementedValdServer struct, and update
RegisterValdServer to call stats.RegisterStatsServer alongside existing
registrations so servers created via NewValdClient/NewValdServer advertise and
serve the same RPCs; ensure symbols referenced are stats.StatsServer,
stats.UnimplementedStatsServer, UnimplementedValdServer, RegisterValdServer, and
NewValdClient.
In `@internal/net/grpc/stats/stats_test.go`:
- Around line 130-145: The test's success path currently requires
stats.CgroupStats to be non-nil which makes it flaky on hosts without cgroup
support; update the checkFunc used in the test for GetResourceStats to
unconditionally assert stats != nil and that stats.Name and stats.Ip are
non-empty, but remove the unconditional failure when stats.CgroupStats is nil —
instead only validate cgroup fields (e.g. contents of
payload.Info_Stats_ResourceStats.CgroupStats) when stats.CgroupStats != nil or
when you have explicitly stubbed a cgroup collector for the test.
In `@internal/net/grpc/stats/stats.go`:
- Around line 84-111: GetResourceStats currently swallows errors from
measureCgroupStats causing callers to think the call succeeded; modify the error
handling after calling measureCgroupStats so that if err is context.Canceled or
context.DeadlineExceeded (use errors.Is(err, context.Canceled) / errors.Is(err,
context.DeadlineExceeded) or compare to ctx.Err()) you propagate that error
immediately (return nil, err) instead of logging and continuing; for non-context
errors keep the existing log.Warn path and continue to populate stats. This
change touches GetResourceStats and its call to measureCgroupStats.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: c784a6d0-d4ec-4491-92d1-79c1f226c6b6
⛔ Files ignored due to path filters (10)
apis/grpc/v1/payload/payload.pb.gois excluded by!**/*.pb.go,!**/*.pb.goapis/grpc/v1/payload/payload.pb.json.gois excluded by!**/*.pb.json.goapis/grpc/v1/payload/payload_vtproto.pb.gois excluded by!**/*.pb.go,!**/*.pb.go,!**/*_vtproto.pb.goapis/grpc/v1/rpc/stats/stats.pb.gois excluded by!**/*.pb.go,!**/*.pb.goapis/grpc/v1/rpc/stats/stats_vtproto.pb.gois excluded by!**/*.pb.go,!**/*.pb.go,!**/*_vtproto.pb.gorust/libs/proto/src/payload/v1/payload.v1.serde.rsis excluded by!**/*.serde.rsrust/libs/proto/src/rpc/v1/rpc.v1.tonic.rsis excluded by!**/*.tonic.rstests/v2/e2e/assets/multi_crud.yamlis excluded by!**/assets/**tests/v2/e2e/assets/stream_crud.yamlis excluded by!**/assets/**tests/v2/e2e/assets/unary_crud.yamlis excluded by!**/assets/**
📒 Files selected for processing (24)
.gitfilesapis/docs/v1/docs.mdapis/docs/v1/payload.md.tmplapis/grpc/v1/vald/vald.goapis/proto/v1/payload/payload.protoapis/proto/v1/rpc/stats/stats.protoapis/swagger/v1/rpc/stats/stats.swagger.jsoninternal/client/v1/client/vald/vald.gointernal/net/grpc/stats/stats.gointernal/net/grpc/stats/stats_test.gopkg/agent/core/faiss/handler/grpc/handler.gopkg/agent/core/faiss/handler/grpc/stats.gopkg/agent/core/faiss/usecase/agentd.gopkg/agent/core/ngt/handler/grpc/handler.gopkg/agent/core/ngt/handler/grpc/stats.gopkg/agent/core/ngt/usecase/agentd.gopkg/gateway/lb/handler/grpc/handler.gopkg/gateway/lb/handler/grpc/stats.gopkg/gateway/lb/usecase/vald.gorust/libs/proto/src/payload/v1/payload.v1.rstests/v2/e2e/config/config.gotests/v2/e2e/config/enums.gotests/v2/e2e/crud/stats_test.gotests/v2/e2e/crud/strategy_test.go
| // Represents the stats info messages. | ||
| message Stats { | ||
| // Represent the resource stats | ||
| message ResourceStats { | ||
| string name = 1; | ||
| string ip = 2 [(buf.validate.field).string.ipv4 = true]; | ||
| // Container resource usage statistics | ||
| CgroupStats cgroup_stats = 3; | ||
| } | ||
|
|
||
| // Represents the resource stats for each agent | ||
| message ResourceStatsDetail { | ||
| map<string, ResourceStats> details = 1; | ||
| } | ||
|
|
||
| message CgroupStats { | ||
| // CPU cores available | ||
| double cpu_limit_cores = 1; | ||
| // CPU usage in cores (not percentage) | ||
| double cpu_usage_cores = 2; | ||
| // Memory limit in bytes | ||
| uint64 memory_limit_bytes = 3; | ||
| // Memory usage in bytes | ||
| uint64 memory_usage_bytes = 4; | ||
| } | ||
| } |
There was a problem hiding this comment.
Preserve the existing Info.ResourceStats type on the v1 surface.
Rehoming ResourceStats under Info.Stats changes the generated type used by the existing ResourceStats RPC (payload.v1.Info.ResourceStats → payload.v1.Info.Stats.ResourceStats). Existing code built against the current v1 SDKs will stop compiling even though the wire fields are unchanged. Please add the new detail shape without renaming the existing ResourceStats message.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apis/proto/v1/payload/payload.proto` around lines 652 - 677, Restore the
original top-level message Info.ResourceStats (with fields name, ip,
cgroup_stats) so the generated type payload.v1.Info.ResourceStats remains
unchanged; keep the new grouped shape for stats under Info.Stats but rename the
nested message (currently Info.Stats.ResourceStats) to a different identifier
(e.g., ResourceStatsItem or ResourceStatsDetailItem) and/or make the map in
Info.Stats.ResourceStatsDetail use the top-level Info.ResourceStats as its value
type (map<string, payload.v1.Info.ResourceStats>) and keep CgroupStats as-is,
ensuring no existing symbol payload.v1.Info.ResourceStats was removed or
renamed.
| // Overview | ||
| // Represent the RPC to get the resource stats. | ||
| rpc ResourceStats(payload.v1.Empty) returns (payload.v1.Info.ResourceStats) { | ||
| rpc ResourceStats(payload.v1.Empty) returns (payload.v1.Info.Stats.ResourceStats) { |
There was a problem hiding this comment.
Keep ResourceStats source-compatible in rpc.v1.
Line 33 changes the existing ResourceStats RPC to return payload.v1.Info.Stats.ResourceStats, which renames the generated response type for every client. The new ResourceStatsDetail RPC is additive, but this line turns the existing v1 method into a source-breaking change. Preserve the original return type on ResourceStats, or move the nested-type change behind a new/versioned RPC.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apis/proto/v1/rpc/stats/stats.proto` at line 33, The change makes rpc
ResourceStats return payload.v1.Info.Stats.ResourceStats which is
source-breaking; revert ResourceStats to its original response type (the
previous generated type) and keep the new nested
payload.v1.Info.Stats.ResourceStats only for the new/additive RPC (e.g.,
ResourceStatsDetail) or introduce a versioned RPC name; update the proto so
ResourceStats signature is unchanged and the nested-type change is exposed only
via the new ResourceStatsDetail (or a v2 RPC) to preserve client
source-compatibility.
| "/resource/stats/detail": { | ||
| "get": { | ||
| "summary": "Represent the RPC to get the resource stats for each agent.", | ||
| "operationId": "StatsDetail_ResourceStatsDetail", | ||
| "responses": { | ||
| "200": { | ||
| "description": "A successful response.", | ||
| "schema": { | ||
| "$ref": "#/definitions/InfoStatsResourceStatsDetail" | ||
| } | ||
| }, | ||
| "default": { | ||
| "description": "An unexpected error response.", | ||
| "schema": { | ||
| "$ref": "#/definitions/rpcStatus" | ||
| } | ||
| } | ||
| }, | ||
| "tags": ["StatsDetail"] | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== Proto/Swagger declarations =='
rg -n -C2 'ResourceStatsDetail|/resource/stats/detail' \
apis/proto/v1/rpc/stats/stats.proto \
apis/swagger/v1/rpc/stats/stats.swagger.json
echo
echo '== Gateway REST wiring =='
rg -n -C3 'WithVald|ResourceStats|ResourceStatsDetail|StatsDetail|/resource/stats' \
pkg/gateway/lb/usecase/vald.go \
pkg/gateway/lb/handler/rest \
pkg/gateway/lb/router
echo
echo 'Expected result: ResourceStatsDetail should appear in the gateway REST wiring if the documented HTTP endpoint is actually served.'Repository: vdaas/vald
Length of output: 5019
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== REST handler implementation =='
find pkg/gateway/lb/handler/rest -name '*.go' -type f | head -20
echo
echo '== Check for stats-related routes in REST handler =='
rg -n 'resource/stats|StatsDetail|ResourceStats|stats\.' \
pkg/gateway/lb/handler/rest/ --type go
echo
echo '== Main REST handler file =='
if [ -f pkg/gateway/lb/handler/rest/handler.go ]; then
wc -l pkg/gateway/lb/handler/rest/handler.go
head -100 pkg/gateway/lb/handler/rest/handler.go
fi
echo
echo '== Check for route registration patterns =='
rg -n 'router\.|Handle|Route' \
pkg/gateway/lb/handler/rest/ --type go -A2 | head -50Repository: vdaas/vald
Length of output: 340
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== REST handler.go content =='
cat -n pkg/gateway/lb/handler/rest/handler.go
echo
echo '== Check handler struct definition =='
rg -n 'type handler struct' -A10 pkg/gateway/lb/handler/rest/handler.go
echo
echo '== Check if handler implements ServeHTTP or similar =='
rg -n 'func.*handler.*ServeHTTP\|func.*handler.*Handle' -A5 pkg/gateway/lb/handler/rest/handler.goRepository: vdaas/vald
Length of output: 8870
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== Check router implementation =='
cat -n pkg/gateway/lb/router/router.go | head -150
echo
echo '== Check if router explicitly registers Handler methods =='
rg -n 'Index|Search|Insert|Handle' pkg/gateway/lb/router/router.go -B2 -A2 | head -80
echo
echo '== Vald server interface - check if it has ResourceStatsDetail =='
rg -n 'interface.*Server\|ResourceStatsDetail' \
apis/grpc/v1/vald/vald.pb.go -A5 | head -50Repository: vdaas/vald
Length of output: 6137
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '== Full router.go to check for stats routes =='
wc -l pkg/gateway/lb/router/router.go
cat -n pkg/gateway/lb/router/router.go | tail -100
echo
echo '== Search for ResourceStatsDetail or stats in entire REST handler directory =='
rg -r 'ResourceStatsDetail' pkg/gateway/lb/handler/rest/
echo
echo '== Search for stats routes in router files =='
find pkg/gateway/lb/router -name '*.go' -type f -exec grep -l 'stats\|Stats\|ResourceStats' {} \;
echo
echo '== Check if there is a separate stats handler =='
find pkg/gateway/lb/handler -type d | grep -i stats
echo
echo '== Search for ResourceStatsDetail across entire gateway package =='
rg -n 'ResourceStatsDetail' pkg/gateway/lb/Repository: vdaas/vald
Length of output: 2778
Implement ResourceStatsDetail handler and router registration for REST endpoint.
The proto declares rpc ResourceStatsDetail with option (google.api.http).get = "/resource/stats/detail", and Swagger documents the endpoint, but the REST layer is incomplete:
pkg/gateway/lb/handler/rest/handler.go: Handler interface omitsResourceStatsDetailmethod; implementation stops atGetObjectpkg/gateway/lb/router/router.go: Route registration ends atGetObject(line 173); no route for/resource/stats/detail
The gRPC service is correctly registered via stats.RegisterStatsDetailServer(srv, v) at line 129 in pkg/gateway/lb/usecase/vald.go, so clients using gRPC will work. However, HTTP clients will receive 404 for a documented endpoint. Add the missing ResourceStatsDetail method to the Handler interface and register the route in the router to serve this endpoint over REST consistently with the proto contract.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apis/swagger/v1/rpc/stats/stats.swagger.json` around lines 39 - 58, The REST
layer is missing the ResourceStatsDetail endpoint: add ResourceStatsDetail(ctx
http.Context or context.Context, w http.ResponseWriter, r *http.Request) to the
Handler interface in pkg/gateway/lb/handler/rest/handler.go and implement it in
the existing handler type (mirror the behavior/response shape of other handlers
like GetObject), then register the route GET "/resource/stats/detail" in
pkg/gateway/lb/router/router.go (add a route entry similar to GetObject
registration) so the documented swagger path is served; note the gRPC service is
already registered via stats.RegisterStatsDetailServer(srv, v) in
pkg/gateway/lb/usecase/vald.go, so ensure the REST handler delegates or
translates requests consistently with that implementation.
pkg/gateway/lb/handler/grpc/stats.go
Outdated
| s.eg.Go(safety.RecoverFunc(func() error { | ||
| defer close(ech) | ||
| ech <- s.gateway.BroadCast(ctx, service.READ, func(ctx context.Context, target string, vc vald.Client, copts ...grpc.CallOption) error { | ||
| sctx, sspan := trace.StartSpan(grpc.WrapGRPCMethod(ctx, "BroadCast/"+target), apiName+"/"+resourceStatsDetailRPCName+"/"+target) | ||
| defer func() { | ||
| if sspan != nil { | ||
| sspan.End() | ||
| } | ||
| }() | ||
| var stats *payload.Info_Stats_ResourceStats | ||
| stats, err = vc.ResourceStats(sctx, new(payload.Empty), copts...) | ||
| if err != nil { | ||
| var ( | ||
| attrs trace.Attributes | ||
| st *status.Status | ||
| msg string | ||
| code codes.Code | ||
| ) | ||
| switch { | ||
| case errors.Is(err, context.Canceled), | ||
| errors.Is(err, errors.ErrRPCCallFailed(target, context.Canceled)): | ||
| attrs = trace.StatusCodeCancelled( | ||
| errdetails.ValdGRPCResourceTypePrefix + | ||
| "/" + statsPackageName + "." + resourceStatsDetailRPCName + ".BroadCast/" + | ||
| target + " canceled: " + err.Error()) | ||
| code = codes.Canceled | ||
| case errors.Is(err, context.DeadlineExceeded), | ||
| errors.Is(err, errors.ErrRPCCallFailed(target, context.DeadlineExceeded)): | ||
| attrs = trace.StatusCodeDeadlineExceeded( | ||
| errdetails.ValdGRPCResourceTypePrefix + | ||
| "/" + statsPackageName + "." + resourceStatsDetailRPCName + ".BroadCast/" + | ||
| target + " deadline_exceeded: " + err.Error()) | ||
| code = codes.DeadlineExceeded | ||
| default: | ||
| st, msg, err = status.ParseError(err, codes.NotFound, "error "+resourceStatsDetailRPCName+" API", | ||
| &errdetails.ResourceInfo{ | ||
| ResourceType: errdetails.ValdGRPCResourceTypePrefix + "/" + statsPackageName + "." + resourceStatsDetailRPCName + ".BroadCast/" + target, | ||
| ResourceName: fmt.Sprintf("%s: %s(%s) to %s", apiName, s.name, s.ip, target), | ||
| }) | ||
| if st != nil { | ||
| code = st.Code() | ||
| } else { | ||
| code = codes.NotFound | ||
| } | ||
| attrs = trace.FromGRPCStatus(code, msg) | ||
| } | ||
| if sspan != nil { | ||
| sspan.RecordError(err) | ||
| sspan.SetAttributes(attrs...) | ||
| sspan.SetStatus(trace.StatusError, err.Error()) | ||
| } | ||
| if err != nil && st != nil && | ||
| code != codes.Canceled && | ||
| code != codes.DeadlineExceeded && | ||
| code != codes.InvalidArgument && | ||
| code != codes.NotFound && | ||
| code != codes.OK && | ||
| code != codes.Unimplemented { | ||
| return err |
There was a problem hiding this comment.
Stop sharing the named err across broadcast workers.
The callback writes to the method-level named return err (vc.ResourceStats(...) and status.ParseError(...)). BroadCast fans out per target, so those writes race with each other and with the outer select, which can return the wrong error or mask one entirely. Keep per-target state in locals and only assign to the method result after the broadcast completes.
💡 Suggested fix
- var stats *payload.Info_Stats_ResourceStats
- stats, err = vc.ResourceStats(sctx, new(payload.Empty), copts...)
- if err != nil {
+ res, callErr := vc.ResourceStats(sctx, new(payload.Empty), copts...)
+ if callErr != nil {
var (
attrs trace.Attributes
st *status.Status
msg string
code codes.Code
)
switch {
- case errors.Is(err, context.Canceled),
- errors.Is(err, errors.ErrRPCCallFailed(target, context.Canceled)):
+ case errors.Is(callErr, context.Canceled),
+ errors.Is(callErr, errors.ErrRPCCallFailed(target, context.Canceled)):
attrs = trace.StatusCodeCancelled(
errdetails.ValdGRPCResourceTypePrefix +
"/" + statsPackageName + "." + resourceStatsDetailRPCName + ".BroadCast/" +
- target + " canceled: " + err.Error())
+ target + " canceled: " + callErr.Error())
code = codes.Canceled
- case errors.Is(err, context.DeadlineExceeded),
- errors.Is(err, errors.ErrRPCCallFailed(target, context.DeadlineExceeded)):
+ case errors.Is(callErr, context.DeadlineExceeded),
+ errors.Is(callErr, errors.ErrRPCCallFailed(target, context.DeadlineExceeded)):
attrs = trace.StatusCodeDeadlineExceeded(
errdetails.ValdGRPCResourceTypePrefix +
"/" + statsPackageName + "." + resourceStatsDetailRPCName + ".BroadCast/" +
- target + " deadline_exceeded: " + err.Error())
+ target + " deadline_exceeded: " + callErr.Error())
code = codes.DeadlineExceeded
default:
- st, msg, err = status.ParseError(err, codes.NotFound, "error "+resourceStatsDetailRPCName+" API",
+ st, msg, callErr = status.ParseError(callErr, codes.NotFound, "error "+resourceStatsDetailRPCName+" API",
&errdetails.ResourceInfo{
ResourceType: errdetails.ValdGRPCResourceTypePrefix + "/" + statsPackageName + "." + resourceStatsDetailRPCName + ".BroadCast/" + target,
ResourceName: fmt.Sprintf("%s: %s(%s) to %s", apiName, s.name, s.ip, target),
})
if st != nil {
code = st.Code()
} else {
code = codes.NotFound
}
attrs = trace.FromGRPCStatus(code, msg)
}
if sspan != nil {
- sspan.RecordError(err)
+ sspan.RecordError(callErr)
sspan.SetAttributes(attrs...)
- sspan.SetStatus(trace.StatusError, err.Error())
+ sspan.SetStatus(trace.StatusError, callErr.Error())
}
- if err != nil && st != nil &&
+ if callErr != nil && st != nil &&
code != codes.Canceled &&
code != codes.DeadlineExceeded &&
code != codes.InvalidArgument &&
code != codes.NotFound &&
code != codes.OK &&
code != codes.Unimplemented {
- return err
+ return callErr
}
return nil
}
- if stats != nil {
+ if res != nil {
mu.Lock()
- detail.Details[target] = stats
+ detail.Details[target] = res
mu.Unlock()
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/gateway/lb/handler/grpc/stats.go` around lines 60 - 118, The callback
passed to s.gateway.BroadCast is mutating the method-level named return err (via
vc.ResourceStats and status.ParseError), causing races across broadcast workers;
make error state local to the closure (e.g., use a local variable like localErr
or ierr) and use that for all error handling, tracing
(sspan.RecordError/SetStatus) and returns from the closure, avoiding writes to
the outer err inside the BroadCast worker; if you must surface a worker error to
the caller, collect it from the BroadCast/errgroup result (s.eg) or send it
through the ech channel and assign to the method-level err only after the
broadcast goroutines have completed.
✅ Addressed in commit 5f93be8
There was a problem hiding this comment.
@jumpei527
This comment seems correct. Let's avoid sharing err.
There was a problem hiding this comment.
IndexStatisticsDetail has similar problem too.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
@Matts966
To prevent error race condition, I updated the BroadCast callback to use a local callErr variable. In addition to ResourceStatsDetail, I fixed the same issue in IndexStatisticsDetail and IndexProperty.
| switch plan.Type { | ||
| case config.OpResourceStatsDetail: | ||
| return single(t, ctx, 0, plan, new(payload.Empty), r.resourceStatsDetail, printCallback[*payload.Info_Stats_ResourceStatsDetail](passThrough)) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Assert that Details is populated.
This path currently only verifies that the RPC returns without error. An implementation that always returns an empty Details map would still pass, which misses the main contract of ResourceStatsDetail. Add a callback assertion for len(res.GetDetails()) > 0 and, if the scenario knows the expected agent count, validate that too.
Based on learnings: "GRPC handler packages in pkg components require minimal testing except for bugfixes; detailed business logic tests belong in E2E tests."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/v2/e2e/crud/stats_test.go` around lines 50 - 52, The test currently
calls single(..., r.resourceStatsDetail,
printCallback[*payload.Info_Stats_ResourceStats_ResourceStatsDetail](passThrough))
but only checks for no error; update the callback used with single (or replace
passThrough) to assert the response's Details are populated by calling
res.GetDetails() and asserting len(res.GetDetails()) > 0, and when the scenario
provides the expected agent count also assert len(res.GetDetails()) ==
expectedAgentCount; locate the call site around resourceStatsDetail and
payload.Info_Stats_ResourceStatsDetail and add the small assertion callback to
fail the test when Details is empty or count mismatches.
|
/format |
|
[FORMAT] Updating license headers and formatting go codes triggered by kpango. |
Signed-off-by: Vdaas CI <vald@vdaas.org>
… IndexProperty to prevent data race
Signed-off-by: Jumpei Tanikawa <jumpei5278@gmail.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
apis/proto/v1/payload/payload.proto (1)
654-678:⚠️ Potential issue | 🟠 MajorStill unresolved: preserve
Info.ResourceStatson the v1 surface.Rehoming this message under
Info.Statsrenames the generated SDK type for the existing stats RPC (Info_ResourceStats→Info_Stats_ResourceStatsin Go), which is a source-breaking change for current v1 clients. Keep the legacy top-level message and hang the new detail shape off it instead.You can verify the generated surface change with:
#!/bin/bash set -euo pipefail rg -n -C2 'type Info_ResourceStats|type Info_Stats_ResourceStats' apis/grpc/v1/payload/payload.pb.go rg -n -C2 '\bResourceStats\b' apis/proto/v1/rpc/stats/stats.proto🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apis/proto/v1/payload/payload.proto` around lines 654 - 678, The change moved the ResourceStats protobuf message under Info.Stats which renames the generated Go type (Info_ResourceStats → Info_Stats_ResourceStats) and breaks v1 clients; restore the original top-level message name and keep the new detailed shape as a nested type. Specifically, reintroduce a top-level message ResourceStats with the same fields (cpu_limit_cores, cpu_usage_cores, memory_limit_bytes, memory_usage_bytes and the name/ip fields) so the generated type Info_ResourceStats remains, and modify Stats (or Info.Stats) to reference that top-level ResourceStats for existing usages while adding a new ResourceStatsDetail (or similar nested message) that can contain the map<string, ResourceStats> for the detailed shape.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@apis/proto/v1/payload/payload.proto`:
- Around line 654-678: The change moved the ResourceStats protobuf message under
Info.Stats which renames the generated Go type (Info_ResourceStats →
Info_Stats_ResourceStats) and breaks v1 clients; restore the original top-level
message name and keep the new detailed shape as a nested type. Specifically,
reintroduce a top-level message ResourceStats with the same fields
(cpu_limit_cores, cpu_usage_cores, memory_limit_bytes, memory_usage_bytes and
the name/ip fields) so the generated type Info_ResourceStats remains, and modify
Stats (or Info.Stats) to reference that top-level ResourceStats for existing
usages while adding a new ResourceStatsDetail (or similar nested message) that
can contain the map<string, ResourceStats> for the detailed shape.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 61a63792-cabc-46bf-a663-a296fdcccb75
⛔ Files ignored due to path filters (3)
apis/grpc/v1/payload/payload.pb.gois excluded by!**/*.pb.go,!**/*.pb.goapis/grpc/v1/payload/payload_vtproto.pb.gois excluded by!**/*.pb.go,!**/*.pb.go,!**/*_vtproto.pb.gorust/libs/proto/src/payload/v1/payload.v1.serde.rsis excluded by!**/*.serde.rs
📒 Files selected for processing (5)
apis/docs/v1/docs.mdapis/docs/v1/payload.md.tmplapis/proto/v1/payload/payload.protopkg/gateway/lb/handler/grpc/index.gorust/libs/proto/src/payload/v1/payload.v1.rs
This reverts commit 628704b.
There was a problem hiding this comment.
IndexStatistics can panic when all agent responses have Valid=false. mergeInfoIndexStatistics skips invalid entries but still indexes into empty slices.
Description
#3243 implemented the
ResourceStatsAPI for future resource control. This PR addsResourceStatsDetailAPI to easily aggregate ResourceStats data from all agents.What Changed:
rpc.v1.StatsDetailservice toapis/proto/v1/rpc/stats/stats.proto.ResourceStatsDetail(payload.v1.Empty) returns (payload.v1.Info.Stats.ResourceStatsDetail).Info.Statsfor detail response (ResourceStatsDetail).ResourceStatsDetailon gateway LB handler to aggregateResourceStatsfrom agents.ResourceStatsDetailon NGT/Faiss agent handlers to return self-only detail.StatsDetailservice in gateway and agent gRPC registration paths.resource_stats_detailoperation support in E2E v2 config/binding/strategy routing.stats_test.go).ResourceStatsDetailscenarios tounary_crud.yaml,stream_crud.yaml, andmulti_crud.yaml.Related Issue
#3274
Versions
Checklist
Special notes for your reviewer
Summary by CodeRabbit
New Features
Tests
Chores