add a debug port which can dump configuration of servingGroup and role#900
add a debug port which can dump configuration of servingGroup and role#900LiZhenCheng9527 wants to merge 3 commits intovolcano-sh:mainfrom
Conversation
Signed-off-by: LiZhenCheng9527 <lizhencheng6@huawei.com>
There was a problem hiding this comment.
Code Review
This pull request adds a debug server to the kthena-controller-manager to dump the internal datastore cache, controlled by a new --debug-port flag. Key additions include the HTTP server setup, cache serialization DTOs, and unit tests. Feedback focuses on improving the HTTP server's security and reliability through timeouts and graceful shutdown, as well as optimizing the datastore's locking behavior during cache dumps.
There was a problem hiding this comment.
Pull request overview
Adds an optional debug HTTP port to the controller-manager to expose an endpoint that dumps the ModelServing controller’s in-memory datastore cache (ServingGroups/Roles/RunningPods) as JSON for troubleshooting.
Changes:
- Extend the model-serving datastore with a
DumpCache()method and JSON DTOs for safe export. - Add a ModelServing controller debug handler at
/debug/modelserving/cache. - Add controller-manager
--debug-portwiring to start a dedicated debug HTTP server when enabled.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/model-serving-controller/datastore/store.go | Extends the Store interface with DumpCache() for cache export. |
| pkg/model-serving-controller/datastore/dump.go | Implements JSON cache snapshot export via DTOs. |
| pkg/model-serving-controller/datastore/dump_test.go | Adds unit coverage for the dumped JSON structure/content. |
| pkg/model-serving-controller/controller/model_serving_controller.go | Adds HTTP handler + registration for cache dump endpoint. |
| pkg/controller/controller.go | Starts a debug HTTP server when DebugPort is set. |
| pkg/controller/config.go | Adds DebugPort to controller-manager config. |
| cmd/kthena-controller-manager/main.go | Adds --debug-port CLI flag to enable the debug server. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // dumpCacheHandler handles requests to dump the datastore cache. | ||
| func (c *ModelServingController) dumpCacheHandler(w http.ResponseWriter, r *http.Request) { | ||
| data := c.store.DumpCache() | ||
| if data == nil { | ||
| http.Error(w, "Failed to dump cache", http.StatusInternalServerError) | ||
| return |
There was a problem hiding this comment.
dumpCacheHandler responds to any HTTP method. Since this is a read-only debug dump endpoint, it should explicitly allow only GET (and possibly HEAD) and return 405 for others to reduce accidental misuse and align with typical debug endpoint behavior.
There was a problem hiding this comment.
maybe revert this binary file?
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Review summary
The overall direction is good — an in-process debug port backed by a safe read-lock snapshot is exactly the right approach for this kind of cache observability. There are a few issues that need fixing before this can land.
🔴 Critical — binary committed to the repo
The file kthena-controller-manager is a compiled binary added to the repository root. Binaries must never be committed to source control. Please remove it and update .gitignore if needed.
🔴 Bug — wrong HTTP status code in dumpCacheHandler
// pkg/model-serving-controller/controller/model_serving_controller.go
if err != nil {
http.Error(w, "Failed to dump cache", http.StatusMethodNotAllowed) // ← 405 is wrong
return
}405 Method Not Allowed is only correct when the HTTP method itself is not supported. Here the error originates from a json.Marshal failure inside DumpCache, which is an internal server error. This should be http.StatusInternalServerError (500).
🟡 Missing method guard on the debug endpoint
dumpCacheHandler accepts any HTTP method (POST, DELETE, …). A dump endpoint should only respond to GET:
func (c *ModelServingController) dumpCacheHandler(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodGet {
http.Error(w, "Method Not Allowed", http.StatusMethodNotAllowed)
return
}
// ...
}🟡 Debug server availability vs. leader-election
The debug server is started inside the leader-election block, so only the active leader exposes the debug port. If the intent is to inspect any replica's local cache — which is most useful exactly when a non-leader is behaving unexpectedly — the debug server should be started unconditionally, outside the leader-election guard. The cache is populated and valid on all replicas.
🟡 Security — no authentication / no localhost binding
By default the debug port is bound to all interfaces (0.0.0.0). Anyone who can reach the pod network can retrieve the full internal cache. Since this is a debug-only port, consider either:
- Binding to
127.0.0.1by default, or - Documenting clearly that this port must be protected (e.g. with a NetworkPolicy) when enabled in shared clusters.
🟢 Minor — ExportedServingGroup.Roles doc comment
The internal structure is roleName → roleID → *Role. The exported type correctly mirrors this, but there is no comment explaining the two-level nesting. A one-line comment would help readers quickly understand the JSON output shape:
// Roles maps role name → role instance ID → ExportedRole.
Roles map[string]map[string]ExportedRole `json:"roles"`✅ What looks good
DumpCachecorrectly releases the read lock before callingjson.Marshal, so serialisation does not block writers.- All HTTP server timeouts are set (
ReadHeaderTimeout,ReadTimeout,WriteTimeout,IdleTimeout) — good defensive practice. - The graceful shutdown goroutine (
<-ctx.Done()→server.Shutdown) is correctly wired. - DTOs (
ExportedRole,ExportedServingGroup) properly decouple the wire format from internal structs, avoiding accidental exposure of unexported locks. dump_test.gocovers the happy path with a realistic scenario.
801e17b to
1424ca3
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1424ca3 to
d37632b
Compare
|
/retest |
hzxuzhonghu
left a comment
There was a problem hiding this comment.
LGTM overall. Please ensure the debug port only listens on localhost to avoid unintended exposure.
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Thanks for this enhancement — observability into the in-memory cache is clearly useful. Overall the approach is clean and well-structured, but there is a critical compilation bug plus a few smaller issues that need to be addressed before merging.
[Critical] DumpCache() is not part of the Store interface — will not compile
ModelServingController.store is typed as datastore.Store (the interface). The new handleModelServingDatastoreCacheDump handler calls c.store.DumpCache(), but DumpCache() ([]byte, error) is not declared in the Store interface in store.go. This will fail to compile.
Fix: add the method signature to the interface:
// in pkg/model-serving-controller/datastore/store.go
type Store interface {
// ... existing methods ...
DumpCache() ([]byte, error)
}[Important] Debug server only available on the leader when leader-election is enabled
The debug server is launched inside startControllers(). When EnableLeaderElection=true, startControllers is only called once leadership is acquired, so non-leader replicas never serve debug traffic. Debug/observability endpoints are typically independent of leadership; consider starting the server outside startControllers so that every replica is reachable, regardless of whether it currently holds the lease.
[Minor] Use defer for the RUnlock in DumpCache
// current
s.mutex.RLock()
for msName, groups := range s.servingGroup { ... }
s.mutex.RUnlock()
// preferred — releases the lock even if a panic occurs
s.mutex.RLock()
defer s.mutex.RUnlock()
for msName, groups := range s.servingGroup { ... }[Minor] Port range should be validated
--debug-port accepts any positive integer. Consider validating that the value is in the range [1, 65535] before starting, e.g.:
if cc.DebugPort > 0 && cc.DebugPort > 65535 {
klog.Fatalf("--debug-port must be in range 1-65535, got %d", cc.DebugPort)
}[Minor] Security note — document that the port must not be exposed externally
The endpoint carries no authentication. Internal cache data (pod names, role states, revision hashes) could be sensitive. Please add a note in the flag description and/or docs that the debug port should only be accessible within the cluster network (e.g. bound to localhost or protected by a NetworkPolicy), and must not be exposed as a LoadBalancer/NodePort service.
[Nit] PR description
The release-note block is empty and the Fixes # issue link is incomplete. Please fill those in (or use NONE in the release-note block if no user-facing change is intended).
Signed-off-by: LiZhenCheng9527 <lizhencheng6@huawei.com>
|
/cc @hzxuzhonghu PTAL |
|
@LiZhenCheng9527: GitHub didn't allow me to request PR reviews from the following users: PTAL. Note that only volcano-sh members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind enhancement
What this PR does / why we need it:
modelServing uses a cache to store the configurations of ServingGroups and Roles, thereby reducing the load on the kube-apiserver. However, this approach lacks observability regarding the status of ServingGroups and Roles.
Therefore, a debug port has been added in this PR to enable the dumping of configurations from the dataStore. This enhances Kthena’s observability.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: