Skip to content

add a debug port which can dump configuration of servingGroup and role#900

Open
LiZhenCheng9527 wants to merge 3 commits intovolcano-sh:mainfrom
LiZhenCheng9527:debug-port
Open

add a debug port which can dump configuration of servingGroup and role#900
LiZhenCheng9527 wants to merge 3 commits intovolcano-sh:mainfrom
LiZhenCheng9527:debug-port

Conversation

@LiZhenCheng9527
Copy link
Copy Markdown
Collaborator

What type of PR is this?
/kind enhancement

What this PR does / why we need it:

modelServing uses a cache to store the configurations of ServingGroups and Roles, thereby reducing the load on the kube-apiserver. However, this approach lacks observability regarding the status of ServingGroups and Roles.

Therefore, a debug port has been added in this PR to enable the dumping of configurations from the dataStore. This enhances Kthena’s observability.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


Signed-off-by: LiZhenCheng9527 <lizhencheng6@huawei.com>
Copilot AI review requested due to automatic review settings April 20, 2026 07:00
@volcano-sh-bot volcano-sh-bot added the kind/enhancement New feature or request label Apr 20, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a debug server to the kthena-controller-manager to dump the internal datastore cache, controlled by a new --debug-port flag. Key additions include the HTTP server setup, cache serialization DTOs, and unit tests. Feedback focuses on improving the HTTP server's security and reliability through timeouts and graceful shutdown, as well as optimizing the datastore's locking behavior during cache dumps.

Comment thread pkg/controller/controller.go Outdated
Comment thread pkg/controller/controller.go
Comment thread pkg/model-serving-controller/datastore/dump.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optional debug HTTP port to the controller-manager to expose an endpoint that dumps the ModelServing controller’s in-memory datastore cache (ServingGroups/Roles/RunningPods) as JSON for troubleshooting.

Changes:

  • Extend the model-serving datastore with a DumpCache() method and JSON DTOs for safe export.
  • Add a ModelServing controller debug handler at /debug/modelserving/cache.
  • Add controller-manager --debug-port wiring to start a dedicated debug HTTP server when enabled.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pkg/model-serving-controller/datastore/store.go Extends the Store interface with DumpCache() for cache export.
pkg/model-serving-controller/datastore/dump.go Implements JSON cache snapshot export via DTOs.
pkg/model-serving-controller/datastore/dump_test.go Adds unit coverage for the dumped JSON structure/content.
pkg/model-serving-controller/controller/model_serving_controller.go Adds HTTP handler + registration for cache dump endpoint.
pkg/controller/controller.go Starts a debug HTTP server when DebugPort is set.
pkg/controller/config.go Adds DebugPort to controller-manager config.
cmd/kthena-controller-manager/main.go Adds --debug-port CLI flag to enable the debug server.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/model-serving-controller/datastore/dump.go Outdated
Comment thread pkg/model-serving-controller/datastore/store.go Outdated
Comment thread pkg/controller/controller.go Outdated
Comment thread pkg/controller/controller.go
Comment on lines +2419 to +2424
// dumpCacheHandler handles requests to dump the datastore cache.
func (c *ModelServingController) dumpCacheHandler(w http.ResponseWriter, r *http.Request) {
data := c.store.DumpCache()
if data == nil {
http.Error(w, "Failed to dump cache", http.StatusInternalServerError)
return
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumpCacheHandler responds to any HTTP method. Since this is a read-only debug dump endpoint, it should explicitly allow only GET (and possibly HEAD) and return 405 for others to reduce accidental misuse and align with typical debug endpoint behavior.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix it

Comment thread kthena-controller-manager Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe revert this binary file?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add in gitignore

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary

The overall direction is good — an in-process debug port backed by a safe read-lock snapshot is exactly the right approach for this kind of cache observability. There are a few issues that need fixing before this can land.


🔴 Critical — binary committed to the repo

The file kthena-controller-manager is a compiled binary added to the repository root. Binaries must never be committed to source control. Please remove it and update .gitignore if needed.


🔴 Bug — wrong HTTP status code in dumpCacheHandler

// pkg/model-serving-controller/controller/model_serving_controller.go
if err != nil {
    http.Error(w, "Failed to dump cache", http.StatusMethodNotAllowed) // ← 405 is wrong
    return
}

405 Method Not Allowed is only correct when the HTTP method itself is not supported. Here the error originates from a json.Marshal failure inside DumpCache, which is an internal server error. This should be http.StatusInternalServerError (500).


🟡 Missing method guard on the debug endpoint

dumpCacheHandler accepts any HTTP method (POST, DELETE, …). A dump endpoint should only respond to GET:

func (c *ModelServingController) dumpCacheHandler(w http.ResponseWriter, r *http.Request) {
    if r.Method != http.MethodGet {
        http.Error(w, "Method Not Allowed", http.StatusMethodNotAllowed)
        return
    }
    // ...
}

🟡 Debug server availability vs. leader-election

The debug server is started inside the leader-election block, so only the active leader exposes the debug port. If the intent is to inspect any replica's local cache — which is most useful exactly when a non-leader is behaving unexpectedly — the debug server should be started unconditionally, outside the leader-election guard. The cache is populated and valid on all replicas.


🟡 Security — no authentication / no localhost binding

By default the debug port is bound to all interfaces (0.0.0.0). Anyone who can reach the pod network can retrieve the full internal cache. Since this is a debug-only port, consider either:

  • Binding to 127.0.0.1 by default, or
  • Documenting clearly that this port must be protected (e.g. with a NetworkPolicy) when enabled in shared clusters.

🟢 Minor — ExportedServingGroup.Roles doc comment

The internal structure is roleName → roleID → *Role. The exported type correctly mirrors this, but there is no comment explaining the two-level nesting. A one-line comment would help readers quickly understand the JSON output shape:

// Roles maps role name → role instance ID → ExportedRole.
Roles map[string]map[string]ExportedRole `json:"roles"`

✅ What looks good

  • DumpCache correctly releases the read lock before calling json.Marshal, so serialisation does not block writers.
  • All HTTP server timeouts are set (ReadHeaderTimeout, ReadTimeout, WriteTimeout, IdleTimeout) — good defensive practice.
  • The graceful shutdown goroutine (<-ctx.Done()server.Shutdown) is correctly wired.
  • DTOs (ExportedRole, ExportedServingGroup) properly decouple the wire format from internal structs, avoiding accidental exposure of unexported locks.
  • dump_test.go covers the happy path with a realistic scenario.

Copilot AI review requested due to automatic review settings April 20, 2026 08:54
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from lizhencheng9527. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment thread pkg/model-serving-controller/controller/model_serving_controller.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/model-serving-controller/controller/model_serving_controller.go
Comment thread pkg/model-serving-controller/datastore/dump.go
Comment thread pkg/model-serving-controller/controller/model_serving_controller.go Outdated
Signed-off-by: LiZhenCheng9527 <lizhencheng6@huawei.com>
@LiZhenCheng9527
Copy link
Copy Markdown
Collaborator Author

/retest

hzxuzhonghu

This comment was marked as abuse.

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Please ensure the debug port only listens on localhost to avoid unintended exposure.

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this enhancement — observability into the in-memory cache is clearly useful. Overall the approach is clean and well-structured, but there is a critical compilation bug plus a few smaller issues that need to be addressed before merging.


[Critical] DumpCache() is not part of the Store interface — will not compile

ModelServingController.store is typed as datastore.Store (the interface). The new handleModelServingDatastoreCacheDump handler calls c.store.DumpCache(), but DumpCache() ([]byte, error) is not declared in the Store interface in store.go. This will fail to compile.

Fix: add the method signature to the interface:

// in pkg/model-serving-controller/datastore/store.go
type Store interface {
    // ... existing methods ...
    DumpCache() ([]byte, error)
}

[Important] Debug server only available on the leader when leader-election is enabled

The debug server is launched inside startControllers(). When EnableLeaderElection=true, startControllers is only called once leadership is acquired, so non-leader replicas never serve debug traffic. Debug/observability endpoints are typically independent of leadership; consider starting the server outside startControllers so that every replica is reachable, regardless of whether it currently holds the lease.


[Minor] Use defer for the RUnlock in DumpCache

// current
s.mutex.RLock()
for msName, groups := range s.servingGroup { ... }
s.mutex.RUnlock()

// preferred — releases the lock even if a panic occurs
s.mutex.RLock()
defer s.mutex.RUnlock()
for msName, groups := range s.servingGroup { ... }

[Minor] Port range should be validated

--debug-port accepts any positive integer. Consider validating that the value is in the range [1, 65535] before starting, e.g.:

if cc.DebugPort > 0 && cc.DebugPort > 65535 {
    klog.Fatalf("--debug-port must be in range 1-65535, got %d", cc.DebugPort)
}

[Minor] Security note — document that the port must not be exposed externally

The endpoint carries no authentication. Internal cache data (pod names, role states, revision hashes) could be sensitive. Please add a note in the flag description and/or docs that the debug port should only be accessible within the cluster network (e.g. bound to localhost or protected by a NetworkPolicy), and must not be exposed as a LoadBalancer/NodePort service.


[Nit] PR description

The release-note block is empty and the Fixes # issue link is incomplete. Please fill those in (or use NONE in the release-note block if no user-facing change is intended).

Signed-off-by: LiZhenCheng9527 <lizhencheng6@huawei.com>
Copilot AI review requested due to automatic review settings May 6, 2026 07:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comment thread pkg/model-serving-controller/datastore/dump.go
Comment thread pkg/model-serving-controller/controller/model_serving_controller.go
Comment thread pkg/model-serving-controller/controller/model_serving_controller.go
@LiZhenCheng9527
Copy link
Copy Markdown
Collaborator Author

/cc @hzxuzhonghu PTAL

@volcano-sh-bot volcano-sh-bot requested a review from hzxuzhonghu May 6, 2026 07:14
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

@LiZhenCheng9527: GitHub didn't allow me to request PR reviews from the following users: PTAL.

Note that only volcano-sh members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @hzxuzhonghu PTAL

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/enhancement New feature or request size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants