Feat/mtls cert source abstraction#293
Conversation
…s.Config from file or SPIRE-managed SVIDs using crypto/tls and fsnotify. Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
…file, and --mtls-ca-file flags across all components and wired them into their respective Config structs with startup validation. Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
…rce-agnostic Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
…ication Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request introduces a certificate-source agnostic mTLS framework across the picod, router, and workload-manager components. It includes new command-line flags, a configuration validation layer, and a pkg/mtls package for dynamic certificate loading and file watching with SPIFFE ID verification. Key feedback points out that the picod server does not yet utilize the new mTLS configuration in its runtime, and the CertWatcher contains a bug where it fails to re-establish watches for the key file after an atomic rename. Additionally, the current implementation performs synchronous disk I/O on every TLS handshake, which should be optimized using the provided CertWatcher, and file watching should be shifted to directory-level monitoring for better reliability in Kubernetes environments.
| config := picod.Config{ | ||
| Port: *port, | ||
| Workspace: *workspace, | ||
| Port: *port, | ||
| Workspace: *workspace, | ||
| MTLSCertFile: *mtlsCertFile, | ||
| MTLSKeyFile: *mtlsKeyFile, | ||
| MTLSCAFile: *mtlsCAFile, | ||
| } |
There was a problem hiding this comment.
While the mTLS flags are correctly parsed and passed to the picod.Config, the picod.Server implementation (specifically in Run()) does not yet utilize these fields to configure its network listener. This results in the mTLS feature being effectively disabled despite the flags being set. The server's Run method needs to be updated to use mtls.LoadServerConfig.
| if event.Has(fsnotify.Rename) || event.Has(fsnotify.Remove) { | ||
| _ = cw.watcher.Remove(cw.certFile) | ||
| if err := cw.watcher.Add(cw.certFile); err != nil { | ||
| klog.Errorf("Failed to re-watch %s: %v", cw.certFile, err) | ||
| } | ||
| if err := cw.reload(); err != nil { | ||
| klog.Errorf("Failed to reload certificate after rename: %v", err) | ||
| } | ||
| } |
There was a problem hiding this comment.
The re-watch logic for atomic renames is currently hardcoded to only handle cw.certFile. If the keyFile is updated via an atomic rename (common in SPIRE/cert-manager rotations), its watch will be lost and never re-established. Use event.Name to ensure the watch is re-added for the specific file that triggered the event.
| if event.Has(fsnotify.Rename) || event.Has(fsnotify.Remove) { | |
| _ = cw.watcher.Remove(cw.certFile) | |
| if err := cw.watcher.Add(cw.certFile); err != nil { | |
| klog.Errorf("Failed to re-watch %s: %v", cw.certFile, err) | |
| } | |
| if err := cw.reload(); err != nil { | |
| klog.Errorf("Failed to reload certificate after rename: %v", err) | |
| } | |
| } | |
| // Re-watch after atomic rename (spiffe-helper does atomic rename) | |
| if event.Has(fsnotify.Rename) || event.Has(fsnotify.Remove) { | |
| _ = cw.watcher.Remove(event.Name) | |
| if err := cw.watcher.Add(event.Name); err != nil { | |
| klog.Errorf("Failed to re-watch %s: %v", event.Name, err) | |
| } | |
| if err := cw.reload(); err != nil { | |
| klog.Errorf("Failed to reload certificate after rename: %v", err) | |
| } | |
| } |
| if err := watcher.Add(certFile); err != nil { | ||
| watcher.Close() | ||
| return nil, fmt.Errorf("watch cert file %s: %w", certFile, err) | ||
| } | ||
| if err := watcher.Add(keyFile); err != nil { | ||
| watcher.Close() | ||
| return nil, fmt.Errorf("watch key file %s: %w", keyFile, err) | ||
| } |
There was a problem hiding this comment.
Watching individual files in Kubernetes (where Secrets/ConfigMaps are symlinked) is often unreliable because the file inodes change when the mount is updated. A more robust pattern for K8s is to watch the parent directory or the ..data symlink. Since this component is intended for use with SPIRE/K8s, consider watching the directory to ensure updates are always detected.
| tlsCfg := &tls.Config{ | ||
| GetCertificate: func(_ *tls.ClientHelloInfo) (*tls.Certificate, error) { | ||
| cert, err := tls.LoadX509KeyPair(cfg.CertFile, cfg.KeyFile) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| return &cert, nil | ||
| }, | ||
| ClientCAs: caPool, | ||
| ClientAuth: tls.RequireAndVerifyClientCert, | ||
| MinVersion: tls.VersionTLS12, | ||
| } |
There was a problem hiding this comment.
The GetCertificate (and GetClientCertificate in LoadClientConfig) callbacks perform synchronous disk I/O on every TLS handshake via tls.LoadX509KeyPair. This is inefficient and can become a bottleneck. Since CertWatcher is provided in this package, it should be integrated here to provide an in-memory cache of the certificate that reloads only on file changes.
There was a problem hiding this comment.
Pull request overview
This PR introduces a new pkg/mtls package intended to provide a certificate-source-agnostic mTLS foundation (PEM file inputs) and adds CLI/config plumbing for Router, WorkloadManager, and PicoD to accept mTLS certificate paths.
Changes:
- Added
pkg/mtls(config + TLS config loader + fsnotify-based certificate watcher) and accompanying unit tests. - Added
--mtls-cert-file,--mtls-key-file,--mtls-ca-fileflags to Router / WorkloadManager / PicoD and stored them in each component’s Config. - Added
github.com/fsnotify/fsnotifyas a direct Go module dependency.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/workloadmanager/server.go | Adds mTLS cert/key/CA fields to WorkloadManager Config. |
| pkg/router/config.go | Adds mTLS cert/key/CA fields to Router Config. |
| pkg/router/config_test.go | Extends Router config JSON round-trip tests for new mTLS fields. |
| pkg/picod/server.go | Adds mTLS cert/key/CA fields (with JSON tags) to PicoD Config. |
| pkg/mtls/config.go | New shared mTLS path config + validation helpers. |
| pkg/mtls/config_test.go | Tests for mtls.Config validation/enabled behavior. |
| pkg/mtls/loader.go | New TLS config “loader” with SPIFFE ID verification hooks. |
| pkg/mtls/loader_test.go | Tests for TLS config loader and SPIFFE verification helpers. |
| pkg/mtls/watcher.go | New fsnotify-driven certificate watcher with hot reload callbacks. |
| pkg/mtls/watcher_test.go | Tests for watcher initial load / update / idempotent stop. |
| cmd/workload-manager/main.go | Adds mTLS flags, validates them, passes values into WorkloadManager Config. |
| cmd/router/main.go | Adds mTLS flags, validates them, passes values into Router Config. |
| cmd/picod/main.go | Adds mTLS flags, validates them, passes values into PicoD Config. |
| go.mod | Promotes fsnotify to a direct dependency. |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #293 +/- ##
==========================================
+ Coverage 46.75% 49.79% +3.04%
==========================================
Files 30 33 +3
Lines 2862 3227 +365
==========================================
+ Hits 1338 1607 +269
- Misses 1381 1449 +68
- Partials 143 171 +28
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…tability - Updated LoadServerConfig and LoadClientConfig to use CertWatcher cache, eliminating repeated disk I/O and ensuring seamless cert rotation. - Improved fsnotify handling to track exact file events, maintaining watch consistency during atomic cert/key replacements. - Enhanced TLS validation by including intermediate CAs in VerifyOptions for proper certificate chain verification. - Added nil guards and enforced strict Config.Enabled() checks to prevent misconfigurations and runtime panics. Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
hzxuzhonghu
left a comment
There was a problem hiding this comment.
The pkg/mtls package is well-structured — clean separation of Config, CertWatcher, and loader helpers, good test coverage, and the sync.Once-guarded Stop() is correct. A few issues to address before merging.
Across all three binaries (pkg/picod/server.go, pkg/workloadmanager/server.go, pkg/router/config.go): the mTLS fields are validated at startup but never consumed by the HTTP listeners. An operator who deploys with --mtls-cert-file etc. will see the flags silently have no effect — no error, no warning. This is the stated intent (follow-up PR will wire it), but please at minimum add a startup klog.Warningf or a // TODO(author): wire mTLS in follow-up PR comment at each ListenAndServe / Start call site so the intent is clear in the code.
| if event.Has(fsnotify.Rename) || event.Has(fsnotify.Remove) { | ||
| targetFile := event.Name | ||
| _ = cw.watcher.Remove(targetFile) | ||
| if err := cw.watcher.Add(targetFile); err != nil { |
There was a problem hiding this comment.
If watcher.Add(targetFile) fails here (for example, spiffe-helper renames cert.pem away before atomically placing the new one), the file is permanently dropped from the watch set and no further events will be received for it — cert hot-reload silently stops working.
The current code only logs the error, so a rotation failure could go unnoticed until the cert expires. Two safer options:
- Watch the parent directory rather than individual files (standard
cert-manager/spiffe-helperapproach). - Implement a short retry loop (e.g. try re-Add up to 5× with 50 ms sleep) so a transient race during atomic rename is recovered.
At minimum, downgrade the log to klog.V(4) for the expected case and add a clear klog.Errorf when the retry budget is exhausted so operators know cert reload is degraded.
There was a problem hiding this comment.
I went with your 2nd suggested approach and implemented a 5-iteration retry loop with a 50ms sleep, plus a CRITICAL log if the budget is exhausted.
| MinVersion: tls.VersionTLS12, | ||
| } | ||
|
|
||
| if expectedServerID != "" { |
There was a problem hiding this comment.
When expectedServerID == "" the else branch (lines 78-82) returns a tls.Config with standard hostname verification enabled (InsecureSkipVerify=false, ServerName=""). SPIFFE-issued certificates carry only URI SANs — no DNS SANs — so Go's hostname check will fail with x509: certificate is valid for ... against any SPIFFE endpoint. The config returned by this branch is essentially unusable with the mTLS setup this package targets.
Options:
- Require
expectedServerIDto be non-empty when mTLS is configured, and return anerrorif it is empty. - Document clearly in the godoc that
expectedServerID=""is only valid when connecting to non-SPIFFE endpoints that have a matching DNS SAN, and consider settingtlsCfg.ServerNameto prevent accidental misuse.
There was a problem hiding this comment.
I've strictly enforced Option 1: LoadClientConfig now returns an explicit error immediately if expectedServerID is empty, completely preventing accidental fallback to standard DNS SAN verification.
| // mTLS configuration (certificate source abstraction) | ||
|
|
||
| // MTLSCertFile is the path to the mTLS certificate (--mtls-cert-file) | ||
| MTLSCertFile string `json:"mtlsCertFile"` |
There was a problem hiding this comment.
These three fields (MTLSCertFile, MTLSKeyFile, MTLSCAFile) are plumbed through and validated in cmd/picod/main.go, but Run() always calls server.ListenAndServe() (plain HTTP) regardless of whether they are set. The same pattern applies in pkg/workloadmanager/server.go (Start()) and pkg/router (server.go/Start()).
Please add either:
- A
klog.Warningf("mTLS cert paths configured but listener not yet wired — ignored until follow-up PR")guard inRun()/Start()whenMTLSCertFile != "", or - A
// TODO(<author>): wire mtls.LoadServerConfig here — follow-up PRcomment at theListenAndServecall site
Otherwise operators silently pass flags that have zero effect.
There was a problem hiding this comment.
I've added a startup klog.Warningf and a TODO guard at all three ListenAndServe call sites to explicitly warn operators until the follow-up wiring PR.
| certFile, keyFile, caFile := generateTestCerts(t) | ||
| cfg := &Config{CertFile: certFile, KeyFile: keyFile, CAFile: caFile} | ||
|
|
||
| tlsCfg, watcher, err := LoadServerConfig(cfg, nil) |
There was a problem hiding this comment.
This test only exercises LoadServerConfig(cfg, nil) (no expectedClientIDs). The code path where VerifyPeerCertificate is injected for SPIFFE enforcement (LoadServerConfig(cfg, []string{spiffeID})) has no test. Since this callback is the central security enforcement for server-side identity, please add a test analogous to TestVerifyPeerSPIFFEID_MatchingID/NonMatchingID but going through LoadServerConfig:
func TestLoadServerConfig_WithExpectedClientIDs(t *testing.T) {
certFile, keyFile, caFile := generateTestCerts(t)
cfg := &Config{CertFile: certFile, KeyFile: keyFile, CAFile: caFile}
tlsCfg, watcher, err := LoadServerConfig(cfg, []string{"spiffe://cluster.local/test"})
...
if tlsCfg.VerifyPeerCertificate == nil {
t.Error("VerifyPeerCertificate should be set when expectedClientIDs provided")
}
// Also exercise the callback directly to confirm reject/accept behaviour.
}There was a problem hiding this comment.
I've added TestLoadServerConfig_WithExpectedClientIDs which directly exercises the injected VerifyPeerCertificate callback to confirm it correctly accepts and rejects matching/non-matching identities.
…and expand integration tests Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
| // LoadClientConfig returns a tls.Config for a client that presents its cert and verifies the server. | ||
| // Certs are loaded dynamically via GetClientCertificate to support rotation. | ||
| // If expectedServerID is non-empty, the server must present a matching SPIFFE ID. | ||
| // This sets InsecureSkipVerify=true to bypass hostname checking (SPIFFE uses URI SANs) | ||
| // and manually verifies the chain + SPIFFE ID instead. | ||
| func LoadClientConfig(cfg *Config, expectedServerID string) (*tls.Config, *CertWatcher, error) { | ||
| if expectedServerID == "" { | ||
| return nil, nil, fmt.Errorf("client TLS config: expectedServerID is required for SPIFFE verification") | ||
| } |
There was a problem hiding this comment.
The doc comment for LoadClientConfig says "If expectedServerID is non-empty" but the implementation returns an error when expectedServerID == "". Please either update the comment to reflect that expectedServerID is required, or adjust the function to support a non-SPIFFE mode (standard verification with RootCAs / ServerName) when it is empty.
| MTLSCertFile string `json:"mtlsCertFile"` | ||
| // MTLSKeyFile is the path to the mTLS private key (--mtls-key-file) | ||
| MTLSKeyFile string `json:"mtlsKeyFile"` | ||
| // MTLSCAFile is the path to the mTLS CA bundle (--mtls-ca-file) | ||
| MTLSCAFile string `json:"mtlsCAFile"` |
There was a problem hiding this comment.
The new PicoD Config JSON tags use lowerCamelCase (mtlsCertFile/mtlsKeyFile/mtlsCAFile), but other PicoD request/response types consistently use snake_case for multi-word JSON fields (e.g., working_dir, exit_code, is_dir). To keep the API consistent, rename these tags to snake_case (for example, mtls_cert_file/mtls_key_file/mtls_ca_file) or align with whichever format PicoD exposes elsewhere.
| MTLSCertFile string `json:"mtlsCertFile"` | |
| // MTLSKeyFile is the path to the mTLS private key (--mtls-key-file) | |
| MTLSKeyFile string `json:"mtlsKeyFile"` | |
| // MTLSCAFile is the path to the mTLS CA bundle (--mtls-ca-file) | |
| MTLSCAFile string `json:"mtlsCAFile"` | |
| MTLSCertFile string `json:"mtls_cert_file"` | |
| // MTLSKeyFile is the path to the mTLS private key (--mtls-key-file) | |
| MTLSKeyFile string `json:"mtls_key_file"` | |
| // MTLSCAFile is the path to the mTLS CA bundle (--mtls-ca-file) | |
| MTLSCAFile string `json:"mtls_ca_file"` |
| intermediates := x509.NewCertPool() | ||
| for i, rawCert := range rawCerts[1:] { | ||
| intermediateCert, err := x509.ParseCertificate(rawCert) | ||
| if err != nil { | ||
| return fmt.Errorf("parse peer intermediate certificate %d: %w", i+1, err) |
There was a problem hiding this comment.
verifyPeerChainAndSPIFFEID has intermediate-certificate handling (building an Intermediates pool from rawCerts[1:]), but loader_test.go doesn’t include a case where the leaf is signed by an intermediate CA. Please add a unit test (root CA → intermediate CA → leaf) to ensure this logic actually works.
| // Re-watch after atomic rename (spiffe-helper does atomic renames for both cert and key) | ||
| if event.Has(fsnotify.Rename) || event.Has(fsnotify.Remove) { | ||
| targetFile := event.Name | ||
| _ = cw.watcher.Remove(targetFile) | ||
| // Retry loop to handle transient races during atomic replacement |
There was a problem hiding this comment.
The Rename/Remove re-watch path in watchLoop (intended to support atomic rotation via rename/symlink swaps) isn’t covered by watcher_test.go (it only overwrites files). Please add a unit test that rotates cert/key using os.Rename into place so this branch is exercised.
Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
def4e7a to
6edb3da
Compare
|
|
||
| // Start HTTP or HTTPS server | ||
| if s.config.MTLSCertFile != "" || s.config.MTLSKeyFile != "" || s.config.MTLSCAFile != "" { | ||
| klog.Warningf("TODO(author): wire mTLS in follow-up PR. mTLS flags are currently parsed but not consumed by the Router listener.") |
There was a problem hiding this comment.
Shipping a runtime log message containing a TODO(author) placeholder is brittle and can leak internal process details into production logs. Consider replacing this with a stable message (no TODO/author), and ideally fail fast (return an error) if users provide --mtls-* flags that are not yet honored, or gate the warning behind a debug/feature flag. (Same pattern appears in WorkloadManager and PicoD.)
| klog.Warningf("TODO(author): wire mTLS in follow-up PR. mTLS flags are currently parsed but not consumed by the Router listener.") | |
| klog.Warning("mTLS configuration was provided, but Router does not currently apply these settings to its listener; the mTLS settings will be ignored.") |
| // Validate mTLS configuration early (fail fast on bad flags) | ||
| mTLSCfg := mtls.Config{ | ||
| CertFile: *mtlsCertFile, | ||
| KeyFile: *mtlsKeyFile, | ||
| CAFile: *mtlsCAFile, | ||
| } | ||
| if err := mTLSCfg.Validate(); err != nil { | ||
| klog.Fatalf("Invalid mTLS configuration: %v", err) | ||
| } |
There was a problem hiding this comment.
The PR description/release-note states the components now support mTLS flags to bootstrap mTLS, but the listeners currently do not consume these flags (only warn). This is user-visible and potentially misleading: users can pass valid --mtls-* flags and still run without mTLS. Consider either (a) updating the release note/PR description to explicitly say the flags are accepted but not yet functional, (b) hiding the flags until mTLS is wired, or (c) erroring out when --mtls-* is set to avoid a false sense of security.
| // Reload on write or create | ||
| if event.Has(fsnotify.Write) || event.Has(fsnotify.Create) { | ||
| if err := cw.reload(); err != nil { | ||
| klog.Errorf("Failed to reload certificate: %v", err) | ||
| } | ||
| } | ||
| // Re-watch after atomic rename (spiffe-helper does atomic renames for both cert and key) | ||
| if event.Has(fsnotify.Rename) || event.Has(fsnotify.Remove) { | ||
| cw.handleRenameEvent(event.Name) | ||
| } |
There was a problem hiding this comment.
During atomic rotations where cert and key are replaced in quick succession, reloading immediately on any single file event can frequently attempt to load a mismatched cert/key pair (or a temporarily missing file), causing repeated error logs and delayed rotation. A more robust approach is to debounce/coalesce events (short timer) and/or retry tls.LoadX509KeyPair with backoff until it succeeds, so rotation converges quickly without log spam.
| var added bool | ||
| for i := 0; i < 5; i++ { | ||
| if err := cw.watcher.Add(targetFile); err != nil { | ||
| klog.V(4).Infof("Failed to re-watch %s (attempt %d/5): %v", targetFile, i+1, err) | ||
| time.Sleep(50 * time.Millisecond) | ||
| continue | ||
| } | ||
| added = true | ||
| break | ||
| } | ||
| if !added { | ||
| klog.Errorf("CRITICAL: Exhausted retry budget attempting to re-watch %s. Certificate hot-reload is degraded!", targetFile) | ||
| } |
There was a problem hiding this comment.
The retry budget (5) and sleep interval (50ms) are hard-coded magic numbers and duplicated in the log message (attempt %d/5). Consider extracting them into named constants (or making them configurable) to keep the code and logs consistent and make tuning easier. Also consider avoiding "CRITICAL" in free-form log text in favor of structured severity via log level / consistent prefixing.
|
|
||
| opts := x509.VerifyOptions{ | ||
| Roots: caPool, | ||
| Intermediates: intermediates, |
There was a problem hiding this comment.
For clarity and to prevent accidental behavior changes if Go's defaults evolve, consider explicitly setting VerifyOptions.KeyUsages to []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth} (client-side server verification). This makes the intended usage constraints explicit in the custom verification path.
| Intermediates: intermediates, | |
| Intermediates: intermediates, | |
| KeyUsages: []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth}, |
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Foundational mTLS library looks solid overall — good test coverage, correct use of InsecureSkipVerify with manual chain verification, and clean SPIFFE ID checking. Flagging a few issues that could bite in production rotation scenarios.
| return | ||
| } | ||
| // Reload on write or create | ||
| if event.Has(fsnotify.Write) || event.Has(fsnotify.Create) { |
There was a problem hiding this comment.
Bug – cert/key rotation race (no debounce).
Both cert and key files are watched independently, and every Write/Create event immediately calls reload() → tls.LoadX509KeyPair(certFile, keyFile). During an atomic rotation (e.g. spiffe-helper renames cert, then renames key), the first event loads the new cert with the old key → "private key does not match public key" error is logged, and the watcher falls back to the stale certificate.
The second file event self-heals, so it's not an outage, but with SPIRE SVIDs (1-hour TTL) you'd get ~24 spurious klog.Errorf lines/day/component — noisy for alerting.
Suggested fix: add a short debounce (e.g. a time.AfterFunc of 200–300 ms that resets on each event) so both files settle before calling reload(). controller-runtime's certwatcher uses exactly this pattern.
| var added bool | ||
| for i := 0; i < 5; i++ { | ||
| if err := cw.watcher.Add(targetFile); err != nil { | ||
| klog.V(4).Infof("Failed to re-watch %s (attempt %d/5): %v", targetFile, i+1, err) | ||
| time.Sleep(50 * time.Millisecond) | ||
| continue | ||
| } | ||
| added = true | ||
| break | ||
| } | ||
| if !added { |
There was a problem hiding this comment.
Bug – permanent watch loss after retry exhaustion; "degraded" is misleading.
If all 5 retries (total budget 250 ms) fail, the log says "Certificate hot-reload is degraded!" — but in reality the file watch for targetFile is permanently lost. No future changes to that file will ever be detected; the component silently serves a stale certificate until it expires, then connections fail with no self-healing path (restart required).
Two issues:
- Retry budget is very tight. 5 × 50 ms = 250 ms. On network-mounted volumes or under I/O pressure, file creation after an atomic swap can take longer. Consider backing off (e.g. 100, 200, 400, 800, 1600 ms) and/or bumping retries.
- No recovery path. Once the watch is lost the only fix is a process restart. Consider a background re-try goroutine, or at minimum change the error message from "degraded" to something like "permanently broken — process restart required to restore certificate rotation."
| return nil, nil, fmt.Errorf("server TLS config: mtls.Config is nil") | ||
| } | ||
|
|
||
| caPool, err := loadCAPool(cfg.CAFile) |
There was a problem hiding this comment.
Design gap – CA bundle is loaded once and never refreshed.
loadCAPool reads the CA file at startup and bakes the resulting *x509.CertPool into tls.Config.ClientCAs / RootCAs. If the CA rotates (standard in SPIFFE trust domains — trust bundle can change when a root or intermediate CA is re-keyed), the stale pool will reject certs signed by the new CA, causing a full connection outage once the old CA's certs expire.
Since this is a foundational PR, this doesn't need to be fixed here, but please document the limitation prominently (e.g. in the LoadServerConfig / LoadClientConfig godocs) so the follow-up PR that wires listeners knows to address it. A CAWatcher analogous to CertWatcher would close the gap.
There was a problem hiding this comment.
please check tls config, tls.Config.GetCertificate
| opts := x509.VerifyOptions{ | ||
| Roots: caPool, | ||
| Intermediates: intermediates, | ||
| } |
There was a problem hiding this comment.
Compatibility – VerifyOptions.KeyUsages defaults to ExtKeyUsageServerAuth only.
When KeyUsages is nil/empty, Go's x509.Certificate.Verify defaults to []ExtKeyUsage{ExtKeyUsageServerAuth}. The SPIFFE X.509-SVID spec says leaf SVIDs SHOULD (not MUST) include id-kp-serverAuth. Certs from non-SPIRE CAs or custom PKIs that omit this EKU will be silently rejected here.
Since the library aims to be "certificate-source agnostic", consider explicitly setting:
opts := x509.VerifyOptions{
Roots: caPool,
Intermediates: intermediates,
KeyUsages: []x509.ExtKeyUsage{x509.ExtKeyUsageAny},
}This makes chain verification purely about trust, leaving EKU policy to the caller/SPIFFE layer.
|
|
||
| // Start HTTP or HTTPS server | ||
| if s.config.MTLSCertFile != "" || s.config.MTLSKeyFile != "" || s.config.MTLSCAFile != "" { | ||
| klog.Warningf("TODO(author): wire mTLS in follow-up PR. mTLS flags are currently parsed but not consumed by the Router listener.") |
There was a problem hiding this comment.
Nit – TODO(author) violates K8s convention.
Per K8s coding conventions, TODOs should use a real username: TODO(mahil-2040):. Same applies to the identical TODOs in pkg/picod/server.go:119 and pkg/workloadmanager/server.go:177.
| // mTLS flags (certificate source abstraction) | ||
| mtlsCertFile := flag.String("mtls-cert-file", "", "Path to mTLS certificate file") | ||
| mtlsKeyFile := flag.String("mtls-key-file", "", "Path to mTLS private key file") | ||
| mtlsCAFile := flag.String("mtls-ca-file", "", "Path to mTLS CA bundle file") |
There was a problem hiding this comment.
can change to not use pointer type
|
|
||
| // mTLS flags (certificate source abstraction) | ||
| mtlsCertFile := flag.String("mtls-cert-file", "", "Path to mTLS certificate file") | ||
| mtlsKeyFile := flag.String("mtls-key-file", "", "Path to mTLS private key file") |
There was a problem hiding this comment.
BTW. I forget have you mounted this dir anywhere?
There was a problem hiding this comment.
For the control plane (Router/WorkloadManager), we already mounted the spiffe-helper volumes in the previous SPIRE PR. For PicoD, the sandbox pod is built dynamically in Go code. I haven't wired the volume mounts yet, but I added a specific TODO in workload_builder.go so we don't forget to mount the SPIRE CSI driver (or sidecar) in the follow-up PR where the listeners are wired!
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Overall solid implementation — the abstraction layer is clean, the CertWatcher pattern is correct, and wiring mTLS across all three components consistently is the right approach. A few correctness issues below.
| if debounceTimer != nil { | ||
| debounceTimer.Stop() | ||
| } | ||
| debounceTimer = time.AfterFunc(200*time.Millisecond, func() { |
There was a problem hiding this comment.
Use-after-stop race. time.AfterFunc fires in a new goroutine. If Stop() is called while the 200ms debounce is in flight, the closure will still invoke cw.reload() after the watcher is torn down. Add a done-channel check inside the closure:
debounceTimer = time.AfterFunc(200*time.Millisecond, func() {
select {
case <-cw.done:
return
default:
}
if err := cw.reload(); err != nil { ... }
})| ClientCAs: caPool, | ||
| ClientAuth: tls.RequireAndVerifyClientCert, | ||
| MinVersion: tls.VersionTLS12, | ||
| } |
There was a problem hiding this comment.
Prefer tls.VersionTLS13 for internal service-to-service mTLS. All three components are modern Go, and TLS 1.3 removes weak cipher suites still negotiable under TLS 1.2.
|
|
||
| // SandboxSPIFFEID is the SPIFFE identity for PicoD sandbox pods. | ||
| // Note: PicoD omits the namespace segment because sandboxes can run in any namespace. | ||
| SandboxSPIFFEID = "spiffe://cluster.local/sa/agentcube-sandbox" |
There was a problem hiding this comment.
SandboxSPIFFEID omits the ns/<namespace> segment. The SPIRE Kubernetes PSAT node attestor typically issues SVIDs as spiffe://<trust-domain>/ns/<namespace>/sa/<service-account>. If the actual SVID differs from this constant, every PicoD mTLS handshake will fail at VerifyPeerCertificate. Please confirm this matches the SPIRE registration entry exactly.
There was a problem hiding this comment.
It omit /ns/<namespace> because sandbox pods can run in any namespace. The standard Istio-convention format (/ns/<ns>/sa/<sa>) is used for Router and WorkloadManager since they always run in agentcube-system. I expanded the comment to clearly explain
| // spiffeHelperConfigMapName is the ConfigMap name created by the Helm chart. | ||
| spiffeHelperConfigMapName = "spiffe-helper-config" | ||
| // spiffeHelperImage is the container image for the spiffe-helper sidecar. | ||
| spiffeHelperImage = "ghcr.io/spiffe/spiffe-helper:0.8.0" |
There was a problem hiding this comment.
spiffeHelperImage is hardcoded to ghcr.io/spiffe/spiffe-helper:0.8.0. Operators in air-gapped environments or upgrading SPIFFE helper versions must recompile. Move this into Config.SPIFFEHelperImage with this string as the default.
| ReadOnly: true, | ||
| }) | ||
|
|
||
| // Append mTLS flags ONLY if the container is a known PicoD entrypoint |
There was a problem hiding this comment.
Silent failure: if no container is named "picod" or "code-interpreter", the --enable-mtls args are never injected but there's no warning emitted. AgentRuntime container names are user-defined, so this will silently produce sandboxes with mTLS volumes but no mTLS flags. Add a warning log if the loop finds zero matching containers.
| }, | ||
| } | ||
|
|
||
| if enableMTLS { |
There was a problem hiding this comment.
mTLS not injected on the warm-pool (SandboxClaim) path. When WarmPoolSize > 0, buildSandboxByCodeInterpreter returns early at line ~413 before reaching this injectMTLSVolumes call. Claim-provisioned sandbox pods will not get the spiffe-helper sidecar or cert mounts. If the SandboxTemplate already embeds the sidecar, add a comment here. Otherwise injectMTLSVolumes needs to run on the claim-path pod spec too.
- Fixed CertWatcher shutdown race and resource leaks - Refactored mTLS injection logic and improved operator warnings - Added configurable SPIFFE helper image support - Added graceful shutdown handling for PicoD - Changed the Min TLS version from TLS1.2 to TLS1.3 Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
hzxuzhonghu
left a comment
There was a problem hiding this comment.
I will continue reading the rest parts this afternoon
There was a problem hiding this comment.
can we revert picod side, this belong to users' workload. This is hacky, correct me if i am wrong
There was a problem hiding this comment.
You're right, i have reverted it. The previous approach had WorkloadManager reaching into PicoD's container spec to inject --enable-mtls / --mtls-cert-file flags, which couples infrastructure to the workload's CLI interface.
The fix: PicoD now auto-detects mTLS on startup by checking if the well-known SPIRE cert files exist at /run/spire/certs/ (provisioned by the spiffe-helper sidecar). If all three SVID files are present, mTLS is enabled automatically, no flags needed. WorkloadManager only injects the sidecar container, volumes, and cert mounts (the standard sidecar injection pattern) and never touches the workload container's args.
- Removed workload CLI arg mutation for mTLS configuration - Added cert-volume-only injection using standard sidecar pattern - Preserved user-defined spiffe-helper annotations - Scoped SPIRE sidecar injection to CodeInterpreter sandboxes only - Added automatic mTLS detection in PicoD based on SPIRE cert presence - Restricted Router mTLS transport to CodeInterpreter workloads - Clarified Router mTLS flag behavior in CLI help text Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
| // LoadServerConfig returns a tls.Config for a server that requires mTLS. | ||
| // Certs are loaded dynamically via GetCertificate to support rotation. | ||
| // If expectedClientIDs is non-empty, clients must present a matching SPIFFE ID. | ||
| func LoadServerConfig(cfg *Config, expectedClientIDs []string) (*tls.Config, *CertWatcher, error) { |
There was a problem hiding this comment.
prefer implement all in this pr. so it can be tested
| return nil, nil, fmt.Errorf("server TLS config: mtls.Config is nil") | ||
| } | ||
|
|
||
| caPool, err := loadCAPool(cfg.CAFile) |
There was a problem hiding this comment.
please check tls config, tls.Config.GetCertificate
|
Please also update the e2e test setup, so mtls can be tested |
Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
- Separated Router upstream mTLS flags from external HTTPS listener flags - Added certificate file existence validation during startup - Implemented dynamic CA bundle rotation with CertWatcher reloads - Added dynamic peer verification against rotating CA pools - Updated tests for the new dynamic CA loading architecture Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
i am working on e2e tests |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 26 out of 26 changed files in this pull request and generated 8 comments.
Comments suppressed due to low confidence (2)
pkg/workloadmanager/handlers_test.go:382
- The gomonkey patch for buildSandboxByAgentRuntime no longer matches the updated function signature (it now takes an extra spiffeHelperImageOverride string). As written, this test will not compile; update the patched function to accept the full argument list (namespace, name, informers, enableMTLS, spiffeHelperImageOverride).
patches.ApplyFunc(buildSandboxByAgentRuntime, func(_, _ string, _ *Informers, _ bool) (*sandboxv1alpha1.Sandbox, *sandboxEntry, error) {
if tc.kind != types.AgentRuntimeKind {
return nil, nil, errors.New("unexpected kind")
}
if tc.buildErr != nil {
pkg/workloadmanager/handlers_test.go:392
- The gomonkey patch for buildSandboxByCodeInterpreter does not match the new signature (missing spiffeHelperImageOverride string). This will fail to compile; adjust the patch function to accept the extra parameter.
patches.ApplyFunc(buildSandboxByCodeInterpreter, func(_, _ string, _ *Informers, _ bool) (*sandboxv1alpha1.Sandbox, *extensionsv1alpha1.SandboxClaim, *sandboxEntry, error) {
if tc.kind != types.CodeInterpreterKind {
return nil, nil, nil, errors.New("unexpected kind")
}
if tc.buildErr != nil {
| // Validate mTLS configuration early (fail fast on bad flags) | ||
| tlsConfig := mtls.Config{ | ||
| CertFile: *tlsCert, | ||
| KeyFile: *tlsKey, | ||
| CAFile: *tlsCA, | ||
| } | ||
| if err := tlsConfig.Validate(); err != nil { | ||
| klog.Fatalf("Invalid mTLS configuration: %v", err) | ||
| } | ||
|
|
||
| if *enableMTLS && !tlsConfig.Enabled() { | ||
| klog.Fatalf("Invalid mTLS configuration: --enable-mtls requires --tls-cert, --tls-key, and --tls-ca") |
| caPool := watcher.GetCAPool() | ||
| opts := x509.VerifyOptions{ | ||
| Roots: caPool, | ||
| Intermediates: intermediates, | ||
| KeyUsages: []x509.ExtKeyUsage{x509.ExtKeyUsageClientAuth}, | ||
| } | ||
| if _, err := peerCert.Verify(opts); err != nil { | ||
| return fmt.Errorf("verify client certificate chain: %w", err) | ||
| } |
| caPool := watcher.GetCAPool() | ||
| opts := x509.VerifyOptions{ | ||
| Roots: caPool, | ||
| Intermediates: intermediates, | ||
| KeyUsages: []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth}, | ||
| } | ||
| if _, err := peerCert.Verify(opts); err != nil { | ||
| return fmt.Errorf("verify server certificate chain: %w", err) | ||
| } |
| } | ||
|
|
||
| if mtlsCfg == nil || !mtlsCfg.Enabled() { | ||
| return nil, fmt.Errorf("--enable-mtls requires --mtls-cert-file, --mtls-key-file, and --mtls-ca-file to be set") |
| debug = flag.Bool("debug", false, "Enable debug mode") | ||
| maxConcurrentRequests = flag.Int("max-concurrent-requests", 1000, "Maximum number of concurrent requests that a router server can handle (0 = unlimited)") | ||
| enableMTLS = flag.Bool("enable-mtls", false, "Enable mutual TLS on the Router listener and for Router-to-WorkloadManager connections") | ||
|
|
||
| // mTLS client certificates used by the Router to authenticate with upstream WorkloadManager. | ||
| // These are distinct from --tls-cert/--tls-key which serve the Router's own HTTPS listener. | ||
| mtlsCert = flag.String("mtls-cert", "", "Path to mTLS client certificate for upstream WorkloadManager connections") |
| // Start PicoD server in goroutine | ||
| errCh := make(chan error, 1) | ||
| go func() { | ||
| klog.Infof("Starting PicoD server on port %d", *port) | ||
| if err := server.Start(ctx); err != nil { | ||
| errCh <- err | ||
| } | ||
| close(errCh) | ||
| }() | ||
|
|
||
| // Wait for signal or fatal error | ||
| select { | ||
| case <-ctx.Done(): | ||
| klog.Info("Received shutdown signal, shutting down gracefully...") | ||
| cancel() | ||
| <-errCh | ||
| case err := <-errCh: | ||
| klog.Fatalf("Server error: %v", err) | ||
| } |
| func main() { | ||
| port := flag.Int("port", 8080, "Port for the PicoD server to listen on") | ||
| workspace := flag.String("workspace", "", "Root directory for file operations (default: current working directory)") | ||
|
|
||
| // Initialize klog flags | ||
| klog.InitFlags(nil) | ||
| flag.Parse() | ||
|
|
||
| config := picod.Config{ | ||
| Port: *port, | ||
| Workspace: *workspace, | ||
| } |
| VolumeSource: corev1.VolumeSource{ | ||
| HostPath: &corev1.HostPathVolumeSource{ | ||
| Path: spireAgentSocketPath, | ||
| Type: hostPathType(corev1.HostPathDirectoryOrCreate), |
Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
What type of PR is this?
/kind feature
What this PR does / why we need it:
Description
This PR implements the foundational mTLS certificate abstraction layer (
pkg/mtls) and actively wires the internal HTTP/gRPC listeners across the AgentCube architecture. Following the secure workload identity framework outlined inauth-proposal.md, this PR fully establishes the zero-trust, mutually authenticated TLS connections for the Router, WorkloadManager, and PicoD components.This officially completes Phase 1: Internal Workload Authentication (SPIRE) of the Auth Proposal.
Core Changes
1. Dynamic Zero-Downtime Rotation (CertWatcher)
Introduced a highly robust,
fsnotify-drivenCertWatchercapable of hot-reloading certificates from disk without dropping active connections.2. Strict SPIFFE Identity Verification
Added
pkg/mtls/spiffeid.gowith hardcoded, secure SPIFFE identity constants.RouterSPIFFEID.WorkloadManagerSPIFFEIDorSandboxSPIFFEIDbefore proxying requests.3. Component Wiring (Router & WorkloadManager)
Wired the
--enable-mtlsand certificate path flags directly into the Router and WorkloadManagerConfigstructs. Their respective internal HTTP transports and listeners are now securely wrapped with mTLS using the new abstraction layer.4. Dynamic SPIRE Sidecar Injection (PicoD)
Enhanced the dynamic Sandbox pod builder to natively support SPIRE. When mTLS is enabled, the builder automatically injects the
spiffe-helpersidecar, mounts thespire-agent-socketfrom the host node, provisions the shared emptyDir volumes, and appends the--enable-mtlsflags to the PicoD container arguments.5. PicoD Authentication Mode Toggle
Added the
--picod-auth-modeflag to the Router (mtlsorjwt). This gives operators the flexibility to choose between high-security transport-layer authentication or latency-optimized application-layer authentication for sandboxes. The PicoD server now correctly bypasses the JWT middleware when mTLS is enabled.6. Production-Pattern E2E Testing
Added in-process integration tests to
mtls_e2e_test.gothat simulate the actualRouter→WorkloadManagerandRouter→PicoDhandshake channels using the real SPIFFE constants, ensuring identity enforcement works flawlessly.Verification
pkg/mtlstest suite passes (including rotational tests, intermediate cert validation logic, and production-pattern handshake E2E tests).pkg/router,pkg/workloadmanager, andpkg/picodunit tests updated and passing.Router→WorkloadManagercommunication with dynamically generated certificates and successful health checks.Which issue(s) this PR fixes:
Fixes Part of #243
Special notes for your reviewer:
The
ptrpackage was removed during the pod builder sidecar injection logic as it is natively supported via pointer creation. If multi-cluster or configurable trust domains are needed in the future, the hardcoded constants inspiffeid.goshould be exposed via CLI flags.Does this PR introduce a user-facing change?:
yes