diff --git a/cilium/CFP-41953-clustermesh-service-v2.md b/cilium/CFP-41953-clustermesh-service-v2.md new file mode 100644 index 0000000..f9b94a6 --- /dev/null +++ b/cilium/CFP-41953-clustermesh-service-v2.md @@ -0,0 +1,241 @@ +# CFP-41953: ClusterMesh Service v2 + +**SIG: SIG-clustermesh** + +**Begin Design Discussion:** 2025-10-01 + +**Cilium Release:** 1.20 + +**Authors:** Arthur Outhenin-Chalandre + +**Status:** Implementable + +## Summary + +This CFP proposes introducing v2 of the clustermesh global service data +format stored in etcd. It transitions from `cilium/state/services/v1/` +to `cilium/state/endpointslices/v1/` and harmonizes backend data insertion +techniques between clustermesh services and Kubernetes services. + +## Motivation + +The current clustermesh global service data is handled with the +[`ClusterService` struct](https://github.com/cilium/cilium/blob/d83cf8ab5e20f8ef6031d9e0f66f577cd095ef89/pkg/clustermesh/store/store.go#L52). +This struct is encoded in JSON format and stored in etcd. While this format +has served well initially, it now faces several limitations that prevent +clustermesh from scaling efficiently and supporting new features. These +limitations fall into three main areas: missing backend conditions, suboptimal +performance for service updates, and inefficient data format and encoding. + +### Missing backend conditions + +The current format omits all backend conditions. It directly removes +backends that are not ready and not serving. This means that we cannot +properly perform graceful termination as described in the +[KPR documentation](https://docs.cilium.io/en/stable/network/kubernetes/kubeproxy-free/#graceful-termination), +most likely resulting in some traffic loss during rolling updates. + +### Performance gap with the loadbalancer k8s reflector + +There is a large performance gap between the clustermesh and the +standard loadbalancer k8s reflector. Their update and ingestion behavior +is not currently on par. The table below shows how large this gap is: + +| Backends | clustermesh (µs) | clustermesh w/o JSON decoding (µs) | loadbalancer k8s (µs) | Ratio (clustermesh w/o JSON/k8s) | +|----------|------------------|------------------------------------|-----------------------|----------------------------------| +| 1 | 44 | 35 | 4 | 9x slower | +| 100 | 629 | 271 | 4 | 68x slower | +| 1 000 | 6 349 | 2 626 | 7 | 375x slower | +| 5 000 | 36 861 | 15 831 | 30 | 528x slower | +| 10 000 | 78 810 | 44 289 | 70 | 633x slower | + +These benchmarks are not strictly equivalent, but they give a good idea +of the performance gap, how much JSON decoding contributes to it, and how +clustermesh degrades as the number of backends grows. + +### Inefficient data format and network traffic + +The `ClusterService` struct/format was designed to fit the loadbalancer internals +when introduced in 2018. In 2026, after several iterations and refactors, those +two formats have diverged. For instance, one recent example was that +`ClusterService` encoded similar ports with different names as multiple entries, +while the loadbalancer backends encoded them as one entry with multiple port +names attached. This divergence led to a bug with one port shadowing the other, +which was recently fixed and will be released in 1.19.4 / 1.18.10. + +In terms of wire size, the backend map duplicates port information for each IP, +so the total size of the object tends to grow almost in proportion to the number +of ports and IPs. In addition, JSON encoding adds extra overhead +for field names and number formatting compared to a binary representation. + +All of this data must be replicated to every node in the mesh. A mesh often +has many more nodes than a single cluster, which results in a high volume +of control plane network traffic. + +Additionally, etcd imposes a hard limit of 1.5 MiB per object. Without a +breaking change to a more efficient format, adding the missing fields to +the current `ClusterService` struct would restrict scalability to fewer +than 10,000 backends per service per cluster. While this is a high limit, +having headroom beyond this point is useful for future growth. + +Even below the limit, keeping objects small is important to reduce network +traffic when backends change. This situation is similar to the Kubernetes +community's move from `Endpoints` to `EndpointSlice`, but the problem is +even stronger in clustermesh because a mesh can contain many more nodes +than a single Kubernetes cluster. + +For example, the current `ClusterService` format for a 5 000 backends service +and 2 ports is 786.89 KiB. In an 11 clusters mesh where each cluster has 1 000 +nodes, any update from any endpoint in this single service would result in about +7.5 GiB of data propagated globally across the mesh. From the perspective of a +single cluster control plane, this corresponds to about 750 MiB per update. +Assuming 10 updates per second for convenience, this would result in 7.5 GiB/s +of control plane traffic within that single cluster! + +## Goals + +* Reduce network bandwidth needed for control plane operations on large services +* Improve clustermesh service ingestion performance in the agent and in particular + related to service churn scenarios +* Allow scaling on the clustermesh level to a larger number of backends per + service per cluster +* Add backend conditions to clustermesh services to allow correct backend + state handling in the loadbalancer +* Add `EndpointSlice` name to the clustermesh service data to simplify + the EndpointSliceSync logic + +## Non-Goals + +* Changes not specific to clustermesh global services (for example + MCS-API handling) +* Large changes to non-clustermesh loadbalancer logic + +## Proposal + +### Overview + +This CFP proposes transitioning to directly include `EndpointSlice` objects +individually. + +This will allow more code reuse between the Kubernetes reflector in the +loadbalancer packages and the clustermesh package. + +The new format will include every field from the EndpointSlice struct, including +endpoint conditions. The clustermesh code will preserve these conditions and let +the existing loadbalancer code apply the same backend state handling as it does +for local services, instead of only receiving backends in an active state. + +This will also significantly simplify the EndpointSliceSync codebase as we will +be able to simply mirror EndpointSlices from remote clusters without the +complex sharding logic. The inclusion of the conditions should also benefit +consumers watching those EndpointSlices, for instance third-party GW-API +implementations. + +### Using EndpointSlice struct directly + +The Kubernetes `EndpointSlice` API must remain backward compatible across +Kubernetes versions. This aligns well with Cilium clustermesh's upgrade +requirement to support at least two consecutive minor versions. + +We will embed the actual `EndpointSlice` struct to make sure we can extend it in +the future if needed. The initial version should look like this: + +```go +type ClusterEndpointSlice struct { + Cluster string + EndpointSlice slim_discovery_v1.EndpointSlice +} +``` + +KCM can update all EndpointSlices up to 20 times per second and up to 30 in +burst. Kubernetes scalability tests also test Services, and in those tests, +despite significantly boosting KCM QPS to 100 or 500, EndpointSlice updates +remain around 10 or fewer updates per second. At very large scale (5k nodes), +they can go up to ~45 updates per second. + +This QPS is entirely manageable for clustermesh despite the probably higher QPS +from managing individual EndpointSlices and adding endpoint conditions. We +currently only have 20 QPS for clustermesh-apiserver, but we could likely boost +it into the 50-100 range. Note that kvstoremesh has 100 QPS and it would still +limit the overall QPS for events from all remote clusters. If there is more QPS +across all clusters and object types, each will compete for the QPS budget. + +### Unifying datapath ingestion pipelines + +The clustermesh and loadbalancer k8s reflector currently diverge in both +their data structures and ingestion logic. This divergence has created the +performance gap described in the Motivation section and makes it harder to +maintain performance parity between the two code paths. + +As we will be using `EndpointSlice` objects directly, we should be able to unify +both code paths and allow the Kubernetes reflector to receive events from remote +clusters. + +Given the 10-600x ingestion gap shown in the Motivation benchmarks, the added +churn from handling individual EndpointSlices and conditions should remain +manageable while still making this path significantly more performant than the +current clustermesh v1 path. + +### EndpointSlice encoding + +TODO: open question see the matrices below that present the size and decoding + speed of a full EndpointSlice (100 endpoints). Our current starting point + in Cluster Mesh would be ~ json while Kubernetes use protobuf. + + We can potentially encode EndpointSlices in a less compact format in the + source clustermesh-apiserver data and let kvstoremesh reflect it in a + more optimized format afterwards. + +Size: + +| Format | Raw | zstd | +| -------- | -------- | ----- | +| JSON | 10.59KiB | 435B | +| CBOR | 7.42KiB | 404B | +| Protobuf | 2.88KiB | 283B | + +Decode speed: + +| Format | Raw | zstd | +| -------- | -------- | ----- | +| JSON | 420µs | 449µs | +| CBOR | 222µs | 252µs | +| Protobuf | 62µs | 74µs | + +### Rollout strategy + +As the Cluster Mesh code path is currently maintained by a small team, we are +aiming at the most straightforward rollout strategy that minimizes the code +changes. + +We will introduce a new configuration option `clustermesh-service-v2`, with the +following possible options, which will evolve over minor releases: +- `prefer-legacy` (default in 1.20): Export `EndpointSlice` while still exporting + and consuming `ClusterService`. +- `prefer-endpointslice` (default in 1.21): Export and consume `EndpointSlice` + while still exporting `ClusterService` for backwards compatibility. +- `only-endpointslice`: Only export and consume `EndpointSlice` and stop + exporting `ClusterService`. + +Cluster Mesh supports one minor version skew. The goal of this rollout is to +make sure we do not break this guarantee while transitioning to the new format +while advanced users could potentially jump to the new format earlier. The goal +is to remove this option and all the `ClusterService` related code in Cilium 1.22. + +## Impacts / Key Questions + +### Key Questions: Chunking multiple EndpointSlice vs a single object + +The question of grouping the EndpointSlices was quite debated. The existing +`ClusterService` format groups all backends/EndpointSlices from a Service +in a single object. Ultimately, we favored individual EndpointSlices because: +- KCM EndpointSlice QPS is manageable in our case +- It matches the Kubernetes model and is simple to reason about +- Kubernetes may provide alternative Service APIs in the future, and not relying + explicitly on Services may help future-proof the format +- Handling individual EndpointSlice updates should be very natural and simple + for the loadbalancer and EndpointSliceSync code + +### Key Questions: Encoding format + +TODO: open question