cilium · MrFreezeex · Oct 1, 2025 · Nov 2, 2025 · Dec 7, 2025 · Dec 13, 2025
diff --git a/cilium/CFP-41953-clustermesh-service-v2.md b/cilium/CFP-41953-clustermesh-service-v2.md
@@ -0,0 +1,241 @@
+# CFP-41953: ClusterMesh Service v2
+
+**SIG: SIG-clustermesh**
+
+**Begin Design Discussion:** 2025-10-01
+
+**Cilium Release:** 1.20
+
+**Authors:** Arthur Outhenin-Chalandre <git@mrfreezeex.fr>
+
+**Status:** Implementable
+
+## Summary
+
+This CFP proposes introducing v2 of the clustermesh global service data
+format stored in etcd. It transitions from `cilium/state/services/v1/`
+to `cilium/state/endpointslices/v1/` and harmonizes backend data insertion
+techniques between clustermesh services and Kubernetes services.
+
+## Motivation
+
+The current clustermesh global service data is handled with the
+[`ClusterService` struct](https://github.com/cilium/cilium/blob/d83cf8ab5e20f8ef6031d9e0f66f577cd095ef89/pkg/clustermesh/store/store.go#L52).
+This struct is encoded in JSON format and stored in etcd. While this format
+has served well initially, it now faces several limitations that prevent
+clustermesh from scaling efficiently and supporting new features. These
+limitations fall into three main areas: missing backend conditions, suboptimal
+performance for service updates, and inefficient data format and encoding.
+
+### Missing backend conditions
+
+The current format omits all backend conditions. It directly removes
+backends that are not ready and not serving. This means that we cannot
+properly perform graceful termination as described in the
+[KPR documentation](https://docs.cilium.io/en/stable/network/kubernetes/kubeproxy-free/#graceful-termination),
+most likely resulting in some traffic loss during rolling updates.
+
+### Performance gap with the loadbalancer k8s reflector
+
+There is a large performance gap between the clustermesh and the
+standard loadbalancer k8s reflector. Their update and ingestion behavior
+is not currently on par. The table below shows how large this gap is:
+
+| Backends | clustermesh (µs) | clustermesh w/o JSON decoding (µs) | loadbalancer k8s (µs) | Ratio (clustermesh w/o JSON/k8s) |
+|----------|------------------|------------------------------------|-----------------------|----------------------------------|
+| 1        | 44               | 35                                 | 4                     | 9x slower                        |
+| 100      | 629              | 271                                | 4                     | 68x slower                       |
+| 1 000    | 6 349            | 2 626                              | 7                     | 375x slower                      |
+| 5 000    | 36 861           | 15 831                             | 30                    | 528x slower                      |
+| 10 000   | 78 810           | 44 289                             | 70                    | 633x slower                      |
+
+These benchmarks are not strictly equivalent, but they give a good idea
+of the performance gap, how much JSON decoding contributes to it, and how
+clustermesh degrades as the number of backends grows.
+
+### Inefficient data format and network traffic
+
+The `ClusterService` struct/format was designed to fit the loadbalancer internals
+when introduced in 2018. In 2026, after several iterations and refactors, those
+two formats have diverged. For instance, one recent example was that
+`ClusterService` encoded similar ports with different names as multiple entries,
+while the loadbalancer backends encoded them as one entry with multiple port
+names attached. This divergence led to a bug with one port shadowing the other,
+which was recently fixed and will be released in 1.19.4 / 1.18.10.
+
+In terms of wire size, the backend map duplicates port information for each IP,
+so the total size of the object tends to grow almost in proportion to the number
+of ports and IPs. In addition, JSON encoding adds extra overhead
+for field names and number formatting compared to a binary representation.
+
+All of this data must be replicated to every node in the mesh. A mesh often
+has many more nodes than a single cluster, which results in a high volume
+of control plane network traffic.
+
+Additionally, etcd imposes a hard limit of 1.5 MiB per object. Without a
+breaking change to a more efficient format, adding the missing fields to
+the current `ClusterService` struct would restrict scalability to fewer
+than 10,000 backends per service per cluster. While this is a high limit,
+having headroom beyond this point is useful for future growth.
+
+Even below the limit, keeping objects small is important to reduce network
+traffic when backends change. This situation is similar to the Kubernetes
+community's move from `Endpoints` to `EndpointSlice`, but the problem is
+even stronger in clustermesh because a mesh can contain many more nodes
+than a single Kubernetes cluster.
+
+For example, the current `ClusterService` format for a 5 000 backends service
+and 2 ports is 786.89 KiB. In an 11 clusters mesh where each cluster has 1 000
+nodes, any update from any endpoint in this single service would result in about
+7.5 GiB of data propagated globally across the mesh. From the perspective of a
+single cluster control plane, this corresponds to about 750 MiB per update.
+Assuming 10 updates per second for convenience, this would result in 7.5 GiB/s
+of control plane traffic within that single cluster!
+
+## Goals
+
+* Reduce network bandwidth needed for control plane operations on large services
+* Improve clustermesh service ingestion performance in the agent and in particular
+  related to service churn scenarios
+* Allow scaling on the clustermesh level to a larger number of backends per
+  service per cluster
+* Add backend conditions to clustermesh services to allow correct backend
+  state handling in the loadbalancer
+* Add `EndpointSlice` name to the clustermesh service data to simplify
+  the EndpointSliceSync logic
+
+## Non-Goals
+
+* Changes not specific to clustermesh global services (for example
+  MCS-API handling)
+* Large changes to non-clustermesh loadbalancer logic
+
+## Proposal
+
+### Overview
+
+This CFP proposes transitioning to directly include `EndpointSlice` objects
+individually.
+
+This will allow more code reuse between the Kubernetes reflector in the
+loadbalancer packages and the clustermesh package.
+
+The new format will include every field from the EndpointSlice struct, including
+endpoint conditions. The clustermesh code will preserve these conditions and let
+the existing loadbalancer code apply the same backend state handling as it does
+for local services, instead of only receiving backends in an active state.
+
+This will also significantly simplify the EndpointSliceSync codebase as we will
+be able to simply mirror EndpointSlices from remote clusters without the
+complex sharding logic. The inclusion of the conditions should also benefit
+consumers watching those EndpointSlices, for instance third-party GW-API
+implementations.
+
+### Using EndpointSlice struct directly
+
+The Kubernetes `EndpointSlice` API must remain backward compatible across
+Kubernetes versions. This aligns well with Cilium clustermesh's upgrade
+requirement to support at least two consecutive minor versions.
+
+We will embed the actual `EndpointSlice` struct to make sure we can extend it in
+the future if needed. The initial version should look like this:
+
+```go
+type ClusterEndpointSlice struct {
+    Cluster string
+    EndpointSlice slim_discovery_v1.EndpointSlice
+}
+```
+
+KCM can update all EndpointSlices up to 20 times per second and up to 30 in
+burst. Kubernetes scalability tests also test Services, and in those tests,
+despite significantly boosting KCM QPS to 100 or 500, EndpointSlice updates
+remain around 10 or fewer updates per second. At very large scale (5k nodes),
+they can go up to ~45 updates per second.
+
+This QPS is entirely manageable for clustermesh despite the probably higher QPS
+from managing individual EndpointSlices and adding endpoint conditions. We
+currently only have 20 QPS for clustermesh-apiserver, but we could likely boost
+it into the 50-100 range. Note that kvstoremesh has 100 QPS and it would still
+limit the overall QPS for events from all remote clusters. If there is more QPS
+across all clusters and object types, each will compete for the QPS budget.
+
+### Unifying datapath ingestion pipelines
+
+The clustermesh and loadbalancer k8s reflector currently diverge in both
+their data structures and ingestion logic. This divergence has created the
+performance gap described in the Motivation section and makes it harder to
+maintain performance parity between the two code paths.
+
+As we will be using `EndpointSlice` objects directly, we should be able to unify
+both code paths and allow the Kubernetes reflector to receive events from remote
+clusters.
+
+Given the 10-600x ingestion gap shown in the Motivation benchmarks, the added
+churn from handling individual EndpointSlices and conditions should remain
+manageable while still making this path significantly more performant than the
+current clustermesh v1 path.
+
+### EndpointSlice encoding
+
+TODO: open question see the matrices below that present the size and decoding
+      speed of a full EndpointSlice (100 endpoints). Our current starting point
+      in Cluster Mesh would be ~ json while Kubernetes use protobuf.
+
+      We can potentially encode EndpointSlices in a less compact format in the
+      source clustermesh-apiserver data and let kvstoremesh reflect it in a
+      more optimized format afterwards.
+
+Size:
+
+| Format   | Raw      | zstd  |
+| -------- | -------- | ----- |
+| JSON     | 10.59KiB | 435B  |
+| CBOR     |  7.42KiB | 404B  |
+| Protobuf |  2.88KiB | 283B  |
+
+Decode speed:
+
+| Format   | Raw      | zstd  |
+| -------- | -------- | ----- |
+| JSON     | 420µs    | 449µs |
+| CBOR     | 222µs    | 252µs |
+| Protobuf |  62µs    |  74µs |
+
+### Rollout strategy
+
+As the Cluster Mesh code path is currently maintained by a small team, we are
+aiming at the most straightforward rollout strategy that minimizes the code
+changes.
+
+We will introduce a new configuration option `clustermesh-service-v2`, with the
+following possible options, which will evolve over minor releases:
+- `prefer-legacy` (default in 1.20): Export `EndpointSlice` while still exporting
+  and consuming `ClusterService`.
+- `prefer-endpointslice` (default in 1.21): Export and consume `EndpointSlice`
+  while still exporting `ClusterService` for backwards compatibility.
+- `only-endpointslice`: Only export and consume `EndpointSlice` and stop
+  exporting `ClusterService`.
+
+Cluster Mesh supports one minor version skew. The goal of this rollout is to
+make sure we do not break this guarantee while transitioning to the new format
+while advanced users could potentially jump to the new format earlier. The goal
+is to remove this option and all the `ClusterService` related code in Cilium 1.22.
+
+## Impacts / Key Questions
+
+### Key Questions: Chunking multiple EndpointSlice vs a single object
+
+The question of grouping the EndpointSlices was quite debated. The existing
+`ClusterService` format groups all backends/EndpointSlices from a Service
+in a single object. Ultimately, we favored individual EndpointSlices because:
+- KCM EndpointSlice QPS is manageable in our case
+- It matches the Kubernetes model and is simple to reason about
+- Kubernetes may provide alternative Service APIs in the future, and not relying
+  explicitly on Services may help future-proof the format
+- Handling individual EndpointSlice updates should be very natural and simple
+  for the loadbalancer and EndpointSliceSync code
+
+### Key Questions: Encoding format
+
+TODO: open question