-
Notifications
You must be signed in to change notification settings - Fork 50
CFP-41953: add ClusterMesh Service v2 #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
98dced8
33465e2
941acfd
4db1bd0
7fc2a65
e6f32f7
de012ce
0c339ed
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,241 @@ | ||
| # CFP-41953: ClusterMesh Service v2 | ||
|
|
||
| **SIG: SIG-clustermesh** | ||
|
|
||
| **Begin Design Discussion:** 2025-10-01 | ||
|
|
||
| **Cilium Release:** 1.20 | ||
|
|
||
| **Authors:** Arthur Outhenin-Chalandre <git@mrfreezeex.fr> | ||
|
|
||
| **Status:** Implementable | ||
|
|
||
| ## Summary | ||
|
|
||
| This CFP proposes introducing v2 of the clustermesh global service data | ||
| format stored in etcd. It transitions from `cilium/state/services/v1/` | ||
| to `cilium/state/endpointslices/v1/` and harmonizes backend data insertion | ||
| techniques between clustermesh services and Kubernetes services. | ||
|
|
||
| ## Motivation | ||
|
|
||
| The current clustermesh global service data is handled with the | ||
| [`ClusterService` struct](https://github.com/cilium/cilium/blob/d83cf8ab5e20f8ef6031d9e0f66f577cd095ef89/pkg/clustermesh/store/store.go#L52). | ||
| This struct is encoded in JSON format and stored in etcd. While this format | ||
| has served well initially, it now faces several limitations that prevent | ||
| clustermesh from scaling efficiently and supporting new features. These | ||
| limitations fall into three main areas: missing backend conditions, suboptimal | ||
| performance for service updates, and inefficient data format and encoding. | ||
|
|
||
| ### Missing backend conditions | ||
|
|
||
| The current format omits all backend conditions. It directly removes | ||
| backends that are not ready and not serving. This means that we cannot | ||
| properly perform graceful termination as described in the | ||
| [KPR documentation](https://docs.cilium.io/en/stable/network/kubernetes/kubeproxy-free/#graceful-termination), | ||
| most likely resulting in some traffic loss during rolling updates. | ||
|
|
||
| ### Performance gap with the loadbalancer k8s reflector | ||
|
|
||
| There is a large performance gap between the clustermesh and the | ||
| standard loadbalancer k8s reflector. Their update and ingestion behavior | ||
| is not currently on par. The table below shows how large this gap is: | ||
|
|
||
| | Backends | clustermesh (µs) | clustermesh w/o JSON decoding (µs) | loadbalancer k8s (µs) | Ratio (clustermesh w/o JSON/k8s) | | ||
| |----------|------------------|------------------------------------|-----------------------|----------------------------------| | ||
| | 1 | 44 | 35 | 4 | 9x slower | | ||
| | 100 | 629 | 271 | 4 | 68x slower | | ||
| | 1 000 | 6 349 | 2 626 | 7 | 375x slower | | ||
| | 5 000 | 36 861 | 15 831 | 30 | 528x slower | | ||
| | 10 000 | 78 810 | 44 289 | 70 | 633x slower | | ||
|
|
||
| These benchmarks are not strictly equivalent, but they give a good idea | ||
| of the performance gap, how much JSON decoding contributes to it, and how | ||
| clustermesh degrades as the number of backends grows. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit. For future reference, I'd suggest adding a brief remark that the |
||
|
|
||
| ### Inefficient data format and network traffic | ||
|
|
||
| The `ClusterService` struct/format was designed to fit the loadbalancer internals | ||
| when introduced in 2018. In 2026, after several iterations and refactors, those | ||
| two formats have diverged. For instance, one recent example was that | ||
| `ClusterService` encoded similar ports with different names as multiple entries, | ||
| while the loadbalancer backends encoded them as one entry with multiple port | ||
| names attached. This divergence led to a bug with one port shadowing the other, | ||
| which was recently fixed and will be released in 1.19.4 / 1.18.10. | ||
|
|
||
| In terms of wire size, the backend map duplicates port information for each IP, | ||
| so the total size of the object tends to grow almost in proportion to the number | ||
| of ports and IPs. In addition, JSON encoding adds extra overhead | ||
| for field names and number formatting compared to a binary representation. | ||
|
|
||
| All of this data must be replicated to every node in the mesh. A mesh often | ||
| has many more nodes than a single cluster, which results in a high volume | ||
| of control plane network traffic. | ||
|
|
||
| Additionally, etcd imposes a hard limit of 1.5 MiB per object. Without a | ||
| breaking change to a more efficient format, adding the missing fields to | ||
| the current `ClusterService` struct would restrict scalability to fewer | ||
| than 10,000 backends per service per cluster. While this is a high limit, | ||
| having headroom beyond this point is useful for future growth. | ||
|
|
||
| Even below the limit, keeping objects small is important to reduce network | ||
| traffic when backends change. This situation is similar to the Kubernetes | ||
| community's move from `Endpoints` to `EndpointSlice`, but the problem is | ||
| even stronger in clustermesh because a mesh can contain many more nodes | ||
| than a single Kubernetes cluster. | ||
|
|
||
| For example, the current `ClusterService` format for a 5 000 backends service | ||
| and 2 ports is 786.89 KiB. In an 11 clusters mesh where each cluster has 1 000 | ||
| nodes, any update from any endpoint in this single service would result in about | ||
| 7.5 GiB of data propagated globally across the mesh. From the perspective of a | ||
| single cluster control plane, this corresponds to about 750 MiB per update. | ||
| Assuming 10 updates per second for convenience, this would result in 7.5 GiB/s | ||
| of control plane traffic within that single cluster! | ||
|
|
||
| ## Goals | ||
|
|
||
| * Reduce network bandwidth needed for control plane operations on large services | ||
| * Improve clustermesh service ingestion performance in the agent and in particular | ||
| related to service churn scenarios | ||
| * Allow scaling on the clustermesh level to a larger number of backends per | ||
| service per cluster | ||
| * Add backend conditions to clustermesh services to allow correct backend | ||
| state handling in the loadbalancer | ||
| * Add `EndpointSlice` name to the clustermesh service data to simplify | ||
| the EndpointSliceSync logic | ||
|
|
||
| ## Non-Goals | ||
|
|
||
| * Changes not specific to clustermesh global services (for example | ||
| MCS-API handling) | ||
| * Large changes to non-clustermesh loadbalancer logic | ||
|
|
||
| ## Proposal | ||
|
|
||
| ### Overview | ||
|
|
||
| This CFP proposes transitioning to directly include `EndpointSlice` objects | ||
| individually. | ||
|
|
||
| This will allow more code reuse between the Kubernetes reflector in the | ||
| loadbalancer packages and the clustermesh package. | ||
|
|
||
| The new format will include every field from the EndpointSlice struct, including | ||
| endpoint conditions. The clustermesh code will preserve these conditions and let | ||
| the existing loadbalancer code apply the same backend state handling as it does | ||
| for local services, instead of only receiving backends in an active state. | ||
|
|
||
| This will also significantly simplify the EndpointSliceSync codebase as we will | ||
| be able to simply mirror EndpointSlices from remote clusters without the | ||
| complex sharding logic. The inclusion of the conditions should also benefit | ||
| consumers watching those EndpointSlices, for instance third-party GW-API | ||
| implementations. | ||
|
|
||
| ### Using EndpointSlice struct directly | ||
|
|
||
| The Kubernetes `EndpointSlice` API must remain backward compatible across | ||
| Kubernetes versions. This aligns well with Cilium clustermesh's upgrade | ||
| requirement to support at least two consecutive minor versions. | ||
|
|
||
| We will embed the actual `EndpointSlice` struct to make sure we can extend it in | ||
| the future if needed. The initial version should look like this: | ||
|
|
||
| ```go | ||
| type ClusterEndpointSlice struct { | ||
| Cluster string | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would add the |
||
| EndpointSlice slim_discovery_v1.EndpointSlice | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that we have a few alternatives:
I'm personally somewhat attracted by option 2, but curious to hear your thoughts. It may also not really matter that much if we go for protobuf, given that
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with you, option 2 looks a bit more appealing at first glance! If we get the top field it should be relatively easy to get the loadbalancer & EndpointSliceSync working while making sure that we have more control/less confusing as to what is really serialized! I agree with the disadvantages you stated in option 3, it also includes |
||
| } | ||
| ``` | ||
|
MrFreezeex marked this conversation as resolved.
|
||
|
|
||
| KCM can update all EndpointSlices up to 20 times per second and up to 30 in | ||
| burst. Kubernetes scalability tests also test Services, and in those tests, | ||
| despite significantly boosting KCM QPS to 100 or 500, EndpointSlice updates | ||
| remain around 10 or fewer updates per second. At very large scale (5k nodes), | ||
| they can go up to ~45 updates per second. | ||
|
|
||
| This QPS is entirely manageable for clustermesh despite the probably higher QPS | ||
| from managing individual EndpointSlices and adding endpoint conditions. We | ||
| currently only have 20 QPS for clustermesh-apiserver, but we could likely boost | ||
| it into the 50-100 range. Note that kvstoremesh has 100 QPS and it would still | ||
| limit the overall QPS for events from all remote clusters. If there is more QPS | ||
| across all clusters and object types, each will compete for the QPS budget. | ||
|
|
||
| ### Unifying datapath ingestion pipelines | ||
|
|
||
| The clustermesh and loadbalancer k8s reflector currently diverge in both | ||
| their data structures and ingestion logic. This divergence has created the | ||
| performance gap described in the Motivation section and makes it harder to | ||
| maintain performance parity between the two code paths. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit. performance (and feature) parity [...] |
||
|
|
||
| As we will be using `EndpointSlice` objects directly, we should be able to unify | ||
| both code paths and allow the Kubernetes reflector to receive events from remote | ||
| clusters. | ||
|
|
||
| Given the 10-600x ingestion gap shown in the Motivation benchmarks, the added | ||
| churn from handling individual EndpointSlices and conditions should remain | ||
| manageable while still making this path significantly more performant than the | ||
| current clustermesh v1 path. | ||
|
|
||
| ### EndpointSlice encoding | ||
|
|
||
| TODO: open question see the matrices below that present the size and decoding | ||
| speed of a full EndpointSlice (100 endpoints). Our current starting point | ||
| in Cluster Mesh would be ~ json while Kubernetes use protobuf. | ||
|
Comment on lines
+181
to
+183
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the benchmarks! I think that adopting I'm personally a bit more on the fence concerning the actual encoding format. On the one hand, CBOR looks a reasonable middle-ground to me, given that it is schema-less, and we could easily adapt the Do you have any idea on how difficult it would be to remashal protobuf to json in say the
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh nice thanks for testing json/v2 too 👀
Hmm it doesn't sounds too complex, I imagine that would be more the
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
And to do the above we would need to at least pull the type definition we would need to pull |
||
|
|
||
| We can potentially encode EndpointSlices in a less compact format in the | ||
| source clustermesh-apiserver data and let kvstoremesh reflect it in a | ||
| more optimized format afterwards. | ||
|
Comment on lines
+185
to
+187
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'd personally try to keep the KVStoreMesh logic as simple as possible, to not embed too much business logic there, and to avoid divergences depending on whether it is used or not. I don't see also a lot of benefit in only having one component using protobuf, as we'd need to find a way for dealing with it anyways, and at that point we could simply use it everywhere. That said, I agree that we may optionally move the compression step there, but I'm not that sure if there's much benefit compared to performing it directly at the source. |
||
|
|
||
| Size: | ||
|
|
||
| | Format | Raw | zstd | | ||
| | -------- | -------- | ----- | | ||
| | JSON | 10.59KiB | 435B | | ||
| | CBOR | 7.42KiB | 404B | | ||
| | Protobuf | 2.88KiB | 283B | | ||
|
|
||
| Decode speed: | ||
|
|
||
| | Format | Raw | zstd | | ||
| | -------- | -------- | ----- | | ||
| | JSON | 420µs | 449µs | | ||
| | CBOR | 222µs | 252µs | | ||
| | Protobuf | 62µs | 74µs | | ||
|
|
||
| ### Rollout strategy | ||
|
|
||
| As the Cluster Mesh code path is currently maintained by a small team, we are | ||
| aiming at the most straightforward rollout strategy that minimizes the code | ||
| changes. | ||
|
|
||
| We will introduce a new configuration option `clustermesh-service-v2`, with the | ||
| following possible options, which will evolve over minor releases: | ||
|
Comment on lines
+211
to
+212
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm personally not yet fully sold on the need of a feature flag, and how it may look like, as depending on the implementation we may be able to get away with a capability to detect if the target cluster uses one format or the other. That said, I don't think we need to necessarily commit to a specific solution now, as mostly an implementation detail. I'm totally fine with the current proposal, maybe slightly softening the current wording to say that it may potentially change during the implementation phase,, if that works for you.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure I can say that this may evolve during the implementation! FYI there's mainly two complex situation if we were to change that:
As we were discussing offline potentially we could get some compat layer in the kvstoremesh but this contradict a bit with your position of not introducing business logic in the kvstoremesh in one of your other comment. It's for different reasons and it would only be a temporary situation. On the plus side we could probably remove the legacy code a bit faster if we had such compat layer in the kvstoremesh. I am not that sure we could reasonably do the conversion from |
||
| - `prefer-legacy` (default in 1.20): Export `EndpointSlice` while still exporting | ||
| and consuming `ClusterService`. | ||
| - `prefer-endpointslice` (default in 1.21): Export and consume `EndpointSlice` | ||
| while still exporting `ClusterService` for backwards compatibility. | ||
| - `only-endpointslice`: Only export and consume `EndpointSlice` and stop | ||
| exporting `ClusterService`. | ||
|
|
||
| Cluster Mesh supports one minor version skew. The goal of this rollout is to | ||
| make sure we do not break this guarantee while transitioning to the new format | ||
| while advanced users could potentially jump to the new format earlier. The goal | ||
| is to remove this option and all the `ClusterService` related code in Cilium 1.22. | ||
|
|
||
| ## Impacts / Key Questions | ||
|
|
||
| ### Key Questions: Chunking multiple EndpointSlice vs a single object | ||
|
|
||
| The question of grouping the EndpointSlices was quite debated. The existing | ||
| `ClusterService` format groups all backends/EndpointSlices from a Service | ||
| in a single object. Ultimately, we favored individual EndpointSlices because: | ||
| - KCM EndpointSlice QPS is manageable in our case | ||
| - It matches the Kubernetes model and is simple to reason about | ||
| - Kubernetes may provide alternative Service APIs in the future, and not relying | ||
| explicitly on Services may help future-proof the format | ||
| - Handling individual EndpointSlice updates should be very natural and simple | ||
| for the loadbalancer and EndpointSliceSync code | ||
|
|
||
| ### Key Questions: Encoding format | ||
|
|
||
| TODO: open question | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is definitely a non-goal for this CFP, but I wonder if it could make sense to mention somewhere that we might want to add service entries back in the future if they are helpful for certain features (maybe some authocreation logic?), but that would be a representation of the service itself, without the backends (similarly to the MCS-API representation).
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah sure I don't particularly mind to add some mentions of that! Theoretically we don't have anything in Cilium OSS that does rely on Service level data today and probably MCS-API would (?) cover for features that needs those data in the future?