Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
241 changes: 241 additions & 0 deletions cilium/CFP-41953-clustermesh-service-v2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,241 @@
# CFP-41953: ClusterMesh Service v2

**SIG: SIG-clustermesh**

**Begin Design Discussion:** 2025-10-01

**Cilium Release:** 1.20

**Authors:** Arthur Outhenin-Chalandre <git@mrfreezeex.fr>

**Status:** Implementable

## Summary

This CFP proposes introducing v2 of the clustermesh global service data
format stored in etcd. It transitions from `cilium/state/services/v1/`
to `cilium/state/endpointslices/v1/` and harmonizes backend data insertion
Comment on lines +16 to +17
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is definitely a non-goal for this CFP, but I wonder if it could make sense to mention somewhere that we might want to add service entries back in the future if they are helpful for certain features (maybe some authocreation logic?), but that would be a representation of the service itself, without the backends (similarly to the MCS-API representation).

Copy link
Copy Markdown
Member Author

@MrFreezeex MrFreezeex May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah sure I don't particularly mind to add some mentions of that! Theoretically we don't have anything in Cilium OSS that does rely on Service level data today and probably MCS-API would (?) cover for features that needs those data in the future?

techniques between clustermesh services and Kubernetes services.

## Motivation

The current clustermesh global service data is handled with the
[`ClusterService` struct](https://github.com/cilium/cilium/blob/d83cf8ab5e20f8ef6031d9e0f66f577cd095ef89/pkg/clustermesh/store/store.go#L52).
This struct is encoded in JSON format and stored in etcd. While this format
has served well initially, it now faces several limitations that prevent
clustermesh from scaling efficiently and supporting new features. These
limitations fall into three main areas: missing backend conditions, suboptimal
performance for service updates, and inefficient data format and encoding.

### Missing backend conditions

The current format omits all backend conditions. It directly removes
backends that are not ready and not serving. This means that we cannot
properly perform graceful termination as described in the
[KPR documentation](https://docs.cilium.io/en/stable/network/kubernetes/kubeproxy-free/#graceful-termination),
most likely resulting in some traffic loss during rolling updates.

### Performance gap with the loadbalancer k8s reflector

There is a large performance gap between the clustermesh and the
standard loadbalancer k8s reflector. Their update and ingestion behavior
is not currently on par. The table below shows how large this gap is:

| Backends | clustermesh (µs) | clustermesh w/o JSON decoding (µs) | loadbalancer k8s (µs) | Ratio (clustermesh w/o JSON/k8s) |
|----------|------------------|------------------------------------|-----------------------|----------------------------------|
| 1 | 44 | 35 | 4 | 9x slower |
| 100 | 629 | 271 | 4 | 68x slower |
| 1 000 | 6 349 | 2 626 | 7 | 375x slower |
| 5 000 | 36 861 | 15 831 | 30 | 528x slower |
| 10 000 | 78 810 | 44 289 | 70 | 633x slower |

These benchmarks are not strictly equivalent, but they give a good idea
of the performance gap, how much JSON decoding contributes to it, and how
clustermesh degrades as the number of backends grows.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. For future reference, I'd suggest adding a brief remark that the LB k8s column does not include the unmarhaling phase.


### Inefficient data format and network traffic

The `ClusterService` struct/format was designed to fit the loadbalancer internals
when introduced in 2018. In 2026, after several iterations and refactors, those
two formats have diverged. For instance, one recent example was that
`ClusterService` encoded similar ports with different names as multiple entries,
while the loadbalancer backends encoded them as one entry with multiple port
names attached. This divergence led to a bug with one port shadowing the other,
which was recently fixed and will be released in 1.19.4 / 1.18.10.

In terms of wire size, the backend map duplicates port information for each IP,
so the total size of the object tends to grow almost in proportion to the number
of ports and IPs. In addition, JSON encoding adds extra overhead
for field names and number formatting compared to a binary representation.

All of this data must be replicated to every node in the mesh. A mesh often
has many more nodes than a single cluster, which results in a high volume
of control plane network traffic.

Additionally, etcd imposes a hard limit of 1.5 MiB per object. Without a
breaking change to a more efficient format, adding the missing fields to
the current `ClusterService` struct would restrict scalability to fewer
than 10,000 backends per service per cluster. While this is a high limit,
having headroom beyond this point is useful for future growth.

Even below the limit, keeping objects small is important to reduce network
traffic when backends change. This situation is similar to the Kubernetes
community's move from `Endpoints` to `EndpointSlice`, but the problem is
even stronger in clustermesh because a mesh can contain many more nodes
than a single Kubernetes cluster.

For example, the current `ClusterService` format for a 5 000 backends service
and 2 ports is 786.89 KiB. In an 11 clusters mesh where each cluster has 1 000
nodes, any update from any endpoint in this single service would result in about
7.5 GiB of data propagated globally across the mesh. From the perspective of a
single cluster control plane, this corresponds to about 750 MiB per update.
Assuming 10 updates per second for convenience, this would result in 7.5 GiB/s
of control plane traffic within that single cluster!

## Goals

* Reduce network bandwidth needed for control plane operations on large services
* Improve clustermesh service ingestion performance in the agent and in particular
related to service churn scenarios
* Allow scaling on the clustermesh level to a larger number of backends per
service per cluster
* Add backend conditions to clustermesh services to allow correct backend
state handling in the loadbalancer
* Add `EndpointSlice` name to the clustermesh service data to simplify
the EndpointSliceSync logic

## Non-Goals

* Changes not specific to clustermesh global services (for example
MCS-API handling)
* Large changes to non-clustermesh loadbalancer logic

## Proposal

### Overview

This CFP proposes transitioning to directly include `EndpointSlice` objects
individually.

This will allow more code reuse between the Kubernetes reflector in the
loadbalancer packages and the clustermesh package.

The new format will include every field from the EndpointSlice struct, including
endpoint conditions. The clustermesh code will preserve these conditions and let
the existing loadbalancer code apply the same backend state handling as it does
for local services, instead of only receiving backends in an active state.

This will also significantly simplify the EndpointSliceSync codebase as we will
be able to simply mirror EndpointSlices from remote clusters without the
complex sharding logic. The inclusion of the conditions should also benefit
consumers watching those EndpointSlices, for instance third-party GW-API
implementations.

### Using EndpointSlice struct directly

The Kubernetes `EndpointSlice` API must remain backward compatible across
Kubernetes versions. This aligns well with Cilium clustermesh's upgrade
requirement to support at least two consecutive minor versions.

We will embed the actual `EndpointSlice` struct to make sure we can extend it in
the future if needed. The initial version should look like this:

```go
type ClusterEndpointSlice struct {
Cluster string
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add the ClusterID as well, for consistency with the current ClusterService structure and e.g., the nodes representation.

EndpointSlice slim_discovery_v1.EndpointSlice
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we have a few alternatives:

  1. Embed the full slim_discovery_v1.EndpointSlice object, as proposed here. This has the main advantage of fully matching the upstream type, and the disadvantage of possibly including fields that likely don't make sense in this context (e.g., metav1.TypeMeta and a bunch of metav1.ObjectMeta fields, such as owner references) and potentially causing confusion and increasing the overhead, if not stripped correctly.
  2. Embed all root-level fields (AddressType/Endpoints/Ports), and the subset of relevant metav1.ObjectMeta fields (maybe under a custom Meta type, which also includes the ClusterName/ID). This would give us some more control on which fields are actually meaningful, while still preserving the full upstream match for the actual fields (and the trivial conversion between types).
  3. Embed (a variant of) the k8s.Endpoints type, which is currently used by the LB subsystem. This would give us even more control, as the struct is fully custom, but it is prone to divergences over time, and I wouldn't exclude that this intermediate abstraction will be removed anyways at a certain point, as it mostly existed as a common middleground between Endpoints and EndpointSlices. It also brings extra overhead due to conversions.

I'm personally somewhat attracted by option 2, but curious to hear your thoughts. It may also not really matter that much if we go for protobuf, given that TypeMeta is not present anyways in that definition.

Copy link
Copy Markdown
Member Author

@MrFreezeex MrFreezeex May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, option 2 looks a bit more appealing at first glance! If we get the top field it should be relatively easy to get the loadbalancer & EndpointSliceSync working while making sure that we have more control/less confusing as to what is really serialized!

I agree with the disadvantages you stated in option 3, it also includes types.UnserializableObjects right now which is more a small detail but still something that we would need to change if we go with option 3...

}
```
Comment thread
MrFreezeex marked this conversation as resolved.

KCM can update all EndpointSlices up to 20 times per second and up to 30 in
burst. Kubernetes scalability tests also test Services, and in those tests,
despite significantly boosting KCM QPS to 100 or 500, EndpointSlice updates
remain around 10 or fewer updates per second. At very large scale (5k nodes),
they can go up to ~45 updates per second.

This QPS is entirely manageable for clustermesh despite the probably higher QPS
from managing individual EndpointSlices and adding endpoint conditions. We
currently only have 20 QPS for clustermesh-apiserver, but we could likely boost
it into the 50-100 range. Note that kvstoremesh has 100 QPS and it would still
limit the overall QPS for events from all remote clusters. If there is more QPS
across all clusters and object types, each will compete for the QPS budget.

### Unifying datapath ingestion pipelines

The clustermesh and loadbalancer k8s reflector currently diverge in both
their data structures and ingestion logic. This divergence has created the
performance gap described in the Motivation section and makes it harder to
maintain performance parity between the two code paths.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. performance (and feature) parity [...]


As we will be using `EndpointSlice` objects directly, we should be able to unify
both code paths and allow the Kubernetes reflector to receive events from remote
clusters.

Given the 10-600x ingestion gap shown in the Motivation benchmarks, the added
churn from handling individual EndpointSlices and conditions should remain
manageable while still making this path significantly more performant than the
current clustermesh v1 path.

### EndpointSlice encoding

TODO: open question see the matrices below that present the size and decoding
speed of a full EndpointSlice (100 endpoints). Our current starting point
in Cluster Mesh would be ~ json while Kubernetes use protobuf.
Comment on lines +181 to +183
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the benchmarks!

I think that adopting zstd for compressing the entries is fairly straightforward, the difference in term of size is huge (10x or more), and the overhead during decoding seems fairly negligible (I assume that it may be a bit higher on encoding, but it doesn't really matter much there).

I'm personally a bit more on the fence concerning the actual encoding format. On the one hand, CBOR looks a reasonable middle-ground to me, given that it is schema-less, and we could easily adapt the kvstore/list script command to convert the entries to JSON, so that they are human readable. OTOH, protobuf seems still more than 3x better, at the cost of the complexities associated with a fixed schema, which would make troubleshooting more difficult without proper helpers. There's also json/v2 that appears to provide better performance than the standard json package; it is not on-par with CBOR, but the difference is smaller, and it may be preferable for consistency (it is still way less performing that protobuf though).

json_zstd         10000             85115 ns/op           43931 B/op        721 allocs/op
json_v2_zstd      10000             74633 ns/op           44079 B/op        721 allocs/op
cbor_zstd         10000             66357 ns/op           27560 B/op        817 allocs/op
protobuf_zstd     10000             21694 ns/op           34965 B/op        722 allocs/op

Do you have any idea on how difficult it would be to remashal protobuf to json in say the k8s/list script command? That's something I'd definitely want to preserve, as necessary for both script tests and troubleshooting. Ideally, adding that support should also not pull in the entire k8s client dependency, as it can otherwise unnecessarily bloat the binary sizes (although arguably all consumers may already import it). With that, I guess we can decide on whether to commit to protobuf, or stick to one of the schema-less formats.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice thanks for testing json/v2 too 👀

Do you have any idea on how difficult it would be to remashal protobuf to json in say the k8s/list script command? [...]

Hmm it doesn't sounds too complex, I imagine that would be more the kvstore/list (and kvstore/update) command here: https://github.com/cilium/cilium/blob/main/pkg/kvstore/commands.go? Theoretically we could just unmarshal from zstd/protobuf to go struct and remarshal back to json in there for the list command for instance (and about the opposite for the update command) and we could use the -o <json/plain> flag to potentially change the behavior of this depending on the tests. We would probably need to add a little abstraction to that command to make it aware in what encoding certaing key prefix are supposed to be but it doesn't seems that bad 🤔.

Copy link
Copy Markdown
Member Author

@MrFreezeex MrFreezeex May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, adding that support should also not pull in the entire k8s client dependency,

And to do the above we would need to at least pull the type definition we would need to pull github.com/cilium/cilium/pkg/k8s/slim/k8s/api/core/v1 and github.com/cilium/cilium/pkg/k8s/slim/k8s/api/discovery/v1 according to my benchmark code imports which I believe should be fairly self contained?


We can potentially encode EndpointSlices in a less compact format in the
source clustermesh-apiserver data and let kvstoremesh reflect it in a
more optimized format afterwards.
Comment on lines +185 to +187
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, i think that exporting in JSON while kvstoremesh rewrite in zstd+protobuf could be a quite nice performance/readability (on the source cluster at least) mix!

I'd personally try to keep the KVStoreMesh logic as simple as possible, to not embed too much business logic there, and to avoid divergences depending on whether it is used or not. I don't see also a lot of benefit in only having one component using protobuf, as we'd need to find a way for dealing with it anyways, and at that point we could simply use it everywhere.

That said, I agree that we may optionally move the compression step there, but I'm not that sure if there's much benefit compared to performing it directly at the source.


Size:

| Format | Raw | zstd |
| -------- | -------- | ----- |
| JSON | 10.59KiB | 435B |
| CBOR | 7.42KiB | 404B |
| Protobuf | 2.88KiB | 283B |

Decode speed:

| Format | Raw | zstd |
| -------- | -------- | ----- |
| JSON | 420µs | 449µs |
| CBOR | 222µs | 252µs |
| Protobuf | 62µs | 74µs |

### Rollout strategy

As the Cluster Mesh code path is currently maintained by a small team, we are
aiming at the most straightforward rollout strategy that minimizes the code
changes.

We will introduce a new configuration option `clustermesh-service-v2`, with the
following possible options, which will evolve over minor releases:
Comment on lines +211 to +212
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally not yet fully sold on the need of a feature flag, and how it may look like, as depending on the implementation we may be able to get away with a capability to detect if the target cluster uses one format or the other. That said, I don't think we need to necessarily commit to a specific solution now, as mostly an implementation detail.

I'm totally fine with the current proposal, maybe slightly softening the current wording to say that it may potentially change during the implementation phase,, if that works for you.

Copy link
Copy Markdown
Member Author

@MrFreezeex MrFreezeex May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can say that this may evolve during the implementation!

FYI there's mainly two complex situation if we were to change that:

  • Handling switching at runtime between one format or the other on the agent/operator. This could potentially be be skipped if we do this check only at startup but this may create situation where for one remote cluster certain agents uses the old format and some uses the new format (depending on when that agent (re-)started)...
  • Handling code in the EndpointSliceSync / loadbalancer that would ingest certain remote clusters with the old format and certains remote cluster with the new format at the same time

As we were discussing offline potentially we could get some compat layer in the kvstoremesh but this contradict a bit with your position of not introducing business logic in the kvstoremesh in one of your other comment. It's for different reasons and it would only be a temporary situation. On the plus side we could probably remove the legacy code a bit faster if we had such compat layer in the kvstoremesh. I am not that sure we could reasonably do the conversion from ClusterService to ClusterEndpointSlice though since it should theoretically require re-chunking the slice and if we can't do that it might remove the advantage of something like that entirely :/.

- `prefer-legacy` (default in 1.20): Export `EndpointSlice` while still exporting
and consuming `ClusterService`.
- `prefer-endpointslice` (default in 1.21): Export and consume `EndpointSlice`
while still exporting `ClusterService` for backwards compatibility.
- `only-endpointslice`: Only export and consume `EndpointSlice` and stop
exporting `ClusterService`.

Cluster Mesh supports one minor version skew. The goal of this rollout is to
make sure we do not break this guarantee while transitioning to the new format
while advanced users could potentially jump to the new format earlier. The goal
is to remove this option and all the `ClusterService` related code in Cilium 1.22.

## Impacts / Key Questions

### Key Questions: Chunking multiple EndpointSlice vs a single object

The question of grouping the EndpointSlices was quite debated. The existing
`ClusterService` format groups all backends/EndpointSlices from a Service
in a single object. Ultimately, we favored individual EndpointSlices because:
- KCM EndpointSlice QPS is manageable in our case
- It matches the Kubernetes model and is simple to reason about
- Kubernetes may provide alternative Service APIs in the future, and not relying
explicitly on Services may help future-proof the format
- Handling individual EndpointSlice updates should be very natural and simple
for the loadbalancer and EndpointSliceSync code

### Key Questions: Encoding format

TODO: open question