From 98dced871784c1220fbeb1f675b4e4b6696a5d5a Mon Sep 17 00:00:00 2001 From: Arthur Outhenin-Chalandre Date: Wed, 1 Oct 2025 16:01:13 +0200 Subject: [PATCH 1/8] CFP-41953: add ClusterMesh Service v2 CFP Signed-off-by: Arthur Outhenin-Chalandre --- cilium/CFP-41953-clustermesh-service-v2.md | 284 +++++++++++++++++++++ 1 file changed, 284 insertions(+) create mode 100644 cilium/CFP-41953-clustermesh-service-v2.md diff --git a/cilium/CFP-41953-clustermesh-service-v2.md b/cilium/CFP-41953-clustermesh-service-v2.md new file mode 100644 index 0000000..edfc87d --- /dev/null +++ b/cilium/CFP-41953-clustermesh-service-v2.md @@ -0,0 +1,284 @@ +# CFP-41953: ClusterMesh Service v2 + +**SIG: SIG-clustermesh** + +**Begin Design Discussion:** 2025-10-01 + +**Cilium Release:** 1.19 + +**Authors:** Arthur Outhenin-Chalandre + +**Status:** Implementable + +## Summary + +This CFP proposes introducing v2 of the ClusterMesh global service data format +stored in etcd and a transition from `cilium/state/services/v1/` to `cilium/state/services/v2/`. + +## Motivation + +The current ClusterMesh global service data is handled with the +[`ClusterService` struct](https://github.com/cilium/cilium/blob/d83cf8ab5e20f8ef6031d9e0f66f577cd095ef89/pkg/clustermesh/store/store.go#L52). +This struct is then encoded in JSON format and stored in etcd. + +It is unfortunately not on par with the data needed by the Cilium load balancer. +For instance, until very recently there was no zone information in the `ClusterService` struct +(a PR to address that was merged recently but this isn't available in a stable release yet) +and there are still no backend conditions (or state) available, which prevents +the Cilium load balancer from excluding/phasing out some backends similarly to regular backends not +coming from a remote cluster. + +Also, the EndpointSliceSync feature that syncs EndpointSlices +across clusters could be greatly simplified by including the EndpointSlice name in the data, +see [CFP-41533](https://github.com/cilium/cilium/issues/41533) for more details about this. + +While we would like to add those fields to the `ClusterService` struct, we have a hard limit +of 1.5 MiB in etcd, and without a breaking change, adding these new fields to the current +`ClusterService` struct might limit the number of backends to fewer than ~10,000 backends +per service per cluster. + +Independently of the hard 1.5 MiB limit in etcd, we probably want to keep the +size of those objects as small as possible to reduce the amount of data flowing on +the network when a backend is added, removed or modified. This is pretty much the same +problem that the Kubernetes community faced with the Endpoints object and why they +transitioned to EndpointSlice. Except that in our case with ClusterMesh it's even more +problematic because we can have way more nodes than the maximum number of nodes supported +in a single cluster. Fortunately with KVStoreMesh we are mostly talking about in-cluster +and not inter-cluster traffic. + +The current `ClusterService` is not efficient in terms of size, mainly because it +is encoded in JSON but also because the `Backends` field (which is the main +contributor to the size of this object) duplicates data for each new port. + +## Goals + +* Address the scalability issues of the current JSON format for Services in ClusterMesh + * Reduce data flowing through the network + * Allow more backends to be encoded per service per cluster than currently +* Add backend conditions (or state) to ClusterMesh Services +* Add EndpointSlice name to the ClusterMesh Service data to simplify the + EndpointSliceSync logic + +## Non-Goals + +* Changes not specific to ClusterMesh global Services + +## Proposal + +### Overview + +The main proposal of this CFP is to transition to encoding Services in a +different, more optimized format with new data to accommodate backend conditions +and additional data for EndpointSliceSync. + +### What format to use + +The main format studied was based on protobuf since it is already used in Cilium +and can be easily integrated. As one of the goals is to be size efficient, I also +tried compressing the encoded data with two algorithms: lz4 and zstd. Both are +modern compression algorithms and LZ4 is known for its speed while zstd is known +for its compression ratio and relative speed. + +Here is the proto file that we will be using: +``` +// SPDX-License-Identifier: Apache-2.0 +// Copyright Authors of Cilium + +edition = "2023"; + +package clustermesh; +option go_package = "github.com/cilium/cilium/api/v1/clustermesh"; + +import "google/protobuf/go_features.proto"; +option features.(pb.go).api_level = API_OPAQUE; + +message Backend { + oneof address { + fixed32 v4 = 1; + bytes v6 = 2; + } + + // bitmask for backends conditions + uint32 conditions = 3; + + // Zone info + uint32 zone_index = 4; + repeated uint32 hints_for_zones_indexes = 5; + + string hostname = 6; +} + +enum L4Type { + L4_TYPE_UNSPECIFIED = 0; + L4_TYPE_TCP = 1; + L4_TYPE_UDP = 2; + L4_TYPE_SCTP = 3; +} + +message Port { + L4Type protocol = 1; + uint32 port = 2; + // Port name can be looked up in the `port_names_table` in the ClusterService message + uint32 name_index = 3; +} + +message Endpoint { + repeated Port ports = 1; + repeated Backend backends = 2; + + // Data used by EndpointSliceSync + string endpoints_name = 3; + // Used in EndpointSliceSync to only trigger reconciliation on EndpointSlice that changed + string endpoints_resource_version = 4; +} + +// ClusterService represents a service definition within a cluster +message ClusterService { + string cluster = 1; + uint32 cluster_id = 2; + string namespace = 3; + string name = 4; + + // String Interning Tables to reduce per backend size + repeated string port_names_table = 5; + repeated string zone_names_table = 6; + + repeated Endpoint endpoints = 7; +} +``` + +This protobuf message is inspired by the current Cilium `Backend` and `BackendParams` structs +while being closer to a regular Kubernetes EndpointSlice as we need this data for +EndpointSliceSync. Also the [current code](https://github.com/cilium/cilium/blob/487ace075d5f88e7a48b9fff3d47c989d2b3acad/operator/watchers/service_sync.go#L207) +which creates `ClusterService` objects converts directly from EndpointSlice and Service, so +having a format closer to EndpointSlice is slightly more straightforward to export. + +To decide what algorithm to pick exactly I did some benchmarks testing different scenarios. +The first one is comparing the encoded size of similar Service objects. Here is the result: + +| Backend Count | JSON | JSON (zone) | JSON (2 ports + zone) | Protobuf | Protobuf LZ4 | Protobuf zstd | +| ------------- | --------- | ----------- | --------------------- | --------- | ------------ | ------------- | +| 1 | 355B | 431B | 470B | 147B | 151B | 137B | +| 10 | 823B | 1.46KiB | 1.84KiB | 274B | 238B | 182B | +| 100 | 5.46KiB | 12.00KiB | 15.81KiB | 1.50KiB | 1.02KiB | 281B | +| 1000 | 52.60KiB | 118.59KiB | 156.67KiB | 14.29KiB | 8.45KiB | 1.96KiB | +| 5000 | 264.20KiB | 596.48KiB | 786.91KiB | 71.12KiB | 32.88KiB | 9.84KiB | +| 10000 | 530.69KiB | 1.17MiB | 1.54MiB | 142.17KiB | 63.46KiB | 19.81KiB | +| 50000 | 2.62MiB | 5.91MiB | 7.77MiB | 710.53KiB | 307.69KiB | 99.12KiB | +| 100000 | 5.25MiB | 11.83MiB | 15.55MiB | 1.39MiB | 612.92KiB | 211.68KiB | +| 150000 | 7.88MiB | 17.75MiB | 23.33MiB | 2.08MiB | 918.13KiB | 323.10KiB | + +There are different cases for JSON encoding to study the impact of adding multiple ports or adding zone info. +Note that the data encoded in protobuf correspond to the same data as "JSON (2 ports + zone)" but with +the new fields (backend conditions and data for EndpointSliceSync). + +From this benchmark we can see that the current JSON format uses a lot of bytes, +and that the protobuf format compressed with zstd overall seems to be the most efficient. + +However as compression is also adding some CPU overhead, I also did some benchmark for its +decompression speed. I focused on decompression since we need to make sure we don't add too much +overhead every time a Service is updated on every agent in the mesh. + +To evaluate this, I made a benchmark that decompresses similar data as in the previous +benchmark and also converts that to a `BackendParams` slice since that's what we need +to feed the load balancer. This does not account for any network/etcd query overhead; it +exclusively decodes bytes from memory. Here are the results summarized by benchstat: + +``` +goos: linux +goarch: amd64 +pkg: github.com/cilium/cilium/bench +cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz + │ bench_result_json.txt │ bench_result_protobuf.txt │ bench_result_protobuf_lz4.txt │ bench_result_protobuf_zstd.txt │ + │ sec/op │ sec/op vs base │ sec/op vs base │ sec/op vs base │ +Decoding/1_backends-8 18.663µ ± 6% 5.125µ ± 7% -72.54% (p=0.000 n=10) 5.346µ ± 5% -71.36% (p=0.000 n=10) 6.816µ ± 17% -63.48% (p=0.000 n=10) +Decoding/10_backends-8 71.65µ ± 3% 21.83µ ± 6% -69.53% (p=0.000 n=10) 20.73µ ± 4% -71.07% (p=0.000 n=10) 23.41µ ± 22% -67.32% (p=0.000 n=10) +Decoding/100_backends-8 598.4µ ± 5% 172.9µ ± 10% -71.11% (p=0.000 n=10) 156.7µ ± 9% -73.82% (p=0.000 n=10) 161.0µ ± 2% -73.09% (p=0.000 n=10) +Decoding/1000_backends-8 7.270m ± 8% 1.813m ± 8% -75.06% (p=0.000 n=10) 1.760m ± 19% -75.80% (p=0.000 n=10) 1.792m ± 3% -75.35% (p=0.000 n=10) +Decoding/5000_backends-8 55.51m ± 11% 24.64m ± 14% -55.62% (p=0.000 n=10) 22.58m ± 7% -59.32% (p=0.000 n=10) 25.36m ± 11% -54.31% (p=0.000 n=10) +Decoding/10000_backends-8 132.36m ± 13% 59.89m ± 23% -54.75% (p=0.000 n=10) 58.73m ± 18% -55.63% (p=0.000 n=10) 55.69m ± 27% -57.92% (p=0.000 n=10) +Decoding/50000_backends-8 802.4m ± 7% 433.7m ± 26% -45.95% (p=0.000 n=10) 362.9m ± 5% -54.77% (p=0.000 n=10) 379.5m ± 17% -52.70% (p=0.000 n=10) +geomean 4.445m 1.560m -64.92% 1.468m -66.97% 1.581m -64.43% +``` + +We can see in the decompression benchmark that all the proposals based on protobuf are +significantly faster (by at least 50%) than the current JSON format. While +we don't have agent profiling data here, we know decoding would be faster. However +we don't know if this is a relevant portion of the total agent CPU usage. + +While zstd is in most cases slightly slower for decompressing in-memory data, it is also +about ~7x smaller than the default protobuf and ~3x smaller than LZ4. This is not +only affecting the total network bandwidth used but should also slightly reduce +the time needed for an agent to receive updates from etcd. + +Based on those two benchmarks, the proposal is to use a protobuf format compressed with zstd. + +### Rollout strategy + +In order to introduce a "v2" of the ClusterMesh Service data format, this CFP is mainly +proposing to have a global switch rather than a per cluster detection. This is +mainly to keep the transition as simple as possible because a per cluster detection +could introduce more complexity as we would need to handle downgrade/upgrade of remote +clusters at runtime. + +With this approach we would add a new option `clustermesh-service-v2-enabled` which will be +disabled by default in Cilium 1.19. This option will control if the operator and agent will use +the v1 or v2 format. This option would be enabled by default in Cilium 1.20 and deprecated +to be then removed in Cilium 1.21. In Cilium 1.19 and 1.20, we would also unconditionally +export both the v1 and v2 format while KVStoreMesh will also mirror both. + +This gives a good balance between keeping the change simple and ensuring that users +can upgrade without traffic disruptions. We will be able to document that when +`clustermesh-service-v2-enabled` is enabled, all remote clusters connected should already be running +Cilium 1.19 or higher. Also the upgrade to Cilium 1.21 with `clustermesh-service-v2-enabled` +disabled will not be officially supported and users doing that should expect disruptions. + +To clarify and facilitate this transition, we could also make Cilium export its own version +in its `CiliumClusterConfig`. And prevent connecting to remote cluster running Cilium 1.18 or lower +when `clustermesh-service-v2-enabled` is enabled. We would also able to add a warning +when we connect to remote clusters running with more than one minor version difference, +as this is not officially supported or tested in our CI. + +## Impacts / Key Questions + +### Impact: Service format breaking change + +This change will introduce a "v2" of service data in etcd and, as proposed, +would introduce an incompatibility between clusters running Cilium 1.18 or lower +and Cilium 1.20 or higher by default. + +### Impact: text format readability for debugging + +If we encode Service objects with protobuf and then compress them with zstd, it would be +harder to inspect the content of a Service object in etcd for debugging than with +the existing JSON format. + +### Option 1: Use a slice approach + +We could also use a slice approach very similar to what Kubernetes has done with +EndpointSlice vs the original Endpoints. + +#### Pros + +* Consistent with the Kubernetes approach + +#### Cons + +* Would introduce more complexity in the clustermesh codebase + (but similar concerns as what is done in the load balancer codebase) +* Needs significantly more objects in etcd for big Services +* If we keep JSON to encode those objects, the advantages in terms of bytes + flowing through the network are not clearly better with a slice approach in all + situations considering the size efficiency of protobuf compressed with zstd. + +### Option 2: Optimize the existing JSON format + +#### Pros + +* Keep a format "readable" for debugging + +#### Cons + +* Some byte optimizations could be achieved by shortening/uglifying the + different field names, which would make the format less readable and probably defeat + the purpose of keeping a JSON encoding. From 33465e20769b80dc9b920e8073572d4c287f76aa Mon Sep 17 00:00:00 2001 From: Arthur Outhenin-Chalandre Date: Sun, 2 Nov 2025 19:21:31 +0100 Subject: [PATCH 2/8] wip: iteration 2 Second iteration of the ClusterMesh Service CFP. This commit is meant to be squashed before the PR is merged. This new iteration change the following things: - adds zstd compression for the existing JSON format - Rename various fields to EndpointSlice instead of Endpoints - Change the address to string. While this is less efficient on uncompressed data it appears to be more efficient while compressing with zstd. It would also be more readable when encoding to JSON - Add some quick note on the plan to address debugging point by using protonjson to allow dumping the data to JSON (and yaml) and testing in hive script with JSON Signed-off-by: Arthur Outhenin-Chalandre --- cilium/CFP-41953-clustermesh-service-v2.md | 93 +++++++++++----------- 1 file changed, 47 insertions(+), 46 deletions(-) diff --git a/cilium/CFP-41953-clustermesh-service-v2.md b/cilium/CFP-41953-clustermesh-service-v2.md index edfc87d..827b8a8 100644 --- a/cilium/CFP-41953-clustermesh-service-v2.md +++ b/cilium/CFP-41953-clustermesh-service-v2.md @@ -93,19 +93,14 @@ import "google/protobuf/go_features.proto"; option features.(pb.go).api_level = API_OPAQUE; message Backend { - oneof address { - fixed32 v4 = 1; - bytes v6 = 2; - } - - // bitmask for backends conditions - uint32 conditions = 3; + string address = 1; + uint32 conditions = 2; // Zone info - uint32 zone_index = 4; - repeated uint32 hints_for_zones_indexes = 5; + uint32 zone_index = 3; + repeated uint32 hints_for_zones_indexes = 4; - string hostname = 6; + string hostname = 5; } enum L4Type { @@ -122,14 +117,10 @@ message Port { uint32 name_index = 3; } -message Endpoint { +message EndpointSlice { repeated Port ports = 1; repeated Backend backends = 2; - - // Data used by EndpointSliceSync - string endpoints_name = 3; - // Used in EndpointSliceSync to only trigger reconciliation on EndpointSlice that changed - string endpoints_resource_version = 4; + string endpoint_slice_name = 3; } // ClusterService represents a service definition within a cluster @@ -143,7 +134,7 @@ message ClusterService { repeated string port_names_table = 5; repeated string zone_names_table = 6; - repeated Endpoint endpoints = 7; + repeated EndpointSlice endpoint_slices = 7; } ``` @@ -156,24 +147,25 @@ having a format closer to EndpointSlice is slightly more straightforward to expo To decide what algorithm to pick exactly I did some benchmarks testing different scenarios. The first one is comparing the encoded size of similar Service objects. Here is the result: -| Backend Count | JSON | JSON (zone) | JSON (2 ports + zone) | Protobuf | Protobuf LZ4 | Protobuf zstd | -| ------------- | --------- | ----------- | --------------------- | --------- | ------------ | ------------- | -| 1 | 355B | 431B | 470B | 147B | 151B | 137B | -| 10 | 823B | 1.46KiB | 1.84KiB | 274B | 238B | 182B | -| 100 | 5.46KiB | 12.00KiB | 15.81KiB | 1.50KiB | 1.02KiB | 281B | -| 1000 | 52.60KiB | 118.59KiB | 156.67KiB | 14.29KiB | 8.45KiB | 1.96KiB | -| 5000 | 264.20KiB | 596.48KiB | 786.91KiB | 71.12KiB | 32.88KiB | 9.84KiB | -| 10000 | 530.69KiB | 1.17MiB | 1.54MiB | 142.17KiB | 63.46KiB | 19.81KiB | -| 50000 | 2.62MiB | 5.91MiB | 7.77MiB | 710.53KiB | 307.69KiB | 99.12KiB | -| 100000 | 5.25MiB | 11.83MiB | 15.55MiB | 1.39MiB | 612.92KiB | 211.68KiB | -| 150000 | 7.88MiB | 17.75MiB | 23.33MiB | 2.08MiB | 918.13KiB | 323.10KiB | - -There are different cases for JSON encoding to study the impact of adding multiple ports or adding zone info. -Note that the data encoded in protobuf correspond to the same data as "JSON (2 ports + zone)" but with +| Backend Count | JSON | JSON (2 ports) | JSON (2 ports + zstd) | Protobuf | Protobuf lz4 | Protobuf zstd | +| ------------- | --------- | -------------- | --------------------- | --------- | ------------ | ------------- | +| 1 | 414B | 453B | 239B | 144B | 154B | 131B | +| 10 | 1.44KiB | 1.82KiB | 309B | 316B | 208B | 177B | +| 100 | 11.99KiB | 15.80KiB | 578B | 2.07KiB | 681B | 309B | +| 1000 | 118.57KiB | 156.66KiB | 3.11KiB | 20.61KiB | 5.55KiB | 2.55KiB | +| 5000 | 596.46KiB | 786.89KiB | 14.22KiB | 105.15KiB | 27.01KiB | 10.47KiB | +| 10000 | 1.17MiB | 1.54MiB | 21.72KiB | 212.80KiB | 53.94KiB | 20.36KiB | +| 50000 | 5.91MiB | 7.77MiB | 89.62KiB | 1.07MiB | 271.34KiB | 85.96KiB | +| 100000 | 11.83MiB | 15.55MiB | 170.01KiB | 2.14MiB | 542.75KiB | 158.87KiB | +| 150000 | 17.75MiB | 23.33MiB | 252.83KiB | 3.22MiB | 815.06KiB | 227.04KiB | + +There are different cases for JSON encoding to study the impact of adding multiple ports. +Note that the data encoded in protobuf correspond to the same data as "JSON (2 ports)" but with the new fields (backend conditions and data for EndpointSliceSync). From this benchmark we can see that the current JSON format uses a lot of bytes, -and that the protobuf format compressed with zstd overall seems to be the most efficient. +and that zstd compression is very efficient. It makes the JSON format almost as +efficient as protobuf size wise. However as compression is also adding some CPU overhead, I also did some benchmark for its decompression speed. I focused on decompression since we need to make sure we don't add too much @@ -189,16 +181,16 @@ goos: linux goarch: amd64 pkg: github.com/cilium/cilium/bench cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz - │ bench_result_json.txt │ bench_result_protobuf.txt │ bench_result_protobuf_lz4.txt │ bench_result_protobuf_zstd.txt │ - │ sec/op │ sec/op vs base │ sec/op vs base │ sec/op vs base │ -Decoding/1_backends-8 18.663µ ± 6% 5.125µ ± 7% -72.54% (p=0.000 n=10) 5.346µ ± 5% -71.36% (p=0.000 n=10) 6.816µ ± 17% -63.48% (p=0.000 n=10) -Decoding/10_backends-8 71.65µ ± 3% 21.83µ ± 6% -69.53% (p=0.000 n=10) 20.73µ ± 4% -71.07% (p=0.000 n=10) 23.41µ ± 22% -67.32% (p=0.000 n=10) -Decoding/100_backends-8 598.4µ ± 5% 172.9µ ± 10% -71.11% (p=0.000 n=10) 156.7µ ± 9% -73.82% (p=0.000 n=10) 161.0µ ± 2% -73.09% (p=0.000 n=10) -Decoding/1000_backends-8 7.270m ± 8% 1.813m ± 8% -75.06% (p=0.000 n=10) 1.760m ± 19% -75.80% (p=0.000 n=10) 1.792m ± 3% -75.35% (p=0.000 n=10) -Decoding/5000_backends-8 55.51m ± 11% 24.64m ± 14% -55.62% (p=0.000 n=10) 22.58m ± 7% -59.32% (p=0.000 n=10) 25.36m ± 11% -54.31% (p=0.000 n=10) -Decoding/10000_backends-8 132.36m ± 13% 59.89m ± 23% -54.75% (p=0.000 n=10) 58.73m ± 18% -55.63% (p=0.000 n=10) 55.69m ± 27% -57.92% (p=0.000 n=10) -Decoding/50000_backends-8 802.4m ± 7% 433.7m ± 26% -45.95% (p=0.000 n=10) 362.9m ± 5% -54.77% (p=0.000 n=10) 379.5m ± 17% -52.70% (p=0.000 n=10) -geomean 4.445m 1.560m -64.92% 1.468m -66.97% 1.581m -64.43% + │ bench_result_json.txt │ bench_result_json_zstd.txt │ bench_result_protobuf.txt │ bench_result_protobuf_lz4.txt │ bench_result_protobuf_zstd.txt │ + │ sec/op │ sec/op vs base │ sec/op vs base │ sec/op vs base │ sec/op vs base │ +Decoding/1_backends-8 17.843µ ± 6% 24.087µ ± 1% +34.99% (p=0.000 n=10) 4.488µ ± 1% -74.85% (p=0.000 n=10) 4.707µ ± 1% -73.62% (p=0.000 n=10) 6.166µ ± 3% -65.44% (p=0.000 n=10) +Decoding/10_backends-8 70.37µ ± 1% 77.89µ ± 0% +10.68% (p=0.000 n=10) 18.43µ ± 0% -73.81% (p=0.000 n=10) 18.91µ ± 0% -73.13% (p=0.000 n=10) 20.65µ ± 0% -70.66% (p=0.000 n=10) +Decoding/100_backends-8 582.2µ ± 0% 605.9µ ± 0% +4.07% (p=0.000 n=10) 143.4µ ± 1% -75.36% (p=0.000 n=10) 145.3µ ± 1% -75.04% (p=0.000 n=10) 160.1µ ± 2% -72.50% (p=0.000 n=10) +Decoding/1000_backends-8 6.440m ± 2% 6.565m ± 2% +1.95% (p=0.019 n=10) 1.640m ± 2% -74.53% (p=0.000 n=10) 1.743m ± 1% -72.93% (p=0.000 n=10) 1.788m ± 2% -72.24% (p=0.000 n=10) +Decoding/5000_backends-8 52.07m ± 5% 55.31m ± 9% +6.22% (p=0.043 n=10) 22.84m ± 23% -56.14% (p=0.000 n=10) 22.42m ± 15% -56.95% (p=0.000 n=10) 23.49m ± 15% -54.89% (p=0.000 n=10) +Decoding/10000_backends-8 131.42m ± 10% 129.13m ± 18% ~ (p=0.631 n=10) 58.85m ± 20% -55.22% (p=0.000 n=10) 55.82m ± 17% -57.53% (p=0.000 n=10) 56.00m ± 18% -57.39% (p=0.000 n=10) +Decoding/50000_backends-8 742.2m ± 9% 841.4m ± 14% ~ (p=0.052 n=10) 391.0m ± 29% -47.31% (p=0.000 n=10) 384.9m ± 9% -48.14% (p=0.000 n=10) 398.0m ± 17% -46.37% (p=0.000 n=10) +geomean 4.222m 4.619m +9.40% 1.394m -66.98% 1.406m -66.70% 1.524m -63.91% ``` We can see in the decompression benchmark that all the proposals based on protobuf are @@ -207,11 +199,15 @@ we don't have agent profiling data here, we know decoding would be faster. Howev we don't know if this is a relevant portion of the total agent CPU usage. While zstd is in most cases slightly slower for decompressing in-memory data, it is also -about ~7x smaller than the default protobuf and ~3x smaller than LZ4. This is not -only affecting the total network bandwidth used but should also slightly reduce -the time needed for an agent to receive updates from etcd. +about ~7x smaller than the default protobuf and ~3x smaller than LZ4. This should also +reduce various IO processing times not accounted in the in-memory benchmark above. -Based on those two benchmarks, the proposal is to use a protobuf format compressed with zstd. +The JSON format compressed with zstd add relatively little overhead vs the current JSON +format and since it has approximately the same size as a protobuf zstd format it makes +a compelling alternative. However adding a conditions fields to backends would amplify +the number of service updates and thus makes the decode CPU overhead even more significant. + +Based on this, the proposal is to use a protobuf format compressed with zstd. ### Rollout strategy @@ -253,6 +249,11 @@ If we encode Service objects with protobuf and then compress them with zstd, it harder to inspect the content of a Service object in etcd for debugging than with the existing JSON format. +In order to mitigate this we will use the protojson encoding/decoding.The `kvstore get` +command with the json and yaml output will still be supported even for protobuf data. +Also some new hive script command will be added to encode/decode the service object +from/to JSON to facilitate testing + ### Option 1: Use a slice approach We could also use a slice approach very similar to what Kubernetes has done with From 941acfd3c33311e9adc42e188b208db276531718 Mon Sep 17 00:00:00 2001 From: Arthur Outhenin-Chalandre Date: Sun, 7 Dec 2025 15:52:15 +0100 Subject: [PATCH 3/8] wip: iteration 3 Rewrote the CFP to remove protobuf and focus on JSON format based on slim endpointslice. It also includes statedb performance consideration after having done end to end benchmarks. Signed-off-by: Arthur Outhenin-Chalandre --- cilium/CFP-41953-clustermesh-service-v2.md | 519 ++++++++++++--------- 1 file changed, 295 insertions(+), 224 deletions(-) diff --git a/cilium/CFP-41953-clustermesh-service-v2.md b/cilium/CFP-41953-clustermesh-service-v2.md index 827b8a8..e782688 100644 --- a/cilium/CFP-41953-clustermesh-service-v2.md +++ b/cilium/CFP-41953-clustermesh-service-v2.md @@ -12,252 +12,319 @@ ## Summary -This CFP proposes introducing v2 of the ClusterMesh global service data format -stored in etcd and a transition from `cilium/state/services/v1/` to `cilium/state/services/v2/`. +This CFP proposes introducing v2 of the clustermesh global service data +format stored in etcd. It transitions from `cilium/state/services/v1/` +to `cilium/state/services/v2/` and harmonizes backend data insertion +techniques between clustermesh services and Kubernetes services. ## Motivation -The current ClusterMesh global service data is handled with the +The current clustermesh global service data is handled with the [`ClusterService` struct](https://github.com/cilium/cilium/blob/d83cf8ab5e20f8ef6031d9e0f66f577cd095ef89/pkg/clustermesh/store/store.go#L52). -This struct is then encoded in JSON format and stored in etcd. - -It is unfortunately not on par with the data needed by the Cilium load balancer. -For instance, until very recently there was no zone information in the `ClusterService` struct -(a PR to address that was merged recently but this isn't available in a stable release yet) -and there are still no backend conditions (or state) available, which prevents -the Cilium load balancer from excluding/phasing out some backends similarly to regular backends not -coming from a remote cluster. - -Also, the EndpointSliceSync feature that syncs EndpointSlices -across clusters could be greatly simplified by including the EndpointSlice name in the data, -see [CFP-41533](https://github.com/cilium/cilium/issues/41533) for more details about this. - -While we would like to add those fields to the `ClusterService` struct, we have a hard limit -of 1.5 MiB in etcd, and without a breaking change, adding these new fields to the current -`ClusterService` struct might limit the number of backends to fewer than ~10,000 backends -per service per cluster. - -Independently of the hard 1.5 MiB limit in etcd, we probably want to keep the -size of those objects as small as possible to reduce the amount of data flowing on -the network when a backend is added, removed or modified. This is pretty much the same -problem that the Kubernetes community faced with the Endpoints object and why they -transitioned to EndpointSlice. Except that in our case with ClusterMesh it's even more -problematic because we can have way more nodes than the maximum number of nodes supported -in a single cluster. Fortunately with KVStoreMesh we are mostly talking about in-cluster -and not inter-cluster traffic. - -The current `ClusterService` is not efficient in terms of size, mainly because it -is encoded in JSON but also because the `Backends` field (which is the main -contributor to the size of this object) duplicates data for each new port. +This struct is encoded in JSON format and stored in etcd. While this format +has served well initially, it now faces several limitations that prevent +clustermesh from scaling efficiently and supporting new features. These +limitations fall into three main areas: missing critical fields, suboptimal +performance for service updates, and inefficient data encoding. + +### Missing backend conditions + +The current format lacks important fields. It omits backend conditions, so +the system cannot skip non-ready endpoints or properly phase out +terminating ones. It also does not include the `EndpointSlice` name, which +would greatly simplify the EndpointSliceSync feature (as described in +[CFP-41533](https://github.com/cilium/cilium/issues/41533)). + +Adding backend conditions is critical for correct backend state handling in +the loadbalancer. However, this will increase service churn as backends +transition between ready, not ready, and terminating states. The current +clustermesh implementation is not equipped to handle this increased churn +efficiently. + +### Performance gap with the loadbalancer k8s reflector + +There is a large performance gap between the clustermesh and the +standard loadbalancer k8s reflector. The clustermesh pipeline currently +re-inserts every backend and uses a watermark mechanism to orphan older +entries. Even if we change a small number of backends, the cost stays close +to the initial `ClusterService` ingestion. This is especially problematic +for churn scenarios. In contrast, the loadbalancer maintains state outside +statedb. It does not need to re-insert all backends and can update multiple +services in a single transaction. The table below shows how large this gap is: + +| Backends | clustermesh (µs) | loadbalancer k8s (µs) | Ratio (clustermesh/k8s) | +|----------|------------------|-----------------------|-------------------------| +| 1 | 95 | 3 | 32x slower | +| 100 | 2141 | 5 | 428x slower | +| 1,000 | 18149 | 9 | 2,017x slower | +| 5,000 | 107435 | 34 | 3,160x slower | +| 10,000 | 232300 | 75 | 3,097x slower | + +These benchmarks are not strictly equivalent, but they give a good idea +of the performance gap and how clustermesh degrades as the number of +backends grows. + +### Inefficient data format and network traffic + +Over time, the current data format has shown its limits. Backend +information is largely the same across ports, so the total size of the +object tends to grow almost in proportion to the number of ports. In +addition, JSON encoding adds extra overhead for field names and number +formatting compared to a binary representation. + +All of this data must be replicated to every node in the mesh. A mesh often +has many more nodes than a single cluster, which results in a high volume +of control plane network traffic. When combined with the increased churn +from backend conditions, this inefficiency becomes a significant scaling +concern. + +Additionally, etcd imposes a hard limit of 1.5 MiB per object. Without a +breaking change to a more efficient format, adding the missing fields to +the current `ClusterService` struct would restrict scalability to fewer +than 10,000 backends per service per cluster. While this is a high limit, +having headroom beyond this point is useful for future growth. + +Even below the limit, keeping objects small is important to reduce network +traffic when backends change. This situation is similar to the Kubernetes +community's move from `Endpoints` to `EndpointSlice`, but the problem is +even stronger in clustermesh because a mesh can contain many more nodes +than a single Kubernetes cluster. ## Goals -* Address the scalability issues of the current JSON format for Services in ClusterMesh - * Reduce data flowing through the network - * Allow more backends to be encoded per service per cluster than currently -* Add backend conditions (or state) to ClusterMesh Services -* Add EndpointSlice name to the ClusterMesh Service data to simplify the - EndpointSliceSync logic +* Reduce network bandwidth needed for control plane operations on large services +* Improve clustermesh service ingestion performance in the agent and in particular + related to service churn scenarios +* Allow scaling on the clustermesh level to a larger number of backends per + service per cluster +* Add backend conditions to clustermesh services to allow correct backend + state handling in the loadbalancer +* Add `EndpointSlice` name to the clustermesh service data to simplify + the EndpointSliceSync logic ## Non-Goals -* Changes not specific to ClusterMesh global Services +* Changes not specific to clustermesh global services (for example + MCS-API handling) +* Large changes to non-clustermesh loadbalancer logic ## Proposal ### Overview -The main proposal of this CFP is to transition to encoding Services in a -different, more optimized format with new data to accommodate backend conditions -and additional data for EndpointSliceSync. - -### What format to use - -The main format studied was based on protobuf since it is already used in Cilium -and can be easily integrated. As one of the goals is to be size efficient, I also -tried compressing the encoded data with two algorithms: lz4 and zstd. Both are -modern compression algorithms and LZ4 is known for its speed while zstd is known -for its compression ratio and relative speed. - -Here is the proto file that we will be using: -``` -// SPDX-License-Identifier: Apache-2.0 -// Copyright Authors of Cilium - -edition = "2023"; - -package clustermesh; -option go_package = "github.com/cilium/cilium/api/v1/clustermesh"; - -import "google/protobuf/go_features.proto"; -option features.(pb.go).api_level = API_OPAQUE; - -message Backend { - string address = 1; - uint32 conditions = 2; - - // Zone info - uint32 zone_index = 3; - repeated uint32 hints_for_zones_indexes = 4; - - string hostname = 5; -} - -enum L4Type { - L4_TYPE_UNSPECIFIED = 0; - L4_TYPE_TCP = 1; - L4_TYPE_UDP = 2; - L4_TYPE_SCTP = 3; -} - -message Port { - L4Type protocol = 1; - uint32 port = 2; - // Port name can be looked up in the `port_names_table` in the ClusterService message - uint32 name_index = 3; -} - -message EndpointSlice { - repeated Port ports = 1; - repeated Backend backends = 2; - string endpoint_slice_name = 3; -} - -// ClusterService represents a service definition within a cluster -message ClusterService { - string cluster = 1; - uint32 cluster_id = 2; - string namespace = 3; - string name = 4; - - // String Interning Tables to reduce per backend size - repeated string port_names_table = 5; - repeated string zone_names_table = 6; - - repeated EndpointSlice endpoint_slices = 7; +This CFP proposes transitioning to a different format that directly +embeds Kubernetes structs and uses zstd compression. + +This will allow more code reuse between the Kubernetes reflector in the +loadbalancer packages and the clustermesh package. + +The new format will also include endpoint conditions and will route +backends based on their state (ready, not ready, terminating, and so on). +It will include the `EndpointSlice` name to simplify the +EndpointSliceSync logic. + +While these new backend conditions will increase service churn, we keep it +manageable by adopting statedb incremental updates through code reuse with +the loadbalancer k8s reflector. We expect clustermesh to achieve similar +performance for service churn scenarios (100-3000x improvement over v1). +Additionally, etcd operations are already rate limited (by default 20qps), +which naturally coalesces multiple service updates in the workqueue, +ensuring good latency and throughput without requiring explicit export +throttling. + +Compressing the data with zstd will also dramatically reduce the on-wire size +and etcd object size. We expect around 50x compression ratios or higher for +services with thousands of backends. This significantly reduces control plane +network bandwidth consumption. + +### Unifying ingestion pipelines through shared data structures + +The clustermesh and loadbalancer k8s reflector currently diverge in both +their data structures and ingestion logic. This divergence has created the +performance gap described in the Motivation section and makes it harder to +maintain feature parity between the two code paths. + +We now propose unifying these pipelines by adopting shared Kubernetes data +structures. Specifically, we will align clustermesh and the loadbalancer k8s +reflector directly with Kubernetes slim `EndpointSlice` structs instead of +using Cilium specific intermediate representations. + +The Kubernetes `EndpointSlice` API must remain backward compatible across +Kubernetes versions. This aligns well with Cilium clustermesh's upgrade +requirement to support at least two consecutive minor versions. Using the +Kubernetes format directly also allows the loadbalancer code to use the +same struct whether the data comes from a local Kubernetes cluster or from +clustermesh. + +Cilium currently uses an internal `Endpoints` resource with the +following struct: + +```go +type Endpoints struct { + types.UnserializableObject + slim_metav1.ObjectMeta + + EndpointSliceID + + // Backends is a map containing all backend IPs and ports. The key to + // the map is the backend IP in string form. The value defines the list + // of ports for that backend IP, plus an additional optional node name. + // Backends map[cmtypes.AddrCluster]*Backend + Backends map[cmtypes.AddrCluster]*Backend } ``` -This protobuf message is inspired by the current Cilium `Backend` and `BackendParams` structs -while being closer to a regular Kubernetes EndpointSlice as we need this data for -EndpointSliceSync. Also the [current code](https://github.com/cilium/cilium/blob/487ace075d5f88e7a48b9fff3d47c989d2b3acad/operator/watchers/service_sync.go#L207) -which creates `ClusterService` objects converts directly from EndpointSlice and Service, so -having a format closer to EndpointSlice is slightly more straightforward to export. - -To decide what algorithm to pick exactly I did some benchmarks testing different scenarios. -The first one is comparing the encoded size of similar Service objects. Here is the result: - -| Backend Count | JSON | JSON (2 ports) | JSON (2 ports + zstd) | Protobuf | Protobuf lz4 | Protobuf zstd | -| ------------- | --------- | -------------- | --------------------- | --------- | ------------ | ------------- | -| 1 | 414B | 453B | 239B | 144B | 154B | 131B | -| 10 | 1.44KiB | 1.82KiB | 309B | 316B | 208B | 177B | -| 100 | 11.99KiB | 15.80KiB | 578B | 2.07KiB | 681B | 309B | -| 1000 | 118.57KiB | 156.66KiB | 3.11KiB | 20.61KiB | 5.55KiB | 2.55KiB | -| 5000 | 596.46KiB | 786.89KiB | 14.22KiB | 105.15KiB | 27.01KiB | 10.47KiB | -| 10000 | 1.17MiB | 1.54MiB | 21.72KiB | 212.80KiB | 53.94KiB | 20.36KiB | -| 50000 | 5.91MiB | 7.77MiB | 89.62KiB | 1.07MiB | 271.34KiB | 85.96KiB | -| 100000 | 11.83MiB | 15.55MiB | 170.01KiB | 2.14MiB | 542.75KiB | 158.87KiB | -| 150000 | 17.75MiB | 23.33MiB | 252.83KiB | 3.22MiB | 815.06KiB | 227.04KiB | - -There are different cases for JSON encoding to study the impact of adding multiple ports. -Note that the data encoded in protobuf correspond to the same data as "JSON (2 ports)" but with -the new fields (backend conditions and data for EndpointSliceSync). - -From this benchmark we can see that the current JSON format uses a lot of bytes, -and that zstd compression is very efficient. It makes the JSON format almost as -efficient as protobuf size wise. - -However as compression is also adding some CPU overhead, I also did some benchmark for its -decompression speed. I focused on decompression since we need to make sure we don't add too much -overhead every time a Service is updated on every agent in the mesh. - -To evaluate this, I made a benchmark that decompresses similar data as in the previous -benchmark and also converts that to a `BackendParams` slice since that's what we need -to feed the load balancer. This does not account for any network/etcd query overhead; it -exclusively decodes bytes from memory. Here are the results summarized by benchstat: - -``` -goos: linux -goarch: amd64 -pkg: github.com/cilium/cilium/bench -cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz - │ bench_result_json.txt │ bench_result_json_zstd.txt │ bench_result_protobuf.txt │ bench_result_protobuf_lz4.txt │ bench_result_protobuf_zstd.txt │ - │ sec/op │ sec/op vs base │ sec/op vs base │ sec/op vs base │ sec/op vs base │ -Decoding/1_backends-8 17.843µ ± 6% 24.087µ ± 1% +34.99% (p=0.000 n=10) 4.488µ ± 1% -74.85% (p=0.000 n=10) 4.707µ ± 1% -73.62% (p=0.000 n=10) 6.166µ ± 3% -65.44% (p=0.000 n=10) -Decoding/10_backends-8 70.37µ ± 1% 77.89µ ± 0% +10.68% (p=0.000 n=10) 18.43µ ± 0% -73.81% (p=0.000 n=10) 18.91µ ± 0% -73.13% (p=0.000 n=10) 20.65µ ± 0% -70.66% (p=0.000 n=10) -Decoding/100_backends-8 582.2µ ± 0% 605.9µ ± 0% +4.07% (p=0.000 n=10) 143.4µ ± 1% -75.36% (p=0.000 n=10) 145.3µ ± 1% -75.04% (p=0.000 n=10) 160.1µ ± 2% -72.50% (p=0.000 n=10) -Decoding/1000_backends-8 6.440m ± 2% 6.565m ± 2% +1.95% (p=0.019 n=10) 1.640m ± 2% -74.53% (p=0.000 n=10) 1.743m ± 1% -72.93% (p=0.000 n=10) 1.788m ± 2% -72.24% (p=0.000 n=10) -Decoding/5000_backends-8 52.07m ± 5% 55.31m ± 9% +6.22% (p=0.043 n=10) 22.84m ± 23% -56.14% (p=0.000 n=10) 22.42m ± 15% -56.95% (p=0.000 n=10) 23.49m ± 15% -54.89% (p=0.000 n=10) -Decoding/10000_backends-8 131.42m ± 10% 129.13m ± 18% ~ (p=0.631 n=10) 58.85m ± 20% -55.22% (p=0.000 n=10) 55.82m ± 17% -57.53% (p=0.000 n=10) 56.00m ± 18% -57.39% (p=0.000 n=10) -Decoding/50000_backends-8 742.2m ± 9% 841.4m ± 14% ~ (p=0.052 n=10) 391.0m ± 29% -47.31% (p=0.000 n=10) 384.9m ± 9% -48.14% (p=0.000 n=10) 398.0m ± 17% -46.37% (p=0.000 n=10) -geomean 4.222m 4.619m +9.40% 1.394m -66.98% 1.406m -66.70% 1.524m -63.91% +This struct is a Cilium specific transformation of Kubernetes `EndpointSlice` +data. It was designed to support both the legacy `Endpoints` API and the newer +`EndpointSlice` API. With the introduction of statedb and most consumers watching +statedb instead of relying on this resource, the only place actually using it +is `operator/watchers/service_sync.go` to export service data for clustermesh. + +Since this resource is now primarily used for clustermesh exports, we propose +changing it to directly expose the slim Kubernetes `EndpointSlice` struct. The +new `ClusterServiceV2` format will embed these `EndpointSlice` objects directly, +avoiding the need for an intermediate Cilium-specific representation. This +approach prevents future divergence that could occur if the internal `Endpoints` +struct is later optimized or changed for other purposes. + +The new `ClusterService` v2 struct would thus look like this: + +```go +type ClusterServiceV2 struct { + Cluster string `json:"cluster"` + ClusterID uint32 `json:"clusterID"` + Namespace string `json:"namespace"` + Name string `json:"name"` + // Note that not every field from the EndpointSlice will + // be populated (for instance, fields from TypeMeta and most + // fields from ObjectMeta) + EndpointSlices []*slim_discovery_v1.EndpointSlice `json:"endpointslices"` +} ``` -We can see in the decompression benchmark that all the proposals based on protobuf are -significantly faster (by at least 50%) than the current JSON format. While -we don't have agent profiling data here, we know decoding would be faster. However -we don't know if this is a relevant portion of the total agent CPU usage. - -While zstd is in most cases slightly slower for decompressing in-memory data, it is also -about ~7x smaller than the default protobuf and ~3x smaller than LZ4. This should also -reduce various IO processing times not accounted in the in-memory benchmark above. - -The JSON format compressed with zstd add relatively little overhead vs the current JSON -format and since it has approximately the same size as a protobuf zstd format it makes -a compelling alternative. However adding a conditions fields to backends would amplify -the number of service updates and thus makes the decode CPU overhead even more significant. - -Based on this, the proposal is to use a protobuf format compressed with zstd. +By using the Kubernetes `EndpointSlice` format directly, both the loadbalancer +and clustermesh will consume similar data structures. This enables code reuse +for the agent-side ingestion logic. We can extract most of the code in +`pkg/loadbalancer/reflectors/k8s.go` into a shared package under +`pkg/loadbalancer` that both the loadbalancer and clustermesh can use. This +shared code will handle buffering, coalescing, and statedb updates, ensuring +similar ingestion and optimization for both code paths. This alignment allows +us to introduce an ingestion buffer in the agent for clustermesh services, +similar to the loadbalancer reflector. We might tweak some parameters for +these two different use cases (max buffer size, timeouts, etc.), but the core +logic would be shared. + +We should also find out which is the most efficient between having a single +buffer for all remote clusters or having a per-cluster buffer and its exact +parameters. This could be experimented with during implementation. + +This alignment brings significant performance improvements. The current +v1 implementation re-inserts all backends on every update in statedb. +This is inefficient for service churn. With v2, the plan is to adopt +statedb incremental updates by reusing much of the current loadbalancer +k8s reflector code. Based on the benchmarks in the Motivation section, +clustermesh v1 is 100-3000x slower for updates. We expect v2 to achieve +similar performance to the loadbalancer k8s reflector for service churn +scenarios. The overhead will mainly come from JSON decoding and +decompression. This means handling backend condition changes will remain +efficient even at scale. + +### Compressing the clustermesh service data + +We propose compressing the service data stored in etcd using zstd. For +large services with thousands of backends, we expect compression ratios +around 50x or higher for services beyond that scale. For example, the +current `ClusterService` v1 format for a 5,000 backend service and 2 +ports compresses from 786.89 KiB down to 14.22 KiB with zstd. This +achieves a 55x compression ratio. + +For instance, consider an 11 clusters mesh where each cluster has 1,000 +nodes. With the same `ClusterService` in v1 format with 5,000 backends +and 2 ports, any update from any endpoint in this single service would +result in about 7.5 GiB of uncompressed data propagated across the mesh. +With zstd compression, this reduces to approximately 139 MiB. This is +especially important when service churn happens during service flaps or +even a simple Deployment rollout, which could potentially trigger these +updates hundreds of times. + +The etcd 1.5 MiB per object limit is not a concern for the supported +scale. Kubernetes supports a maximum of 150k endpoints per cluster total +(across all services). Cilium's default configuration limits the number +of backends to 64k (both local cluster and clustermesh combined) via the +`bpf-lb-map-max` setting. These limits are not per service. Even if they +can be increased, individual service backends from a particular cluster +should remain well below the etcd object size limit even without +compression. However, compression is still critical for reducing control +plane network bandwidth needed to propagate service changes to all agents +in the mesh. + +All clustermesh v2 service objects will be stored as raw +`zstd(JSON(ClusterServiceV2))` bytes under the `cilium/state/services/v2/` +key prefix. All other objects in etcd (including `ClusterService` v1) will +remain uncompressed. Services are the only objects in etcd that can grow +unbounded with cluster size. This makes them the primary candidate for +compression. + +In our benchmarks, zstd decompression added minimal overhead compared +to JSON decoding (about 5%). Decompression and decoding are done +concurrently per remote cluster in the agent. This further limits the +performance impact. ### Rollout strategy -In order to introduce a "v2" of the ClusterMesh Service data format, this CFP is mainly -proposing to have a global switch rather than a per cluster detection. This is -mainly to keep the transition as simple as possible because a per cluster detection -could introduce more complexity as we would need to handle downgrade/upgrade of remote -clusters at runtime. - -With this approach we would add a new option `clustermesh-service-v2-enabled` which will be -disabled by default in Cilium 1.19. This option will control if the operator and agent will use -the v1 or v2 format. This option would be enabled by default in Cilium 1.20 and deprecated -to be then removed in Cilium 1.21. In Cilium 1.19 and 1.20, we would also unconditionally -export both the v1 and v2 format while KVStoreMesh will also mirror both. - -This gives a good balance between keeping the change simple and ensuring that users -can upgrade without traffic disruptions. We will be able to document that when -`clustermesh-service-v2-enabled` is enabled, all remote clusters connected should already be running -Cilium 1.19 or higher. Also the upgrade to Cilium 1.21 with `clustermesh-service-v2-enabled` -disabled will not be officially supported and users doing that should expect disruptions. - -To clarify and facilitate this transition, we could also make Cilium export its own version -in its `CiliumClusterConfig`. And prevent connecting to remote cluster running Cilium 1.18 or lower -when `clustermesh-service-v2-enabled` is enabled. We would also able to add a warning -when we connect to remote clusters running with more than one minor version difference, -as this is not officially supported or tested in our CI. +To introduce v2 of the clustermesh service data format, this CFP +proposes a global switch rather than per-cluster detection. This keeps +the transition simple. Per-cluster detection would add complexity +because we would need to handle downgrade and upgrade of remote clusters +at runtime. + +With this approach, we would add a new option +`clustermesh-service-v2-enabled`. It will be disabled by default in +Cilium 1.19. This option will control whether the operator and agent +use the v1 or v2 format. The option will be enabled by default in +Cilium 1.20 and deprecated to be removed in Cilium 1.22. In Cilium 1.19 +and 1.20, we would also unconditionally export both the v1 and v2 format +while `KVStoreMesh` mirrors both. This means double etcd storage during +the transition period, but agents will watch only one version (based on +their configuration). Network traffic is not doubled. Starting with +Cilium 1.22, only the v2 format will be used and the +`clustermesh-service-v2-enabled` option will be removed. + +This gives a balance between keeping the change simple and ensuring +that users can upgrade without traffic disruptions. We will be able to +document that when `clustermesh-service-v2-enabled` is enabled, all +connected remote clusters should already be running Cilium 1.19 or +higher. The upgrade to Cilium 1.21 with +`clustermesh-service-v2-enabled` disabled will not be officially +supported. Users doing that should expect disruptions. + +To make this transition easier to understand, we could make Cilium +export its own version in its `CiliumClusterConfig`. We could prevent +connecting to remote clusters running Cilium 1.18 or lower when +`clustermesh-service-v2-enabled` is enabled. This would result in a hard +error when attempting to establish a connection to an incompatible +cluster. We could also add a warning when we connect to remote clusters +that run with more than one minor version difference. This is not +officially supported or tested in our CI. ## Impacts / Key Questions ### Impact: Service format breaking change -This change will introduce a "v2" of service data in etcd and, as proposed, -would introduce an incompatibility between clusters running Cilium 1.18 or lower -and Cilium 1.20 or higher by default. - -### Impact: text format readability for debugging +This change will introduce v2 of service data in etcd. As proposed, this +would introduce an incompatibility between clusters running Cilium 1.18 +or lower and Cilium 1.20 or higher by default. -If we encode Service objects with protobuf and then compress them with zstd, it would be -harder to inspect the content of a Service object in etcd for debugging than with -the existing JSON format. +### Impact: Text format readability for debugging -In order to mitigate this we will use the protojson encoding/decoding.The `kvstore get` -command with the json and yaml output will still be supported even for protobuf data. -Also some new hive script command will be added to encode/decode the service object -from/to JSON to facilitate testing +If we compress data with zstd, it will be harder to inspect the content +of a service object in etcd for debugging. We think this is an acceptable +trade-off given the benefits for large services. ### Option 1: Use a slice approach -We could also use a slice approach very similar to what Kubernetes has done with -EndpointSlice vs the original Endpoints. +We could also use a slice approach very similar to what Kubernetes has +done with `EndpointSlice` vs the original `Endpoints`. #### Pros @@ -265,21 +332,25 @@ EndpointSlice vs the original Endpoints. #### Cons -* Would introduce more complexity in the clustermesh codebase - (but similar concerns as what is done in the load balancer codebase) -* Needs significantly more objects in etcd for big Services -* If we keep JSON to encode those objects, the advantages in terms of bytes - flowing through the network are not clearly better with a slice approach in all - situations considering the size efficiency of protobuf compressed with zstd. +* Requires more objects encoded in etcd and generates more churn that we + cannot easily coalesce across multiple slice objects +* Compression ratio might be worse than a single-object approach +* As a result of the two previous points, it would most likely be worse in + terms of network usage -### Option 2: Optimize the existing JSON format +### Option 2: Use protobuf encoding #### Pros -* Keep a format "readable" for debugging +* More efficient encoding than JSON in terms of size +* Faster decoding than JSON #### Cons -* Some byte optimizations could be achieved by shortening/uglifying the - different field names, which would make the format less readable and probably defeat - the purpose of keeping a JSON encoding. +* Readability for debugging would be worse +* zstd achieves worse compression ratios with protobuf than with JSON, + which reduces the effective size benefit from switching to protobuf + (although protobuf is still better) +* Work on `encoding/json/v2` in recent Go versions will reduce the JSON + decoding gap in the future. Data decoding is also done concurrently + per cluster, which further limits the impact of decoding speed. From 4db1bd0de58677851f3254a7709bda421be13106 Mon Sep 17 00:00:00 2001 From: Arthur Outhenin-Chalandre Date: Sat, 13 Dec 2025 18:21:34 +0100 Subject: [PATCH 4/8] wip: revise timeline to cilium 1.20 and shorten transition period The shorter transition period will help supporting both codepath for a shorter period. It also won't affect users that rely on default setting (and stay within the supported ground of keeping one minor version skew). Now that we are targeting 1.20 it also leaves us more time to have potentially less rough edges for the first version where this is supported (even if disabled by default). Signed-off-by: Arthur Outhenin-Chalandre --- cilium/CFP-41953-clustermesh-service-v2.md | 33 +++++++++------------- 1 file changed, 14 insertions(+), 19 deletions(-) diff --git a/cilium/CFP-41953-clustermesh-service-v2.md b/cilium/CFP-41953-clustermesh-service-v2.md index e782688..6b525f1 100644 --- a/cilium/CFP-41953-clustermesh-service-v2.md +++ b/cilium/CFP-41953-clustermesh-service-v2.md @@ -4,7 +4,7 @@ **Begin Design Discussion:** 2025-10-01 -**Cilium Release:** 1.19 +**Cilium Release:** 1.20 **Authors:** Arthur Outhenin-Chalandre @@ -278,29 +278,24 @@ the transition simple. Per-cluster detection would add complexity because we would need to handle downgrade and upgrade of remote clusters at runtime. -With this approach, we would add a new option -`clustermesh-service-v2-enabled`. It will be disabled by default in -Cilium 1.19. This option will control whether the operator and agent -use the v1 or v2 format. The option will be enabled by default in -Cilium 1.20 and deprecated to be removed in Cilium 1.22. In Cilium 1.19 -and 1.20, we would also unconditionally export both the v1 and v2 format -while `KVStoreMesh` mirrors both. This means double etcd storage during -the transition period, but agents will watch only one version (based on -their configuration). Network traffic is not doubled. Starting with -Cilium 1.22, only the v2 format will be used and the -`clustermesh-service-v2-enabled` option will be removed. +With this approach, we would add a new temporary option +`clustermesh-service-v2-enabled`. This option will control whether the operator +and agent use the v1 or v2 format. The option will be disabled by default in +Cilium 1.20 and removed in Cilium 1.21. In Cilium 1.20, we would also +unconditionally export both the v1 and v2 format while `KVStoreMesh` mirrors +both. This means "double" etcd storage during the transition period, but agents +will watch only one version (based on their configuration). This means that the +network traffic will not be "doubled". This gives a balance between keeping the change simple and ensuring -that users can upgrade without traffic disruptions. We will be able to -document that when `clustermesh-service-v2-enabled` is enabled, all -connected remote clusters should already be running Cilium 1.19 or -higher. The upgrade to Cilium 1.21 with -`clustermesh-service-v2-enabled` disabled will not be officially -supported. Users doing that should expect disruptions. +that users can upgrade without traffic disruptions. Cilium 1.20 will +essentially serve as a transition release where both formats are supported. +Users will be able to turn on `clustermesh-service-v2-enabled` early in Cilium +1.20, assuming all the clusters in their mesh already run Cilium 1.20 or higher. To make this transition easier to understand, we could make Cilium export its own version in its `CiliumClusterConfig`. We could prevent -connecting to remote clusters running Cilium 1.18 or lower when +connecting to remote clusters running Cilium 1.19 or lower when `clustermesh-service-v2-enabled` is enabled. This would result in a hard error when attempting to establish a connection to an incompatible cluster. We could also add a warning when we connect to remote clusters From 7fc2a651fe16444f59061d7cee39806d1e83c7d4 Mon Sep 17 00:00:00 2001 From: Arthur Outhenin-Chalandre Date: Wed, 28 Jan 2026 14:42:47 +0100 Subject: [PATCH 5/8] wip: revise conditions details after clustermesh backends fix ClusterService should have excluded ready=false / serving=false backends but there was a regression which is now fixed. Also fixing this CFP text accordingly. Signed-off-by: Arthur Outhenin-Chalandre --- cilium/CFP-41953-clustermesh-service-v2.md | 34 +++++++++++----------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/cilium/CFP-41953-clustermesh-service-v2.md b/cilium/CFP-41953-clustermesh-service-v2.md index 6b525f1..4821cbc 100644 --- a/cilium/CFP-41953-clustermesh-service-v2.md +++ b/cilium/CFP-41953-clustermesh-service-v2.md @@ -24,22 +24,16 @@ The current clustermesh global service data is handled with the This struct is encoded in JSON format and stored in etcd. While this format has served well initially, it now faces several limitations that prevent clustermesh from scaling efficiently and supporting new features. These -limitations fall into three main areas: missing critical fields, suboptimal -performance for service updates, and inefficient data encoding. +limitations fall into three main areas: missing backend conditions, suboptimal +performance for service updates, and inefficient data format and encoding. ### Missing backend conditions -The current format lacks important fields. It omits backend conditions, so -the system cannot skip non-ready endpoints or properly phase out -terminating ones. It also does not include the `EndpointSlice` name, which -would greatly simplify the EndpointSliceSync feature (as described in -[CFP-41533](https://github.com/cilium/cilium/issues/41533)). - -Adding backend conditions is critical for correct backend state handling in -the loadbalancer. However, this will increase service churn as backends -transition between ready, not ready, and terminating states. The current -clustermesh implementation is not equipped to handle this increased churn -efficiently. +The current format omits all backend conditions. It directly removes +backends that are not ready and not serving. This means that we cannot +properly perform graceful termination as described in the +[KPR documentation](https://docs.cilium.io/en/stable/network/kubernetes/kubeproxy-free/#graceful-termination), +most likely resulting in some traffic loss during rolling updates. ### Performance gap with the loadbalancer k8s reflector @@ -118,10 +112,16 @@ embeds Kubernetes structs and uses zstd compression. This will allow more code reuse between the Kubernetes reflector in the loadbalancer packages and the clustermesh package. -The new format will also include endpoint conditions and will route -backends based on their state (ready, not ready, terminating, and so on). -It will include the `EndpointSlice` name to simplify the -EndpointSliceSync logic. +The new format will also include endpoint conditions, enabling the +loadbalancer to retain backends in various states and allowing the +loadbalancer to apply the same logic as for local services instead of only +having backends in an active state. + +This will allow EndpointSliceSync to also include endpoints that were +previously excluded and also include their full conditions allowing more +native integration with non Cilium GW-API implementations. The inclusion +of an `EndpointSlice` name should also significantly simplify the +EndpointSliceSync logic and code base. While these new backend conditions will increase service churn, we keep it manageable by adopting statedb incremental updates through code reuse with From e6f32f728cd84c0add256709835c25bb6aa85b40 Mon Sep 17 00:00:00 2001 From: Arthur Outhenin-Chalandre Date: Mon, 9 Mar 2026 10:23:36 +0100 Subject: [PATCH 6/8] wip: update with latest lb improvements in main Signed-off-by: Arthur Outhenin-Chalandre --- cilium/CFP-41953-clustermesh-service-v2.md | 60 +++++++++------------- 1 file changed, 25 insertions(+), 35 deletions(-) diff --git a/cilium/CFP-41953-clustermesh-service-v2.md b/cilium/CFP-41953-clustermesh-service-v2.md index 4821cbc..6de1484 100644 --- a/cilium/CFP-41953-clustermesh-service-v2.md +++ b/cilium/CFP-41953-clustermesh-service-v2.md @@ -38,25 +38,20 @@ most likely resulting in some traffic loss during rolling updates. ### Performance gap with the loadbalancer k8s reflector There is a large performance gap between the clustermesh and the -standard loadbalancer k8s reflector. The clustermesh pipeline currently -re-inserts every backend and uses a watermark mechanism to orphan older -entries. Even if we change a small number of backends, the cost stays close -to the initial `ClusterService` ingestion. This is especially problematic -for churn scenarios. In contrast, the loadbalancer maintains state outside -statedb. It does not need to re-insert all backends and can update multiple -services in a single transaction. The table below shows how large this gap is: - -| Backends | clustermesh (µs) | loadbalancer k8s (µs) | Ratio (clustermesh/k8s) | -|----------|------------------|-----------------------|-------------------------| -| 1 | 95 | 3 | 32x slower | -| 100 | 2141 | 5 | 428x slower | -| 1,000 | 18149 | 9 | 2,017x slower | -| 5,000 | 107435 | 34 | 3,160x slower | -| 10,000 | 232300 | 75 | 3,097x slower | +standard loadbalancer k8s reflector. Their update and ingestion behavior +is not currently on par. The table below shows how large this gap is: + +| Backends | clustermesh (µs) | clustermesh w/o JSON decoding (µs) | loadbalancer k8s (µs) | Ratio (clustermesh/k8s) | Ratio (clustermesh w/o JSON/k8s) | +|----------|------------------|------------------------------------|-----------------------|-------------------------|----------------------------------| +| 1 | 44 | 35 | 4 | 11x slower | 9x slower | +| 100 | 629 | 271 | 4 | 157x slower | 68x slower | +| 1 000 | 6 349 | 2 626 | 7 | 907x slower | 375x slower | +| 5 000 | 36 861 | 15 831 | 30 | 1229x slower | 528x slower | +| 10 000 | 78 810 | 44 289 | 70 | 1126x slower | 633x slower | These benchmarks are not strictly equivalent, but they give a good idea -of the performance gap and how clustermesh degrades as the number of -backends grows. +of the performance gap, how much JSON decoding contributes to it, and how +clustermesh degrades as the number of backends grows. ### Inefficient data format and network traffic @@ -123,14 +118,14 @@ native integration with non Cilium GW-API implementations. The inclusion of an `EndpointSlice` name should also significantly simplify the EndpointSliceSync logic and code base. -While these new backend conditions will increase service churn, we keep it -manageable by adopting statedb incremental updates through code reuse with -the loadbalancer k8s reflector. We expect clustermesh to achieve similar -performance for service churn scenarios (100-3000x improvement over v1). -Additionally, etcd operations are already rate limited (by default 20qps), -which naturally coalesces multiple service updates in the workqueue, -ensuring good latency and throughput without requiring explicit export -throttling. +While these new backend conditions will increase service churn, we expect +to keep it manageable by following a very similar code path and adopting +the ingestion techniques already used in the loadbalancer k8s reflector, +with significant improvement over v1 for service churn scenarios (about +10-600x when excluding JSON decoding). Additionally, etcd operations are +already rate limited (by default 20qps), which naturally coalesces multiple +service updates in the workqueue, ensuring good latency and throughput without +requiring explicit export throttling. Compressing the data with zstd will also dramatically reduce the on-wire size and etcd object size. We expect around 50x compression ratios or higher for @@ -218,16 +213,11 @@ We should also find out which is the most efficient between having a single buffer for all remote clusters or having a per-cluster buffer and its exact parameters. This could be experimented with during implementation. -This alignment brings significant performance improvements. The current -v1 implementation re-inserts all backends on every update in statedb. -This is inefficient for service churn. With v2, the plan is to adopt -statedb incremental updates by reusing much of the current loadbalancer -k8s reflector code. Based on the benchmarks in the Motivation section, -clustermesh v1 is 100-3000x slower for updates. We expect v2 to achieve -similar performance to the loadbalancer k8s reflector for service churn -scenarios. The overhead will mainly come from JSON decoding and -decompression. This means handling backend condition changes will remain -efficient even at scale. +We expect performance to be closer to the k8s reflector. According to the +benchmarks in the Motivation section, the k8s reflector is currently about +10-600x faster than clustermesh v1 for updates when excluding JSON decoding. +The added churn from condition changes should thus remain manageable even at +scale, with much of the remaining overhead coming from JSON decoding and decompression. ### Compressing the clustermesh service data From de012ce0497c6339fd4d22a2ec0ca27efd1e4fff Mon Sep 17 00:00:00 2001 From: Arthur Outhenin-Chalandre Date: Thu, 19 Mar 2026 15:15:14 +0100 Subject: [PATCH 7/8] wip: remove ratio with json decoding Signed-off-by: Arthur Outhenin-Chalandre --- cilium/CFP-41953-clustermesh-service-v2.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/cilium/CFP-41953-clustermesh-service-v2.md b/cilium/CFP-41953-clustermesh-service-v2.md index 6de1484..ae4a208 100644 --- a/cilium/CFP-41953-clustermesh-service-v2.md +++ b/cilium/CFP-41953-clustermesh-service-v2.md @@ -41,13 +41,13 @@ There is a large performance gap between the clustermesh and the standard loadbalancer k8s reflector. Their update and ingestion behavior is not currently on par. The table below shows how large this gap is: -| Backends | clustermesh (µs) | clustermesh w/o JSON decoding (µs) | loadbalancer k8s (µs) | Ratio (clustermesh/k8s) | Ratio (clustermesh w/o JSON/k8s) | -|----------|------------------|------------------------------------|-----------------------|-------------------------|----------------------------------| -| 1 | 44 | 35 | 4 | 11x slower | 9x slower | -| 100 | 629 | 271 | 4 | 157x slower | 68x slower | -| 1 000 | 6 349 | 2 626 | 7 | 907x slower | 375x slower | -| 5 000 | 36 861 | 15 831 | 30 | 1229x slower | 528x slower | -| 10 000 | 78 810 | 44 289 | 70 | 1126x slower | 633x slower | +| Backends | clustermesh (µs) | clustermesh w/o JSON decoding (µs) | loadbalancer k8s (µs) | Ratio (clustermesh w/o JSON/k8s) | +|----------|------------------|------------------------------------|-----------------------|----------------------------------| +| 1 | 44 | 35 | 4 | 9x slower | +| 100 | 629 | 271 | 4 | 68x slower | +| 1 000 | 6 349 | 2 626 | 7 | 375x slower | +| 5 000 | 36 861 | 15 831 | 30 | 528x slower | +| 10 000 | 78 810 | 44 289 | 70 | 633x slower | These benchmarks are not strictly equivalent, but they give a good idea of the performance gap, how much JSON decoding contributes to it, and how From 0c339ed923903f82ace6f2974b7970a609a98a1f Mon Sep 17 00:00:00 2001 From: Arthur Outhenin-Chalandre Date: Tue, 5 May 2026 15:54:35 +0200 Subject: [PATCH 8/8] wip: simplify doc and change the approach This commit removes/simplify a couple of things that repeat itself and the doc to focus on the key point (format/encoding). This commit also change the approach to export individual EndpointSlice objects rather than doing a service v2 approach. The encoding point is still an open question (included the data to decide about that). Signed-off-by: Arthur Outhenin-Chalandre --- cilium/CFP-41953-clustermesh-service-v2.md | 322 +++++++-------------- 1 file changed, 111 insertions(+), 211 deletions(-) diff --git a/cilium/CFP-41953-clustermesh-service-v2.md b/cilium/CFP-41953-clustermesh-service-v2.md index ae4a208..f9b94a6 100644 --- a/cilium/CFP-41953-clustermesh-service-v2.md +++ b/cilium/CFP-41953-clustermesh-service-v2.md @@ -14,7 +14,7 @@ This CFP proposes introducing v2 of the clustermesh global service data format stored in etcd. It transitions from `cilium/state/services/v1/` -to `cilium/state/services/v2/` and harmonizes backend data insertion +to `cilium/state/endpointslices/v1/` and harmonizes backend data insertion techniques between clustermesh services and Kubernetes services. ## Motivation @@ -55,17 +55,22 @@ clustermesh degrades as the number of backends grows. ### Inefficient data format and network traffic -Over time, the current data format has shown its limits. Backend -information is largely the same across ports, so the total size of the -object tends to grow almost in proportion to the number of ports. In -addition, JSON encoding adds extra overhead for field names and number -formatting compared to a binary representation. +The `ClusterService` struct/format was designed to fit the loadbalancer internals +when introduced in 2018. In 2026, after several iterations and refactors, those +two formats have diverged. For instance, one recent example was that +`ClusterService` encoded similar ports with different names as multiple entries, +while the loadbalancer backends encoded them as one entry with multiple port +names attached. This divergence led to a bug with one port shadowing the other, +which was recently fixed and will be released in 1.19.4 / 1.18.10. + +In terms of wire size, the backend map duplicates port information for each IP, +so the total size of the object tends to grow almost in proportion to the number +of ports and IPs. In addition, JSON encoding adds extra overhead +for field names and number formatting compared to a binary representation. All of this data must be replicated to every node in the mesh. A mesh often has many more nodes than a single cluster, which results in a high volume -of control plane network traffic. When combined with the increased churn -from backend conditions, this inefficiency becomes a significant scaling -concern. +of control plane network traffic. Additionally, etcd imposes a hard limit of 1.5 MiB per object. Without a breaking change to a more efficient format, adding the missing fields to @@ -79,6 +84,14 @@ community's move from `Endpoints` to `EndpointSlice`, but the problem is even stronger in clustermesh because a mesh can contain many more nodes than a single Kubernetes cluster. +For example, the current `ClusterService` format for a 5 000 backends service +and 2 ports is 786.89 KiB. In an 11 clusters mesh where each cluster has 1 000 +nodes, any update from any endpoint in this single service would result in about +7.5 GiB of data propagated globally across the mesh. From the perspective of a +single cluster control plane, this corresponds to about 750 MiB per update. +Assuming 10 updates per second for convenience, this would result in 7.5 GiB/s +of control plane traffic within that single cluster! + ## Goals * Reduce network bandwidth needed for control plane operations on large services @@ -101,241 +114,128 @@ than a single Kubernetes cluster. ### Overview -This CFP proposes transitioning to a different format that directly -embeds Kubernetes structs and uses zstd compression. +This CFP proposes transitioning to directly include `EndpointSlice` objects +individually. This will allow more code reuse between the Kubernetes reflector in the loadbalancer packages and the clustermesh package. -The new format will also include endpoint conditions, enabling the -loadbalancer to retain backends in various states and allowing the -loadbalancer to apply the same logic as for local services instead of only -having backends in an active state. - -This will allow EndpointSliceSync to also include endpoints that were -previously excluded and also include their full conditions allowing more -native integration with non Cilium GW-API implementations. The inclusion -of an `EndpointSlice` name should also significantly simplify the -EndpointSliceSync logic and code base. - -While these new backend conditions will increase service churn, we expect -to keep it manageable by following a very similar code path and adopting -the ingestion techniques already used in the loadbalancer k8s reflector, -with significant improvement over v1 for service churn scenarios (about -10-600x when excluding JSON decoding). Additionally, etcd operations are -already rate limited (by default 20qps), which naturally coalesces multiple -service updates in the workqueue, ensuring good latency and throughput without -requiring explicit export throttling. - -Compressing the data with zstd will also dramatically reduce the on-wire size -and etcd object size. We expect around 50x compression ratios or higher for -services with thousands of backends. This significantly reduces control plane -network bandwidth consumption. - -### Unifying ingestion pipelines through shared data structures +The new format will include every field from the EndpointSlice struct, including +endpoint conditions. The clustermesh code will preserve these conditions and let +the existing loadbalancer code apply the same backend state handling as it does +for local services, instead of only receiving backends in an active state. -The clustermesh and loadbalancer k8s reflector currently diverge in both -their data structures and ingestion logic. This divergence has created the -performance gap described in the Motivation section and makes it harder to -maintain feature parity between the two code paths. +This will also significantly simplify the EndpointSliceSync codebase as we will +be able to simply mirror EndpointSlices from remote clusters without the +complex sharding logic. The inclusion of the conditions should also benefit +consumers watching those EndpointSlices, for instance third-party GW-API +implementations. -We now propose unifying these pipelines by adopting shared Kubernetes data -structures. Specifically, we will align clustermesh and the loadbalancer k8s -reflector directly with Kubernetes slim `EndpointSlice` structs instead of -using Cilium specific intermediate representations. +### Using EndpointSlice struct directly The Kubernetes `EndpointSlice` API must remain backward compatible across Kubernetes versions. This aligns well with Cilium clustermesh's upgrade -requirement to support at least two consecutive minor versions. Using the -Kubernetes format directly also allows the loadbalancer code to use the -same struct whether the data comes from a local Kubernetes cluster or from -clustermesh. +requirement to support at least two consecutive minor versions. -Cilium currently uses an internal `Endpoints` resource with the -following struct: +We will embed the actual `EndpointSlice` struct to make sure we can extend it in +the future if needed. The initial version should look like this: ```go -type Endpoints struct { - types.UnserializableObject - slim_metav1.ObjectMeta - - EndpointSliceID - - // Backends is a map containing all backend IPs and ports. The key to - // the map is the backend IP in string form. The value defines the list - // of ports for that backend IP, plus an additional optional node name. - // Backends map[cmtypes.AddrCluster]*Backend - Backends map[cmtypes.AddrCluster]*Backend +type ClusterEndpointSlice struct { + Cluster string + EndpointSlice slim_discovery_v1.EndpointSlice } ``` -This struct is a Cilium specific transformation of Kubernetes `EndpointSlice` -data. It was designed to support both the legacy `Endpoints` API and the newer -`EndpointSlice` API. With the introduction of statedb and most consumers watching -statedb instead of relying on this resource, the only place actually using it -is `operator/watchers/service_sync.go` to export service data for clustermesh. +KCM can update all EndpointSlices up to 20 times per second and up to 30 in +burst. Kubernetes scalability tests also test Services, and in those tests, +despite significantly boosting KCM QPS to 100 or 500, EndpointSlice updates +remain around 10 or fewer updates per second. At very large scale (5k nodes), +they can go up to ~45 updates per second. -Since this resource is now primarily used for clustermesh exports, we propose -changing it to directly expose the slim Kubernetes `EndpointSlice` struct. The -new `ClusterServiceV2` format will embed these `EndpointSlice` objects directly, -avoiding the need for an intermediate Cilium-specific representation. This -approach prevents future divergence that could occur if the internal `Endpoints` -struct is later optimized or changed for other purposes. +This QPS is entirely manageable for clustermesh despite the probably higher QPS +from managing individual EndpointSlices and adding endpoint conditions. We +currently only have 20 QPS for clustermesh-apiserver, but we could likely boost +it into the 50-100 range. Note that kvstoremesh has 100 QPS and it would still +limit the overall QPS for events from all remote clusters. If there is more QPS +across all clusters and object types, each will compete for the QPS budget. -The new `ClusterService` v2 struct would thus look like this: - -```go -type ClusterServiceV2 struct { - Cluster string `json:"cluster"` - ClusterID uint32 `json:"clusterID"` - Namespace string `json:"namespace"` - Name string `json:"name"` - // Note that not every field from the EndpointSlice will - // be populated (for instance, fields from TypeMeta and most - // fields from ObjectMeta) - EndpointSlices []*slim_discovery_v1.EndpointSlice `json:"endpointslices"` -} -``` +### Unifying datapath ingestion pipelines -By using the Kubernetes `EndpointSlice` format directly, both the loadbalancer -and clustermesh will consume similar data structures. This enables code reuse -for the agent-side ingestion logic. We can extract most of the code in -`pkg/loadbalancer/reflectors/k8s.go` into a shared package under -`pkg/loadbalancer` that both the loadbalancer and clustermesh can use. This -shared code will handle buffering, coalescing, and statedb updates, ensuring -similar ingestion and optimization for both code paths. This alignment allows -us to introduce an ingestion buffer in the agent for clustermesh services, -similar to the loadbalancer reflector. We might tweak some parameters for -these two different use cases (max buffer size, timeouts, etc.), but the core -logic would be shared. - -We should also find out which is the most efficient between having a single -buffer for all remote clusters or having a per-cluster buffer and its exact -parameters. This could be experimented with during implementation. - -We expect performance to be closer to the k8s reflector. According to the -benchmarks in the Motivation section, the k8s reflector is currently about -10-600x faster than clustermesh v1 for updates when excluding JSON decoding. -The added churn from condition changes should thus remain manageable even at -scale, with much of the remaining overhead coming from JSON decoding and decompression. - -### Compressing the clustermesh service data - -We propose compressing the service data stored in etcd using zstd. For -large services with thousands of backends, we expect compression ratios -around 50x or higher for services beyond that scale. For example, the -current `ClusterService` v1 format for a 5,000 backend service and 2 -ports compresses from 786.89 KiB down to 14.22 KiB with zstd. This -achieves a 55x compression ratio. - -For instance, consider an 11 clusters mesh where each cluster has 1,000 -nodes. With the same `ClusterService` in v1 format with 5,000 backends -and 2 ports, any update from any endpoint in this single service would -result in about 7.5 GiB of uncompressed data propagated across the mesh. -With zstd compression, this reduces to approximately 139 MiB. This is -especially important when service churn happens during service flaps or -even a simple Deployment rollout, which could potentially trigger these -updates hundreds of times. - -The etcd 1.5 MiB per object limit is not a concern for the supported -scale. Kubernetes supports a maximum of 150k endpoints per cluster total -(across all services). Cilium's default configuration limits the number -of backends to 64k (both local cluster and clustermesh combined) via the -`bpf-lb-map-max` setting. These limits are not per service. Even if they -can be increased, individual service backends from a particular cluster -should remain well below the etcd object size limit even without -compression. However, compression is still critical for reducing control -plane network bandwidth needed to propagate service changes to all agents -in the mesh. - -All clustermesh v2 service objects will be stored as raw -`zstd(JSON(ClusterServiceV2))` bytes under the `cilium/state/services/v2/` -key prefix. All other objects in etcd (including `ClusterService` v1) will -remain uncompressed. Services are the only objects in etcd that can grow -unbounded with cluster size. This makes them the primary candidate for -compression. - -In our benchmarks, zstd decompression added minimal overhead compared -to JSON decoding (about 5%). Decompression and decoding are done -concurrently per remote cluster in the agent. This further limits the -performance impact. +The clustermesh and loadbalancer k8s reflector currently diverge in both +their data structures and ingestion logic. This divergence has created the +performance gap described in the Motivation section and makes it harder to +maintain performance parity between the two code paths. -### Rollout strategy +As we will be using `EndpointSlice` objects directly, we should be able to unify +both code paths and allow the Kubernetes reflector to receive events from remote +clusters. -To introduce v2 of the clustermesh service data format, this CFP -proposes a global switch rather than per-cluster detection. This keeps -the transition simple. Per-cluster detection would add complexity -because we would need to handle downgrade and upgrade of remote clusters -at runtime. - -With this approach, we would add a new temporary option -`clustermesh-service-v2-enabled`. This option will control whether the operator -and agent use the v1 or v2 format. The option will be disabled by default in -Cilium 1.20 and removed in Cilium 1.21. In Cilium 1.20, we would also -unconditionally export both the v1 and v2 format while `KVStoreMesh` mirrors -both. This means "double" etcd storage during the transition period, but agents -will watch only one version (based on their configuration). This means that the -network traffic will not be "doubled". - -This gives a balance between keeping the change simple and ensuring -that users can upgrade without traffic disruptions. Cilium 1.20 will -essentially serve as a transition release where both formats are supported. -Users will be able to turn on `clustermesh-service-v2-enabled` early in Cilium -1.20, assuming all the clusters in their mesh already run Cilium 1.20 or higher. - -To make this transition easier to understand, we could make Cilium -export its own version in its `CiliumClusterConfig`. We could prevent -connecting to remote clusters running Cilium 1.19 or lower when -`clustermesh-service-v2-enabled` is enabled. This would result in a hard -error when attempting to establish a connection to an incompatible -cluster. We could also add a warning when we connect to remote clusters -that run with more than one minor version difference. This is not -officially supported or tested in our CI. +Given the 10-600x ingestion gap shown in the Motivation benchmarks, the added +churn from handling individual EndpointSlices and conditions should remain +manageable while still making this path significantly more performant than the +current clustermesh v1 path. -## Impacts / Key Questions +### EndpointSlice encoding -### Impact: Service format breaking change +TODO: open question see the matrices below that present the size and decoding + speed of a full EndpointSlice (100 endpoints). Our current starting point + in Cluster Mesh would be ~ json while Kubernetes use protobuf. -This change will introduce v2 of service data in etcd. As proposed, this -would introduce an incompatibility between clusters running Cilium 1.18 -or lower and Cilium 1.20 or higher by default. + We can potentially encode EndpointSlices in a less compact format in the + source clustermesh-apiserver data and let kvstoremesh reflect it in a + more optimized format afterwards. -### Impact: Text format readability for debugging +Size: -If we compress data with zstd, it will be harder to inspect the content -of a service object in etcd for debugging. We think this is an acceptable -trade-off given the benefits for large services. +| Format | Raw | zstd | +| -------- | -------- | ----- | +| JSON | 10.59KiB | 435B | +| CBOR | 7.42KiB | 404B | +| Protobuf | 2.88KiB | 283B | -### Option 1: Use a slice approach +Decode speed: -We could also use a slice approach very similar to what Kubernetes has -done with `EndpointSlice` vs the original `Endpoints`. +| Format | Raw | zstd | +| -------- | -------- | ----- | +| JSON | 420µs | 449µs | +| CBOR | 222µs | 252µs | +| Protobuf | 62µs | 74µs | -#### Pros +### Rollout strategy -* Consistent with the Kubernetes approach +As the Cluster Mesh code path is currently maintained by a small team, we are +aiming at the most straightforward rollout strategy that minimizes the code +changes. -#### Cons +We will introduce a new configuration option `clustermesh-service-v2`, with the +following possible options, which will evolve over minor releases: +- `prefer-legacy` (default in 1.20): Export `EndpointSlice` while still exporting + and consuming `ClusterService`. +- `prefer-endpointslice` (default in 1.21): Export and consume `EndpointSlice` + while still exporting `ClusterService` for backwards compatibility. +- `only-endpointslice`: Only export and consume `EndpointSlice` and stop + exporting `ClusterService`. -* Requires more objects encoded in etcd and generates more churn that we - cannot easily coalesce across multiple slice objects -* Compression ratio might be worse than a single-object approach -* As a result of the two previous points, it would most likely be worse in - terms of network usage +Cluster Mesh supports one minor version skew. The goal of this rollout is to +make sure we do not break this guarantee while transitioning to the new format +while advanced users could potentially jump to the new format earlier. The goal +is to remove this option and all the `ClusterService` related code in Cilium 1.22. -### Option 2: Use protobuf encoding +## Impacts / Key Questions -#### Pros +### Key Questions: Chunking multiple EndpointSlice vs a single object -* More efficient encoding than JSON in terms of size -* Faster decoding than JSON +The question of grouping the EndpointSlices was quite debated. The existing +`ClusterService` format groups all backends/EndpointSlices from a Service +in a single object. Ultimately, we favored individual EndpointSlices because: +- KCM EndpointSlice QPS is manageable in our case +- It matches the Kubernetes model and is simple to reason about +- Kubernetes may provide alternative Service APIs in the future, and not relying + explicitly on Services may help future-proof the format +- Handling individual EndpointSlice updates should be very natural and simple + for the loadbalancer and EndpointSliceSync code -#### Cons +### Key Questions: Encoding format -* Readability for debugging would be worse -* zstd achieves worse compression ratios with protobuf than with JSON, - which reduces the effective size benefit from switching to protobuf - (although protobuf is still better) -* Work on `encoding/json/v2` in recent Go versions will reduce the JSON - decoding gap in the future. Data decoding is also done concurrently - per cluster, which further limits the impact of decoding speed. +TODO: open question