HLD: Reusable service-level component statistics.#2312
Merged
yutongzhang-microsoft merged 14 commits intoJun 1, 2026
Merged
Conversation
Adds a new HLD describing swss::ComponentStats: a reusable library in sonic-swss-common that produces service-level (control-plane) counters, mirrors them to COUNTERS_DB, and exports them via OTLP to a local OpenTelemetry Collector. The existing SwssStats class in sonic-swss is refactored into a thin facade over this library. Related PRs: - sonic-swss-common#1180 - sonic-swss#4516 - sonic-buildimage#26924 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
… label Address review feedback: - Replace 'Initial draft' with 'Initial revision' in the revision table. - Treat the SwssStats facade as freshly introduced by this work; remove all references to sonic-swss#4434 in Scope, Overview, Requirements, the facade section, Warmboot, Memory, and Testing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
…mplify section 9 - Reword non-swss vocabulary out-of-scope item as future work. - Remove the sonic-buildimage submodule row from the repositories table; not needed. - Section 9: collapse Manifest / CLI / CONFIG_DB subsections into a single 'Not applicable' note. - Update Phase 1 wording and system-test bullet to reference two companion PRs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
Split the previous single component-stats-hld.md into two documents so that responsibilities map cleanly to the teams involved: * component-stats-framework-hld.md (SONiC team): the swss::ComponentStats library, the SwssStats facade pattern, hot path, threading, memory ordering, warmboot, memory and testing for the producer. The DB sink is the only sink documented; OTLP is moved to future work. * component-stats-reporting-hld.md (SONiC team, contract with NDM): the COUNTERS_DB schema (key layout, hash fields, idle suppression) and SWSS-specific vocabulary, plus conventions for future components. The reporting transport (telegraf -> mdm -> Geneva) is owned by the NDM HLD and referenced here, not duplicated. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
r12f
reviewed
May 21, 2026
|
|
||
| To add equivalent metrics to e.g. `gnmi`, write a facade analogous to §7.7: | ||
|
|
||
| ```cpp |
Contributor
There was a problem hiding this comment.
the code is not too important. more important thing is the metrics design (in table):
| Metric Name | Label List | Description |
|---|---|---|
| GNMI_STATS_XXX | gnmi.action, ... | XXX |
Contributor
Author
There was a problem hiding this comment.
Thanks for the feedback - addressed in a02960a:
- Framework HLD: dropped the inline C++ snippets in sections 7.4 (hot path), 7.7 (
SwssStats), and 7.8 (GnmiStatsexample). 7.7 now shows the call-to-metric mapping as a small table; 7.8 includes an illustrative future GNMI metrics table in the Metric Name / Label List / Description shape you asked for. - Reporting HLD: section 7.2 is now a proper metric design table for the four SWSS metrics (
SWSS_STATS_SET/DEL/COMPLETE/ERRORwith labelswss.table). The Redis-side mapping is kept as a footnote so the on-box debug story still works. - Conventions section (7.3) is tightened so future components (gnmi, bmp, ...) are told to follow the same table shape.
Revisions bumped: Framework 0.3, Reporting 0.2.
Reviewer feedback (r12f) on the framework HLD was that the inline C++ snippets are not the right focus for an HLD and that what matters is the metric design - laid out as a Metric Name / Label List / Description table. Framework HLD changes: - Replace the hot-path code listing (was section 7.4) with a short prose summary of the two atomic RMWs. - Replace the SwssStats code listing (was section 7.7) with a small call-to-metric mapping table and a forward reference to the Reporting HLD for the full metric design. - Replace the GnmiStats illustrative code (was section 7.8) with a recipe and an illustrative future metrics table in the same shape the reviewer requested. Reporting HLD changes: - Reframe section 7.2 from a Redis key/field table into a proper Metric Name / Label List / Description table for the four SWSS metrics (SWSS_STATS_SET / DEL / COMPLETE / ERROR with label swss.table). Keep the Redis-side mapping as a footnote. - Tighten section 7.3 so future components are told to follow the exact same Metric Name / Label List / Description shape. Bump revisions: Framework 0.2 to 0.3, Reporting 0.1 to 0.2. Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
…c naming Self-review caught two stale references that still described the old `sonic.<lower(component)>.<field>` metric naming with attribute `entity=<E>`, while section 7.2 had already been reframed around `<COMPONENT>_STATS_<VERB>` with a component-specific label (`swss.table` for SWSS). The inconsistency would have left two different naming conventions in the same HLD. - Section 7.5 (Telegraf interface) now points at the section 7.2 / 7.3 schema rather than restating a different naming convention. - Section 13.2 system test step now asserts the four metrics named in section 7.2 with the `swss.table` label. Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Self-review caught that the word ""Metric"" was used with two meanings across the two HLDs: - Framework section 3 defined Metric as the hash field name on the producer side (e.g. SET, DEL). - Reporting section 7.2 uses ""Metric Name"" as the column header for the downstream wire name (e.g. SWSS_STATS_SET). Update the Framework section 3 Metric entry to spell out both views and point at Reporting section 7.2 for the wire schema. Also add a Label entry so the new ""Label List"" column in Reporting section 7.2 has a definition to anchor to. Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
…Framework
Reporting HLD §3 was still using the pre-rev-0.4 Metric definition
("A named uint64 counter or gauge") and was missing the Label term
entirely. Framework HLD §3 was updated in rev 0.4 to cover the
dual meaning (producer-side hash field + downstream wire name
COMPONENT_STATS_<metric>) and to add the Label entry.
This commit brings Reporting HLD §3 into sync so readers who start
at the Reporting document find Metric and Label defined consistently
with the Framework document. Version bumped to 0.4.
Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
§7.7 describes SwssStats as ~130 LoC; §7.8 step 4 said ~30 LoC with no explanation. Added a clarifying note: a minimal facade stays near ~30 LoC; SwssStats is larger because it integrates gSwssStatsRecord and singleton plumbing into orch.cpp. Rev bumped to 0.5. Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
§7.3 says each new component documents its vocabulary in its own component HLD. §14's third bullet contradicted that by saying to add the table to §7.3 of this HLD. Fixed to say: add the vocab table to the component's own HLD (following §7.3 conventions), with an optional cross-reference added here. Rev bumped to 0.5. Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Collaborator
|
/azp run |
1 similar comment
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
1 similar comment
|
No pipelines are associated with this pull request. |
§7.8 now explains that a minimal facade is ~30 LoC and SwssStats is ~130 LoC. §4 and §6 still said "~100 LoC", creating a third inconsistent number. Replaced both with a pointer to §7.8. Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A prior edit replaced the rev 0.4 row instead of appending 0.5 after it,
causing the revision table to jump from 0.3 to 0.5. Restored the 0.4
entry ("Sync §3 Metric/Label definitions") so the history is complete.
Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Collaborator
|
/azp run |
1 similar comment
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
1 similar comment
|
No pipelines are associated with this pull request. |
r12f
previously approved these changes
May 28, 2026
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
957bb42 to
989cf40
Compare
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
Janetxxx
approved these changes
Jun 1, 2026
a114j0y
pushed a commit
to a114j0y/SONiC
that referenced
this pull request
Jun 5, 2026
[component-stats] Add HLD for SONiC component statistics ## What I'm doing Adding a new High-Level Design document at `doc/component-stats/component-stats-hld.md` that specifies a reusable mechanism for exposing **service-level (control-plane software) counters** from SONiC containers. The HLD introduces: 1. A new shared library `swss::ComponentStats` in `sonic-swss-common`. 2. A SWSS-specific facade `SwssStats` in `sonic-swss` built on top of that library, as the first consumer. Counters are published to two sinks driven from a single in-process atomic snapshot: - **`COUNTERS_DB`** — for parity with the existing Flex-Counter pipeline and for on-box diagnostic tooling (`redis-cli`, `show ... stats`). - **Local OpenTelemetry (OTLP) Collector sidecar** — so the same counters can be forwarded to off-box telemetry systems (e.g. Geneva mdm) that consume OTLP. ## Why SONiC already has dataplane counters (Flex-Counter / SAI), but no uniform mechanism for **service-level** counters such as orchagent task throughput, gNMI request rate, or BMP error counts. A naive per-container implementation would duplicate atomic counter management, dirty tracking, the writer thread, the Redis schema, and an OTLP exporter in every container — concurrency review, bug fixes, and on-the-wire schemas would all drift. This HLD specifies one reusable producer that any container can adopt with a ~100-line facade. ## Companion PRs - `sonic-net/sonic-swss-common` [sonic-net#1180](sonic-net/sonic-swss-common#1180) — `swss::ComponentStats` library + unit tests. - `sonic-net/sonic-swss` [#4516](sonic-net/sonic-swss#4516) `SwssStats` thin facade over `ComponentStats` in `orchagent/`. Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What I'm doing
Adding a new High-Level Design document at
doc/component-stats/component-stats-hld.mdthat specifies a reusable mechanism for exposing service-level (control-plane software) counters from SONiC containers.The HLD introduces:
swss::ComponentStatsinsonic-swss-common.SwssStatsinsonic-swssbuilt on top of that library, as the first consumer.Counters are published to two sinks driven from a single in-process atomic snapshot:
COUNTERS_DB— for parity with the existing Flex-Counter pipeline and for on-box diagnostic tooling (redis-cli,show ... stats).Why
SONiC already has dataplane counters (Flex-Counter / SAI), but no uniform mechanism for service-level counters such as orchagent task throughput, gNMI request rate, or BMP error counts. A naive per-container implementation would duplicate atomic counter management, dirty tracking, the writer thread, the Redis schema, and an OTLP exporter in every container — concurrency review, bug fixes, and on-the-wire schemas would all drift. This HLD specifies one reusable producer that any container can adopt with a ~100-line facade.
Companion PRs
sonic-net/sonic-swss-common#1180 —swss::ComponentStatslibrary + unit tests.sonic-net/sonic-swss#4516SwssStatsthin facade overComponentStatsinorchagent/.