Skip to content

HLD: Reusable service-level component statistics.#2312

Merged
yutongzhang-microsoft merged 14 commits into
sonic-net:masterfrom
yutongzhang-microsoft:sonic-component-stats-hld
Jun 1, 2026
Merged

HLD: Reusable service-level component statistics.#2312
yutongzhang-microsoft merged 14 commits into
sonic-net:masterfrom
yutongzhang-microsoft:sonic-component-stats-hld

Conversation

@yutongzhang-microsoft
Copy link
Copy Markdown
Contributor

@yutongzhang-microsoft yutongzhang-microsoft commented Apr 28, 2026

What I'm doing

Adding a new High-Level Design document at doc/component-stats/component-stats-hld.md that specifies a reusable mechanism for exposing service-level (control-plane software) counters from SONiC containers.

The HLD introduces:

  1. A new shared library swss::ComponentStats in sonic-swss-common.
  2. A SWSS-specific facade SwssStats in sonic-swss built on top of that library, as the first consumer.

Counters are published to two sinks driven from a single in-process atomic snapshot:

  • COUNTERS_DB — for parity with the existing Flex-Counter pipeline and for on-box diagnostic tooling (redis-cli, show ... stats).
  • Local OpenTelemetry (OTLP) Collector sidecar — so the same counters can be forwarded to off-box telemetry systems (e.g. Geneva mdm) that consume OTLP.

Why

SONiC already has dataplane counters (Flex-Counter / SAI), but no uniform mechanism for service-level counters such as orchagent task throughput, gNMI request rate, or BMP error counts. A naive per-container implementation would duplicate atomic counter management, dirty tracking, the writer thread, the Redis schema, and an OTLP exporter in every container — concurrency review, bug fixes, and on-the-wire schemas would all drift. This HLD specifies one reusable producer that any container can adopt with a ~100-line facade.

Companion PRs

  • sonic-net/sonic-swss-common #1180swss::ComponentStats library + unit tests.
  • sonic-net/sonic-swss #4516 SwssStats thin facade over ComponentStats in orchagent/.

Adds a new HLD describing swss::ComponentStats: a reusable library in sonic-swss-common that produces service-level (control-plane) counters, mirrors them to COUNTERS_DB, and exports them via OTLP to a local OpenTelemetry Collector. The existing SwssStats class in sonic-swss is refactored into a thin facade over this library.

Related PRs:

- sonic-swss-common#1180

- sonic-swss#4516

- sonic-buildimage#26924

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

… label

Address review feedback:
- Replace 'Initial draft' with 'Initial revision' in the revision table.
- Treat the SwssStats facade as freshly introduced by this work; remove all
  references to sonic-swss#4434 in Scope, Overview, Requirements, the
  facade section, Warmboot, Memory, and Testing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Apr 28, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

yutongzhang-microsoft and others added 2 commits April 28, 2026 10:47
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
…mplify section 9

- Reword non-swss vocabulary out-of-scope item as future work.
- Remove the sonic-buildimage submodule row from the repositories table; not needed.
- Section 9: collapse Manifest / CLI / CONFIG_DB subsections into a single
  'Not applicable' note.
- Update Phase 1 wording and system-test bullet to reference two companion PRs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
@yutongzhang-microsoft yutongzhang-microsoft changed the title Add HLD for SONiC component statistics HLD: Reusable service-level component statistics Apr 28, 2026
@yutongzhang-microsoft yutongzhang-microsoft changed the title HLD: Reusable service-level component statistics [doc] HLD: Reusable service-level component statistics (swss::ComponentStats + SwssStats facade) Apr 28, 2026
@yutongzhang-microsoft yutongzhang-microsoft changed the title [doc] HLD: Reusable service-level component statistics (swss::ComponentStats + SwssStats facade) HLD: Reusable service-level component statistics. Apr 28, 2026
Split the previous single component-stats-hld.md into two documents so
that responsibilities map cleanly to the teams involved:

* component-stats-framework-hld.md (SONiC team): the swss::ComponentStats
  library, the SwssStats facade pattern, hot path, threading, memory
  ordering, warmboot, memory and testing for the producer. The DB sink
  is the only sink documented; OTLP is moved to future work.

* component-stats-reporting-hld.md (SONiC team, contract with NDM): the
  COUNTERS_DB schema (key layout, hash fields, idle suppression) and
  SWSS-specific vocabulary, plus conventions for future components. The
  reporting transport (telegraf -> mdm -> Geneva) is owned by the NDM
  HLD and referenced here, not duplicated.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.


To add equivalent metrics to e.g. `gnmi`, write a facade analogous to §7.7:

```cpp
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code is not too important. more important thing is the metrics design (in table):

Metric Name Label List Description
GNMI_STATS_XXX gnmi.action, ... XXX

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback - addressed in a02960a:

  • Framework HLD: dropped the inline C++ snippets in sections 7.4 (hot path), 7.7 (SwssStats), and 7.8 (GnmiStats example). 7.7 now shows the call-to-metric mapping as a small table; 7.8 includes an illustrative future GNMI metrics table in the Metric Name / Label List / Description shape you asked for.
  • Reporting HLD: section 7.2 is now a proper metric design table for the four SWSS metrics (SWSS_STATS_SET / DEL / COMPLETE / ERROR with label swss.table). The Redis-side mapping is kept as a footnote so the on-box debug story still works.
  • Conventions section (7.3) is tightened so future components (gnmi, bmp, ...) are told to follow the same table shape.

Revisions bumped: Framework 0.3, Reporting 0.2.

Reviewer feedback (r12f) on the framework HLD was that the inline C++
snippets are not the right focus for an HLD and that what matters is
the metric design - laid out as a Metric Name / Label List /
Description table.

Framework HLD changes:
- Replace the hot-path code listing (was section 7.4) with a short
  prose summary of the two atomic RMWs.
- Replace the SwssStats code listing (was section 7.7) with a small
  call-to-metric mapping table and a forward reference to the
  Reporting HLD for the full metric design.
- Replace the GnmiStats illustrative code (was section 7.8) with a
  recipe and an illustrative future metrics table in the same shape
  the reviewer requested.

Reporting HLD changes:
- Reframe section 7.2 from a Redis key/field table into a proper
  Metric Name / Label List / Description table for the four SWSS
  metrics (SWSS_STATS_SET / DEL / COMPLETE / ERROR with label
  swss.table). Keep the Redis-side mapping as a footnote.
- Tighten section 7.3 so future components are told to follow the
  exact same Metric Name / Label List / Description shape.

Bump revisions: Framework 0.2 to 0.3, Reporting 0.1 to 0.2.

Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

yutongzhang-microsoft and others added 2 commits May 27, 2026 06:32
…c naming

Self-review caught two stale references that still described the old
`sonic.<lower(component)>.<field>` metric naming with attribute
`entity=<E>`, while section 7.2 had already been reframed around
`<COMPONENT>_STATS_<VERB>` with a component-specific label
(`swss.table` for SWSS). The inconsistency would have left two
different naming conventions in the same HLD.

- Section 7.5 (Telegraf interface) now points at the section 7.2 / 7.3
  schema rather than restating a different naming convention.
- Section 13.2 system test step now asserts the four metrics named in
  section 7.2 with the `swss.table` label.

Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Self-review caught that the word ""Metric"" was used with two meanings
across the two HLDs:

- Framework section 3 defined Metric as the hash field name on the
  producer side (e.g. SET, DEL).
- Reporting section 7.2 uses ""Metric Name"" as the column header for
  the downstream wire name (e.g. SWSS_STATS_SET).

Update the Framework section 3 Metric entry to spell out both views and
point at Reporting section 7.2 for the wire schema. Also add a Label
entry so the new ""Label List"" column in Reporting section 7.2 has a
definition to anchor to.

Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

yutongzhang-microsoft and others added 3 commits May 27, 2026 14:46
…Framework

Reporting HLD §3 was still using the pre-rev-0.4 Metric definition
("A named uint64 counter or gauge") and was missing the Label term
entirely.  Framework HLD §3 was updated in rev 0.4 to cover the
dual meaning (producer-side hash field + downstream wire name
COMPONENT_STATS_<metric>) and to add the Label entry.

This commit brings Reporting HLD §3 into sync so readers who start
at the Reporting document find Metric and Label defined consistently
with the Framework document.  Version bumped to 0.4.

Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
§7.7 describes SwssStats as ~130 LoC; §7.8 step 4 said ~30 LoC with
no explanation.  Added a clarifying note: a minimal facade stays near
~30 LoC; SwssStats is larger because it integrates gSwssStatsRecord
and singleton plumbing into orch.cpp.  Rev bumped to 0.5.

Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
§7.3 says each new component documents its vocabulary in its own
component HLD.  §14's third bullet contradicted that by saying to add
the table to §7.3 of this HLD.  Fixed to say: add the vocab table to
the component's own HLD (following §7.3 conventions), with an optional
cross-reference added here.  Rev bumped to 0.5.

Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

1 similar comment
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

1 similar comment
@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

yutongzhang-microsoft and others added 2 commits May 27, 2026 14:54
§7.8 now explains that a minimal facade is ~30 LoC and SwssStats is
~130 LoC.  §4 and §6 still said "~100 LoC", creating a third
inconsistent number.  Replaced both with a pointer to §7.8.

Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A prior edit replaced the rev 0.4 row instead of appending 0.5 after it,
causing the revision table to jump from 0.3 to 0.5.  Restored the 0.4
entry ("Sync §3 Metric/Label definitions") so the history is complete.

Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

1 similar comment
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

1 similar comment
@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

r12f
r12f previously approved these changes May 28, 2026
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
@yutongzhang-microsoft yutongzhang-microsoft force-pushed the sonic-component-stats-hld branch from 957bb42 to 989cf40 Compare June 1, 2026 02:50
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@yutongzhang-microsoft yutongzhang-microsoft merged commit 77ca679 into sonic-net:master Jun 1, 2026
2 checks passed
@yutongzhang-microsoft yutongzhang-microsoft deleted the sonic-component-stats-hld branch June 1, 2026 03:03
a114j0y pushed a commit to a114j0y/SONiC that referenced this pull request Jun 5, 2026
[component-stats] Add HLD for SONiC component statistics

## What I'm doing

Adding a new High-Level Design document at `doc/component-stats/component-stats-hld.md` that specifies a reusable mechanism for exposing **service-level (control-plane software) counters** from SONiC containers.

The HLD introduces:

1. A new shared library `swss::ComponentStats` in `sonic-swss-common`.
2. A SWSS-specific facade `SwssStats` in `sonic-swss` built on top of that library, as the first consumer.

Counters are published to two sinks driven from a single in-process atomic snapshot:

- **`COUNTERS_DB`** — for parity with the existing Flex-Counter pipeline and for on-box diagnostic tooling (`redis-cli`, `show ... stats`).
- **Local OpenTelemetry (OTLP) Collector sidecar** — so the same counters can be forwarded to off-box telemetry systems (e.g. Geneva mdm) that consume OTLP.

## Why

SONiC already has dataplane counters (Flex-Counter / SAI), but no uniform mechanism for **service-level** counters such as orchagent task throughput, gNMI request rate, or BMP error counts. A naive per-container implementation would duplicate atomic counter management, dirty tracking, the writer thread, the Redis schema, and an OTLP exporter in every container — concurrency review, bug fixes, and on-the-wire schemas would all drift. This HLD specifies one reusable producer that any container can adopt with a ~100-line facade.

## Companion PRs

- `sonic-net/sonic-swss-common` [sonic-net#1180](sonic-net/sonic-swss-common#1180) — `swss::ComponentStats` library + unit tests.
- `sonic-net/sonic-swss` [#4516](sonic-net/sonic-swss#4516) `SwssStats` thin facade over `ComponentStats` in `orchagent/`.

Signed-off-by: Yutong Zhang <yutongzhang@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants