From 675f16067cfbb14841a080762ae0be9ef815ee47 Mon Sep 17 00:00:00 2001 From: Yutong Zhang Date: Tue, 28 Apr 2026 10:36:17 +0800 Subject: [PATCH 01/14] [component-stats] Add HLD for SONiC component statistics Adds a new HLD describing swss::ComponentStats: a reusable library in sonic-swss-common that produces service-level (control-plane) counters, mirrors them to COUNTERS_DB, and exports them via OTLP to a local OpenTelemetry Collector. The existing SwssStats class in sonic-swss is refactored into a thin facade over this library. Related PRs: - sonic-swss-common#1180 - sonic-swss#4516 - sonic-buildimage#26924 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yutong Zhang --- doc/component-stats/component-stats-hld.md | 441 +++++++++++++++++++++ 1 file changed, 441 insertions(+) create mode 100644 doc/component-stats/component-stats-hld.md diff --git a/doc/component-stats/component-stats-hld.md b/doc/component-stats/component-stats-hld.md new file mode 100644 index 00000000000..0bd4f03a935 --- /dev/null +++ b/doc/component-stats/component-stats-hld.md @@ -0,0 +1,441 @@ +# SONiC Component Statistics HLD + +## Table of Content + +- [Revision](#1-revision) +- [Scope](#2-scope) +- [Definitions/Abbreviations](#3-definitionsabbreviations) +- [Overview](#4-overview) +- [Requirements](#5-requirements) +- [Architecture Design](#6-architecture-design) +- [High-Level Design](#7-high-level-design) +- [SAI API](#8-sai-api) +- [Configuration and management](#9-configuration-and-management) +- [Warmboot and Fastboot Design Impact](#10-warmboot-and-fastboot-design-impact) +- [Memory Consumption](#11-memory-consumption) +- [Restrictions/Limitations](#12-restrictionslimitations) +- [Testing Requirements/Design](#13-testing-requirementsdesign) +- [Open/Action items](#14-openaction-items) + +### 1. Revision + +| Rev | Date | Author | Change Description | +|-----|------------|---------------|--------------------------| +| 0.1 | 2026-04-28 | Yutong Zhang | Initial draft | + +### 2. Scope + +This HLD specifies a reusable mechanism for exposing **service-level (control-plane software) counters** from SONiC containers. It introduces a new shared library `swss::ComponentStats` in `sonic-swss-common` and refactors the existing `SwssStats` class in `sonic-swss` (introduced by [sonic-swss#4434](https://github.com/sonic-net/sonic-swss/pull/4434)) into a thin façade over the new library. The library publishes counters to: + +1. `COUNTERS_DB`, for parity with the existing Flex-Counter pipeline and for on-box diagnostic tooling (`redis-cli`, `show ... stats`). +2. A local OpenTelemetry (OTLP) Collector sidecar, so the same counters can be forwarded to off-box telemetry systems (e.g. Geneva mdm) that consume OTLP. + +Configuration of the OTel Collector itself, off-box telemetry endpoints, dashboards, and alerts are explicitly **out of scope** for this HLD. + +### 3. Definitions/Abbreviations + +| Term | Definition | +|-----------------|---------------------------------------------------------------------------------------------| +| Component | A SONiC container that produces service-level counters (e.g. `swss`, `gnmi`, `bmp`). | +| Entity | A logical grouping of metrics inside a component (e.g. an orchagent table, a gNMI path). | +| Metric | A named uint64 counter or gauge inside an entity (e.g. `SET`, `DEL`, `COMPLETE`, `ERROR`). | +| ComponentStats | The new shared library in `sonic-swss-common` providing the producer mechanism. | +| SwssStats | A SWSS-specific façade over `ComponentStats` (lives in `sonic-swss`). | +| DB sink | The output path that mirrors counters into `COUNTERS_DB`. | +| OTLP sink | The output path that exports counters via OpenTelemetry Protocol to a local OTel Collector. | +| OTel Collector | A locally-running OpenTelemetry Collector sidecar; not delivered by this HLD. | + +### 4. Overview + +SONiC already publishes **dataplane** counters via the Flex-Counter framework (`CONFIG_DB / FLEX_COUNTER_TABLE` → `syncd` → `COUNTERS_DB`). What is missing is **service-level** counters — software-side events such as orchagent task throughput, gNMI request rate, BMP message error counts. Without these we cannot answer questions like *"is orchagent draining tasks?"*, *"is gNMI seeing subscribe failures?"*, *"is one container dropping more events than its peers?"*. + +A first attempt ([sonic-swss#4434](https://github.com/sonic-net/sonic-swss/pull/4434)) added a class `SwssStats` directly inside `orchagent`. The same plumbing — atomic counters, dirty tracking, a 1-second writer thread, a Redis-side schema — will be needed by every other SONiC container, and we additionally want to expose these counters via OTLP for off-box collection. Copy-pasting the implementation into each container is unacceptable: every container needs its own concurrency review, bug-fixes drift, and the on-the-wire schemas diverge. + +This HLD specifies a single, reusable producer that: + +1. accumulates counters in process-local atomic state with negligible hot-path cost, +2. mirrors them to `COUNTERS_DB` so `redis-cli`, `show ... stats` CLIs, and any other on-box tooling continue to work, +3. emits them as OTLP metrics to a local OTel Collector for forwarding to off-box telemetry systems, +4. exposes a stable public API so each container only needs to write a thin (~100 LoC) façade. + +### 5. Requirements + +**Functional** + +- R1. A reusable C++ library shall accumulate per-component, per-entity, per-metric `uint64` counters. +- R2. The library shall publish counters to `COUNTERS_DB` under a uniform key layout `_STATS:` (Redis hash, fields = metric names, values = decimal `uint64`). +- R3. The library shall publish the same counters as OpenTelemetry OTLP records to a configurable endpoint (default `localhost:4317`). +- R4. The library shall be usable by any SONiC container by writing a thin façade that owns only the container-specific metric vocabulary. +- R5. The existing `SwssStats` public surface (`gSwssStatsRecord`, `SwssStats::getInstance()`, `recordTask/Complete/Error`) shall remain byte-identical to that introduced in #4434. +- R6. The `COUNTERS_DB` schema introduced by #4434 (`SWSS_STATS:` hash with SET/DEL/COMPLETE/ERROR fields) shall remain unchanged. + +**Non-functional** + +- R7. The hot path (`increment` / `setValue`) shall be lock-free and constant-time after the first use of a given (entity, metric) pair. +- R8. Construction of a `ComponentStats` instance shall not crash the host process if Redis or the OTel Collector is not yet reachable; both sinks shall connect lazily and retry independently. +- R9. A failure in one sink (Redis down, OTel Collector restarting) shall not affect the other sink and shall not affect the hot path. After recovery, no monotonic data point shall be lost beyond intermediate samples (the next successful flush carries the latest cumulative value). +- R10. Idle systems shall produce zero outbound traffic on either sink (driven by per-entity dirty tracking). + +**Out of scope** + +- The OTel Collector itself, including its image, configuration, exporter pipeline to off-box telemetry systems, authentication, and operator onboarding. +- Replacing existing FlexCounter / SAI counter pipelines (those measure dataplane state via SAI; this design measures control-plane software events). +- Defining the metric vocabulary for non-swss containers — that is the job of each container's own façade. + +### 6. Architecture Design + +The architecture is unchanged at the SONiC system level. A single new library is introduced in `sonic-swss-common`; an existing class in `sonic-swss` is refactored to delegate to it; future containers may add their own façades using the same library. + +``` +┌────────────────────────────── SONiC switch ──────────────────────────────┐ +│ │ +│ orchagent (sonic-swss) gnmi / bmp / telemetry / … │ +│ ┌──────────────────────┐ ┌──────────────────────┐ │ +│ │ orch.cpp + SwssStats │ … │ gnmistats / bmpstats │ │ +│ └──────────┬───────────┘ └──────────┬───────────┘ │ +│ │ instrument │ │ +│ ▼ ▼ │ +│ ┌────────────────────────────────────────────────────────────┐ │ +│ │ swss::ComponentStats (in libswsscommon) │ │ +│ │ ┌─────────────────────────────────────────────────────┐ │ │ +│ │ │ atomic counters + dirty tracking + writer thread │ │ │ +│ │ └──────────────┬──────────────────────────┬───────────┘ │ │ +│ │ │ │ │ │ +│ │ DB sink OTLP sink │ │ +│ │ (Redis HSET via swss::Table) (OTLP/gRPC, localhost) │ │ +│ └──────────┬──────────────────────────────────┬──────────────┘ │ +│ │ │ │ +│ ▼ ▼ │ +│ ┌──────────────────────────┐ ┌────────────────────────────┐ │ +│ │ COUNTERS_DB │ │ Local OTel Collector │ │ +│ │ SWSS_STATS:PORT_TABLE │ │ (sidecar container) │ │ +│ │ GNMI_STATS:/iface/… │ │ │ │ +│ │ BMP_STATS:… │ │ batches, retries, adds │ │ +│ │ │ │ resource attrs, exports │ │ +│ │ used by: redis-cli, │ │ to off-box telemetry │ │ +│ │ show stats CLI, local │ └─────────────┬──────────────┘ │ +│ │ diagnostic tools │ │ │ +│ └──────────────────────────┘ │ │ +│ │ OTLP │ +└──────────────────────────────────────────────────┼───────────────────────┘ + │ + ▼ + ┌────────────────────┐ + │ Off-box telemetry │ + │ (e.g. Geneva mdm) │ + └────────────────────┘ +``` + +**Layering rule.** `swss-common` knows nothing of orchagent or any specific container; each container knows only its own façade plus `swss::ComponentStats`. New containers get both sinks for free by writing a ~100-line wrapper. + +**Dual-sink design properties.** + +- *One source of truth.* Both sinks consume the same atomic-counter snapshot inside `ComponentStats`. They cannot diverge: if the OTel pipeline is briefly down, `COUNTERS_DB` still reflects current state, and vice versa. +- *No new transport for local debugging.* The `COUNTERS_DB` layout is unchanged, so `redis-cli`, `show ... stats` CLIs, and any existing in-band tooling keep working. +- *No off-box-system-specific code in containers.* Containers know only `ComponentStats`; the OTLP sink talks to a local OTel Collector at `localhost:4317`, and the Collector handles everything beyond that hop. +- *Independent failure domains.* Failures in one sink (DB unreachable, OTel agent restarting) do not affect the other or the hot path. + +### 7. High-Level Design + +#### 7.1 Repositories changed + +| Repository | What changes | +|--------------------------------|-----------------------------------------------------------------------------| +| `sonic-net/sonic-swss-common` | New library `swss::ComponentStats` + unit tests ([PR #1180](https://github.com/sonic-net/sonic-swss-common/pull/1180)). | +| `sonic-net/sonic-swss` | `SwssStats` is reduced to a thin façade over `ComponentStats` ([PR #4516](https://github.com/sonic-net/sonic-swss/pull/4516)). | +| `sonic-net/sonic-buildimage` | Submodule pointer bumps for the two repos above ([PR #26924](https://github.com/sonic-net/sonic-buildimage/pull/26924)). | + +No platform-specific code is added. No SAI changes. No syncd changes. + +#### 7.2 `swss::ComponentStats` — public API + +```cpp +namespace swss { + +class ComponentStats { +public: + using CounterSnapshot = std::map; + + // Sink configuration. Both sinks default to "on". + struct SinkConfig { + bool enableDb = true; // mirror to COUNTERS_DB + bool enableOtlp = true; // export to local OTel Collector + std::string otlpEndpoint = "localhost:4317"; // OTLP/gRPC endpoint + std::string serviceName; // OTel resource attr (default: componentName) + std::string serviceInstanceId; // OTel resource attr (default: hostname) + }; + + static std::shared_ptr create( + const std::string& componentName, + const std::string& dbName = "COUNTERS_DB", + uint32_t intervalSec = 1, + const SinkConfig& sinks = SinkConfig{}); + + void increment(const std::string& entity, const std::string& metric, uint64_t n = 1); + void setValue (const std::string& entity, const std::string& metric, uint64_t value); + + uint64_t get (const std::string& entity, const std::string& metric); + CounterSnapshot getAll(const std::string& entity); + + void setEnabled(bool on); + bool isEnabled() const; + void stop(); +}; + +} // namespace swss +``` + +`create()` consults a process-wide registry keyed by `componentName`. A second call with the same name returns the existing instance, ensuring containers cannot accidentally start multiple writer threads against the same Redis prefix. + +#### 7.3 Internal state + +Per instance: +- `m_entities : std::map` — `std::map` (not `unordered_map`) so references returned by `getOrCreateEntity` remain valid after later inserts. +- `EntityStats` holds `map>` (heap-allocated because `std::atomic` is not movable) plus a per-entity `atomic version`. +- `m_mutex` guards only the **structure** of the maps (insert/find). Hot-path reads/writes of counter values use `std::atomic` and skip the mutex after the first use. +- `m_running`, `m_enabled` — atomic flags. +- `m_cv` — wakes the writer thread immediately on `stop()` instead of waiting up to `intervalSec`. +- `m_thread` — owns the writer. + +Process-wide: +- `registry : std::map>` (`weak_ptr` so a fully released instance can be destroyed). + +#### 7.4 Hot path + +```cpp +void ComponentStats::increment(const string& entity, const string& metric, uint64_t n) { + if (!isEnabled() || n == 0) return; + + auto& e = getOrCreateEntity(entity); // mutex on first use only + auto& c = getOrCreateCounter(e, metric); // mutex on first use only + + c.value .fetch_add(n, memory_order_relaxed); // ① counter + e.version.fetch_add(1, memory_order_release); // ② dirty-bump (release) +} +``` + +Cost after warm-up: two atomic RMWs. No mutex acquisition, no allocation, no syscall. + +#### 7.5 Writer thread + +Runs at `intervalSec` (default 1 s) and fans the snapshot out to both sinks: + +``` +┌───────────────────────────────────────────────────────────────┐ +│ Phase A — connect each enabled sink (run once, with retry) │ +│ loop until m_running == false: │ +│ if enableDb and !dbConnected: try connect Redis │ +│ if enableOtlp and !otlpConnected: try open OTLP exporter │ +│ if all enabled sinks connected: break │ +│ else cv.wait_for(intervalSec, predicate=!m_running) │ +└───────────────────────────────────────────────────────────────┘ +┌───────────────────────────────────────────────────────────────┐ +│ Phase B — flush loop │ +│ loop: │ +│ cv.wait_for(intervalSec, predicate=!m_running) │ +│ if !m_running: break │ +│ │ +│ # SNAPSHOT (under lock) — single snapshot, two sinks │ +│ for each entity e in m_entities: │ +│ v = e.version.load(acquire) ← pairs ② │ +│ if lastVersion[e.name] == v: continue (skip clean)│ +│ lastVersion[e.name] = v │ +│ row = [(metric, c.value.load(relaxed)) for c in e] │ +│ enqueue(name, row) │ +│ │ +│ # FAN-OUT (lock released, sinks fail independently) │ +│ if enableDb: │ +│ for (name, row) in queue: │ +│ try: m_table->set(name, stringify(row)) │ +│ catch: log warn, continue │ +│ │ +│ if enableOtlp: │ +│ build OTLP ResourceMetrics{ … } from queue │ +│ try: m_otlp->Export(batch) │ +│ catch: log warn, continue │ +└───────────────────────────────────────────────────────────────┘ +``` + +Three properties: + +1. *Lock released before any I/O.* Round-trips under the structural lock would briefly stall every concurrent `increment()`. +2. *Idle systems generate zero outbound traffic on either sink.* When no entity has changed, the queue is empty and neither sink is touched. +3. *Sink isolation.* A failure in one sink is logged and skipped; the other sink still publishes the same cycle's snapshot. + +#### 7.6 Memory ordering correctness + +The release/acquire pair (`②` in 7.4 ↔ acquire-load in 7.5) guarantees: + +> If the writer reads `version == N`, then every counter mutation that contributed to bumping the version up to `N` has already happened-before the reader and is visible. + +Without it, on weakly ordered architectures (ARM, POWER) the writer could see the new version but read an old counter value, recording a stale snapshot. + +#### 7.7 OTLP sink details + +- **Wire format.** OTLP/gRPC over plaintext `localhost:4317`. No TLS or authentication on the local hop — the loopback link is inside the switch, and any off-box credentials live in the OTel Collector. OTLP/HTTP is supported as a build option but not the default. +- **Metric model.** Counters set via `increment()` are exported as OTLP `Sum` with `aggregation_temporality = CUMULATIVE` and `is_monotonic = true`. Counters set via `setValue()` (gauges) are exported as OTLP `Gauge`. +- **Resource attributes** attached to every batch: `service.name=`, `service.instance.id=`, `sonic.component=`. +- **Metric attributes** attached to every data point: `entity` — the table name / gNMI path / etc. The entity is a *label*, not part of the metric name, so dashboards can pivot freely. +- **Metric name** convention: `sonic..` (e.g. `sonic.swss.SET`, `sonic.gnmi.SUBSCRIBE`). +- **Batching / retry.** The producer does not batch beyond one `intervalSec` snapshot and does not retry. Batching, queuing, retrying, and back-pressure are the local OTel Collector's responsibility. +- **Container restart.** `start_time_unix_nano` is captured once in the constructor and advances on every container restart. This is the OTel-defined signal for counter reset; consumers handle it natively. + +#### 7.8 `COUNTERS_DB` sink details + +For component name `C` and entity `E`: + +``` +COUNTERS_DB key: "_STATS:" +hash fields: each metric name → uint64_t string +``` + +Example: `redis-cli -n 2 HGETALL "SWSS_STATS:PORT_TABLE"` → + +``` +1) "SET" +2) "1283" +3) "DEL" +4) "17" +5) "COMPLETE" +6) "1300" +7) "ERROR" +8) "0" +``` + +The shape mirrors the existing `COUNTERS:*` keys produced by the Flex-Counter pipeline. + +#### 7.9 `SwssStats` thin façade + +`SwssStats` (in `sonic-swss/orchagent/`) is reduced to a translation layer that owns only the SWSS-specific vocabulary and the global enable flag consumed by `orch.cpp`: + +```cpp +SwssStats::SwssStats() : m_impl(swss::ComponentStats::create("SWSS")) {} + +void SwssStats::recordTask(const std::string& t, const std::string& op) { + if (op == "SET") m_impl->increment(t, "SET"); + else if (op == "DEL") m_impl->increment(t, "DEL"); +} +void SwssStats::recordComplete(const std::string& t, uint64_t n) { m_impl->increment(t, "COMPLETE", n); } +void SwssStats::recordError (const std::string& t, uint64_t n) { m_impl->increment(t, "ERROR", n); } +``` + +The whole file is ~130 lines of straightforward translation. **The public surface (`gSwssStatsRecord`, `SwssStats::getInstance()`, `recordTask`/`recordComplete`/`recordError`) and the on-the-wire `SWSS_STATS:
` Redis layout are byte-identical to those introduced in #4434.** Existing consumers keep working without changes. + +#### 7.10 Adopting the library in a new container + +To add equivalent metrics to e.g. `gnmi`, write a façade analogous to §7.9: + +```cpp +class GnmiStats { +public: + static GnmiStats* getInstance(); + void recordSubscribe(const std::string& path) { m_impl->increment(path, "SUBSCRIBE"); } + void recordError (const std::string& path) { m_impl->increment(path, "ERROR"); } +private: + GnmiStats() : m_impl(swss::ComponentStats::create("GNMI")) {} + std::shared_ptr m_impl; +}; +``` + +Result: counters land in `COUNTERS_DB` under keys `GNMI_STATS:` **and** are exported as OTLP metrics `sonic.gnmi.SUBSCRIBE` / `sonic.gnmi.ERROR` (with attribute `entity=`). No new threads, no new Redis or gRPC client management, no new test harness needed. + +### 8. SAI API + +No SAI API changes are required for this feature. This design measures control-plane software events inside SONiC containers; it does not query or modify any SAI state. + +### 9. Configuration and management + +#### 9.1 Manifest + +Not applicable. This is a built-in SONiC library, not an Application Extension. + +#### 9.2 CLI/YANG model Enhancements + +No new CLI commands or YANG models are introduced by this HLD. Existing CLIs that already read `COUNTERS_DB` (e.g. `redis-cli -n 2 HGETALL`, `show ... stats` style commands) continue to work and gain visibility into the new `_STATS:` keys for free. + +#### 9.3 Config DB Enhancements + +A future enhancement may add a `COMPONENT_STATS` table in `CONFIG_DB`, keyed by component name, to allow operators to flip individual sinks on/off and to override the OTLP endpoint without rebuilding: + +``` +CONFIG_DB key: COMPONENT_STATS| +fields: enable_db : "true" | "false" + enable_otlp : "true" | "false" + otlp_endpoint : + interval_sec : +``` + +The library reads the table once at construction time. Runtime re-configuration is not in scope for the first cut. + +### 10. Warmboot and Fastboot Design Impact + +Counters are kept in process memory and are reset on container restart, including warmboot and fastboot. This matches the existing behaviour of the `SwssStats` introduced in #4434, and is acceptable because consumers (dashboards, alerts) compute rate-of-change rather than absolute values. The OTLP `start_time_unix_nano` attribute advances on every restart, which is the OTel-standard signal for counter reset and is handled natively by OTel-aware consumers. + +#### Warmboot and Fastboot Performance Impact + +- The library does **not** add any stalls, sleeps, or I/O operations to the boot critical chain. Construction is non-blocking; the writer thread connects to Redis and to the OTel Collector lazily and retries in the background, so a not-yet-ready dependency cannot delay container start. +- No CPU-heavy processing (Jinja templates, etc.) is added in the boot path. +- No third-party dependency is updated by this HLD beyond linking against the OpenTelemetry C++ SDK gRPC exporter, which is loaded only when the OTLP sink is enabled. +- The library does not delay any service or Docker container. + +No measurable boot-time degradation is expected. + +### 11. Memory Consumption + +- Per-instance footprint: O(entities × metrics) `uint64` slots plus their `std::map` keys. Bounded by the number of orchagent tables (≈ tens) for the SWSS façade. +- The OTLP exporter adds a small fixed overhead (one gRPC channel, one per-cycle batch buffer). +- When the feature is disabled at runtime via `setEnabled(false)`, the hot path becomes inert and the writer thread's queue stays empty; memory remains bounded. +- When the feature is disabled at compile time (the OTLP sink can be compiled out via build option), there is no residual memory cost beyond the symbols of `swss::ComponentStats` itself (the DB sink remains unconditional, matching #4434 behaviour). + +### 12. Restrictions/Limitations + +- Counters reset to zero on container restart by design. Consumers must compute rate-of-change rather than rely on absolute values across restarts. +- The library does not retain history; it relies on downstream consumers (`COUNTERS_DB` readers, OTel Collector) for retention. +- The OTLP sink depends on a local OTel Collector reachable at the configured endpoint. If absent, the OTLP sink retries silently in the background; the DB sink and the hot path are unaffected. +- The structural mutex (`m_mutex`) is acquired only on the *first* use of a given (entity, metric) pair. Workloads that constantly mint new entity names will see one mutex acquisition per new name; this is not the expected pattern for SONiC containers. + +### 13. Testing Requirements/Design + +#### 13.1 Unit Test cases + +Library unit tests live in `sonic-swss-common/tests/componentstats_ut.cpp`: + +| # | Test | What it proves | +|---|----------------------------|---------------------------------------------------------------------------------------------| +| 1 | BasicIncrement | `increment` + `get` round-trip | +| 2 | MultipleMetrics | metric isolation within an entity | +| 3 | MultipleEntities | entity isolation within a component | +| 4 | SetValueOverwrites | gauge semantics | +| 5 | DisabledIsNoOp | `setEnabled(false)` makes hot path inert | +| 6 | GetAllReturnsSnapshot | bulk read returns the right shape | +| 7 | ConcurrentIncrements | 8 threads × 10 000 increments → exactly 80 000 (no torn writes, no lost updates) | +| 8 | SingletonSameName | `create("X")` returns the same instance | +| 9 | SingletonDifferentNames | `create("X") ≠ create("Y")` | + +The existing `swssstats_ut.cpp` (9 cases) in `sonic-swss` is kept verbatim and continues to pass against the thin façade, proving the public API has not regressed. + +Run: + +``` +cd sonic-swss-common && ./autogen.sh && ./configure && make check +./tests/tests --gtest_filter='ComponentStats*' +``` + +#### 13.2 System Test cases + +- Boot a `sonic-vs` image built with the three companion PRs. +- Exercise orchagent (e.g. `config vlan add`, `config interface ip add`). +- Verify on-box DB sink: + ``` + redis-cli -n 2 KEYS "SWSS_STATS:*" + redis-cli -n 2 HGETALL "SWSS_STATS:PORT_TABLE" + ``` + Counters increment in proportion to operations; idle dwell shows zero further writes (dirty tracking working). +- Verify OTLP sink (Phase 2): point a local OTel Collector at `localhost:4317` with a debug exporter and confirm `sonic.swss.*` metrics arrive with correct resource and metric attributes. +- Confirm warmboot and fastboot are unaffected (no boot-time regression, no service startup ordering change). + +### 14. Open/Action items + +- Phase 1 (this HLD's three PRs) lands the `ComponentStats` library and `SwssStats` refactor with the DB sink fully active and the OTLP sink stubbed (`enableOtlp=false` by default). +- Phase 2 implements the OTLP sink against the OpenTelemetry C++ SDK and is gated on the local OTel Collector sidecar being available in `sonic-buildimage`. Coordination with whichever team owns the local OTel Collector image is required before Phase 2 can be enabled by default. +- Phase 3 onboards additional SONiC containers (`gnmi`, `bmp`, `telemetry`, …) by adding their own façades. Each is a self-contained PR in the relevant repository. From 9e67d7ce92d25329e35561b81d4a960bfc380420 Mon Sep 17 00:00:00 2001 From: Yutong Zhang Date: Tue, 28 Apr 2026 10:43:05 +0800 Subject: [PATCH 02/14] Component Stats HLD: drop sonic-swss#4434 references; reword revision label Address review feedback: - Replace 'Initial draft' with 'Initial revision' in the revision table. - Treat the SwssStats facade as freshly introduced by this work; remove all references to sonic-swss#4434 in Scope, Overview, Requirements, the facade section, Warmboot, Memory, and Testing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yutong Zhang --- doc/component-stats/component-stats-hld.md | 27 +++++++++++++--------- 1 file changed, 16 insertions(+), 11 deletions(-) diff --git a/doc/component-stats/component-stats-hld.md b/doc/component-stats/component-stats-hld.md index 0bd4f03a935..5f47f65b3cc 100644 --- a/doc/component-stats/component-stats-hld.md +++ b/doc/component-stats/component-stats-hld.md @@ -21,11 +21,16 @@ | Rev | Date | Author | Change Description | |-----|------------|---------------|--------------------------| -| 0.1 | 2026-04-28 | Yutong Zhang | Initial draft | +| 0.1 | 2026-04-28 | Yutong Zhang | Initial revision | ### 2. Scope -This HLD specifies a reusable mechanism for exposing **service-level (control-plane software) counters** from SONiC containers. It introduces a new shared library `swss::ComponentStats` in `sonic-swss-common` and refactors the existing `SwssStats` class in `sonic-swss` (introduced by [sonic-swss#4434](https://github.com/sonic-net/sonic-swss/pull/4434)) into a thin façade over the new library. The library publishes counters to: +This HLD specifies a reusable mechanism for exposing **service-level (control-plane software) counters** from SONiC containers. It introduces: + +1. A new shared library `swss::ComponentStats` in `sonic-swss-common`. +2. A SWSS-specific façade `SwssStats` in `sonic-swss` built on top of that library, which is the first consumer. + +The library publishes counters to: 1. `COUNTERS_DB`, for parity with the existing Flex-Counter pipeline and for on-box diagnostic tooling (`redis-cli`, `show ... stats`). 2. A local OpenTelemetry (OTLP) Collector sidecar, so the same counters can be forwarded to off-box telemetry systems (e.g. Geneva mdm) that consume OTLP. @@ -49,7 +54,7 @@ Configuration of the OTel Collector itself, off-box telemetry endpoints, dashboa SONiC already publishes **dataplane** counters via the Flex-Counter framework (`CONFIG_DB / FLEX_COUNTER_TABLE` → `syncd` → `COUNTERS_DB`). What is missing is **service-level** counters — software-side events such as orchagent task throughput, gNMI request rate, BMP message error counts. Without these we cannot answer questions like *"is orchagent draining tasks?"*, *"is gNMI seeing subscribe failures?"*, *"is one container dropping more events than its peers?"*. -A first attempt ([sonic-swss#4434](https://github.com/sonic-net/sonic-swss/pull/4434)) added a class `SwssStats` directly inside `orchagent`. The same plumbing — atomic counters, dirty tracking, a 1-second writer thread, a Redis-side schema — will be needed by every other SONiC container, and we additionally want to expose these counters via OTLP for off-box collection. Copy-pasting the implementation into each container is unacceptable: every container needs its own concurrency review, bug-fixes drift, and the on-the-wire schemas diverge. +A naïve implementation would put this plumbing — atomic counters, dirty tracking, a 1-second writer thread, a Redis-side schema, an OTLP exporter — directly inside each container. That is unacceptable: every container would need its own concurrency review, bug fixes would drift, and the on-the-wire schemas would diverge. This HLD specifies a single, reusable producer that: @@ -66,8 +71,8 @@ This HLD specifies a single, reusable producer that: - R2. The library shall publish counters to `COUNTERS_DB` under a uniform key layout `_STATS:` (Redis hash, fields = metric names, values = decimal `uint64`). - R3. The library shall publish the same counters as OpenTelemetry OTLP records to a configurable endpoint (default `localhost:4317`). - R4. The library shall be usable by any SONiC container by writing a thin façade that owns only the container-specific metric vocabulary. -- R5. The existing `SwssStats` public surface (`gSwssStatsRecord`, `SwssStats::getInstance()`, `recordTask/Complete/Error`) shall remain byte-identical to that introduced in #4434. -- R6. The `COUNTERS_DB` schema introduced by #4434 (`SWSS_STATS:
` hash with SET/DEL/COMPLETE/ERROR fields) shall remain unchanged. +- R5. The first consumer of the library is the SWSS-specific façade `SwssStats` (in `sonic-swss/orchagent/`), which exposes a small SWSS-specific public surface: a global `gSwssStatsRecord` enable flag, `SwssStats::getInstance()`, and `recordTask` / `recordComplete` / `recordError` methods. +- R6. The `SwssStats` façade shall write into `COUNTERS_DB` under keys `SWSS_STATS:
` with hash fields `SET` / `DEL` / `COMPLETE` / `ERROR`, following the uniform schema in R2. **Non-functional** @@ -84,7 +89,7 @@ This HLD specifies a single, reusable producer that: ### 6. Architecture Design -The architecture is unchanged at the SONiC system level. A single new library is introduced in `sonic-swss-common`; an existing class in `sonic-swss` is refactored to delegate to it; future containers may add their own façades using the same library. +The architecture is unchanged at the SONiC system level. A new library is introduced in `sonic-swss-common`, and a new SWSS-specific façade (its first consumer) is added in `sonic-swss`; future containers may add their own façades using the same library. ``` ┌────────────────────────────── SONiC switch ──────────────────────────────┐ @@ -142,7 +147,7 @@ The architecture is unchanged at the SONiC system level. A single new library is | Repository | What changes | |--------------------------------|-----------------------------------------------------------------------------| | `sonic-net/sonic-swss-common` | New library `swss::ComponentStats` + unit tests ([PR #1180](https://github.com/sonic-net/sonic-swss-common/pull/1180)). | -| `sonic-net/sonic-swss` | `SwssStats` is reduced to a thin façade over `ComponentStats` ([PR #4516](https://github.com/sonic-net/sonic-swss/pull/4516)). | +| `sonic-net/sonic-swss` | New `SwssStats` thin façade over `ComponentStats` in `orchagent/` ([PR #4516](https://github.com/sonic-net/sonic-swss/pull/4516)). | | `sonic-net/sonic-buildimage` | Submodule pointer bumps for the two repos above ([PR #26924](https://github.com/sonic-net/sonic-buildimage/pull/26924)). | No platform-specific code is added. No SAI changes. No syncd changes. @@ -319,7 +324,7 @@ void SwssStats::recordComplete(const std::string& t, uint64_t n) { m_impl->incre void SwssStats::recordError (const std::string& t, uint64_t n) { m_impl->increment(t, "ERROR", n); } ``` -The whole file is ~130 lines of straightforward translation. **The public surface (`gSwssStatsRecord`, `SwssStats::getInstance()`, `recordTask`/`recordComplete`/`recordError`) and the on-the-wire `SWSS_STATS:
` Redis layout are byte-identical to those introduced in #4434.** Existing consumers keep working without changes. +The whole file is ~130 lines of straightforward delegation. **The public surface (`gSwssStatsRecord`, `SwssStats::getInstance()`, `recordTask`/`recordComplete`/`recordError`) and the on-the-wire `SWSS_STATS:
` Redis layout are deliberately kept narrow and stable so that the SWSS-specific vocabulary remains independent of future evolution of the underlying `ComponentStats` library.** #### 7.10 Adopting the library in a new container @@ -369,7 +374,7 @@ The library reads the table once at construction time. Runtime re-configuration ### 10. Warmboot and Fastboot Design Impact -Counters are kept in process memory and are reset on container restart, including warmboot and fastboot. This matches the existing behaviour of the `SwssStats` introduced in #4434, and is acceptable because consumers (dashboards, alerts) compute rate-of-change rather than absolute values. The OTLP `start_time_unix_nano` attribute advances on every restart, which is the OTel-standard signal for counter reset and is handled natively by OTel-aware consumers. +Counters are kept in process memory and are reset on container restart, including warmboot and fastboot. This is acceptable because consumers (dashboards, alerts) compute rate-of-change rather than absolute values. The OTLP `start_time_unix_nano` attribute advances on every restart, which is the OTel-standard signal for counter reset and is handled natively by OTel-aware consumers. #### Warmboot and Fastboot Performance Impact @@ -385,7 +390,7 @@ No measurable boot-time degradation is expected. - Per-instance footprint: O(entities × metrics) `uint64` slots plus their `std::map` keys. Bounded by the number of orchagent tables (≈ tens) for the SWSS façade. - The OTLP exporter adds a small fixed overhead (one gRPC channel, one per-cycle batch buffer). - When the feature is disabled at runtime via `setEnabled(false)`, the hot path becomes inert and the writer thread's queue stays empty; memory remains bounded. -- When the feature is disabled at compile time (the OTLP sink can be compiled out via build option), there is no residual memory cost beyond the symbols of `swss::ComponentStats` itself (the DB sink remains unconditional, matching #4434 behaviour). +- When the feature is disabled at compile time (the OTLP sink can be compiled out via build option), there is no residual memory cost beyond the symbols of `swss::ComponentStats` itself; the DB sink remains unconditional. ### 12. Restrictions/Limitations @@ -412,7 +417,7 @@ Library unit tests live in `sonic-swss-common/tests/componentstats_ut.cpp`: | 8 | SingletonSameName | `create("X")` returns the same instance | | 9 | SingletonDifferentNames | `create("X") ≠ create("Y")` | -The existing `swssstats_ut.cpp` (9 cases) in `sonic-swss` is kept verbatim and continues to pass against the thin façade, proving the public API has not regressed. +A façade-level test suite `swssstats_ut.cpp` (9 cases) is added in `sonic-swss` and exercises the SwssStats vocabulary (`recordTask`/`recordComplete`/`recordError`, `gSwssStatsRecord` enable flag, singleton behaviour) end-to-end against the new backend. Run: From fc2dff1694f6ea940689fcc8da78681ddc76d40b Mon Sep 17 00:00:00 2001 From: Yutong Zhang Date: Tue, 28 Apr 2026 10:47:34 +0800 Subject: [PATCH 03/14] Component Stats HLD: use plain ASCII for facade/naive Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yutong Zhang --- doc/component-stats/component-stats-hld.md | 32 +++++++++++----------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/doc/component-stats/component-stats-hld.md b/doc/component-stats/component-stats-hld.md index 5f47f65b3cc..826eae312a3 100644 --- a/doc/component-stats/component-stats-hld.md +++ b/doc/component-stats/component-stats-hld.md @@ -28,7 +28,7 @@ This HLD specifies a reusable mechanism for exposing **service-level (control-plane software) counters** from SONiC containers. It introduces: 1. A new shared library `swss::ComponentStats` in `sonic-swss-common`. -2. A SWSS-specific façade `SwssStats` in `sonic-swss` built on top of that library, which is the first consumer. +2. A SWSS-specific facade `SwssStats` in `sonic-swss` built on top of that library, which is the first consumer. The library publishes counters to: @@ -45,7 +45,7 @@ Configuration of the OTel Collector itself, off-box telemetry endpoints, dashboa | Entity | A logical grouping of metrics inside a component (e.g. an orchagent table, a gNMI path). | | Metric | A named uint64 counter or gauge inside an entity (e.g. `SET`, `DEL`, `COMPLETE`, `ERROR`). | | ComponentStats | The new shared library in `sonic-swss-common` providing the producer mechanism. | -| SwssStats | A SWSS-specific façade over `ComponentStats` (lives in `sonic-swss`). | +| SwssStats | A SWSS-specific facade over `ComponentStats` (lives in `sonic-swss`). | | DB sink | The output path that mirrors counters into `COUNTERS_DB`. | | OTLP sink | The output path that exports counters via OpenTelemetry Protocol to a local OTel Collector. | | OTel Collector | A locally-running OpenTelemetry Collector sidecar; not delivered by this HLD. | @@ -54,14 +54,14 @@ Configuration of the OTel Collector itself, off-box telemetry endpoints, dashboa SONiC already publishes **dataplane** counters via the Flex-Counter framework (`CONFIG_DB / FLEX_COUNTER_TABLE` → `syncd` → `COUNTERS_DB`). What is missing is **service-level** counters — software-side events such as orchagent task throughput, gNMI request rate, BMP message error counts. Without these we cannot answer questions like *"is orchagent draining tasks?"*, *"is gNMI seeing subscribe failures?"*, *"is one container dropping more events than its peers?"*. -A naïve implementation would put this plumbing — atomic counters, dirty tracking, a 1-second writer thread, a Redis-side schema, an OTLP exporter — directly inside each container. That is unacceptable: every container would need its own concurrency review, bug fixes would drift, and the on-the-wire schemas would diverge. +A naive implementation would put this plumbing — atomic counters, dirty tracking, a 1-second writer thread, a Redis-side schema, an OTLP exporter — directly inside each container. That is unacceptable: every container would need its own concurrency review, bug fixes would drift, and the on-the-wire schemas would diverge. This HLD specifies a single, reusable producer that: 1. accumulates counters in process-local atomic state with negligible hot-path cost, 2. mirrors them to `COUNTERS_DB` so `redis-cli`, `show ... stats` CLIs, and any other on-box tooling continue to work, 3. emits them as OTLP metrics to a local OTel Collector for forwarding to off-box telemetry systems, -4. exposes a stable public API so each container only needs to write a thin (~100 LoC) façade. +4. exposes a stable public API so each container only needs to write a thin (~100 LoC) facade. ### 5. Requirements @@ -70,9 +70,9 @@ This HLD specifies a single, reusable producer that: - R1. A reusable C++ library shall accumulate per-component, per-entity, per-metric `uint64` counters. - R2. The library shall publish counters to `COUNTERS_DB` under a uniform key layout `_STATS:` (Redis hash, fields = metric names, values = decimal `uint64`). - R3. The library shall publish the same counters as OpenTelemetry OTLP records to a configurable endpoint (default `localhost:4317`). -- R4. The library shall be usable by any SONiC container by writing a thin façade that owns only the container-specific metric vocabulary. -- R5. The first consumer of the library is the SWSS-specific façade `SwssStats` (in `sonic-swss/orchagent/`), which exposes a small SWSS-specific public surface: a global `gSwssStatsRecord` enable flag, `SwssStats::getInstance()`, and `recordTask` / `recordComplete` / `recordError` methods. -- R6. The `SwssStats` façade shall write into `COUNTERS_DB` under keys `SWSS_STATS:
` with hash fields `SET` / `DEL` / `COMPLETE` / `ERROR`, following the uniform schema in R2. +- R4. The library shall be usable by any SONiC container by writing a thin facade that owns only the container-specific metric vocabulary. +- R5. The first consumer of the library is the SWSS-specific facade `SwssStats` (in `sonic-swss/orchagent/`), which exposes a small SWSS-specific public surface: a global `gSwssStatsRecord` enable flag, `SwssStats::getInstance()`, and `recordTask` / `recordComplete` / `recordError` methods. +- R6. The `SwssStats` facade shall write into `COUNTERS_DB` under keys `SWSS_STATS:
` with hash fields `SET` / `DEL` / `COMPLETE` / `ERROR`, following the uniform schema in R2. **Non-functional** @@ -85,11 +85,11 @@ This HLD specifies a single, reusable producer that: - The OTel Collector itself, including its image, configuration, exporter pipeline to off-box telemetry systems, authentication, and operator onboarding. - Replacing existing FlexCounter / SAI counter pipelines (those measure dataplane state via SAI; this design measures control-plane software events). -- Defining the metric vocabulary for non-swss containers — that is the job of each container's own façade. +- Defining the metric vocabulary for non-swss containers — that is the job of each container's own facade. ### 6. Architecture Design -The architecture is unchanged at the SONiC system level. A new library is introduced in `sonic-swss-common`, and a new SWSS-specific façade (its first consumer) is added in `sonic-swss`; future containers may add their own façades using the same library. +The architecture is unchanged at the SONiC system level. A new library is introduced in `sonic-swss-common`, and a new SWSS-specific facade (its first consumer) is added in `sonic-swss`; future containers may add their own facades using the same library. ``` ┌────────────────────────────── SONiC switch ──────────────────────────────┐ @@ -131,7 +131,7 @@ The architecture is unchanged at the SONiC system level. A new library is introd └────────────────────┘ ``` -**Layering rule.** `swss-common` knows nothing of orchagent or any specific container; each container knows only its own façade plus `swss::ComponentStats`. New containers get both sinks for free by writing a ~100-line wrapper. +**Layering rule.** `swss-common` knows nothing of orchagent or any specific container; each container knows only its own facade plus `swss::ComponentStats`. New containers get both sinks for free by writing a ~100-line wrapper. **Dual-sink design properties.** @@ -147,7 +147,7 @@ The architecture is unchanged at the SONiC system level. A new library is introd | Repository | What changes | |--------------------------------|-----------------------------------------------------------------------------| | `sonic-net/sonic-swss-common` | New library `swss::ComponentStats` + unit tests ([PR #1180](https://github.com/sonic-net/sonic-swss-common/pull/1180)). | -| `sonic-net/sonic-swss` | New `SwssStats` thin façade over `ComponentStats` in `orchagent/` ([PR #4516](https://github.com/sonic-net/sonic-swss/pull/4516)). | +| `sonic-net/sonic-swss` | New `SwssStats` thin facade over `ComponentStats` in `orchagent/` ([PR #4516](https://github.com/sonic-net/sonic-swss/pull/4516)). | | `sonic-net/sonic-buildimage` | Submodule pointer bumps for the two repos above ([PR #26924](https://github.com/sonic-net/sonic-buildimage/pull/26924)). | No platform-specific code is added. No SAI changes. No syncd changes. @@ -309,7 +309,7 @@ Example: `redis-cli -n 2 HGETALL "SWSS_STATS:PORT_TABLE"` → The shape mirrors the existing `COUNTERS:*` keys produced by the Flex-Counter pipeline. -#### 7.9 `SwssStats` thin façade +#### 7.9 `SwssStats` thin facade `SwssStats` (in `sonic-swss/orchagent/`) is reduced to a translation layer that owns only the SWSS-specific vocabulary and the global enable flag consumed by `orch.cpp`: @@ -328,7 +328,7 @@ The whole file is ~130 lines of straightforward delegation. **The public surface #### 7.10 Adopting the library in a new container -To add equivalent metrics to e.g. `gnmi`, write a façade analogous to §7.9: +To add equivalent metrics to e.g. `gnmi`, write a facade analogous to §7.9: ```cpp class GnmiStats { @@ -387,7 +387,7 @@ No measurable boot-time degradation is expected. ### 11. Memory Consumption -- Per-instance footprint: O(entities × metrics) `uint64` slots plus their `std::map` keys. Bounded by the number of orchagent tables (≈ tens) for the SWSS façade. +- Per-instance footprint: O(entities × metrics) `uint64` slots plus their `std::map` keys. Bounded by the number of orchagent tables (≈ tens) for the SWSS facade. - The OTLP exporter adds a small fixed overhead (one gRPC channel, one per-cycle batch buffer). - When the feature is disabled at runtime via `setEnabled(false)`, the hot path becomes inert and the writer thread's queue stays empty; memory remains bounded. - When the feature is disabled at compile time (the OTLP sink can be compiled out via build option), there is no residual memory cost beyond the symbols of `swss::ComponentStats` itself; the DB sink remains unconditional. @@ -417,7 +417,7 @@ Library unit tests live in `sonic-swss-common/tests/componentstats_ut.cpp`: | 8 | SingletonSameName | `create("X")` returns the same instance | | 9 | SingletonDifferentNames | `create("X") ≠ create("Y")` | -A façade-level test suite `swssstats_ut.cpp` (9 cases) is added in `sonic-swss` and exercises the SwssStats vocabulary (`recordTask`/`recordComplete`/`recordError`, `gSwssStatsRecord` enable flag, singleton behaviour) end-to-end against the new backend. +A facade-level test suite `swssstats_ut.cpp` (9 cases) is added in `sonic-swss` and exercises the SwssStats vocabulary (`recordTask`/`recordComplete`/`recordError`, `gSwssStatsRecord` enable flag, singleton behaviour) end-to-end against the new backend. Run: @@ -443,4 +443,4 @@ cd sonic-swss-common && ./autogen.sh && ./configure && make check - Phase 1 (this HLD's three PRs) lands the `ComponentStats` library and `SwssStats` refactor with the DB sink fully active and the OTLP sink stubbed (`enableOtlp=false` by default). - Phase 2 implements the OTLP sink against the OpenTelemetry C++ SDK and is gated on the local OTel Collector sidecar being available in `sonic-buildimage`. Coordination with whichever team owns the local OTel Collector image is required before Phase 2 can be enabled by default. -- Phase 3 onboards additional SONiC containers (`gnmi`, `bmp`, `telemetry`, …) by adding their own façades. Each is a self-contained PR in the relevant repository. +- Phase 3 onboards additional SONiC containers (`gnmi`, `bmp`, `telemetry`, …) by adding their own facades. Each is a self-contained PR in the relevant repository. From 6f9f85d4e51d00754ac2e5c724613a5760ac83f5 Mon Sep 17 00:00:00 2001 From: Yutong Zhang Date: Tue, 28 Apr 2026 11:03:21 +0800 Subject: [PATCH 04/14] Component Stats HLD: trim out-of-scope items, drop buildimage row, simplify section 9 - Reword non-swss vocabulary out-of-scope item as future work. - Remove the sonic-buildimage submodule row from the repositories table; not needed. - Section 9: collapse Manifest / CLI / CONFIG_DB subsections into a single 'Not applicable' note. - Update Phase 1 wording and system-test bullet to reference two companion PRs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yutong Zhang --- doc/component-stats/component-stats-hld.md | 31 ++++------------------ 1 file changed, 5 insertions(+), 26 deletions(-) diff --git a/doc/component-stats/component-stats-hld.md b/doc/component-stats/component-stats-hld.md index 826eae312a3..4d5559a6e96 100644 --- a/doc/component-stats/component-stats-hld.md +++ b/doc/component-stats/component-stats-hld.md @@ -85,7 +85,7 @@ This HLD specifies a single, reusable producer that: - The OTel Collector itself, including its image, configuration, exporter pipeline to off-box telemetry systems, authentication, and operator onboarding. - Replacing existing FlexCounter / SAI counter pipelines (those measure dataplane state via SAI; this design measures control-plane software events). -- Defining the metric vocabulary for non-swss containers — that is the job of each container's own facade. +- Defining the metric vocabulary for non-swss containers (`gnmi`, `bmp`, `telemetry`, …); this is left as future work. ### 6. Architecture Design @@ -148,7 +148,6 @@ The architecture is unchanged at the SONiC system level. A new library is introd |--------------------------------|-----------------------------------------------------------------------------| | `sonic-net/sonic-swss-common` | New library `swss::ComponentStats` + unit tests ([PR #1180](https://github.com/sonic-net/sonic-swss-common/pull/1180)). | | `sonic-net/sonic-swss` | New `SwssStats` thin facade over `ComponentStats` in `orchagent/` ([PR #4516](https://github.com/sonic-net/sonic-swss/pull/4516)). | -| `sonic-net/sonic-buildimage` | Submodule pointer bumps for the two repos above ([PR #26924](https://github.com/sonic-net/sonic-buildimage/pull/26924)). | No platform-specific code is added. No SAI changes. No syncd changes. @@ -350,27 +349,7 @@ No SAI API changes are required for this feature. This design measures control-p ### 9. Configuration and management -#### 9.1 Manifest - -Not applicable. This is a built-in SONiC library, not an Application Extension. - -#### 9.2 CLI/YANG model Enhancements - -No new CLI commands or YANG models are introduced by this HLD. Existing CLIs that already read `COUNTERS_DB` (e.g. `redis-cli -n 2 HGETALL`, `show ... stats` style commands) continue to work and gain visibility into the new `_STATS:` keys for free. - -#### 9.3 Config DB Enhancements - -A future enhancement may add a `COMPONENT_STATS` table in `CONFIG_DB`, keyed by component name, to allow operators to flip individual sinks on/off and to override the OTLP endpoint without rebuilding: - -``` -CONFIG_DB key: COMPONENT_STATS| -fields: enable_db : "true" | "false" - enable_otlp : "true" | "false" - otlp_endpoint : - interval_sec : -``` - -The library reads the table once at construction time. Runtime re-configuration is not in scope for the first cut. +Not applicable. This HLD introduces no new CLI commands, YANG models, manifests, or `CONFIG_DB` schema. Existing CLIs that already read `COUNTERS_DB` (e.g. `redis-cli -n 2 HGETALL`, `show ... stats` style commands) continue to work and gain visibility into the new `_STATS:` keys for free. ### 10. Warmboot and Fastboot Design Impact @@ -428,7 +407,7 @@ cd sonic-swss-common && ./autogen.sh && ./configure && make check #### 13.2 System Test cases -- Boot a `sonic-vs` image built with the three companion PRs. +- Boot a `sonic-vs` image built with the two companion PRs. - Exercise orchagent (e.g. `config vlan add`, `config interface ip add`). - Verify on-box DB sink: ``` @@ -441,6 +420,6 @@ cd sonic-swss-common && ./autogen.sh && ./configure && make check ### 14. Open/Action items -- Phase 1 (this HLD's three PRs) lands the `ComponentStats` library and `SwssStats` refactor with the DB sink fully active and the OTLP sink stubbed (`enableOtlp=false` by default). -- Phase 2 implements the OTLP sink against the OpenTelemetry C++ SDK and is gated on the local OTel Collector sidecar being available in `sonic-buildimage`. Coordination with whichever team owns the local OTel Collector image is required before Phase 2 can be enabled by default. +- Phase 1 (this HLD's two PRs) lands the `ComponentStats` library and the `SwssStats` facade with the DB sink fully active and the OTLP sink stubbed (`enableOtlp=false` by default). +- Phase 2 implements the OTLP sink against the OpenTelemetry C++ SDK and is gated on the local OTel Collector sidecar being available on the switch. Coordination with whichever team owns the local OTel Collector image is required before Phase 2 can be enabled by default. - Phase 3 onboards additional SONiC containers (`gnmi`, `bmp`, `telemetry`, …) by adding their own facades. Each is a self-contained PR in the relevant repository. From 882b33741d552a288cd7188e08baaf83962be80f Mon Sep 17 00:00:00 2001 From: Yutong Zhang Date: Tue, 12 May 2026 11:06:46 +0800 Subject: [PATCH 05/14] Component Stats HLD: split into Framework + Reporting HLDs Split the previous single component-stats-hld.md into two documents so that responsibilities map cleanly to the teams involved: * component-stats-framework-hld.md (SONiC team): the swss::ComponentStats library, the SwssStats facade pattern, hot path, threading, memory ordering, warmboot, memory and testing for the producer. The DB sink is the only sink documented; OTLP is moved to future work. * component-stats-reporting-hld.md (SONiC team, contract with NDM): the COUNTERS_DB schema (key layout, hash fields, idle suppression) and SWSS-specific vocabulary, plus conventions for future components. The reporting transport (telegraf -> mdm -> Geneva) is owned by the NDM HLD and referenced here, not duplicated. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yutong Zhang --- .../component-stats-framework-hld.md | 376 ++++++++++++++++ doc/component-stats/component-stats-hld.md | 425 ------------------ .../component-stats-reporting-hld.md | 258 +++++++++++ 3 files changed, 634 insertions(+), 425 deletions(-) create mode 100644 doc/component-stats/component-stats-framework-hld.md delete mode 100644 doc/component-stats/component-stats-hld.md create mode 100644 doc/component-stats/component-stats-reporting-hld.md diff --git a/doc/component-stats/component-stats-framework-hld.md b/doc/component-stats/component-stats-framework-hld.md new file mode 100644 index 00000000000..3ab58e55cf4 --- /dev/null +++ b/doc/component-stats/component-stats-framework-hld.md @@ -0,0 +1,376 @@ +# SONiC Component Statistics — Framework HLD + +## Table of Content + +- [Revision](#1-revision) +- [Scope](#2-scope) +- [Definitions/Abbreviations](#3-definitionsabbreviations) +- [Overview](#4-overview) +- [Requirements](#5-requirements) +- [Architecture Design](#6-architecture-design) +- [High-Level Design](#7-high-level-design) +- [SAI API](#8-sai-api) +- [Configuration and management](#9-configuration-and-management) +- [Warmboot and Fastboot Design Impact](#10-warmboot-and-fastboot-design-impact) +- [Memory Consumption](#11-memory-consumption) +- [Restrictions/Limitations](#12-restrictionslimitations) +- [Testing Requirements/Design](#13-testing-requirementsdesign) +- [Open/Action items](#14-openaction-items) + +### 1. Revision + +| Rev | Date | Author | Change Description | +|-----|------------|---------------|------------------------------------------------------| +| 0.1 | 2026-04-28 | Yutong Zhang | Initial revision | +| 0.2 | 2026-05-12 | Yutong Zhang | Split out the reporting pipeline into a separate HLD | + +### 2. Scope + +This HLD specifies a reusable producer-side mechanism for **service-level (control-plane software) counters** in SONiC containers. It introduces: + +1. A new shared library `swss::ComponentStats` in `sonic-swss-common`. +2. A SWSS-specific facade `SwssStats` in `sonic-swss` built on top of that library, which is the first consumer. + +The library publishes counters into `COUNTERS_DB` so that: + +- on-box diagnostic tooling (`redis-cli`, `show ... stats`) keeps working with no new transport, and +- off-box telemetry consumers can pick the counters up via the reporting pipeline described in the companion HLD. + +**This HLD owns the producer side only**: the library, the facade pattern, the hot-path / threading / memory-ordering design, and warmboot / memory / testing concerns for the library itself. The reporting pipeline (how counters travel from `COUNTERS_DB` to Geneva or other off-box telemetry systems) is specified in the companion HLD: + +- [Component Statistics — Reporting HLD](./component-stats-reporting-hld.md) + +### 3. Definitions/Abbreviations + +| Term | Definition | +|-----------------|---------------------------------------------------------------------------------------------| +| Component | A SONiC container that produces service-level counters (e.g. `swss`, `gnmi`, `bmp`). | +| Entity | A logical grouping of metrics inside a component (e.g. an orchagent table, a gNMI path). | +| Metric | A named uint64 counter or gauge inside an entity (e.g. `SET`, `DEL`, `COMPLETE`, `ERROR`). | +| ComponentStats | The new shared library in `sonic-swss-common` providing the producer mechanism. | +| SwssStats | A SWSS-specific facade over `ComponentStats` (lives in `sonic-swss`). | +| DB sink | The output path that mirrors counters into `COUNTERS_DB`. | + +### 4. Overview + +SONiC already publishes **dataplane** counters via the Flex-Counter framework (`CONFIG_DB / FLEX_COUNTER_TABLE` -> `syncd` -> `COUNTERS_DB`). What is missing is **service-level** counters — software-side events such as orchagent task throughput, gNMI request rate, BMP message error counts. Without these we cannot answer questions like *"is orchagent draining tasks?"*, *"is gNMI seeing subscribe failures?"*, *"is one container dropping more events than its peers?"*. + +A naive implementation would put this plumbing — atomic counters, dirty tracking, a 1-second writer thread, and a Redis-side schema — directly inside each container. That is unacceptable: every container would need its own concurrency review, bug fixes would drift, and the on-the-wire schemas would diverge. + +This HLD specifies a single, reusable producer that: + +1. accumulates counters in process-local atomic state with negligible hot-path cost, +2. mirrors them to `COUNTERS_DB` so `redis-cli`, `show ... stats` CLIs, and any other on-box tooling continue to work, +3. exposes a stable public API so each container only needs to write a thin (~100 LoC) facade. + +How the `COUNTERS_DB` rows then reach Geneva or any other off-box system is the responsibility of the [Reporting HLD](./component-stats-reporting-hld.md). + +### 5. Requirements + +**Functional** + +- R1. A reusable C++ library shall accumulate per-component, per-entity, per-metric `uint64` counters. +- R2. The library shall publish counters to `COUNTERS_DB` under a uniform key layout `_STATS:` (Redis hash, fields = metric names, values = decimal `uint64`). The exact key/field contract is normatively defined in the Reporting HLD. +- R3. The library shall be usable by any SONiC container by writing a thin facade that owns only the container-specific metric vocabulary. +- R4. The first consumer of the library is the SWSS-specific facade `SwssStats` (in `sonic-swss/orchagent/`), which exposes a small SWSS-specific public surface: a global `gSwssStatsRecord` enable flag, `SwssStats::getInstance()`, and `recordTask` / `recordComplete` / `recordError` methods. +- R5. The `SwssStats` facade shall write into `COUNTERS_DB` under keys `SWSS_STATS:
` with hash fields `SET` / `DEL` / `COMPLETE` / `ERROR`, following the uniform schema in R2. + +**Non-functional** + +- R6. The hot path (`increment` / `setValue`) shall be lock-free and constant-time after the first use of a given (entity, metric) pair. +- R7. Construction of a `ComponentStats` instance shall not crash the host process if Redis is not yet reachable; the sink shall connect lazily and retry independently. +- R8. A failure in the sink (Redis down) shall not affect the hot path. After recovery, no monotonic data point shall be lost beyond intermediate samples (the next successful flush carries the latest cumulative value). +- R9. Idle systems shall produce zero outbound traffic on the sink (driven by per-entity dirty tracking). + +**Out of scope** + +- The reporting pipeline that consumes the `COUNTERS_DB` rows (telegraf, mdm, Geneva, etc.) — see the [Reporting HLD](./component-stats-reporting-hld.md). +- Replacing existing FlexCounter / SAI counter pipelines (those measure dataplane state via SAI; this design measures control-plane software events). +- Defining the metric vocabulary for non-swss containers (`gnmi`, `bmp`, `telemetry`, …); this is left as future work. + +### 6. Architecture Design + +The architecture is unchanged at the SONiC system level. A new library is introduced in `sonic-swss-common`, and a new SWSS-specific facade (its first consumer) is added in `sonic-swss`; future containers may add their own facades using the same library. + +``` ++---------------------------- SONiC switch ------------------------------+ +| | +| orchagent (sonic-swss) gnmi / bmp / telemetry / ... | +| +----------------------+ +----------------------+ | +| | orch.cpp + SwssStats | ... | gnmistats / bmpstats | | +| +----------+-----------+ +----------+-----------+ | +| | instrument | | +| v v | +| +----------------------------------------------------------+ | +| | swss::ComponentStats (in libswsscommon) | | +| | +---------------------------------------------------+ | | +| | | atomic counters + dirty tracking + writer thread | | | +| | +-------------------------+-------------------------+ | | +| | | | | +| | DB sink | | +| | (Redis HSET via swss::Table) | | +| +-----------------------------+----------------------------+ | +| | | +| v | +| +-------------------------+ | +| | COUNTERS_DB | | +| | SWSS_STATS:PORT_TABLE | | +| | GNMI_STATS:/iface/... | | +| | BMP_STATS:... | | +| | | | +| | used by: | | +| | - redis-cli | | +| | - show stats CLI | | +| | - reporting pipeline | --> see Reporting HLD | +| +-------------------------+ | ++------------------------------------------------------------------------+ +``` + +**Layering rule.** `swss-common` knows nothing of orchagent or any specific container; each container knows only its own facade plus `swss::ComponentStats`. New containers get the sink for free by writing a ~100-line wrapper. + +**Sink design properties.** + +- *One source of truth.* The sink consumes the atomic-counter snapshot inside `ComponentStats`. +- *No new transport for local debugging.* The `COUNTERS_DB` layout follows the existing convention so `redis-cli`, `show ... stats` CLIs, and any in-band tooling keep working. +- *Sink isolation from hot path.* Failures in the sink (Redis unreachable) do not affect the hot path; they are logged and retried. + +### 7. High-Level Design + +#### 7.1 Repositories changed + +| Repository | What changes | +|--------------------------------|-----------------------------------------------------------------------------| +| `sonic-net/sonic-swss-common` | New library `swss::ComponentStats` + unit tests ([PR #1180](https://github.com/sonic-net/sonic-swss-common/pull/1180)). | +| `sonic-net/sonic-swss` | New `SwssStats` thin facade over `ComponentStats` in `orchagent/` ([PR #4516](https://github.com/sonic-net/sonic-swss/pull/4516)). | + +No platform-specific code is added. No SAI changes. No syncd changes. + +#### 7.2 `swss::ComponentStats` — public API + +```cpp +namespace swss { + +class ComponentStats { +public: + using CounterSnapshot = std::map; + + // Sink configuration. The DB sink is on by default; additional + // sinks (e.g. OTLP) may be added by future revisions and are kept + // off by default. + struct SinkConfig { + bool enableDb = true; // mirror to COUNTERS_DB + }; + + static std::shared_ptr create( + const std::string& componentName, + const std::string& dbName = "COUNTERS_DB", + uint32_t intervalSec = 1, + const SinkConfig& sinks = SinkConfig{}); + + void increment(const std::string& entity, const std::string& metric, uint64_t n = 1); + void setValue (const std::string& entity, const std::string& metric, uint64_t value); + + uint64_t get (const std::string& entity, const std::string& metric); + CounterSnapshot getAll(const std::string& entity); + + void setEnabled(bool on); + bool isEnabled() const; + void stop(); +}; + +} // namespace swss +``` + +`create()` consults a process-wide registry keyed by `componentName`. A second call with the same name returns the existing instance, ensuring containers cannot accidentally start multiple writer threads against the same Redis prefix. + +#### 7.3 Internal state + +Per instance: +- `m_entities : std::map` — `std::map` (not `unordered_map`) so references returned by `getOrCreateEntity` remain valid after later inserts. +- `EntityStats` holds `map>` (heap-allocated because `std::atomic` is not movable) plus a per-entity `atomic version`. +- `m_mutex` guards only the **structure** of the maps (insert/find). Hot-path reads/writes of counter values use `std::atomic` and skip the mutex after the first use. +- `m_running`, `m_enabled` — atomic flags. +- `m_cv` — wakes the writer thread immediately on `stop()` instead of waiting up to `intervalSec`. +- `m_thread` — owns the writer. + +Process-wide: +- `registry : std::map>` (`weak_ptr` so a fully released instance can be destroyed). + +#### 7.4 Hot path + +```cpp +void ComponentStats::increment(const string& entity, const string& metric, uint64_t n) { + if (!isEnabled() || n == 0) return; + + auto& e = getOrCreateEntity(entity); // mutex on first use only + auto& c = getOrCreateCounter(e, metric); // mutex on first use only + + c.value .fetch_add(n, memory_order_relaxed); // (1) counter + e.version.fetch_add(1, memory_order_release); // (2) dirty-bump (release) +} +``` + +Cost after warm-up: two atomic RMWs. No mutex acquisition, no allocation, no syscall. + +#### 7.5 Writer thread + +Runs at `intervalSec` (default 1 s) and flushes the snapshot to the DB sink: + +``` ++---------------------------------------------------------------+ +| Phase A - connect the DB sink (run once, with retry) | +| loop until m_running == false: | +| if !dbConnected: try connect Redis | +| if connected: break | +| else cv.wait_for(intervalSec, predicate=!m_running) | ++---------------------------------------------------------------+ ++---------------------------------------------------------------+ +| Phase B - flush loop | +| loop: | +| cv.wait_for(intervalSec, predicate=!m_running) | +| if !m_running: break | +| | +| # SNAPSHOT (under lock) | +| for each entity e in m_entities: | +| v = e.version.load(acquire) <- pairs (2) | +| if lastVersion[e.name] == v: continue (skip clean) | +| lastVersion[e.name] = v | +| row = [(metric, c.value.load(relaxed)) for c in e] | +| enqueue(name, row) | +| | +| # FAN-OUT (lock released) | +| for (name, row) in queue: | +| try: m_table->set(name, stringify(row)) | +| catch: log warn, continue | ++---------------------------------------------------------------+ +``` + +Three properties: + +1. *Lock released before any I/O.* Round-trips under the structural lock would briefly stall every concurrent `increment()`. +2. *Idle systems generate zero outbound traffic.* When no entity has changed, the queue is empty and the sink is not touched. +3. *Hot-path isolation.* A sink failure is logged and skipped; the hot path is never blocked. + +#### 7.6 Memory ordering correctness + +The release/acquire pair ((2) in 7.4 ↔ acquire-load in 7.5) guarantees: + +> If the writer reads `version == N`, then every counter mutation that contributed to bumping the version up to `N` has already happened-before the reader and is visible. + +Without it, on weakly ordered architectures (ARM, POWER) the writer could see the new version but read an old counter value, recording a stale snapshot. + +#### 7.7 `SwssStats` thin facade + +`SwssStats` (in `sonic-swss/orchagent/`) is reduced to a translation layer that owns only the SWSS-specific vocabulary and the global enable flag consumed by `orch.cpp`: + +```cpp +SwssStats::SwssStats() : m_impl(swss::ComponentStats::create("SWSS")) {} + +void SwssStats::recordTask(const std::string& t, const std::string& op) { + if (op == "SET") m_impl->increment(t, "SET"); + else if (op == "DEL") m_impl->increment(t, "DEL"); +} +void SwssStats::recordComplete(const std::string& t, uint64_t n) { m_impl->increment(t, "COMPLETE", n); } +void SwssStats::recordError (const std::string& t, uint64_t n) { m_impl->increment(t, "ERROR", n); } +``` + +The whole file is ~130 lines of straightforward delegation. **The public surface (`gSwssStatsRecord`, `SwssStats::getInstance()`, `recordTask`/`recordComplete`/`recordError`) and the on-the-wire `SWSS_STATS:
` Redis layout are deliberately kept narrow and stable so that the SWSS-specific vocabulary remains independent of future evolution of the underlying `ComponentStats` library.** + +The exact `SWSS_STATS:
` schema (key layout, field names, types) is documented in the [Reporting HLD](./component-stats-reporting-hld.md), which owns the contract with downstream consumers. + +#### 7.8 Adopting the library in a new container + +To add equivalent metrics to e.g. `gnmi`, write a facade analogous to §7.7: + +```cpp +class GnmiStats { +public: + static GnmiStats* getInstance(); + void recordSubscribe(const std::string& path) { m_impl->increment(path, "SUBSCRIBE"); } + void recordError (const std::string& path) { m_impl->increment(path, "ERROR"); } +private: + GnmiStats() : m_impl(swss::ComponentStats::create("GNMI")) {} + std::shared_ptr m_impl; +}; +``` + +Result: counters land in `COUNTERS_DB` under keys `GNMI_STATS:`. No new threads, no new Redis client management, no new test harness needed. Reporting then picks them up automatically via the pipeline described in the Reporting HLD. + +### 8. SAI API + +No SAI API changes are required for this feature. This design measures control-plane software events inside SONiC containers; it does not query or modify any SAI state. + +### 9. Configuration and management + +Not applicable. This HLD introduces no new CLI commands, YANG models, manifests, or `CONFIG_DB` schema. Existing CLIs that already read `COUNTERS_DB` (e.g. `redis-cli -n 2 HGETALL`, `show ... stats` style commands) continue to work and gain visibility into the new `_STATS:` keys for free. + +### 10. Warmboot and Fastboot Design Impact + +Counters are kept in process memory and are reset on container restart, including warmboot and fastboot. This is acceptable because consumers (dashboards, alerts) compute rate-of-change rather than absolute values. + +#### Warmboot and Fastboot Performance Impact + +- The library does **not** add any stalls, sleeps, or I/O operations to the boot critical chain. Construction is non-blocking; the writer thread connects to Redis lazily and retries in the background, so a not-yet-ready dependency cannot delay container start. +- No CPU-heavy processing (Jinja templates, etc.) is added in the boot path. +- No third-party dependency is updated by this HLD. +- The library does not delay any service or Docker container. + +No measurable boot-time degradation is expected. + +### 11. Memory Consumption + +- Per-instance footprint: O(entities × metrics) `uint64` slots plus their `std::map` keys. Bounded by the number of orchagent tables (≈ tens) for the SWSS facade. +- When the feature is disabled at runtime via `setEnabled(false)`, the hot path becomes inert and the writer thread's queue stays empty; memory remains bounded. + +### 12. Restrictions/Limitations + +- Counters reset to zero on container restart by design. Consumers must compute rate-of-change rather than rely on absolute values across restarts. +- The library does not retain history; it relies on downstream consumers (`COUNTERS_DB` readers, the reporting pipeline) for retention. +- The structural mutex (`m_mutex`) is acquired only on the *first* use of a given (entity, metric) pair. Workloads that constantly mint new entity names will see one mutex acquisition per new name; this is not the expected pattern for SONiC containers. + +### 13. Testing Requirements/Design + +#### 13.1 Unit Test cases + +Library unit tests live in `sonic-swss-common/tests/componentstats_ut.cpp`: + +| # | Test | What it proves | +|---|----------------------------|---------------------------------------------------------------------------------------------| +| 1 | BasicIncrement | `increment` + `get` round-trip | +| 2 | MultipleMetrics | metric isolation within an entity | +| 3 | MultipleEntities | entity isolation within a component | +| 4 | SetValueOverwrites | gauge semantics | +| 5 | DisabledIsNoOp | `setEnabled(false)` makes hot path inert | +| 6 | GetAllReturnsSnapshot | bulk read returns the right shape | +| 7 | ConcurrentIncrements | 8 threads × 10 000 increments → exactly 80 000 (no torn writes, no lost updates) | +| 8 | SingletonSameName | `create("X")` returns the same instance | +| 9 | SingletonDifferentNames | `create("X") ≠ create("Y")` | + +A facade-level test suite `swssstats_ut.cpp` (9 cases) is added in `sonic-swss` and exercises the SwssStats vocabulary (`recordTask`/`recordComplete`/`recordError`, `gSwssStatsRecord` enable flag, singleton behaviour) end-to-end against the new backend. + +Run: + +``` +cd sonic-swss-common && ./autogen.sh && ./configure && make check +./tests/tests --gtest_filter='ComponentStats*' +``` + +#### 13.2 System Test cases + +- Boot a `sonic-vs` image built with the two companion PRs. +- Exercise orchagent (e.g. `config vlan add`, `config interface ip add`). +- Verify on-box DB sink: + ``` + redis-cli -n 2 KEYS "SWSS_STATS:*" + redis-cli -n 2 HGETALL "SWSS_STATS:PORT_TABLE" + ``` + Counters increment in proportion to operations; idle dwell shows zero further writes (dirty tracking working). +- Confirm warmboot and fastboot are unaffected (no boot-time regression, no service startup ordering change). + +End-to-end validation of the reporting path (telegraf → mdm → Geneva) is covered in the [Reporting HLD](./component-stats-reporting-hld.md). + +### 14. Open/Action items + +- Phase 1 (this HLD's two PRs) lands the `ComponentStats` library and the `SwssStats` facade with the DB sink fully active. +- Phase 2 onboards additional SONiC containers (`gnmi`, `bmp`, `telemetry`, …) by adding their own facades. Each is a self-contained PR in the relevant repository. +- Phase 3 (future) may add direct OTLP export from the library to a local agent for components that need lower reporting latency than the DB → telegraf path provides. Out of scope for this HLD. diff --git a/doc/component-stats/component-stats-hld.md b/doc/component-stats/component-stats-hld.md deleted file mode 100644 index 4d5559a6e96..00000000000 --- a/doc/component-stats/component-stats-hld.md +++ /dev/null @@ -1,425 +0,0 @@ -# SONiC Component Statistics HLD - -## Table of Content - -- [Revision](#1-revision) -- [Scope](#2-scope) -- [Definitions/Abbreviations](#3-definitionsabbreviations) -- [Overview](#4-overview) -- [Requirements](#5-requirements) -- [Architecture Design](#6-architecture-design) -- [High-Level Design](#7-high-level-design) -- [SAI API](#8-sai-api) -- [Configuration and management](#9-configuration-and-management) -- [Warmboot and Fastboot Design Impact](#10-warmboot-and-fastboot-design-impact) -- [Memory Consumption](#11-memory-consumption) -- [Restrictions/Limitations](#12-restrictionslimitations) -- [Testing Requirements/Design](#13-testing-requirementsdesign) -- [Open/Action items](#14-openaction-items) - -### 1. Revision - -| Rev | Date | Author | Change Description | -|-----|------------|---------------|--------------------------| -| 0.1 | 2026-04-28 | Yutong Zhang | Initial revision | - -### 2. Scope - -This HLD specifies a reusable mechanism for exposing **service-level (control-plane software) counters** from SONiC containers. It introduces: - -1. A new shared library `swss::ComponentStats` in `sonic-swss-common`. -2. A SWSS-specific facade `SwssStats` in `sonic-swss` built on top of that library, which is the first consumer. - -The library publishes counters to: - -1. `COUNTERS_DB`, for parity with the existing Flex-Counter pipeline and for on-box diagnostic tooling (`redis-cli`, `show ... stats`). -2. A local OpenTelemetry (OTLP) Collector sidecar, so the same counters can be forwarded to off-box telemetry systems (e.g. Geneva mdm) that consume OTLP. - -Configuration of the OTel Collector itself, off-box telemetry endpoints, dashboards, and alerts are explicitly **out of scope** for this HLD. - -### 3. Definitions/Abbreviations - -| Term | Definition | -|-----------------|---------------------------------------------------------------------------------------------| -| Component | A SONiC container that produces service-level counters (e.g. `swss`, `gnmi`, `bmp`). | -| Entity | A logical grouping of metrics inside a component (e.g. an orchagent table, a gNMI path). | -| Metric | A named uint64 counter or gauge inside an entity (e.g. `SET`, `DEL`, `COMPLETE`, `ERROR`). | -| ComponentStats | The new shared library in `sonic-swss-common` providing the producer mechanism. | -| SwssStats | A SWSS-specific facade over `ComponentStats` (lives in `sonic-swss`). | -| DB sink | The output path that mirrors counters into `COUNTERS_DB`. | -| OTLP sink | The output path that exports counters via OpenTelemetry Protocol to a local OTel Collector. | -| OTel Collector | A locally-running OpenTelemetry Collector sidecar; not delivered by this HLD. | - -### 4. Overview - -SONiC already publishes **dataplane** counters via the Flex-Counter framework (`CONFIG_DB / FLEX_COUNTER_TABLE` → `syncd` → `COUNTERS_DB`). What is missing is **service-level** counters — software-side events such as orchagent task throughput, gNMI request rate, BMP message error counts. Without these we cannot answer questions like *"is orchagent draining tasks?"*, *"is gNMI seeing subscribe failures?"*, *"is one container dropping more events than its peers?"*. - -A naive implementation would put this plumbing — atomic counters, dirty tracking, a 1-second writer thread, a Redis-side schema, an OTLP exporter — directly inside each container. That is unacceptable: every container would need its own concurrency review, bug fixes would drift, and the on-the-wire schemas would diverge. - -This HLD specifies a single, reusable producer that: - -1. accumulates counters in process-local atomic state with negligible hot-path cost, -2. mirrors them to `COUNTERS_DB` so `redis-cli`, `show ... stats` CLIs, and any other on-box tooling continue to work, -3. emits them as OTLP metrics to a local OTel Collector for forwarding to off-box telemetry systems, -4. exposes a stable public API so each container only needs to write a thin (~100 LoC) facade. - -### 5. Requirements - -**Functional** - -- R1. A reusable C++ library shall accumulate per-component, per-entity, per-metric `uint64` counters. -- R2. The library shall publish counters to `COUNTERS_DB` under a uniform key layout `_STATS:` (Redis hash, fields = metric names, values = decimal `uint64`). -- R3. The library shall publish the same counters as OpenTelemetry OTLP records to a configurable endpoint (default `localhost:4317`). -- R4. The library shall be usable by any SONiC container by writing a thin facade that owns only the container-specific metric vocabulary. -- R5. The first consumer of the library is the SWSS-specific facade `SwssStats` (in `sonic-swss/orchagent/`), which exposes a small SWSS-specific public surface: a global `gSwssStatsRecord` enable flag, `SwssStats::getInstance()`, and `recordTask` / `recordComplete` / `recordError` methods. -- R6. The `SwssStats` facade shall write into `COUNTERS_DB` under keys `SWSS_STATS:
` with hash fields `SET` / `DEL` / `COMPLETE` / `ERROR`, following the uniform schema in R2. - -**Non-functional** - -- R7. The hot path (`increment` / `setValue`) shall be lock-free and constant-time after the first use of a given (entity, metric) pair. -- R8. Construction of a `ComponentStats` instance shall not crash the host process if Redis or the OTel Collector is not yet reachable; both sinks shall connect lazily and retry independently. -- R9. A failure in one sink (Redis down, OTel Collector restarting) shall not affect the other sink and shall not affect the hot path. After recovery, no monotonic data point shall be lost beyond intermediate samples (the next successful flush carries the latest cumulative value). -- R10. Idle systems shall produce zero outbound traffic on either sink (driven by per-entity dirty tracking). - -**Out of scope** - -- The OTel Collector itself, including its image, configuration, exporter pipeline to off-box telemetry systems, authentication, and operator onboarding. -- Replacing existing FlexCounter / SAI counter pipelines (those measure dataplane state via SAI; this design measures control-plane software events). -- Defining the metric vocabulary for non-swss containers (`gnmi`, `bmp`, `telemetry`, …); this is left as future work. - -### 6. Architecture Design - -The architecture is unchanged at the SONiC system level. A new library is introduced in `sonic-swss-common`, and a new SWSS-specific facade (its first consumer) is added in `sonic-swss`; future containers may add their own facades using the same library. - -``` -┌────────────────────────────── SONiC switch ──────────────────────────────┐ -│ │ -│ orchagent (sonic-swss) gnmi / bmp / telemetry / … │ -│ ┌──────────────────────┐ ┌──────────────────────┐ │ -│ │ orch.cpp + SwssStats │ … │ gnmistats / bmpstats │ │ -│ └──────────┬───────────┘ └──────────┬───────────┘ │ -│ │ instrument │ │ -│ ▼ ▼ │ -│ ┌────────────────────────────────────────────────────────────┐ │ -│ │ swss::ComponentStats (in libswsscommon) │ │ -│ │ ┌─────────────────────────────────────────────────────┐ │ │ -│ │ │ atomic counters + dirty tracking + writer thread │ │ │ -│ │ └──────────────┬──────────────────────────┬───────────┘ │ │ -│ │ │ │ │ │ -│ │ DB sink OTLP sink │ │ -│ │ (Redis HSET via swss::Table) (OTLP/gRPC, localhost) │ │ -│ └──────────┬──────────────────────────────────┬──────────────┘ │ -│ │ │ │ -│ ▼ ▼ │ -│ ┌──────────────────────────┐ ┌────────────────────────────┐ │ -│ │ COUNTERS_DB │ │ Local OTel Collector │ │ -│ │ SWSS_STATS:PORT_TABLE │ │ (sidecar container) │ │ -│ │ GNMI_STATS:/iface/… │ │ │ │ -│ │ BMP_STATS:… │ │ batches, retries, adds │ │ -│ │ │ │ resource attrs, exports │ │ -│ │ used by: redis-cli, │ │ to off-box telemetry │ │ -│ │ show stats CLI, local │ └─────────────┬──────────────┘ │ -│ │ diagnostic tools │ │ │ -│ └──────────────────────────┘ │ │ -│ │ OTLP │ -└──────────────────────────────────────────────────┼───────────────────────┘ - │ - ▼ - ┌────────────────────┐ - │ Off-box telemetry │ - │ (e.g. Geneva mdm) │ - └────────────────────┘ -``` - -**Layering rule.** `swss-common` knows nothing of orchagent or any specific container; each container knows only its own facade plus `swss::ComponentStats`. New containers get both sinks for free by writing a ~100-line wrapper. - -**Dual-sink design properties.** - -- *One source of truth.* Both sinks consume the same atomic-counter snapshot inside `ComponentStats`. They cannot diverge: if the OTel pipeline is briefly down, `COUNTERS_DB` still reflects current state, and vice versa. -- *No new transport for local debugging.* The `COUNTERS_DB` layout is unchanged, so `redis-cli`, `show ... stats` CLIs, and any existing in-band tooling keep working. -- *No off-box-system-specific code in containers.* Containers know only `ComponentStats`; the OTLP sink talks to a local OTel Collector at `localhost:4317`, and the Collector handles everything beyond that hop. -- *Independent failure domains.* Failures in one sink (DB unreachable, OTel agent restarting) do not affect the other or the hot path. - -### 7. High-Level Design - -#### 7.1 Repositories changed - -| Repository | What changes | -|--------------------------------|-----------------------------------------------------------------------------| -| `sonic-net/sonic-swss-common` | New library `swss::ComponentStats` + unit tests ([PR #1180](https://github.com/sonic-net/sonic-swss-common/pull/1180)). | -| `sonic-net/sonic-swss` | New `SwssStats` thin facade over `ComponentStats` in `orchagent/` ([PR #4516](https://github.com/sonic-net/sonic-swss/pull/4516)). | - -No platform-specific code is added. No SAI changes. No syncd changes. - -#### 7.2 `swss::ComponentStats` — public API - -```cpp -namespace swss { - -class ComponentStats { -public: - using CounterSnapshot = std::map; - - // Sink configuration. Both sinks default to "on". - struct SinkConfig { - bool enableDb = true; // mirror to COUNTERS_DB - bool enableOtlp = true; // export to local OTel Collector - std::string otlpEndpoint = "localhost:4317"; // OTLP/gRPC endpoint - std::string serviceName; // OTel resource attr (default: componentName) - std::string serviceInstanceId; // OTel resource attr (default: hostname) - }; - - static std::shared_ptr create( - const std::string& componentName, - const std::string& dbName = "COUNTERS_DB", - uint32_t intervalSec = 1, - const SinkConfig& sinks = SinkConfig{}); - - void increment(const std::string& entity, const std::string& metric, uint64_t n = 1); - void setValue (const std::string& entity, const std::string& metric, uint64_t value); - - uint64_t get (const std::string& entity, const std::string& metric); - CounterSnapshot getAll(const std::string& entity); - - void setEnabled(bool on); - bool isEnabled() const; - void stop(); -}; - -} // namespace swss -``` - -`create()` consults a process-wide registry keyed by `componentName`. A second call with the same name returns the existing instance, ensuring containers cannot accidentally start multiple writer threads against the same Redis prefix. - -#### 7.3 Internal state - -Per instance: -- `m_entities : std::map` — `std::map` (not `unordered_map`) so references returned by `getOrCreateEntity` remain valid after later inserts. -- `EntityStats` holds `map>` (heap-allocated because `std::atomic` is not movable) plus a per-entity `atomic version`. -- `m_mutex` guards only the **structure** of the maps (insert/find). Hot-path reads/writes of counter values use `std::atomic` and skip the mutex after the first use. -- `m_running`, `m_enabled` — atomic flags. -- `m_cv` — wakes the writer thread immediately on `stop()` instead of waiting up to `intervalSec`. -- `m_thread` — owns the writer. - -Process-wide: -- `registry : std::map>` (`weak_ptr` so a fully released instance can be destroyed). - -#### 7.4 Hot path - -```cpp -void ComponentStats::increment(const string& entity, const string& metric, uint64_t n) { - if (!isEnabled() || n == 0) return; - - auto& e = getOrCreateEntity(entity); // mutex on first use only - auto& c = getOrCreateCounter(e, metric); // mutex on first use only - - c.value .fetch_add(n, memory_order_relaxed); // ① counter - e.version.fetch_add(1, memory_order_release); // ② dirty-bump (release) -} -``` - -Cost after warm-up: two atomic RMWs. No mutex acquisition, no allocation, no syscall. - -#### 7.5 Writer thread - -Runs at `intervalSec` (default 1 s) and fans the snapshot out to both sinks: - -``` -┌───────────────────────────────────────────────────────────────┐ -│ Phase A — connect each enabled sink (run once, with retry) │ -│ loop until m_running == false: │ -│ if enableDb and !dbConnected: try connect Redis │ -│ if enableOtlp and !otlpConnected: try open OTLP exporter │ -│ if all enabled sinks connected: break │ -│ else cv.wait_for(intervalSec, predicate=!m_running) │ -└───────────────────────────────────────────────────────────────┘ -┌───────────────────────────────────────────────────────────────┐ -│ Phase B — flush loop │ -│ loop: │ -│ cv.wait_for(intervalSec, predicate=!m_running) │ -│ if !m_running: break │ -│ │ -│ # SNAPSHOT (under lock) — single snapshot, two sinks │ -│ for each entity e in m_entities: │ -│ v = e.version.load(acquire) ← pairs ② │ -│ if lastVersion[e.name] == v: continue (skip clean)│ -│ lastVersion[e.name] = v │ -│ row = [(metric, c.value.load(relaxed)) for c in e] │ -│ enqueue(name, row) │ -│ │ -│ # FAN-OUT (lock released, sinks fail independently) │ -│ if enableDb: │ -│ for (name, row) in queue: │ -│ try: m_table->set(name, stringify(row)) │ -│ catch: log warn, continue │ -│ │ -│ if enableOtlp: │ -│ build OTLP ResourceMetrics{ … } from queue │ -│ try: m_otlp->Export(batch) │ -│ catch: log warn, continue │ -└───────────────────────────────────────────────────────────────┘ -``` - -Three properties: - -1. *Lock released before any I/O.* Round-trips under the structural lock would briefly stall every concurrent `increment()`. -2. *Idle systems generate zero outbound traffic on either sink.* When no entity has changed, the queue is empty and neither sink is touched. -3. *Sink isolation.* A failure in one sink is logged and skipped; the other sink still publishes the same cycle's snapshot. - -#### 7.6 Memory ordering correctness - -The release/acquire pair (`②` in 7.4 ↔ acquire-load in 7.5) guarantees: - -> If the writer reads `version == N`, then every counter mutation that contributed to bumping the version up to `N` has already happened-before the reader and is visible. - -Without it, on weakly ordered architectures (ARM, POWER) the writer could see the new version but read an old counter value, recording a stale snapshot. - -#### 7.7 OTLP sink details - -- **Wire format.** OTLP/gRPC over plaintext `localhost:4317`. No TLS or authentication on the local hop — the loopback link is inside the switch, and any off-box credentials live in the OTel Collector. OTLP/HTTP is supported as a build option but not the default. -- **Metric model.** Counters set via `increment()` are exported as OTLP `Sum` with `aggregation_temporality = CUMULATIVE` and `is_monotonic = true`. Counters set via `setValue()` (gauges) are exported as OTLP `Gauge`. -- **Resource attributes** attached to every batch: `service.name=`, `service.instance.id=`, `sonic.component=`. -- **Metric attributes** attached to every data point: `entity` — the table name / gNMI path / etc. The entity is a *label*, not part of the metric name, so dashboards can pivot freely. -- **Metric name** convention: `sonic..` (e.g. `sonic.swss.SET`, `sonic.gnmi.SUBSCRIBE`). -- **Batching / retry.** The producer does not batch beyond one `intervalSec` snapshot and does not retry. Batching, queuing, retrying, and back-pressure are the local OTel Collector's responsibility. -- **Container restart.** `start_time_unix_nano` is captured once in the constructor and advances on every container restart. This is the OTel-defined signal for counter reset; consumers handle it natively. - -#### 7.8 `COUNTERS_DB` sink details - -For component name `C` and entity `E`: - -``` -COUNTERS_DB key: "_STATS:" -hash fields: each metric name → uint64_t string -``` - -Example: `redis-cli -n 2 HGETALL "SWSS_STATS:PORT_TABLE"` → - -``` -1) "SET" -2) "1283" -3) "DEL" -4) "17" -5) "COMPLETE" -6) "1300" -7) "ERROR" -8) "0" -``` - -The shape mirrors the existing `COUNTERS:*` keys produced by the Flex-Counter pipeline. - -#### 7.9 `SwssStats` thin facade - -`SwssStats` (in `sonic-swss/orchagent/`) is reduced to a translation layer that owns only the SWSS-specific vocabulary and the global enable flag consumed by `orch.cpp`: - -```cpp -SwssStats::SwssStats() : m_impl(swss::ComponentStats::create("SWSS")) {} - -void SwssStats::recordTask(const std::string& t, const std::string& op) { - if (op == "SET") m_impl->increment(t, "SET"); - else if (op == "DEL") m_impl->increment(t, "DEL"); -} -void SwssStats::recordComplete(const std::string& t, uint64_t n) { m_impl->increment(t, "COMPLETE", n); } -void SwssStats::recordError (const std::string& t, uint64_t n) { m_impl->increment(t, "ERROR", n); } -``` - -The whole file is ~130 lines of straightforward delegation. **The public surface (`gSwssStatsRecord`, `SwssStats::getInstance()`, `recordTask`/`recordComplete`/`recordError`) and the on-the-wire `SWSS_STATS:
` Redis layout are deliberately kept narrow and stable so that the SWSS-specific vocabulary remains independent of future evolution of the underlying `ComponentStats` library.** - -#### 7.10 Adopting the library in a new container - -To add equivalent metrics to e.g. `gnmi`, write a facade analogous to §7.9: - -```cpp -class GnmiStats { -public: - static GnmiStats* getInstance(); - void recordSubscribe(const std::string& path) { m_impl->increment(path, "SUBSCRIBE"); } - void recordError (const std::string& path) { m_impl->increment(path, "ERROR"); } -private: - GnmiStats() : m_impl(swss::ComponentStats::create("GNMI")) {} - std::shared_ptr m_impl; -}; -``` - -Result: counters land in `COUNTERS_DB` under keys `GNMI_STATS:` **and** are exported as OTLP metrics `sonic.gnmi.SUBSCRIBE` / `sonic.gnmi.ERROR` (with attribute `entity=`). No new threads, no new Redis or gRPC client management, no new test harness needed. - -### 8. SAI API - -No SAI API changes are required for this feature. This design measures control-plane software events inside SONiC containers; it does not query or modify any SAI state. - -### 9. Configuration and management - -Not applicable. This HLD introduces no new CLI commands, YANG models, manifests, or `CONFIG_DB` schema. Existing CLIs that already read `COUNTERS_DB` (e.g. `redis-cli -n 2 HGETALL`, `show ... stats` style commands) continue to work and gain visibility into the new `_STATS:` keys for free. - -### 10. Warmboot and Fastboot Design Impact - -Counters are kept in process memory and are reset on container restart, including warmboot and fastboot. This is acceptable because consumers (dashboards, alerts) compute rate-of-change rather than absolute values. The OTLP `start_time_unix_nano` attribute advances on every restart, which is the OTel-standard signal for counter reset and is handled natively by OTel-aware consumers. - -#### Warmboot and Fastboot Performance Impact - -- The library does **not** add any stalls, sleeps, or I/O operations to the boot critical chain. Construction is non-blocking; the writer thread connects to Redis and to the OTel Collector lazily and retries in the background, so a not-yet-ready dependency cannot delay container start. -- No CPU-heavy processing (Jinja templates, etc.) is added in the boot path. -- No third-party dependency is updated by this HLD beyond linking against the OpenTelemetry C++ SDK gRPC exporter, which is loaded only when the OTLP sink is enabled. -- The library does not delay any service or Docker container. - -No measurable boot-time degradation is expected. - -### 11. Memory Consumption - -- Per-instance footprint: O(entities × metrics) `uint64` slots plus their `std::map` keys. Bounded by the number of orchagent tables (≈ tens) for the SWSS facade. -- The OTLP exporter adds a small fixed overhead (one gRPC channel, one per-cycle batch buffer). -- When the feature is disabled at runtime via `setEnabled(false)`, the hot path becomes inert and the writer thread's queue stays empty; memory remains bounded. -- When the feature is disabled at compile time (the OTLP sink can be compiled out via build option), there is no residual memory cost beyond the symbols of `swss::ComponentStats` itself; the DB sink remains unconditional. - -### 12. Restrictions/Limitations - -- Counters reset to zero on container restart by design. Consumers must compute rate-of-change rather than rely on absolute values across restarts. -- The library does not retain history; it relies on downstream consumers (`COUNTERS_DB` readers, OTel Collector) for retention. -- The OTLP sink depends on a local OTel Collector reachable at the configured endpoint. If absent, the OTLP sink retries silently in the background; the DB sink and the hot path are unaffected. -- The structural mutex (`m_mutex`) is acquired only on the *first* use of a given (entity, metric) pair. Workloads that constantly mint new entity names will see one mutex acquisition per new name; this is not the expected pattern for SONiC containers. - -### 13. Testing Requirements/Design - -#### 13.1 Unit Test cases - -Library unit tests live in `sonic-swss-common/tests/componentstats_ut.cpp`: - -| # | Test | What it proves | -|---|----------------------------|---------------------------------------------------------------------------------------------| -| 1 | BasicIncrement | `increment` + `get` round-trip | -| 2 | MultipleMetrics | metric isolation within an entity | -| 3 | MultipleEntities | entity isolation within a component | -| 4 | SetValueOverwrites | gauge semantics | -| 5 | DisabledIsNoOp | `setEnabled(false)` makes hot path inert | -| 6 | GetAllReturnsSnapshot | bulk read returns the right shape | -| 7 | ConcurrentIncrements | 8 threads × 10 000 increments → exactly 80 000 (no torn writes, no lost updates) | -| 8 | SingletonSameName | `create("X")` returns the same instance | -| 9 | SingletonDifferentNames | `create("X") ≠ create("Y")` | - -A facade-level test suite `swssstats_ut.cpp` (9 cases) is added in `sonic-swss` and exercises the SwssStats vocabulary (`recordTask`/`recordComplete`/`recordError`, `gSwssStatsRecord` enable flag, singleton behaviour) end-to-end against the new backend. - -Run: - -``` -cd sonic-swss-common && ./autogen.sh && ./configure && make check -./tests/tests --gtest_filter='ComponentStats*' -``` - -#### 13.2 System Test cases - -- Boot a `sonic-vs` image built with the two companion PRs. -- Exercise orchagent (e.g. `config vlan add`, `config interface ip add`). -- Verify on-box DB sink: - ``` - redis-cli -n 2 KEYS "SWSS_STATS:*" - redis-cli -n 2 HGETALL "SWSS_STATS:PORT_TABLE" - ``` - Counters increment in proportion to operations; idle dwell shows zero further writes (dirty tracking working). -- Verify OTLP sink (Phase 2): point a local OTel Collector at `localhost:4317` with a debug exporter and confirm `sonic.swss.*` metrics arrive with correct resource and metric attributes. -- Confirm warmboot and fastboot are unaffected (no boot-time regression, no service startup ordering change). - -### 14. Open/Action items - -- Phase 1 (this HLD's two PRs) lands the `ComponentStats` library and the `SwssStats` facade with the DB sink fully active and the OTLP sink stubbed (`enableOtlp=false` by default). -- Phase 2 implements the OTLP sink against the OpenTelemetry C++ SDK and is gated on the local OTel Collector sidecar being available on the switch. Coordination with whichever team owns the local OTel Collector image is required before Phase 2 can be enabled by default. -- Phase 3 onboards additional SONiC containers (`gnmi`, `bmp`, `telemetry`, …) by adding their own facades. Each is a self-contained PR in the relevant repository. diff --git a/doc/component-stats/component-stats-reporting-hld.md b/doc/component-stats/component-stats-reporting-hld.md new file mode 100644 index 00000000000..7e63a281eab --- /dev/null +++ b/doc/component-stats/component-stats-reporting-hld.md @@ -0,0 +1,258 @@ +# SONiC Component Statistics — Reporting HLD + +## Table of Content + +- [Revision](#1-revision) +- [Scope](#2-scope) +- [Definitions/Abbreviations](#3-definitionsabbreviations) +- [Overview](#4-overview) +- [Requirements](#5-requirements) +- [Architecture Design](#6-architecture-design) +- [High-Level Design](#7-high-level-design) +- [SAI API](#8-sai-api) +- [Configuration and management](#9-configuration-and-management) +- [Warmboot and Fastboot Design Impact](#10-warmboot-and-fastboot-design-impact) +- [Memory Consumption](#11-memory-consumption) +- [Restrictions/Limitations](#12-restrictionslimitations) +- [Testing Requirements/Design](#13-testing-requirementsdesign) +- [Open/Action items](#14-openaction-items) + +### 1. Revision + +| Rev | Date | Author | Change Description | +|-----|------------|---------------|----------------------------------------------------------| +| 0.1 | 2026-05-12 | Yutong Zhang | Initial revision (split from component-stats Framework HLD) | + +### 2. Scope + +This HLD specifies how the service-level component counters produced by `swss::ComponentStats` (see the [Framework HLD](./component-stats-framework-hld.md)) are **reported** from a SONiC switch to off-box telemetry systems. + +For the initial revision the reporting path is exactly one: + +``` +component (swss/gnmi/...) + -> ComponentStats library + -> COUNTERS_DB (Redis) + -> telegraf (Geneva mdm pipeline) + -> Geneva +``` + +This HLD owns the **schema contract** between the producer (`ComponentStats`) and the consumer (telegraf). The deployment, configuration, and operation of the telegraf and mdm containers themselves are owned by the NDM "Geneva integration with SONiC" HLD; this document references them but does not duplicate them. + +Direct application-side OTLP export (e.g. the `OpenTelemetry SDK -> mdm` path described in the NDM HLD §4) is **not** part of this revision; it is listed as future work in §14. + +### 3. Definitions/Abbreviations + +| Term | Definition | +|-----------------|---------------------------------------------------------------------------------------------| +| Component | A SONiC container that produces service-level counters (e.g. `swss`, `gnmi`, `bmp`). | +| Entity | A logical grouping of metrics inside a component (e.g. an orchagent table, a gNMI path). | +| Metric | A named `uint64` counter or gauge inside an entity. | +| ComponentStats | The reusable producer library specified in the Framework HLD. | +| `COUNTERS_DB` | The existing SONiC Redis database (logical DB 2) holding counter rows. | +| telegraf | The off-box-friendly metric agent running on the switch; configured and operated by NDM. | +| mdm | Geneva metric agent that consumes telegraf output and forwards it to Geneva. | +| NDM HLD | "Geneva integration with SONiC" HLD, owned by the NDM team. | + +### 4. Overview + +The Framework HLD specifies a producer that writes each component's service-level counters into `COUNTERS_DB` under a uniform key layout. To make those counters useful off-box, we need a stable contract between that producer and whatever agent harvests Redis on the switch and forwards data to Geneva. + +NDM has already designed and is rolling out a telegraf-based pipeline for harvesting `COUNTERS_DB` and forwarding to Geneva (see NDM HLD §5 "Existing stats collecting from Database via mdm"). This HLD therefore does **not** introduce a new transport. Instead it: + +1. **Defines the Redis schema** that the producer writes and that telegraf consumes (key layout, hash fields, types, dirty-tracking semantics). +2. **Specifies the SWSS-specific vocabulary** (`SWSS_STATS:
` with `SET` / `DEL` / `COMPLETE` / `ERROR`). +3. **States the conventions** that future components must follow so that telegraf can pick them up by pattern match without a per-component configuration change. + +The result is a thin, declarative contract between two teams: SONiC owns what is written; NDM owns how it is harvested and forwarded. + +### 5. Requirements + +**Functional** + +- R1. Every SONiC container that integrates `ComponentStats` shall expose its counters in `COUNTERS_DB` under the uniform key layout defined in §7.1. +- R2. The schema shall be discoverable by pattern match (`_STATS:*`) so that a single telegraf input definition can pick up all current and future components without code or configuration changes. +- R3. The SWSS facade (`SwssStats`) shall publish counters under `SWSS_STATS:
` with hash fields `SET`, `DEL`, `COMPLETE`, `ERROR` (decimal `uint64`). +- R4. The schema shall include a per-entity *update marker* (the version-bump in the producer; observable to telegraf as the row's hash value changing) so that idle rows are not re-emitted to Geneva every cycle. + +**Non-functional** + +- R5. The reporting path shall not require changes to the SONiC dataplane, syncd, SAI, or the existing Flex-Counter pipeline. +- R6. The reporting path shall not impose any on-the-wire dependency between SONiC and a specific off-box telemetry system. SONiC writes Redis; whatever consumes Redis is replaceable. +- R7. A failure of telegraf, mdm, or Geneva shall not affect the producer or any other SONiC service. + +**Out of scope** + +- Telegraf container packaging, lifecycle, and configuration. See NDM HLD §5.2 ("telegraf design"). +- mdm container deployment, KubeSonic rollout. See NDM HLD §3 and §6. +- Geneva endpoint, authentication, dashboards, alerting. +- Direct OTLP export from the application (see future work, §14). + +### 6. Architecture Design + +``` ++-------------------------- SONiC switch ---------------------------+ +| | +| +-- container (e.g. swss) -----------------------------------+ | +| | application -> ComponentStats library | | +| +------------------------+-----------------------------------+ | +| | HSET | +| v | +| +-------------------------+ | +| | COUNTERS_DB (Redis DB 2)| | +| | SWSS_STATS:PORT_TABLE | | +| | GNMI_STATS:/iface/... | | +| | BMP_STATS:... | | +| +-----------+-------------+ | +| | HSCAN / HGETALL | +| v | +| +-------------------------+ | +| | telegraf | (owned by NDM HLD §5.2) | +| +-----------+-------------+ | +| | | +| v | +| +-------------------------+ | +| | mdm | (owned by NDM HLD §4) | +| +-----------+-------------+ | +| | | ++--------------------------|----------------------------------------+ + v + +--------+ + | Geneva | + +--------+ +``` + +The boundary owned by this HLD is the box labelled `COUNTERS_DB`. Everything above it (the producer) is specified in the Framework HLD; everything below it (telegraf, mdm, Geneva) is specified in the NDM HLD. This HLD owns the **interface between the two**. + +### 7. High-Level Design + +#### 7.1 `COUNTERS_DB` key layout (the contract) + +For a component named `C` (case-insensitive at the API; rendered uppercase on the wire) and an entity `E`: + +``` +db: COUNTERS_DB (logical DB 2) +key: "_STATS:" +type: Redis hash +fields: each metric name -> decimal uint64 string +``` + +Properties guaranteed by the producer: + +- **Stable suffix `_STATS`.** Every component writes under `_STATS:*` and only there, so telegraf can match `*_STATS:*` (or a per-component pattern such as `SWSS_STATS:*`) to discover all rows for that component without an allow-list. +- **Hash, never string.** Field names are metric names; values are decimal `uint64`. Telegraf can call `HGETALL` and produce one measurement per (key, field) pair. +- **Idle suppression.** A row is `HSET` only when at least one of its metrics changed during the producer's 1 s cycle. Rows that did not change are not rewritten. Therefore an idle SONiC produces zero extra Redis traffic and telegraf, when configured to detect "no change since last poll", produces no upstream traffic either. +- **No TTL.** Keys are not expired; their lifetime is the producer process. On container restart they are recreated by the next 1 s flush. +- **No deletion in v1.** Entities that disappear at the application layer leave their last `HSET` in Redis until the container restarts. Garbage collection is left to the application; the framework does not delete keys (this keeps the contract simple). + +Example for `componentName="SWSS"`, entity `PORT_TABLE`: + +``` +redis-cli -n 2 HGETALL "SWSS_STATS:PORT_TABLE" +1) "SET" +2) "1283" +3) "DEL" +4) "17" +5) "COMPLETE" +6) "1300" +7) "ERROR" +8) "0" +``` + +The shape mirrors the existing `COUNTERS:*` keys produced by the Flex-Counter pipeline so that on-box tooling (`redis-cli`, `show ... stats`) needs no changes. + +#### 7.2 SWSS-specific vocabulary + +The SWSS facade (`SwssStats`) writes to: + +| Key | Field | Meaning | +|--------------------------------------|------------|-------------------------------------------------| +| `SWSS_STATS:` | `SET` | Number of `SET` operations seen on the table. | +| `SWSS_STATS:` | `DEL` | Number of `DEL` operations seen on the table. | +| `SWSS_STATS:` | `COMPLETE` | Number of operations that finished successfully.| +| `SWSS_STATS:` | `ERROR` | Number of operations that finished with error. | + +`` is the same identifier used by orchagent (e.g. `PORT_TABLE`, `VLAN_TABLE`, `ROUTE_TABLE`); no transformation is applied. + +#### 7.3 Conventions for future components + +When onboarding a new component (`gnmi`, `bmp`, `telemetry`, …) using the framework: + +1. Pick a stable, uppercase component name `C`. Counters land under `C_STATS:*` automatically. +2. Define a short, finite set of metric names (verbs/states) that describe the events the component cares about. Avoid putting cardinality-heavy values (interface name, neighbour IP) inside the metric name; put them in the entity (`E`) instead. Telegraf reads the entity from the Redis key and the metric from the hash field, so dashboards can pivot freely. +3. Document the vocabulary in the component's own HLD (one row per field, the same shape as §7.2). + +No telegraf configuration change is required to onboard a new component, provided telegraf is configured to scan `*_STATS:*` patterns (NDM HLD §5.2.1). + +#### 7.4 Interaction with the producer + +The producer (specified in the Framework HLD) maintains a per-entity *version* counter that is bumped on every `increment()` / `setValue()`. The 1 s writer thread snapshots only entities whose version changed since the last cycle and issues one `HSET` per dirty entity. As a result: + +- A row that has not changed since the previous cycle is **not** rewritten — telegraf and Redis monitoring both see this as no activity. +- A row that has changed even once is rewritten with the latest cumulative values, so the next `HGETALL` always returns the latest snapshot. +- There is no risk of telegraf reading a half-written row: each `HSET` is atomic on the Redis side, and a single `HSET` writes all fields of the entity together. + +#### 7.5 Telegraf interface (consumed, not specified here) + +Telegraf is expected to: + +- Run on the switch alongside the SONiC containers (NDM HLD §5.2.2 "telegraf container"). +- Scan `COUNTERS_DB` for keys matching `*_STATS:*`. +- Convert each `(key, field)` pair into a metric named `sonic..` with attributes `entity=`, `host=`. +- Forward to mdm. + +The exact telegraf configuration (input plugin, polling interval, output to mdm) is owned by the NDM HLD §5.2.1. This HLD only commits to the schema described in §7.1 / §7.2 / §7.3. + +### 8. SAI API + +No SAI API changes are required. This HLD covers a Redis schema and an interface to a consumer agent; SAI is not involved. + +### 9. Configuration and management + +Not applicable. This HLD introduces no new CLI commands, YANG models, manifests, or `CONFIG_DB` schema. Operator-facing configuration of telegraf / mdm is documented in the NDM HLD. + +### 10. Warmboot and Fastboot Design Impact + +The Redis schema is process-local: keys live in `COUNTERS_DB` for the duration of the producer container. On warmboot / fastboot the producer container restarts, the keys are recreated at the next 1 s flush, and counters start again from zero (see Framework HLD §10). Telegraf treats the appearance of fresh keys as new measurements; consumers compute rate-of-change and tolerate the reset. + +No boot-critical-chain dependency is added. + +### 11. Memory Consumption + +The reporting path adds no new in-container state beyond what the Framework HLD already describes for the DB sink (one Redis client per producer instance). Redis-side memory is bounded by the number of `(component, entity)` rows × the number of fields × the size of a `uint64` ASCII string; for the SWSS facade this is on the order of tens of rows × four fields. + +Telegraf and mdm memory are owned by the NDM HLD. + +### 12. Restrictions/Limitations + +- The schema is hash-only. Field values are decimal `uint64` strings; non-numeric fields are not supported. Components that need richer types must use a different reporting path (out of scope). +- The schema does not encode metric units. Units are implicit in the metric name (events) for v1; if a future component needs to report bytes / seconds / etc. it should put the unit in the metric name (e.g. `BYTES_RX`) until a more elaborate schema is introduced. +- Entity names are opaque strings. They must be safe for use as a Redis key suffix and for use as an attribute value downstream; in practice all SONiC table names already satisfy this. +- No deletion in v1 (see §7.1). Stale rows accumulate until container restart. + +### 13. Testing Requirements/Design + +#### 13.1 Unit / library tests + +The library-level invariants (`HSET` on dirty entities, idle suppression, field naming) are covered by the Framework HLD unit-test suite (`componentstats_ut.cpp`). No additional unit tests are introduced by this HLD. + +#### 13.2 System tests + +- Boot a `sonic-vs` image that includes the Framework HLD's two companion PRs. +- Exercise orchagent so that the SWSS facade increments counters (e.g. `config vlan add`, `config interface ip add`). +- Verify the schema directly in Redis: + ``` + redis-cli -n 2 KEYS "SWSS_STATS:*" + redis-cli -n 2 HGETALL "SWSS_STATS:PORT_TABLE" + ``` + Confirm that: + - The key shape matches §7.1. + - All four SWSS fields (`SET`, `DEL`, `COMPLETE`, `ERROR`) are present and are decimal integers. + - After a quiescent dwell, no `HSET` traffic is observed (idle suppression). +- End-to-end with telegraf (on a testbed configured per the NDM HLD): exercise orchagent and confirm metrics named `sonic.swss.SET` (etc.) arrive in Geneva with attribute `entity=
`. + +### 14. Open/Action items + +- The single reporting path in this revision is `COUNTERS_DB -> telegraf -> mdm -> Geneva`. Direct OTLP export from the application (the `OpenTelemetry SDK -> mdm` path described in NDM HLD §4) is a possible future addition; it would be specified in a future revision of this document if and when SONiC components need lower reporting latency than 1 s polling can provide. +- Garbage collection of stale `*_STATS:` keys on long-lived containers is left for a future revision. The current behaviour (cleared on container restart) is sufficient for the planned consumers. +- When additional components (`gnmi`, `bmp`, `telemetry`, …) adopt the framework, each one should add its vocabulary table to §7.3 by a small follow-up PR on this HLD. From a02960a497d9655cb4c63fe52c23addff327ebfc Mon Sep 17 00:00:00 2001 From: Yutong Zhang Date: Wed, 27 May 2026 06:17:10 +0000 Subject: [PATCH 06/14] Trim inline code, add metric design tables per review feedback Reviewer feedback (r12f) on the framework HLD was that the inline C++ snippets are not the right focus for an HLD and that what matters is the metric design - laid out as a Metric Name / Label List / Description table. Framework HLD changes: - Replace the hot-path code listing (was section 7.4) with a short prose summary of the two atomic RMWs. - Replace the SwssStats code listing (was section 7.7) with a small call-to-metric mapping table and a forward reference to the Reporting HLD for the full metric design. - Replace the GnmiStats illustrative code (was section 7.8) with a recipe and an illustrative future metrics table in the same shape the reviewer requested. Reporting HLD changes: - Reframe section 7.2 from a Redis key/field table into a proper Metric Name / Label List / Description table for the four SWSS metrics (SWSS_STATS_SET / DEL / COMPLETE / ERROR with label swss.table). Keep the Redis-side mapping as a footnote. - Tighten section 7.3 so future components are told to follow the exact same Metric Name / Label List / Description shape. Bump revisions: Framework 0.2 to 0.3, Reporting 0.1 to 0.2. Signed-off-by: Yutong Zhang Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../component-stats-framework-hld.md | 107 +++++++++++------- .../component-stats-reporting-hld.md | 72 ++++++++---- 2 files changed, 118 insertions(+), 61 deletions(-) diff --git a/doc/component-stats/component-stats-framework-hld.md b/doc/component-stats/component-stats-framework-hld.md index 3ab58e55cf4..022824c21ca 100644 --- a/doc/component-stats/component-stats-framework-hld.md +++ b/doc/component-stats/component-stats-framework-hld.md @@ -23,6 +23,7 @@ |-----|------------|---------------|------------------------------------------------------| | 0.1 | 2026-04-28 | Yutong Zhang | Initial revision | | 0.2 | 2026-05-12 | Yutong Zhang | Split out the reporting pipeline into a separate HLD | +| 0.3 | 2026-05-27 | Yutong Zhang | Trim inline code, focus on metric design tables | ### 2. Scope @@ -198,19 +199,17 @@ Process-wide: #### 7.4 Hot path -```cpp -void ComponentStats::increment(const string& entity, const string& metric, uint64_t n) { - if (!isEnabled() || n == 0) return; - - auto& e = getOrCreateEntity(entity); // mutex on first use only - auto& c = getOrCreateCounter(e, metric); // mutex on first use only +After the first use of a given `(entity, metric)` pair, `increment()` does +exactly two atomic RMWs and nothing else: - c.value .fetch_add(n, memory_order_relaxed); // (1) counter - e.version.fetch_add(1, memory_order_release); // (2) dirty-bump (release) -} -``` +1. **Relaxed `fetch_add`** on the counter value — accumulates the event. +2. **Release `fetch_add`** on the per-entity *version* — marks the entity + dirty and publishes the new counter value to the writer thread. + Pairs with the writer's acquire-load (see §7.6). -Cost after warm-up: two atomic RMWs. No mutex acquisition, no allocation, no syscall. +No mutex acquisition, no allocation, no syscall on the hot path. The +structural mutex is taken only the first time a given `(entity, metric)` +pair is seen, to insert it into the per-entity map. #### 7.5 Writer thread @@ -261,40 +260,63 @@ Without it, on weakly ordered architectures (ARM, POWER) the writer could see th #### 7.7 `SwssStats` thin facade -`SwssStats` (in `sonic-swss/orchagent/`) is reduced to a translation layer that owns only the SWSS-specific vocabulary and the global enable flag consumed by `orch.cpp`: - -```cpp -SwssStats::SwssStats() : m_impl(swss::ComponentStats::create("SWSS")) {} - -void SwssStats::recordTask(const std::string& t, const std::string& op) { - if (op == "SET") m_impl->increment(t, "SET"); - else if (op == "DEL") m_impl->increment(t, "DEL"); -} -void SwssStats::recordComplete(const std::string& t, uint64_t n) { m_impl->increment(t, "COMPLETE", n); } -void SwssStats::recordError (const std::string& t, uint64_t n) { m_impl->increment(t, "ERROR", n); } -``` - -The whole file is ~130 lines of straightforward delegation. **The public surface (`gSwssStatsRecord`, `SwssStats::getInstance()`, `recordTask`/`recordComplete`/`recordError`) and the on-the-wire `SWSS_STATS:
` Redis layout are deliberately kept narrow and stable so that the SWSS-specific vocabulary remains independent of future evolution of the underlying `ComponentStats` library.** - -The exact `SWSS_STATS:
` schema (key layout, field names, types) is documented in the [Reporting HLD](./component-stats-reporting-hld.md), which owns the contract with downstream consumers. +`SwssStats` (in `sonic-swss/orchagent/`) is a ~130-line translation layer +that owns only the SWSS-specific vocabulary and the global +`gSwssStatsRecord` enable flag consumed by `orch.cpp`. Every call +delegates directly to `swss::ComponentStats::increment()`: + +| `SwssStats` call | Delegates to | Reports as (see Reporting HLD §7.2) | +|-----------------------------|-----------------------------------|--------------------------------------| +| `recordTask(t, "SET")` | `increment(t, "SET")` | `SWSS_STATS_SET{swss.table=t}` | +| `recordTask(t, "DEL")` | `increment(t, "DEL")` | `SWSS_STATS_DEL{swss.table=t}` | +| `recordComplete(t, n)` | `increment(t, "COMPLETE", n)` | `SWSS_STATS_COMPLETE{swss.table=t}` | +| `recordError(t, n)` | `increment(t, "ERROR", n)` | `SWSS_STATS_ERROR{swss.table=t}` | + +The public surface (`gSwssStatsRecord`, `SwssStats::getInstance()`, +`recordTask` / `recordComplete` / `recordError`) and the on-the-wire +`SWSS_STATS:
` Redis layout are deliberately kept narrow and +stable so the SWSS vocabulary remains independent of future evolution +of the underlying `ComponentStats` library. + +The full SWSS metric design (metric names, labels, descriptions) and +the exact `SWSS_STATS:
` Redis schema are owned by the +[Reporting HLD §7.2](./component-stats-reporting-hld.md#72-swss-metric-design), +which is the contract with downstream consumers. #### 7.8 Adopting the library in a new container -To add equivalent metrics to e.g. `gnmi`, write a facade analogous to §7.7: - -```cpp -class GnmiStats { -public: - static GnmiStats* getInstance(); - void recordSubscribe(const std::string& path) { m_impl->increment(path, "SUBSCRIBE"); } - void recordError (const std::string& path) { m_impl->increment(path, "ERROR"); } -private: - GnmiStats() : m_impl(swss::ComponentStats::create("GNMI")) {} - std::shared_ptr m_impl; -}; -``` - -Result: counters land in `COUNTERS_DB` under keys `GNMI_STATS:`. No new threads, no new Redis client management, no new test harness needed. Reporting then picks them up automatically via the pipeline described in the Reporting HLD. +A new component `C` adopts the framework by: + +1. Picking an uppercase component name `C`. Counters automatically land in + `COUNTERS_DB` under `C_STATS:*` and surface downstream as metrics + named `C_STATS_` with one label per entity. +2. **Designing a finite vocabulary** of verb-style metric names for the + events the component cares about. Anything high-cardinality + (interface name, neighbour IP, gNMI path, BMP peer) **must** go into + the entity (the part after the `:` in the Redis key) rather than the + metric name, so that dashboards can pivot on the label without + explosion in metric count. See + [Reporting HLD §7.3](./component-stats-reporting-hld.md#73-conventions-for-future-components) + for the rationale. +3. Documenting that vocabulary as a Metric Name | Label List | + Description table in the component's own HLD, identical in shape to + the SWSS table in Reporting HLD §7.2. +4. Writing a thin facade (~30 LoC) that calls + `swss::ComponentStats::increment()` for each event. + +No new threads, no new Redis client management, no new test harness +needed. Reporting picks the metrics up automatically via the +`*_STATS:*` pattern match. + +Illustrative future vocabulary for `gnmi` (to be finalised when the +gNMI facade lands): + +| Metric Name | Label List | Description | +|-------------------------|--------------|--------------------------------------------------------------| +| `GNMI_STATS_SUBSCRIBE` | `gnmi.path` | Number of `Subscribe` requests received on the path. | +| `GNMI_STATS_GET` | `gnmi.path` | Number of `Get` RPCs handled on the path. | +| `GNMI_STATS_SET` | `gnmi.path` | Number of `Set` RPCs handled on the path. | +| `GNMI_STATS_ERROR` | `gnmi.path` | Number of RPCs that returned an error on the path. | ### 8. SAI API @@ -374,3 +396,4 @@ End-to-end validation of the reporting path (telegraf → mdm → Geneva) is cov - Phase 1 (this HLD's two PRs) lands the `ComponentStats` library and the `SwssStats` facade with the DB sink fully active. - Phase 2 onboards additional SONiC containers (`gnmi`, `bmp`, `telemetry`, …) by adding their own facades. Each is a self-contained PR in the relevant repository. - Phase 3 (future) may add direct OTLP export from the library to a local agent for components that need lower reporting latency than the DB → telegraf path provides. Out of scope for this HLD. + diff --git a/doc/component-stats/component-stats-reporting-hld.md b/doc/component-stats/component-stats-reporting-hld.md index 7e63a281eab..028830fff47 100644 --- a/doc/component-stats/component-stats-reporting-hld.md +++ b/doc/component-stats/component-stats-reporting-hld.md @@ -22,6 +22,7 @@ | Rev | Date | Author | Change Description | |-----|------------|---------------|----------------------------------------------------------| | 0.1 | 2026-05-12 | Yutong Zhang | Initial revision (split from component-stats Framework HLD) | +| 0.2 | 2026-05-27 | Yutong Zhang | Reframe §7.2 as a Metric Name / Label List / Description table | ### 2. Scope @@ -161,28 +162,60 @@ redis-cli -n 2 HGETALL "SWSS_STATS:PORT_TABLE" The shape mirrors the existing `COUNTERS:*` keys produced by the Flex-Counter pipeline so that on-box tooling (`redis-cli`, `show ... stats`) needs no changes. -#### 7.2 SWSS-specific vocabulary - -The SWSS facade (`SwssStats`) writes to: - -| Key | Field | Meaning | -|--------------------------------------|------------|-------------------------------------------------| -| `SWSS_STATS:` | `SET` | Number of `SET` operations seen on the table. | -| `SWSS_STATS:` | `DEL` | Number of `DEL` operations seen on the table. | -| `SWSS_STATS:` | `COMPLETE` | Number of operations that finished successfully.| -| `SWSS_STATS:` | `ERROR` | Number of operations that finished with error. | - -`` is the same identifier used by orchagent (e.g. `PORT_TABLE`, `VLAN_TABLE`, `ROUTE_TABLE`); no transformation is applied. +#### 7.2 SWSS metric design + +When telegraf reads a `SWSS_STATS:
` hash from `COUNTERS_DB` and +forwards it via mdm, each `(key, field)` pair surfaces downstream as a +single metric with one label carrying the orchagent table name. The +SWSS facade emits the following four metrics: + +| Metric Name | Label List | Description | +|-----------------------|---------------|-----------------------------------------------------------------------------------| +| `SWSS_STATS_SET` | `swss.table` | Count of `SET` operations enqueued on the orchagent table named by the label. | +| `SWSS_STATS_DEL` | `swss.table` | Count of `DEL` operations enqueued on the orchagent table named by the label. | +| `SWSS_STATS_COMPLETE` | `swss.table` | Count of operations that finished successfully on the table. | +| `SWSS_STATS_ERROR` | `swss.table` | Count of operations that finished with error on the table. | + +Notes: + +- All values are monotonically increasing `uint64` counters. Consumers + compute rate-of-change; absolute values reset on container restart + (see §10). +- The label value (`swss.table`) is the orchagent table identifier + verbatim — e.g. `PORT_TABLE`, `VLAN_TABLE`, `ROUTE_TABLE` — so + dashboards can filter on a specific table without parsing the Redis + key. +- Mapping back to `COUNTERS_DB`: the metric `SWSS_STATS_` + corresponds to Redis key `SWSS_STATS:` with hash field + ``; the label value is the `` part of the key. See §7.1 + for the key layout and §7.4 for the dirty-tracking semantics that + guarantee idle entities do not produce reporting traffic. #### 7.3 Conventions for future components -When onboarding a new component (`gnmi`, `bmp`, `telemetry`, …) using the framework: - -1. Pick a stable, uppercase component name `C`. Counters land under `C_STATS:*` automatically. -2. Define a short, finite set of metric names (verbs/states) that describe the events the component cares about. Avoid putting cardinality-heavy values (interface name, neighbour IP) inside the metric name; put them in the entity (`E`) instead. Telegraf reads the entity from the Redis key and the metric from the hash field, so dashboards can pivot freely. -3. Document the vocabulary in the component's own HLD (one row per field, the same shape as §7.2). - -No telegraf configuration change is required to onboard a new component, provided telegraf is configured to scan `*_STATS:*` patterns (NDM HLD §5.2.1). +When onboarding a new component (`gnmi`, `bmp`, `telemetry`, …) using +the framework: + +1. Pick a stable, uppercase component name `C`. Counters land under + `C_STATS:*` automatically and surface downstream as metrics named + `C_STATS_`. +2. Define a short, finite vocabulary of `` names that describe + the event classes the component cares about (e.g. `SUBSCRIBE`, + `GET`, `SET`, `ERROR`). Avoid putting cardinality-heavy values + (interface name, neighbour IP, gNMI path) inside the metric name; + put them in the entity (`E`) so they become the label value + downstream. Telegraf reads the entity from the Redis key and the + metric from the hash field, so dashboards can pivot freely without + metric-name explosion. +3. Document the vocabulary in the component's own HLD as a Metric + Name | Label List | Description table, identical in shape to §7.2. + A typical label name is `c.entity` for a generic component, or a + domain-specific synonym such as `gnmi.path` / `bmp.peer` / + `swss.table` when that reads better on dashboards. + +No telegraf configuration change is required to onboard a new +component, provided telegraf is configured to scan `*_STATS:*` patterns +(NDM HLD §5.2.1). #### 7.4 Interaction with the producer @@ -256,3 +289,4 @@ The library-level invariants (`HSET` on dirty entities, idle suppression, field - The single reporting path in this revision is `COUNTERS_DB -> telegraf -> mdm -> Geneva`. Direct OTLP export from the application (the `OpenTelemetry SDK -> mdm` path described in NDM HLD §4) is a possible future addition; it would be specified in a future revision of this document if and when SONiC components need lower reporting latency than 1 s polling can provide. - Garbage collection of stale `*_STATS:` keys on long-lived containers is left for a future revision. The current behaviour (cleared on container restart) is sufficient for the planned consumers. - When additional components (`gnmi`, `bmp`, `telemetry`, …) adopt the framework, each one should add its vocabulary table to §7.3 by a small follow-up PR on this HLD. + From 31233382c87a8eafd684a1347f126b197a132b0e Mon Sep 17 00:00:00 2001 From: Yutong Zhang Date: Wed, 27 May 2026 06:32:44 +0000 Subject: [PATCH 07/14] Align Reporting HLD section 7.5 / 13.2 with the new section 7.2 metric naming Self-review caught two stale references that still described the old `sonic..` metric naming with attribute `entity=`, while section 7.2 had already been reframed around `_STATS_` with a component-specific label (`swss.table` for SWSS). The inconsistency would have left two different naming conventions in the same HLD. - Section 7.5 (Telegraf interface) now points at the section 7.2 / 7.3 schema rather than restating a different naming convention. - Section 13.2 system test step now asserts the four metrics named in section 7.2 with the `swss.table` label. Signed-off-by: Yutong Zhang Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../component-stats-reporting-hld.md | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/doc/component-stats/component-stats-reporting-hld.md b/doc/component-stats/component-stats-reporting-hld.md index 028830fff47..ec867b4ae95 100644 --- a/doc/component-stats/component-stats-reporting-hld.md +++ b/doc/component-stats/component-stats-reporting-hld.md @@ -23,6 +23,7 @@ |-----|------------|---------------|----------------------------------------------------------| | 0.1 | 2026-05-12 | Yutong Zhang | Initial revision (split from component-stats Framework HLD) | | 0.2 | 2026-05-27 | Yutong Zhang | Reframe §7.2 as a Metric Name / Label List / Description table | +| 0.3 | 2026-05-27 | Yutong Zhang | Align §7.5 and §13.2 with the §7.2 metric naming | ### 2. Scope @@ -231,7 +232,12 @@ Telegraf is expected to: - Run on the switch alongside the SONiC containers (NDM HLD §5.2.2 "telegraf container"). - Scan `COUNTERS_DB` for keys matching `*_STATS:*`. -- Convert each `(key, field)` pair into a metric named `sonic..` with attributes `entity=`, `host=`. +- Convert each `(key, field)` pair into a metric in the schema defined + by §7.2 / §7.3 of this HLD: the metric name is + `_STATS_`, the entity part of the Redis key + becomes the label value, and the label name is component-specific + (e.g. `swss.table` for SWSS — see §7.2). The hostname is attached as + an additional label by telegraf itself. - Forward to mdm. The exact telegraf configuration (input plugin, polling interval, output to mdm) is owned by the NDM HLD §5.2.1. This HLD only commits to the schema described in §7.1 / §7.2 / §7.3. @@ -282,7 +288,11 @@ The library-level invariants (`HSET` on dirty entities, idle suppression, field - The key shape matches §7.1. - All four SWSS fields (`SET`, `DEL`, `COMPLETE`, `ERROR`) are present and are decimal integers. - After a quiescent dwell, no `HSET` traffic is observed (idle suppression). -- End-to-end with telegraf (on a testbed configured per the NDM HLD): exercise orchagent and confirm metrics named `sonic.swss.SET` (etc.) arrive in Geneva with attribute `entity=
`. +- End-to-end with telegraf (on a testbed configured per the NDM HLD): + exercise orchagent and confirm the four metrics defined in §7.2 + (`SWSS_STATS_SET` / `SWSS_STATS_DEL` / `SWSS_STATS_COMPLETE` / + `SWSS_STATS_ERROR`) arrive in Geneva carrying the `swss.table` label + for the exercised orchagent tables. ### 14. Open/Action items From fe61b53c14f569ce1c648b972065c07732c80378 Mon Sep 17 00:00:00 2001 From: Yutong Zhang Date: Wed, 27 May 2026 06:35:38 +0000 Subject: [PATCH 08/14] Clarify Metric / Label terminology in Framework HLD section 3 Self-review caught that the word ""Metric"" was used with two meanings across the two HLDs: - Framework section 3 defined Metric as the hash field name on the producer side (e.g. SET, DEL). - Reporting section 7.2 uses ""Metric Name"" as the column header for the downstream wire name (e.g. SWSS_STATS_SET). Update the Framework section 3 Metric entry to spell out both views and point at Reporting section 7.2 for the wire schema. Also add a Label entry so the new ""Label List"" column in Reporting section 7.2 has a definition to anchor to. Signed-off-by: Yutong Zhang Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- doc/component-stats/component-stats-framework-hld.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/doc/component-stats/component-stats-framework-hld.md b/doc/component-stats/component-stats-framework-hld.md index 022824c21ca..002683cbddd 100644 --- a/doc/component-stats/component-stats-framework-hld.md +++ b/doc/component-stats/component-stats-framework-hld.md @@ -24,6 +24,7 @@ | 0.1 | 2026-04-28 | Yutong Zhang | Initial revision | | 0.2 | 2026-05-12 | Yutong Zhang | Split out the reporting pipeline into a separate HLD | | 0.3 | 2026-05-27 | Yutong Zhang | Trim inline code, focus on metric design tables | +| 0.4 | 2026-05-27 | Yutong Zhang | Clarify Metric / Label terminology in §3 | ### 2. Scope @@ -47,7 +48,8 @@ The library publishes counters into `COUNTERS_DB` so that: |-----------------|---------------------------------------------------------------------------------------------| | Component | A SONiC container that produces service-level counters (e.g. `swss`, `gnmi`, `bmp`). | | Entity | A logical grouping of metrics inside a component (e.g. an orchagent table, a gNMI path). | -| Metric | A named uint64 counter or gauge inside an entity (e.g. `SET`, `DEL`, `COMPLETE`, `ERROR`). | +| Metric | A named uint64 counter or gauge inside an entity (e.g. `SET`, `DEL`, `COMPLETE`, `ERROR`). Stored as a Redis hash field on the producer side; surfaces downstream as a wire metric named `_STATS_` — see the [Reporting HLD §7.2](./component-stats-reporting-hld.md#72-swss-metric-design) for the wire schema. | +| Label | A key/value attribute attached to a wire metric by telegraf. The entity name (the part after the `:` in the Redis key) is surfaced as a component-specific label such as `swss.table`. | | ComponentStats | The new shared library in `sonic-swss-common` providing the producer mechanism. | | SwssStats | A SWSS-specific facade over `ComponentStats` (lives in `sonic-swss`). | | DB sink | The output path that mirrors counters into `COUNTERS_DB`. | @@ -397,3 +399,4 @@ End-to-end validation of the reporting path (telegraf → mdm → Geneva) is cov - Phase 2 onboards additional SONiC containers (`gnmi`, `bmp`, `telemetry`, …) by adding their own facades. Each is a self-contained PR in the relevant repository. - Phase 3 (future) may add direct OTLP export from the library to a local agent for components that need lower reporting latency than the DB → telegraf path provides. Out of scope for this HLD. + From c3e65e084aae0d82b487e8d710acbe329f78432a Mon Sep 17 00:00:00 2001 From: Yutong Zhang Date: Wed, 27 May 2026 14:46:18 +0800 Subject: [PATCH 09/14] =?UTF-8?q?docs(reporting-hld):=20sync=20Reporting?= =?UTF-8?q?=20=C2=A73=20Metric/Label=20definitions=20with=20Framework?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reporting HLD §3 was still using the pre-rev-0.4 Metric definition ("A named uint64 counter or gauge") and was missing the Label term entirely. Framework HLD §3 was updated in rev 0.4 to cover the dual meaning (producer-side hash field + downstream wire name COMPONENT_STATS_) and to add the Label entry. This commit brings Reporting HLD §3 into sync so readers who start at the Reporting document find Metric and Label defined consistently with the Framework document. Version bumped to 0.4. Signed-off-by: Yutong Zhang Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- doc/component-stats/component-stats-reporting-hld.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/doc/component-stats/component-stats-reporting-hld.md b/doc/component-stats/component-stats-reporting-hld.md index ec867b4ae95..1f2ec475b85 100644 --- a/doc/component-stats/component-stats-reporting-hld.md +++ b/doc/component-stats/component-stats-reporting-hld.md @@ -24,6 +24,7 @@ | 0.1 | 2026-05-12 | Yutong Zhang | Initial revision (split from component-stats Framework HLD) | | 0.2 | 2026-05-27 | Yutong Zhang | Reframe §7.2 as a Metric Name / Label List / Description table | | 0.3 | 2026-05-27 | Yutong Zhang | Align §7.5 and §13.2 with the §7.2 metric naming | +| 0.4 | 2026-05-27 | Yutong Zhang | Sync §3 Metric / Label definitions with Framework HLD | ### 2. Scope @@ -49,7 +50,8 @@ Direct application-side OTLP export (e.g. the `OpenTelemetry SDK -> mdm` path de |-----------------|---------------------------------------------------------------------------------------------| | Component | A SONiC container that produces service-level counters (e.g. `swss`, `gnmi`, `bmp`). | | Entity | A logical grouping of metrics inside a component (e.g. an orchagent table, a gNMI path). | -| Metric | A named `uint64` counter or gauge inside an entity. | +| Metric | A named `uint64` counter or gauge inside an entity (e.g. `SET`, `DEL`, `COMPLETE`, `ERROR`). Stored as a Redis hash field on the producer side; surfaces downstream as a wire metric named `_STATS_` (see §7.2 for the SWSS instance). | +| Label | A key/value attribute attached to a wire metric. The entity name (the part after the `:` in the Redis key) becomes the label value; the label name is component-specific (e.g. `swss.table` for SWSS — see §7.2). | | ComponentStats | The reusable producer library specified in the Framework HLD. | | `COUNTERS_DB` | The existing SONiC Redis database (logical DB 2) holding counter rows. | | telegraf | The off-box-friendly metric agent running on the switch; configured and operated by NDM. | @@ -300,3 +302,4 @@ The library-level invariants (`HSET` on dirty entities, idle suppression, field - Garbage collection of stale `*_STATS:` keys on long-lived containers is left for a future revision. The current behaviour (cleared on container restart) is sufficient for the planned consumers. - When additional components (`gnmi`, `bmp`, `telemetry`, …) adopt the framework, each one should add its vocabulary table to §7.3 by a small follow-up PR on this HLD. + From 157dc619285f5104db74015c3bf322aef0dcb2d1 Mon Sep 17 00:00:00 2001 From: Yutong Zhang Date: Wed, 27 May 2026 14:49:08 +0800 Subject: [PATCH 10/14] =?UTF-8?q?docs(framework-hld):=20clarify=20facade?= =?UTF-8?q?=20LoC=20discrepancy=20in=20=C2=A77.8?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit §7.7 describes SwssStats as ~130 LoC; §7.8 step 4 said ~30 LoC with no explanation. Added a clarifying note: a minimal facade stays near ~30 LoC; SwssStats is larger because it integrates gSwssStatsRecord and singleton plumbing into orch.cpp. Rev bumped to 0.5. Signed-off-by: Yutong Zhang Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- doc/component-stats/component-stats-framework-hld.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/doc/component-stats/component-stats-framework-hld.md b/doc/component-stats/component-stats-framework-hld.md index 002683cbddd..4f47ae0cbfc 100644 --- a/doc/component-stats/component-stats-framework-hld.md +++ b/doc/component-stats/component-stats-framework-hld.md @@ -25,6 +25,7 @@ | 0.2 | 2026-05-12 | Yutong Zhang | Split out the reporting pipeline into a separate HLD | | 0.3 | 2026-05-27 | Yutong Zhang | Trim inline code, focus on metric design tables | | 0.4 | 2026-05-27 | Yutong Zhang | Clarify Metric / Label terminology in §3 | +| 0.5 | 2026-05-27 | Yutong Zhang | Clarify §7.8 facade LoC vs SwssStats LoC discrepancy | ### 2. Scope @@ -303,8 +304,12 @@ A new component `C` adopts the framework by: 3. Documenting that vocabulary as a Metric Name | Label List | Description table in the component's own HLD, identical in shape to the SWSS table in Reporting HLD §7.2. -4. Writing a thin facade (~30 LoC) that calls +4. Writing a thin facade that calls `swss::ComponentStats::increment()` for each event. + A minimal facade needs only ~30 LoC. The SwssStats facade is larger + (~130 LoC) because it also integrates a `gSwssStatsRecord` enable flag + and a singleton into orchagent's existing `orch.cpp` infrastructure; + new containers that do not need that extra plumbing stay near ~30 LoC. No new threads, no new Redis client management, no new test harness needed. Reporting picks the metrics up automatically via the @@ -400,3 +405,4 @@ End-to-end validation of the reporting path (telegraf → mdm → Geneva) is cov - Phase 3 (future) may add direct OTLP export from the library to a local agent for components that need lower reporting latency than the DB → telegraf path provides. Out of scope for this HLD. + From 66100210c6ca31e4712d8908d7e4ed096c607a50 Mon Sep 17 00:00:00 2001 From: Yutong Zhang Date: Wed, 27 May 2026 14:49:11 +0800 Subject: [PATCH 11/14] =?UTF-8?q?docs(reporting-hld):=20fix=20=C2=A714=20p?= =?UTF-8?q?ointer=20for=20new-component=20vocabulary=20tables?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit §7.3 says each new component documents its vocabulary in its own component HLD. §14's third bullet contradicted that by saying to add the table to §7.3 of this HLD. Fixed to say: add the vocab table to the component's own HLD (following §7.3 conventions), with an optional cross-reference added here. Rev bumped to 0.5. Signed-off-by: Yutong Zhang Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- doc/component-stats/component-stats-reporting-hld.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/component-stats/component-stats-reporting-hld.md b/doc/component-stats/component-stats-reporting-hld.md index 1f2ec475b85..098514ef83c 100644 --- a/doc/component-stats/component-stats-reporting-hld.md +++ b/doc/component-stats/component-stats-reporting-hld.md @@ -24,7 +24,7 @@ | 0.1 | 2026-05-12 | Yutong Zhang | Initial revision (split from component-stats Framework HLD) | | 0.2 | 2026-05-27 | Yutong Zhang | Reframe §7.2 as a Metric Name / Label List / Description table | | 0.3 | 2026-05-27 | Yutong Zhang | Align §7.5 and §13.2 with the §7.2 metric naming | -| 0.4 | 2026-05-27 | Yutong Zhang | Sync §3 Metric / Label definitions with Framework HLD | +| 0.5 | 2026-05-27 | Yutong Zhang | Fix §14 to direct component vocab tables to their own HLD | ### 2. Scope @@ -300,6 +300,6 @@ The library-level invariants (`HSET` on dirty entities, idle suppression, field - The single reporting path in this revision is `COUNTERS_DB -> telegraf -> mdm -> Geneva`. Direct OTLP export from the application (the `OpenTelemetry SDK -> mdm` path described in NDM HLD §4) is a possible future addition; it would be specified in a future revision of this document if and when SONiC components need lower reporting latency than 1 s polling can provide. - Garbage collection of stale `*_STATS:` keys on long-lived containers is left for a future revision. The current behaviour (cleared on container restart) is sufficient for the planned consumers. -- When additional components (`gnmi`, `bmp`, `telemetry`, …) adopt the framework, each one should add its vocabulary table to §7.3 by a small follow-up PR on this HLD. +- When additional components (`gnmi`, `bmp`, `telemetry`, …) adopt the framework, each one should add its vocabulary table (in the `Metric Name | Label List | Description` shape of §7.2) to **its own component's HLD**, following the conventions in §7.3. A cross-reference to that table may optionally be added to §7.3 of this HLD so that all known vocabularies are discoverable from one place. From bc52bfcda2c9694b3fae563fcd23c1a762e5047f Mon Sep 17 00:00:00 2001 From: Yutong Zhang Date: Wed, 27 May 2026 14:54:28 +0800 Subject: [PATCH 12/14] =?UTF-8?q?docs(framework-hld):=20remove=20stale=20~?= =?UTF-8?q?100=20LoC=20facade=20claims=20in=20=C2=A74=20and=20=C2=A76?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit §7.8 now explains that a minimal facade is ~30 LoC and SwssStats is ~130 LoC. §4 and §6 still said "~100 LoC", creating a third inconsistent number. Replaced both with a pointer to §7.8. Signed-off-by: Yutong Zhang Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- doc/component-stats/component-stats-framework-hld.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/component-stats/component-stats-framework-hld.md b/doc/component-stats/component-stats-framework-hld.md index 4f47ae0cbfc..6fce5cac4b9 100644 --- a/doc/component-stats/component-stats-framework-hld.md +++ b/doc/component-stats/component-stats-framework-hld.md @@ -25,7 +25,7 @@ | 0.2 | 2026-05-12 | Yutong Zhang | Split out the reporting pipeline into a separate HLD | | 0.3 | 2026-05-27 | Yutong Zhang | Trim inline code, focus on metric design tables | | 0.4 | 2026-05-27 | Yutong Zhang | Clarify Metric / Label terminology in §3 | -| 0.5 | 2026-05-27 | Yutong Zhang | Clarify §7.8 facade LoC vs SwssStats LoC discrepancy | +| 0.5 | 2026-05-27 | Yutong Zhang | Fix §4/§6 LoC claims to be consistent with §7.8 | ### 2. Scope @@ -65,7 +65,7 @@ This HLD specifies a single, reusable producer that: 1. accumulates counters in process-local atomic state with negligible hot-path cost, 2. mirrors them to `COUNTERS_DB` so `redis-cli`, `show ... stats` CLIs, and any other on-box tooling continue to work, -3. exposes a stable public API so each container only needs to write a thin (~100 LoC) facade. +3. exposes a stable public API so each container only needs to write a thin facade (see §7.8 for sizing guidance). How the `COUNTERS_DB` rows then reach Geneva or any other off-box system is the responsibility of the [Reporting HLD](./component-stats-reporting-hld.md). @@ -130,7 +130,7 @@ The architecture is unchanged at the SONiC system level. A new library is introd +------------------------------------------------------------------------+ ``` -**Layering rule.** `swss-common` knows nothing of orchagent or any specific container; each container knows only its own facade plus `swss::ComponentStats`. New containers get the sink for free by writing a ~100-line wrapper. +**Layering rule.** `swss-common` knows nothing of orchagent or any specific container; each container knows only its own facade plus `swss::ComponentStats`. New containers get the sink for free by writing a thin wrapper (see §7.8 for sizing guidance). **Sink design properties.** From 1d61c972e4191c1956093bb4846ad1242cc3bdb6 Mon Sep 17 00:00:00 2001 From: Yutong Zhang Date: Wed, 27 May 2026 14:54:30 +0800 Subject: [PATCH 13/14] =?UTF-8?q?docs(reporting-hld):=20restore=20missing?= =?UTF-8?q?=20rev=200.4=20entry=20in=20=C2=A71=20revision=20table?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A prior edit replaced the rev 0.4 row instead of appending 0.5 after it, causing the revision table to jump from 0.3 to 0.5. Restored the 0.4 entry ("Sync §3 Metric/Label definitions") so the history is complete. Signed-off-by: Yutong Zhang Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- doc/component-stats/component-stats-reporting-hld.md | 1 + 1 file changed, 1 insertion(+) diff --git a/doc/component-stats/component-stats-reporting-hld.md b/doc/component-stats/component-stats-reporting-hld.md index 098514ef83c..e9e3ec11159 100644 --- a/doc/component-stats/component-stats-reporting-hld.md +++ b/doc/component-stats/component-stats-reporting-hld.md @@ -24,6 +24,7 @@ | 0.1 | 2026-05-12 | Yutong Zhang | Initial revision (split from component-stats Framework HLD) | | 0.2 | 2026-05-27 | Yutong Zhang | Reframe §7.2 as a Metric Name / Label List / Description table | | 0.3 | 2026-05-27 | Yutong Zhang | Align §7.5 and §13.2 with the §7.2 metric naming | +| 0.4 | 2026-05-27 | Yutong Zhang | Sync §3 Metric / Label definitions with Framework HLD | | 0.5 | 2026-05-27 | Yutong Zhang | Fix §14 to direct component vocab tables to their own HLD | ### 2. Scope From 989cf403a43c3b8f0312aef6b4b1f3bfd5cd3b0b Mon Sep 17 00:00:00 2001 From: Yutong Zhang Date: Mon, 1 Jun 2026 10:47:44 +0800 Subject: [PATCH 14/14] Change revision Signed-off-by: Yutong Zhang --- .../component-stats-framework-hld.md | 10 +++------- .../component-stats-reporting-hld.md | 14 +++++--------- 2 files changed, 8 insertions(+), 16 deletions(-) diff --git a/doc/component-stats/component-stats-framework-hld.md b/doc/component-stats/component-stats-framework-hld.md index 6fce5cac4b9..6031a200816 100644 --- a/doc/component-stats/component-stats-framework-hld.md +++ b/doc/component-stats/component-stats-framework-hld.md @@ -22,10 +22,6 @@ | Rev | Date | Author | Change Description | |-----|------------|---------------|------------------------------------------------------| | 0.1 | 2026-04-28 | Yutong Zhang | Initial revision | -| 0.2 | 2026-05-12 | Yutong Zhang | Split out the reporting pipeline into a separate HLD | -| 0.3 | 2026-05-27 | Yutong Zhang | Trim inline code, focus on metric design tables | -| 0.4 | 2026-05-27 | Yutong Zhang | Clarify Metric / Label terminology in §3 | -| 0.5 | 2026-05-27 | Yutong Zhang | Fix §4/§6 LoC claims to be consistent with §7.8 | ### 2. Scope @@ -403,6 +399,6 @@ End-to-end validation of the reporting path (telegraf → mdm → Geneva) is cov - Phase 1 (this HLD's two PRs) lands the `ComponentStats` library and the `SwssStats` facade with the DB sink fully active. - Phase 2 onboards additional SONiC containers (`gnmi`, `bmp`, `telemetry`, …) by adding their own facades. Each is a self-contained PR in the relevant repository. - Phase 3 (future) may add direct OTLP export from the library to a local agent for components that need lower reporting latency than the DB → telegraf path provides. Out of scope for this HLD. - - - + + + diff --git a/doc/component-stats/component-stats-reporting-hld.md b/doc/component-stats/component-stats-reporting-hld.md index e9e3ec11159..90a5ef46216 100644 --- a/doc/component-stats/component-stats-reporting-hld.md +++ b/doc/component-stats/component-stats-reporting-hld.md @@ -19,13 +19,9 @@ ### 1. Revision -| Rev | Date | Author | Change Description | -|-----|------------|---------------|----------------------------------------------------------| -| 0.1 | 2026-05-12 | Yutong Zhang | Initial revision (split from component-stats Framework HLD) | -| 0.2 | 2026-05-27 | Yutong Zhang | Reframe §7.2 as a Metric Name / Label List / Description table | -| 0.3 | 2026-05-27 | Yutong Zhang | Align §7.5 and §13.2 with the §7.2 metric naming | -| 0.4 | 2026-05-27 | Yutong Zhang | Sync §3 Metric / Label definitions with Framework HLD | -| 0.5 | 2026-05-27 | Yutong Zhang | Fix §14 to direct component vocab tables to their own HLD | +| Rev | Date | Author | Change Description | +|-----|------------|---------------|-----------------------| +| 0.1 | 2026-05-12 | Yutong Zhang | Initial revision | ### 2. Scope @@ -302,5 +298,5 @@ The library-level invariants (`HSET` on dirty entities, idle suppression, field - The single reporting path in this revision is `COUNTERS_DB -> telegraf -> mdm -> Geneva`. Direct OTLP export from the application (the `OpenTelemetry SDK -> mdm` path described in NDM HLD §4) is a possible future addition; it would be specified in a future revision of this document if and when SONiC components need lower reporting latency than 1 s polling can provide. - Garbage collection of stale `*_STATS:` keys on long-lived containers is left for a future revision. The current behaviour (cleared on container restart) is sufficient for the planned consumers. - When additional components (`gnmi`, `bmp`, `telemetry`, …) adopt the framework, each one should add its vocabulary table (in the `Metric Name | Label List | Description` shape of §7.2) to **its own component's HLD**, following the conventions in §7.3. A cross-reference to that table may optionally be added to §7.3 of this HLD so that all known vocabularies are discoverable from one place. - - + +