Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 122 additions & 0 deletions tracing-tests/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Tracing Validation

Documentation and approaches for tracing validation in microservices to identify services that "lose" trace context.

**Backend:** Grafana Tempo

---

## Quick Start

### 1. Understand the problem

Tracing gets lost when:
- A service is not instrumented
- A service does not propagate trace context
- Incompatible propagation formats
- Configuration errors

The result is fragmented traces and inability to debug.

### 2. Choose validation tests

| # | Test | Severity |
|---|------|----------|
| 1 | Client span → Server span matching | HIGH |
| 2 | Requests without trace_id | HIGH |
| 3 | Span without parent_span_id | MEDIUM |
| 4 | Propagator format consistency | HIGH |
| 5 | Required span attributes | MEDIUM |

Full list: [docs/01-approach.md](docs/01-approach.md)

### 3. Configure exclusions

Not all spans should be validated:
- Cron jobs (root span is allowed)
- Kafka consumers (new trace chain)
- Health checks (not traced)

Details: [docs/04-exclusions.md](docs/04-exclusions.md)

### 4. Run TraceQL queries

```traceql
# Find orphan server spans
{ span.kind = server && nestedSetParent = -1 && resource.service.name != "api-gateway" }

# Find spans missing required attributes
{ span.kind = server && span.http.method = "" }

# Find error spans
{ status = error }
```

All queries: [docs/03-tempo-queries.md](docs/03-tempo-queries.md)

---

## Documentation

| Document | Description |
|----------|-------------|
| [01-approach.md](docs/01-approach.md) | Catalog of tracing validation tests (overview + links to detailed files) |
| [02-architecture.md](docs/02-architecture.md) | Prototype architecture (offline/real-time) |
| [03-tempo-queries.md](docs/03-tempo-queries.md) | TraceQL queries for each test |
| [04-exclusions.md](docs/04-exclusions.md) | Exclusion rules |
| [05-implementation.md](docs/05-implementation.md) | Implementation: Go CLI, rollout phases, build-vs-buy |
| [06-test-framework-integration.md](docs/06-test-framework-integration.md) | Integration with the autotest framework (per-step assertions + aggregate via CLI) |
| [tests/01-client-server-matching.md](docs/tests/01-client-server-matching.md) | Detailed spec of Test 1 (runnable on an existing cluster) |
| [tests/11-inventory-coverage.md](docs/tests/11-inventory-coverage.md) | Tests 11–13: reconciliation with K8s API (pod↔trace, `ENABLE_TRACING` contract) |

---

## Expected outcome

After applying the approaches, teams receive a report:

```
Service Tracing Health Report
=============================
service-a: OK (all spans have parents, server spans match client spans)
service-b: WARN (missing server span for 3/10 client calls)
service-c: ERROR (no spans at all, likely not instrumented)
service-d: WARN (spans without required attributes: http.method)
```

---

## Operating modes

### Offline Analysis

Analysis of existing traces over a period:
- Daily/weekly health check
- Incident investigation
- Post-release quality assessment

### Real-time Validation

Validation during tests:
- Integration tests
- Smoke tests after deployment
- Validation of new services

---

## Industry sources

- [OpenTelemetry Distributed Tracing Best Practices](https://www.withcoherence.com/articles/opentelemetry-distributed-tracing-tutorial-and-best-practices)
- [Trace-based Testing OpenTelemetry Demo](https://opentelemetry.io/blog/2023/testing-otel-demo/)
- [Netflix Distributed Tracing Infrastructure](https://netflixtechblog.com/building-netflixs-distributed-tracing-infrastructure-bb856c319304)
- [Grafana Tempo TraceQL](https://grafana.com/docs/tempo/latest/traceql/)

---

## Existing tools

| Tool | Description |
|------|-------------|
| [Tracetest](https://tracetest.io/) | Trace-based testing with assertions |
| Jaeger Trace Quality Engine | Orphan span analysis (Jaeger only) |
| Tempo metrics-generator | Span metrics for monitoring |
249 changes: 249 additions & 0 deletions tracing-tests/docs/01-approach.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,249 @@
# Approaches to tracing validation in microservices

## The "lost traces" problem

In distributed systems, tracing can "get lost" for several reasons:
- A service is not instrumented
- A service is instrumented but does not propagate trace context
- Incompatible propagation formats between services
- Configuration errors (wrong endpoint, sampling)

The result is fragmented traces where part of the call chain is invisible, making debugging and root cause analysis impossible.

---

## Validation tests

### Core tests (detecting tracing loss)

| # | Test | What it detects | Severity |
|---|-------------------------------------|-----------------------------------------------------------------|-------------|
| 1 | Client span → Server span matching | Services that receive requests but do not emit spans | HIGH |
| 2 | Reject requests without trace_id | Services that send requests without trace context | HIGH |
| 3 | Span without parent_span_id = error | Services that tried to propagate context but did not find it | MEDIUM |
| 4 | B3 propagation check | Services that use the legacy B3 propagation protocol | MEDIUM |


### Additional tests (instrumentation quality)

| # | Test | What it detects | Severity |
|---|------|--------------|-------------|
| 4 | Propagator format consistency | Format inconsistency (W3C vs B3 vs Jaeger) | HIGH |
| 5 | Required span attributes validation | Missing required attributes | MEDIUM |
| 6 | Trace depth anomaly detection | Suspiciously shallow traces | LOW |
| 7 | Hanging spans detection | Spans without end_time | MEDIUM |
| 8 | Duplicate span_id detection | Identifier collisions | LOW |
| 9 | Sampling decision propagation | Loss of the sampled=true flag | HIGH |
| 10 | Trace completeness ratio | % of complete vs fragmented traces | MEDIUM |

### Inventory-driven tests (reconciliation with the K8s API)

These work not on trace structure but reconcile the set of services/pods in Tempo against the K8s inventory. They catch "silent" services, typos in `service.name`, zombie sources, and internal contract violations.

| # | Test | What it detects | Severity |
|---|------|--------------|-------------|
| 11 | K8s pod → trace presence | Pods that are running but do not emit a single span during the window | MEDIUM |
| 12 | Trace → K8s pod existence | Traces from sources that do not exist in the cluster (typo in `service.name`, zombies) | LOW |
| 13 | `ENABLE_TRACING=true` contract | Pods that declare tracing support but emit nothing (strict form of Test 11) | HIGH |

Details: [tests/11-inventory-coverage.md](tests/11-inventory-coverage.md). Require read-only access to the K8s API (`pods`, `deployments`) in addition to the Tempo API.

---

## Detailed test descriptions

> **Organizing principle:** each test, documented as a runnable specification (run modes, algorithm, edge cases, report format), lives in a separate file `tests/NN-*.md`. Here we only provide short descriptions and links.

### Test 1: Client span → Server span matching

**Problem:** Service A calls service B, creates a client span, but service B does not create the corresponding server span.

**Check summary:** for every `kind=CLIENT` span in a trace there must exist a `kind=SERVER` span with `parent_span_id = client.span_id` from a different service. Absence means the `callee` is not instrumented, does not extract trace context, or loses it at an async boundary.

**Running without modifying microservices:** read-only access to the Tempo HTTP API; for smoke mode — access to Ingress. Detailed algorithm, false-positive filtering, the `missing_ratio` hypothesis heuristic, and report format are in [tests/01-client-server-matching.md](tests/01-client-server-matching.md).

---

### Test 2: Requests without trace_id

**Problem:** A service sends HTTP/gRPC requests without `traceparent`/`tracestate` headers.

**How to detect:**
- **Active method:** Middleware/sidecar checks incoming requests for the presence of trace headers
- **Passive method:** Server span without parent_span_id, even though the calling service is known

**Check configuration:**
```yaml
# Sidecar/Envoy configuration
validate_trace_headers:
enabled: true
action: log # or reject
exclude:
- path: /health
- path: /ready
- source: external-gateway
```

---

### Test 3: Span without parent_span_id

**Problem:** A service creates a span but cannot find the parent context.

**When it is an error:**
- The request came from an internal service (not from the edge gateway)
- The request is not a scheduled job or async consumer

**When it is acceptable:**
- Root span from an edge gateway
- Cron job / scheduled task
- Kafka consumer (first span in a new chain)

**TraceQL query:**
```
{ span.kind = server && parent_span_id = "" && resource.service.name != "edge-gateway" }
```

---

### Test 4: Propagator format consistency

**Problem:** Service A uses W3C Trace Context, service B uses B3 headers.

**Formats:**
| Format | Headers | Example |
|--------|---------|--------|
| W3C Trace Context | `traceparent`, `tracestate` | `traceparent: 00-trace_id-span_id-01` |
| B3 Single | `b3` | `b3: trace_id-span_id-1-parent_span_id` |
| B3 Multi | `X-B3-TraceId`, `X-B3-SpanId`, etc. | Multiple headers |
| Jaeger | `uber-trace-id` | `trace_id:span_id:parent_id:flags` |

**Recommendation:** Standardize on W3C Trace Context with a fallback to B3 for legacy.

---

### Test 5: Required span attributes validation

**Required attributes (semantic conventions):**

| Attribute | Span type | Description |
|---------|------------|----------|
| `service.name` | All | Service name |
| `span.kind` | All | CLIENT, SERVER, PRODUCER, CONSUMER, INTERNAL |
| `http.method` | HTTP | GET, POST, etc. |
| `http.url` or `http.route` | HTTP | URL or route template |
| `http.status_code` | HTTP | 200, 404, 500, etc. |
| `rpc.system` | gRPC | "grpc" |
| `rpc.service` | gRPC | Service name |
| `rpc.method` | gRPC | Method name |
| `db.system` | Database | "postgresql", "mysql", etc. |
| `db.statement` | Database | SQL query (sanitized) |

**TraceQL for finding spans without attributes:**
```
{ span.kind = server && http.method = "" }
```

---

### Test 6: Trace depth anomaly detection

**Problem:** A trace contains 1-2 spans where a chain of 5-10 is expected.

**How to detect:**
1. Establish a baseline depth for typical user flows
2. Compare the depth of a specific trace against the baseline
3. Abnormally shallow traces → possible loss of spans

**Metric:**
```
trace_depth_ratio = actual_depth / expected_depth
if trace_depth_ratio < 0.5 → WARNING
```

---

### Test 7: Hanging spans detection

**Problem:** A span started but did not finish (no `end_time`).

**Causes:**
- Application crash
- Timeout without proper cleanup
- Bug in instrumentation

**TraceQL:**
```
{ status = unset && duration > 60s }
```

---

### Test 8: Duplicate span_id detection

**Problem:** Two spans with the same `span_id` in the same trace.

**Causes:**
- Bug in ID generation
- Replay attack
- Incorrect SDK usage

**How to detect:**
```sql
SELECT trace_id, span_id, COUNT(*)
FROM spans
GROUP BY trace_id, span_id
HAVING COUNT(*) > 1
```

---

### Test 9: Sampling decision propagation

**Problem:** Parent span has `sampled=true`, but the child span is not recorded.

**W3C Trace Context:**
```
traceparent: 00-{trace_id}-{span_id}-{flags}
^^ 01 = sampled
```

**How to detect:**
- Compare the number of client spans with the number of server spans
- If client spans > server spans when sampled=true → propagation problem

---

### Test 10: Trace completeness ratio

**Tracing quality metric:**
```
completeness_ratio = traces_with_full_chain / total_traces
```

**How to define a "complete" trace:**
1. Has a root span (entry point)
2. Every client span has a corresponding server span
3. No orphan spans (spans without a parent, except the root)

**Target:** completeness_ratio > 95%

---

## Test applicability matrix

| Service type | T1 | T2 | T3 | T4 | T5 | T6 | T7 | T8 | T9 | T10 | T11 | T12 | T13 |
|-------------|----|----|----|----|----|----|----|----|----|----|----|----|----|
| REST API | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| gRPC Service | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Kafka Consumer | ✓ | - | - | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Cron Job | - | - | - | ✓ | ✓ | - | ✓ | ✓ | - | - | - | ✓ | ✓ |
| Edge Gateway | ✓ | - | - | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Database | ✓ | N/A | N/A | N/A | ✓ | N/A | ✓ | ✓ | N/A | N/A | N/A | N/A | N/A |

**Legend:**
- ✓ — test applicable
- `-` — test not applicable (exclusion)
- N/A — not relevant for this type

*Test 11 is not applicable to Cron Job — the absence of spans between runs is legitimate. Test 13 is applicable to Cron Job if `ENABLE_TRACING=true` is declared: the contract holds for all runs within the window.*
Loading
Loading