fix(tracing): direct OTel SDK setup for chain-coherent sampling#2756
fix(tracing): direct OTel SDK setup for chain-coherent sampling#2756ci-operator wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request transitions Pipelines-as-Code tracing configuration from a custom ConfigMap to standard OpenTelemetry environment variables (such as OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_TRACES_SAMPLER). It introduces a new tracing package to initialize and manage the tracer provider lifecycle, updating both the controller and adapter to initialize and shut down the provider correctly. A critical issue was identified in pkg/tracing/provider.go where passing the application's cancellation context to the OTLP exporter prevents final spans from being flushed during shutdown; using context.Background() is recommended instead.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| return noopProvider() | ||
| } | ||
|
|
||
| exporter, err := newExporter(ctx, logger) |
There was a problem hiding this comment.
Passing the application's cancellation context (ctx) to the OTLP exporter creation means that when the application context is cancelled to initiate shutdown, the exporter's connection/client is also cancelled immediately. This prevents the final flush of spans during tp.Shutdown(shutdownCtx) from succeeding, leading to lost spans. Using context.Background() ensures the exporter remains functional during the shutdown flush.
| exporter, err := newExporter(ctx, logger) | |
| exporter, err := newExporter(context.Background(), logger) |
There was a problem hiding this comment.
newExporter uses context.Background() so the exporter's connection stays open through tp.Shutdown(shutdownCtx) and queued spans flush. The ctx parameter on tracing.New is dropped since nothing else used it.
ff868b1 to
50ad9c1
Compare
| # tracing-protocol specifies the trace export protocol. | ||
| # Supported values: "grpc", "http/protobuf", "none". | ||
| # Default is "none" (tracing disabled). | ||
| # tracing-protocol: "none" | ||
| # tracing-endpoint specifies the OTLP collector endpoint. | ||
| # Required when tracing-protocol is "grpc" or "http/protobuf". | ||
| # The OTEL_EXPORTER_OTLP_ENDPOINT env var takes precedence if set. | ||
| # tracing-endpoint: "http://otel-collector.observability.svc.cluster.local:4317" | ||
| # tracing-sampling-rate controls the fraction of traces sampled. | ||
| # 0.0 = none, 1.0 = all. Default is 0 (none). | ||
| # tracing-sampling-rate: "1.0" | ||
There was a problem hiding this comment.
While the configmap keys are removed, I see no change in controller. Does that mean the configmap based tracing continue to work even with these changes?
There was a problem hiding this comment.
The Knative tracing wrapper that consumed these _example keys is replaced; the new tracer in pkg/tracing/provider.go reads the OTel-standard env vars directly. The _example block was just documentation for keys nothing reads anymore.
This is not the same ConfigMap as pipelines-as-code (the main one), which holds the tracing-label-action|application|component operator label-name mappings.
On the "no change in controller" observation - we verified end-to-end that with neither OTEL_EXPORTER_OTLP_ENDPOINT nor OTEL_TRACES_SAMPLER set, PaC falls back to a noop tracer and emits no spans. If you're seeing tracing behavior unchanged from the pre-PR state, could you share how to reproduce so we can dig in?
There was a problem hiding this comment.
By changes in controller, I intended to ask if the knative base (eventing/adapter) that pac uses, reads these observability configmap keys for any arbitrary configuration/operations. Ref.
There was a problem hiding this comment.
Yes - the Knative eventing-adapter still consumes config-observability for non-tracing observability (metrics/profiling/etc. via evadapter.NewObservabilityConfiguratorFromConfigMap() at the line you linked, logging is read from config-logging separately). What this PR changes is specifically the tracing portion: PaC's old Knative tracing wrapper (added in bd9f468) read tracing-protocol / tracing-endpoint / tracing-sampling-rate from this ConfigMap; the new pkg/tracing/provider.go reads the OTel-standard env vars directly and that wrapper is gone, so those three keys went with the _example block. The Knative-base reads for non-tracing observability are unchanged.
| Both `OTEL_EXPORTER_OTLP_ENDPOINT` and `OTEL_TRACES_SAMPLER` must be set to opt in to tracing. If either is unset PaC falls back to a noop tracer that emits no spans and incurs no exporter cost. Changes to any of these env vars take effect on the next pod restart. | ||
|
|
||
| ### Example | ||
| PaC honors inbound `traceparent` headers on incoming webhook requests via the W3C TraceContext propagator. OTel Baggage is intentionally not honored; per [ADR 0061](https://github.com/konflux-ci/architecture/blob/main/ADR/0061-distributed-tracing.md), this delivery-tracing chain does not use Baggage for cross-service attribute propagation because all required attributes are locally available at every emission point. |
There was a problem hiding this comment.
We should not link a downstream project's design doc in PaC's documentation.
There was a problem hiding this comment.
is github.com/konflux-ci considered downstream?? I think NO it's another open source repository.
There was a problem hiding this comment.
Downstream would be anything that builds on or utilizes PaC and if a change here is affecting konflux-ci, then it does qualify as a downstream project.
Even so, point still stands. We should not link a design doc from another project.
There was a problem hiding this comment.
yes please remove konflux references, we are not a dep on this
There was a problem hiding this comment.
Removed the konflux link entirely. Rewrote the Baggage paragraph to stand on its own (every emission point already has the attributes it needs from the local PipelineRun and webhook event, so no cross-service propagation channel is needed). Also corrected a stale tracing-endpoint reference further down the file - the OTel SDK reads OTEL_EXPORTER_OTLP_ENDPOINT automatically, so the docs just needed to point at the standard env var name.
| Both `OTEL_EXPORTER_OTLP_ENDPOINT` and `OTEL_TRACES_SAMPLER` must be set to opt in to tracing. If either is unset PaC falls back to a noop tracer that emits no spans and incurs no exporter cost. Changes to any of these env vars take effect on the next pod restart. | ||
|
|
||
| ### Example | ||
| PaC honors inbound `traceparent` headers on incoming webhook requests via the W3C TraceContext propagator. OTel Baggage is intentionally not honored; per [ADR 0061](https://github.com/konflux-ci/architecture/blob/main/ADR/0061-distributed-tracing.md), this delivery-tracing chain does not use Baggage for cross-service attribute propagation because all required attributes are locally available at every emission point. |
There was a problem hiding this comment.
actually it's 0062-distributed-tracing.md
| case protocolGRPC: | ||
| return otlptracegrpc.New(ctx, otlptracegrpc.WithEndpointURL(endpoint)) | ||
| default: | ||
| logger.Errorw("unsupported OTLP protocol; falling back to grpc", "protocol", protocolFromEnv()) |
There was a problem hiding this comment.
[nit] redundant protocolFromEnv call
There was a problem hiding this comment.
Fixed. Hoisted to a local.
50ad9c1 to
5b41561
Compare
|
/ok-to-test |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2756 +/- ##
==========================================
+ Coverage 59.67% 59.72% +0.04%
==========================================
Files 210 211 +1
Lines 21007 21107 +100
==========================================
+ Hits 12536 12606 +70
- Misses 7685 7711 +26
- Partials 786 790 +4 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
This feelds vibe coded, have you tested this properly? have you properly reviewed it? |
Knative's config-observability ConfigMap only exposes a flat tracing-sampling-rate, so at fractional rates each service in the chain rolls independently — PaC can drop a trace while Tekton keeps it, leaving execution spans whose parent_spanID points at nothing. Switching to the OTel SDK opens up OTEL_TRACES_SAMPLER's parentbased_* family, which honors the root span's sample decision in the W3C traceparent flag so the whole chain is kept or dropped together. Controller and watcher call tracing.New() at startup. Tracing is opt-in: both OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_TRACES_SAMPLER must be set, otherwise PaC falls back to noop (matching the prior tracing-sampling-rate "0" default). Resource service.name is pipelines-as-code. Propagator is W3C TraceContext only; Baggage is intentionally not honored per Konflux-CI ADR 0061. otlptracegrpc and otlptracehttp promoted from indirect to direct dependencies. Assisted-by: Claude Code Signed-off-by: Josiah England <jengland@redhat.com>
5b41561 to
311babd
Compare
|
Sorry for the churn here. A number of findings surfaced while working through the integration-service and release-service side of this work, and the OTel SDK swap in this PR turned out to be necessary to avoid a fractional-sampling fragmentation bug (Knative's tracing wrapper made per-service sampling decisions independently of the root span). The unrelated cleanups landed in the same commit, but the overall behavior is functionally identical to what shipped in #2605 (#2605): same spans, same attribute schema, same opt-in env vars, just plumbed through the OTel SDK directly so The PR was validated end-to-end across more than a dozen scenarios covering each fix and each error class it mitigates:
|
📝 Description of the Change
PR #2605 ran PaC's tracing through Knative's
config-observabilityflattracing-sampling-rate. At fractional rates each service in the chain rollsindependently — PaC can drop a trace while Tekton keeps it, leaving execution
spans whose
parent_spanIDpoints at nothing.This PR switches to the OTel SDK directly.
OTEL_TRACES_SAMPLERopens theparentbased_*family, which honors the root span's sample decision in theW3C
traceparentflag bit so the whole chain is kept or dropped together.Controller and watcher call
tracing.New()at startup; tracing is opt-in(both
OTEL_EXPORTER_OTLP_ENDPOINTandOTEL_TRACES_SAMPLERmust be set).The
tracing-*keys are dropped from the_exampleConfigMap block sincethey no longer apply. Resource
service.nameispipelines-as-code.Propagator is W3C TraceContext only; Baggage is intentionally not honored
per Konflux-CI ADR 0061.
🔗 Linked GitHub Issue
Fixes #
🧪 Testing Strategy
🤖 AI Assistance
AI assistance can be used for various tasks, such as code generation,
documentation, or testing.
Please indicate whether you have used AI assistance
for this PR and provide details if applicable.
Important
Slop will be simply rejected, if you are using AI assistance you need to make sure you
understand the code generated and that it meets the project's standards. you
need at least know how to run the code and deploy it (if needed). See
startpaac to make it easy
to deploy and test your code changes.
If the majority of the code in this PR was generated by an AI, please add a
Co-authored-bytrailer to your commit message.For example:
Co-authored-by: Claude noreply@anthropic.com
✅ Submitter Checklist
fix:,feat:) matches the "Type of Change" I selected above.make testandmake lintlocally to check for and fix anyissues. For an efficient workflow, I have considered installing
pre-commit and running
pre-commit installtoautomate these checks.