Skip to content

fix(tracing): direct OTel SDK setup for chain-coherent sampling#2756

Open
ci-operator wants to merge 1 commit into
tektoncd:mainfrom
ci-operator:distributed-tracing
Open

fix(tracing): direct OTel SDK setup for chain-coherent sampling#2756
ci-operator wants to merge 1 commit into
tektoncd:mainfrom
ci-operator:distributed-tracing

Conversation

@ci-operator
Copy link
Copy Markdown
Contributor

📝 Description of the Change

PR #2605 ran PaC's tracing through Knative's config-observability flat
tracing-sampling-rate. At fractional rates each service in the chain rolls
independently — PaC can drop a trace while Tekton keeps it, leaving execution
spans whose parent_spanID points at nothing.

This PR switches to the OTel SDK directly. OTEL_TRACES_SAMPLER opens the
parentbased_* family, which honors the root span's sample decision in the
W3C traceparent flag bit so the whole chain is kept or dropped together.

Controller and watcher call tracing.New() at startup; tracing is opt-in
(both OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_TRACES_SAMPLER must be set).
The tracing-* keys are dropped from the _example ConfigMap block since
they no longer apply. Resource service.name is pipelines-as-code.
Propagator is W3C TraceContext only; Baggage is intentionally not honored
per Konflux-CI ADR 0061.

🔗 Linked GitHub Issue

Fixes #

🧪 Testing Strategy

  • Unit tests
  • Integration tests
  • End-to-end tests
  • Manual testing
  • Not Applicable

🤖 AI Assistance

AI assistance can be used for various tasks, such as code generation,
documentation, or testing.

Please indicate whether you have used AI assistance
for this PR and provide details if applicable.

  • I have not used any AI assistance for this PR.
  • I have used AI assistance for this PR.

Important

Slop will be simply rejected, if you are using AI assistance you need to make sure you
understand the code generated and that it meets the project's standards. you
need at least know how to run the code and deploy it (if needed). See
startpaac to make it easy
to deploy and test your code changes.

If the majority of the code in this PR was generated by an AI, please add a Co-authored-by trailer to your commit message.
For example:

Co-authored-by: Claude noreply@anthropic.com

✅ Submitter Checklist

  • 📝 My commit messages are clear, informative, and follow the project's How to write a git commit message guide. The Gitlint linter ensures in CI it's properly validated
  • ✨ I have ensured my commit message prefix (e.g., fix:, feat:) matches the "Type of Change" I selected above.
  • ♽ I have run make test and make lint locally to check for and fix any
    issues. For an efficient workflow, I have considered installing
    pre-commit and running pre-commit install to
    automate these checks.
  • 📖 I have added or updated documentation for any user-facing changes.
  • 🧪 I have added sufficient unit tests for my code changes.
  • 🎁 I have added end-to-end tests where feasible. See README for more details.
  • 🔎 I have addressed any CI test flakiness or provided a clear reason to bypass it.
  • If adding a provider feature, I have filled in the following and updated the provider documentation:
    • GitHub App
    • GitHub Webhook
    • Gitea/Forgejo
    • GitLab
    • Bitbucket Cloud
    • Bitbucket Data Center

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request transitions Pipelines-as-Code tracing configuration from a custom ConfigMap to standard OpenTelemetry environment variables (such as OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_TRACES_SAMPLER). It introduces a new tracing package to initialize and manage the tracer provider lifecycle, updating both the controller and adapter to initialize and shut down the provider correctly. A critical issue was identified in pkg/tracing/provider.go where passing the application's cancellation context to the OTLP exporter prevents final spans from being flushed during shutdown; using context.Background() is recommended instead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread pkg/tracing/provider.go Outdated
return noopProvider()
}

exporter, err := newExporter(ctx, logger)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Passing the application's cancellation context (ctx) to the OTLP exporter creation means that when the application context is cancelled to initiate shutdown, the exporter's connection/client is also cancelled immediately. This prevents the final flush of spans during tp.Shutdown(shutdownCtx) from succeeding, leading to lost spans. Using context.Background() ensures the exporter remains functional during the shutdown flush.

Suggested change
exporter, err := newExporter(ctx, logger)
exporter, err := newExporter(context.Background(), logger)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newExporter uses context.Background() so the exporter's connection stays open through tp.Shutdown(shutdownCtx) and queued spans flush. The ctx parameter on tracing.New is dropped since nothing else used it.

@ci-operator ci-operator force-pushed the distributed-tracing branch from ff868b1 to 50ad9c1 Compare June 3, 2026 19:12
Comment on lines -55 to -68
# tracing-protocol specifies the trace export protocol.
# Supported values: "grpc", "http/protobuf", "none".
# Default is "none" (tracing disabled).
# tracing-protocol: "none"
# tracing-endpoint specifies the OTLP collector endpoint.
# Required when tracing-protocol is "grpc" or "http/protobuf".
# The OTEL_EXPORTER_OTLP_ENDPOINT env var takes precedence if set.
# tracing-endpoint: "http://otel-collector.observability.svc.cluster.local:4317"
# tracing-sampling-rate controls the fraction of traces sampled.
# 0.0 = none, 1.0 = all. Default is 0 (none).
# tracing-sampling-rate: "1.0"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the configmap keys are removed, I see no change in controller. Does that mean the configmap based tracing continue to work even with these changes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Knative tracing wrapper that consumed these _example keys is replaced; the new tracer in pkg/tracing/provider.go reads the OTel-standard env vars directly. The _example block was just documentation for keys nothing reads anymore.

This is not the same ConfigMap as pipelines-as-code (the main one), which holds the tracing-label-action|application|component operator label-name mappings.

On the "no change in controller" observation - we verified end-to-end that with neither OTEL_EXPORTER_OTLP_ENDPOINT nor OTEL_TRACES_SAMPLER set, PaC falls back to a noop tracer and emits no spans. If you're seeing tracing behavior unchanged from the pre-PR state, could you share how to reproduce so we can dig in?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By changes in controller, I intended to ask if the knative base (eventing/adapter) that pac uses, reads these observability configmap keys for any arbitrary configuration/operations. Ref.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - the Knative eventing-adapter still consumes config-observability for non-tracing observability (metrics/profiling/etc. via evadapter.NewObservabilityConfiguratorFromConfigMap() at the line you linked, logging is read from config-logging separately). What this PR changes is specifically the tracing portion: PaC's old Knative tracing wrapper (added in bd9f468) read tracing-protocol / tracing-endpoint / tracing-sampling-rate from this ConfigMap; the new pkg/tracing/provider.go reads the OTel-standard env vars directly and that wrapper is gone, so those three keys went with the _example block. The Knative-base reads for non-tracing observability are unchanged.

Comment thread docs/content/docs/operations/tracing.md Outdated
Both `OTEL_EXPORTER_OTLP_ENDPOINT` and `OTEL_TRACES_SAMPLER` must be set to opt in to tracing. If either is unset PaC falls back to a noop tracer that emits no spans and incurs no exporter cost. Changes to any of these env vars take effect on the next pod restart.

### Example
PaC honors inbound `traceparent` headers on incoming webhook requests via the W3C TraceContext propagator. OTel Baggage is intentionally not honored; per [ADR 0061](https://github.com/konflux-ci/architecture/blob/main/ADR/0061-distributed-tracing.md), this delivery-tracing chain does not use Baggage for cross-service attribute propagation because all required attributes are locally available at every emission point.
Copy link
Copy Markdown
Member

@theakshaypant theakshaypant Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not link a downstream project's design doc in PaC's documentation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is github.com/konflux-ci considered downstream?? I think NO it's another open source repository.

Copy link
Copy Markdown
Member

@theakshaypant theakshaypant Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downstream would be anything that builds on or utilizes PaC and if a change here is affecting konflux-ci, then it does qualify as a downstream project.
Even so, point still stands. We should not link a design doc from another project.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please remove konflux references, we are not a dep on this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the konflux link entirely. Rewrote the Baggage paragraph to stand on its own (every emission point already has the attributes it needs from the local PipelineRun and webhook event, so no cross-service propagation channel is needed). Also corrected a stale tracing-endpoint reference further down the file - the OTel SDK reads OTEL_EXPORTER_OTLP_ENDPOINT automatically, so the docs just needed to point at the standard env var name.

Comment thread docs/content/docs/operations/tracing.md Outdated
Both `OTEL_EXPORTER_OTLP_ENDPOINT` and `OTEL_TRACES_SAMPLER` must be set to opt in to tracing. If either is unset PaC falls back to a noop tracer that emits no spans and incurs no exporter cost. Changes to any of these env vars take effect on the next pod restart.

### Example
PaC honors inbound `traceparent` headers on incoming webhook requests via the W3C TraceContext propagator. OTel Baggage is intentionally not honored; per [ADR 0061](https://github.com/konflux-ci/architecture/blob/main/ADR/0061-distributed-tracing.md), this delivery-tracing chain does not use Baggage for cross-service attribute propagation because all required attributes are locally available at every emission point.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken link, get a 404

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it's 0062-distributed-tracing.md

Comment thread pkg/tracing/provider.go Outdated
case protocolGRPC:
return otlptracegrpc.New(ctx, otlptracegrpc.WithEndpointURL(endpoint))
default:
logger.Errorw("unsupported OTLP protocol; falling back to grpc", "protocol", protocolFromEnv())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] redundant protocolFromEnv call

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Hoisted to a local.

@zakisk zakisk force-pushed the distributed-tracing branch from 50ad9c1 to 5b41561 Compare June 4, 2026 05:34
@zakisk
Copy link
Copy Markdown
Member

zakisk commented Jun 4, 2026

/ok-to-test

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 70.00000% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.72%. Comparing base (2c03760) to head (5b41561).

Files with missing lines Patch % Lines
pkg/tracing/provider.go 80.45% 13 Missing and 4 partials ⚠️
pkg/reconciler/controller.go 0.00% 7 Missing ⚠️
pkg/adapter/adapter.go 0.00% 6 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2756      +/-   ##
==========================================
+ Coverage   59.67%   59.72%   +0.04%     
==========================================
  Files         210      211       +1     
  Lines       21007    21107     +100     
==========================================
+ Hits        12536    12606      +70     
- Misses       7685     7711      +26     
- Partials      786      790       +4     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@chmouel
Copy link
Copy Markdown
Member

chmouel commented Jun 4, 2026

This feelds vibe coded, have you tested this properly? have you properly reviewed it?

Knative's config-observability ConfigMap only exposes a flat
tracing-sampling-rate, so at fractional rates each service in the chain
rolls independently — PaC can drop a trace while Tekton keeps it, leaving
execution spans whose parent_spanID points at nothing. Switching to the
OTel SDK opens up OTEL_TRACES_SAMPLER's parentbased_* family, which honors
the root span's sample decision in the W3C traceparent flag so the whole
chain is kept or dropped together.

Controller and watcher call tracing.New() at startup. Tracing is opt-in:
both OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_TRACES_SAMPLER must be set,
otherwise PaC falls back to noop (matching the prior tracing-sampling-rate
"0" default). Resource service.name is pipelines-as-code. Propagator is
W3C TraceContext only; Baggage is intentionally not honored per Konflux-CI
ADR 0061. otlptracegrpc and otlptracehttp promoted from indirect to direct
dependencies.

Assisted-by: Claude Code
Signed-off-by: Josiah England <jengland@redhat.com>
@ci-operator ci-operator force-pushed the distributed-tracing branch from 5b41561 to 311babd Compare June 4, 2026 17:23
@ci-operator
Copy link
Copy Markdown
Contributor Author

Sorry for the churn here. A number of findings surfaced while working through the integration-service and release-service side of this work, and the OTel SDK swap in this PR turned out to be necessary to avoid a fractional-sampling fragmentation bug (Knative's tracing wrapper made per-service sampling decisions independently of the root span). The unrelated cleanups landed in the same commit, but the overall behavior is functionally identical to what shipped in #2605 (#2605): same spans, same attribute schema, same opt-in env vars, just plumbed through the OTel SDK directly so parentbased_* samplers behave coherently across the delivery chain. Let me know if this needs greater alignment with your code conventions.

The PR was validated end-to-end across more than a dozen scenarios covering each fix and each error class it mitigates:

  • Chain-coherency under fractional sampling (per-service-independent sample decisions fragmenting chains - the failure mode the PR exists to fix).
  • Dual opt-in: noop fallback when either ENDPOINT or SAMPLER is unset.
  • All sampler families plus the ratio-arg defaulting path.
  • Transport selection between gRPC and HTTP/protobuf.
  • W3C TraceContext only; Baggage explicitly suppressed.
  • Resource attribute resolution (no unknown_service fallbacks).
  • Concurrent webhook handling.
  • BSP flush behavior under graceful and abrupt controller termination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants