Skip to content

feat(roadmap-planner): DORA Lead Time 3-stage attribution (Phase 2 backend)#197

Open
yhuan123 wants to merge 12 commits into
AlaudaDevops:mainfrom
yhuan123:feat/dora-lead-time-phase2
Open

feat(roadmap-planner): DORA Lead Time 3-stage attribution (Phase 2 backend)#197
yhuan123 wants to merge 12 commits into
AlaudaDevops:mainfrom
yhuan123:feat/dora-lead-time-phase2

Conversation

@yhuan123

Copy link
Copy Markdown

Plan: docs/plans/2026-05-21-dora-optimization-plan.md — 2026-05-21. Backend half of DORA Phase 2; the lean Lead Time card + Trend Panel frontend lands in a follow-up PR.

Summary

Upgrades Lead Time for Changes from a single number to Dev → Review → Release 3-stage attribution. The card now points at which stage is the bottleneck and a separate panel shows the 9-month trend with MoM/QoQ deltas, so the team can locate improvement targets instead of just reading a headline number. DORA-strict: starts at first commit, ends at release.

Diff

Layer Change
Schema 0009_pr_first_commit_at.sql adds pull_requests.first_commit_at TIMESTAMP NULL.
Sync GitHub + GitLab clients gain ListPRCommits / ListMRCommits. Sync populates first_commit_at from the platform commits API; failures degrade silently (COALESCE-protected UPSERT).
Storage UpsertPullRequests SQL extended; new ListPullRequestsSince read path for the metrics collector.
Collector Holds a (nullable) storage.Store, fetches PRs alongside Jira data, exposes them in CalculationContext.PullRequests. main.go opens the store once and shares it with team-analytics.
Calculator lead_time.go rewritten: 3-stage durations (T0..T3), fallback matrix C1–C6, bottleneck dual-condition rule (D15), worst_issues one-per-stage (P2-4), 9-month trend with MoM/QoQ direction, transparency coverage stats, consistency_warning.
API /api/metrics/lead_time_to_release accepts ?include_bots= and ?with_trend=. Other calculators unchanged.

Cycle Time / Patch Ratio / Time to Patch / Deploy Freq calculators are not touched (plan D2 / D4 / D5).

Test plan

  • `go test ./internal/...` + `go vet ./...` — green
  • 11 new unit tests in `lead_time_test.go` covering percentile math, fallback matrix C1/C3/C4/E9, component regex, bottleneck dual-condition, worst-per-stage selection, MoM/QoQ direction, end-to-end Calculate
  • Dev deploy: `curl -s '/api/metrics/lead_time_to_release?include_bots=false&component=tektoncd-operator&with_trend=true' | jq .` returns `value > 0`, `metadata.stages` length 3, `metadata.coverage.coverage_pct >= 70`, `metadata.trend.points` length 9
  • Dev deploy: `SELECT COUNT(*) FROM pull_requests WHERE merged_at > NOW() - INTERVAL '7 days' AND first_commit_at IS NULL` stays small after the first sync cycle

Sequencing

Backend lands first; frontend (`MetricCard` lean redesign + new `LeadTimeTrendPanel.jsx`) follows in a separate PR per plan §Phase 1.

🤖 Generated with Claude Code

Plan: docs/plans/2026-05-21-dora-optimization-plan.md. Upgrades Lead
Time for Changes from a single number to Dev → Review → Release 3-stage
attribution, with bottleneck highlighting and a separate 9-month trend
panel (sparkline + MoM/QoQ chips). DORA-strict: first commit → release,
no backlog.

Phase 2 backend half (schema + sync + calculator + API + tests).
Frontend follows in a separate PR. The other 4 DORA metrics — Deployment
Frequency, Change Failure Rate, Mean Time to Recovery, Cycle Time —
calculator + UI are untouched.
@yhuan123 yhuan123 force-pushed the feat/dora-lead-time-phase2 branch from 56ccb4e to 4d92204 Compare May 21, 2026 15:12
huanyang@alauda.io added 11 commits May 21, 2026 23:23
…P2-2]

Review found that the new componentVersionRE only accepted
`{component}-v{semver}`, so versions named `argo-cd-2.9.0` (no v
prefix) were labelled with the full version string and a dashboard
filter for `argo-cd` returned no Lead Time data while other metrics
still grouped by EnrichedRelease.Component.

matchReleasedVersion now returns the matched EnrichedRelease so the
caller reuses the collector-parsed Component field — same source as
release_frequency and friends. The componentVersionRE +
componentFromVersionName helper are removed.
…alize [P1-2]

Review found that Jira version releaseDate is only a calendar date,
parsed at midnight, so any PR merged later on the same release day
was dropped by filterPreReleasePRs as a hotfix — losing the last PR
on release day for many shipped issues and skewing Lead Time low.

matchReleasedVersion now snaps the returned T3 to end-of-day. A PR
merged at 14:00 on release day stays inside the window; the Release
stage end-time is still that calendar day.
…t [P1-1]

Review found that switching MetricResult.Value/Unit to hours broke
existing consumers: MetricCard / MetricBreakdown hard-code days and
DORA day thresholds, and PrometheusExporter publishes lead_time_days
without conversion. A 30-day Lead Time (720h) would render as
"720 days" and trip Low-band classification at 60+ days.

Value is now p50_hours / 24 (days), Unit is "days". The
hour-precision values stay in Metadata (total.p50_hours,
stages[].p50_hours, trend.points[].p50_hours) for the Phase 1
frontend that needs them. Plan §8 documents the dual encoding.
…-window PRs [P2-1]

Review found that the PR fetch window matched HistoricalDays exactly,
but the Lead Time calculator filters issues by release date — so a
change released early in the window whose PR merged just before
window start (and is not old enough to hit the 180d long-dev exclusion)
gets classified as C4/no PR and undercounted.

Collector now uses since = now - (HistoricalDays + 180d). The 180d
buffer aligns with excludedLongDevThreshold, so PRs older than that
would never be referenced by a qualified issue anyway.
…jsx [P2]

Review found that MetricBreakdown.jsx still reads metadata.min/max/count
on the lead_time_to_release branch, so the expanded breakdown rendered
Min=0 / Max=0 / Epics=0 after Phase 2 even though samples existed.

Lead Time calculator now re-emits the legacy days fields
(min/max/average/count/sample_size/percentile) alongside the new
hour-precision payload (total / stages / worst_issues / trend etc),
mirroring the same backward-compat strategy used for Value/Unit in
P1-1. The frontend stays unchanged on the backend PR.
…le [P2]

Review found that storage.enabled defaults to false in the sample
config, but lead_time_to_release stays enabled. After Phase 2 the
calculator unconditionally required PR data, so every issue was
classified as C4/no PR and the whole metric disappeared from the
dashboard and Prometheus — strictly worse than the pre-Phase-2
behaviour.

CalculationContext now carries PRStoreAvailable, set by the collector
based on whether the storage handle is non-nil. When false, Calculate
falls back to the pre-Phase-2 calendar path (issue.created → release,
days) and emits metadata.degraded = "no_pr_store" + a reason string
so the UI can show a configuration warning. The fallback does not
emit stages / worst_issues / trend / coverage — those require PR
data — but Value, min, max, average, and count are populated for
MetricBreakdown.jsx.

Trade-off documented in plan §10.
- parseVersionName no longer falls back to the whole version name —
  legacy names ("0.3", "v2.1") stop polluting component buckets
- new metrics.exclude_plugins config (D6): v3-era plugins (katanomi,
  knative, jenkins, tekton-operator) dropped from all component
  dimensions at collection time
- calculators skip component=="" instead of bucketing to "unknown"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant