Skip to content

feat(synthetic): add rds_upstream suite with scenario 001 (part 1/6 for #1437)#2523

Open
Devesh36 wants to merge 1 commit into
Tracer-Cloud:mainfrom
Devesh36:feat/rds-upstream-synthetic-001-1437
Open

feat(synthetic): add rds_upstream suite with scenario 001 (part 1/6 for #1437)#2523
Devesh36 wants to merge 1 commit into
Tracer-Cloud:mainfrom
Devesh36:feat/rds-upstream-synthetic-001-1437

Conversation

@Devesh36
Copy link
Copy Markdown
Collaborator

Summary

This PR is part 1 of 6 for #1437 (epic #1433: DB symptom → upstream cause without trace IDs).

It introduces a new synthetic test suite at tests/synthetic/rds_upstream/ for cross-system incident fixtures: RDS/CloudWatch symptoms correlated with EC2 app tier + ELB evidence, without relying on trace or request IDs in logs.

What’s included

Area Change
New suite tests/synthetic/rds_upstream/ — separate from DB-only tests/synthetic/rds_postgres/
Scenario 001 001-request-burst-ec2-app-tier — RDS CPU + DatabaseConnections spike driven by sustained load on the web EC2 tier (worker tier stays flat; red-herring tier)
Loader scenario_loader.py — thin wrapper around rds_postgres.scenario_loader with suite-local SUITE_DIR
Tests test_suite.py — fixture load, metadata/evidence validation, answer-key checks
Correlation correlation/test_001_request_burst.py — deterministic ranking: web ASG ranks above worker ASG using time-window + topology scoring (reuses rds_postgres.correlation helpers)

Scenario 001 (maps to #1437 test case 1)

  • Symptom: RDS MySQL primary (orders-prod) — climbing connections + CPU.
  • True cause: Request burst on orders-web-asg behind ALB target group orders-web-tg.
  • Evidence sources: aws_cloudwatch_metrics, aws_rds_events, ec2_instances_by_tag, elb_target_health.
  • Topology: ALB → web/worker EC2 tiers → RDS (declared in scenario.yml).

What’s intentionally not in this PR

  • Scenarios 002–006 from #1437 (periodic workflow, no-trace-only, red herrings, metrics-before-logs ordering, time-correlation-only) — follow-up PRs.
  • #1438 correlation pathway report model/rendering.
  • run_suite CLI wrapper / Makefile target — PR keeps scope to pytest-only validation (no live LLM required for CI/review).
  • User-facing docs/ changes — test-suite-only; no product behavior change.

Design notes

  • Fixtures follow the same layout as rds_postgres/015-mysql-ec2-load-attribution (split CloudWatch metric JSON files + envelope) for consistency with existing mock AWS/Grafana backends.
  • failure_mode remains application_load_spike (allowed by tests/synthetic/schemas.py); scenario identity is via scenario_id and directory name.
  • Default repo CI (make test-cov) ignores tests/synthetic/ — reviewers should run the commands below locally (same as other synthetic suites).

Motivation

tests/synthetic/rds_postgres/ is largely DB-centric. #1437 requires scored, deterministic scenarios that prove the agent can attribute RDS symptoms to upstream EC2/ALB/app layers using correlation and topology—not only in-database root causes.

Related issues

Test plan

Required (no API keys)

make format-check
make lint
make typecheck
uv run pytest tests/synthetic/rds_upstream/ -v -m synthetic

@github-actions
Copy link
Copy Markdown
Contributor

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 26, 2026

Greptile Summary

This PR introduces the tests/synthetic/rds_upstream/ suite as part 1 of 6 for issue #1437, adding scenario 001 (001-request-burst-ec2-app-tier) that exercises cross-system RCA: RDS CPU + connection spike attributed to a web-tier EC2 ASG load burst (not the flat worker tier), without relying on trace IDs. Previous review feedback on missing __init__.py files, pytestmark placement, and the CloudWatch envelope namespace has been addressed in this revision.

  • New suite structure: rds_upstream/ with __init__.py, correlation/__init__.py, scenario_loader.py (thin wrapper re-exporting rds_postgres loader with suite-local SUITE_DIR), and test_suite.py covering load, metadata/evidence, and answer-key checks.
  • Scenario 001 fixtures: Split CloudWatch metric files (RDS CPU, DB connections, EC2 web/worker CPU) plus ec2_instances_by_tag.json, elb_target_health.json, and aws_rds_events.json — all validated by the existing rds_postgres schema validators.
  • Correlation test: test_001_request_burst.py reuses rds_postgres correlation helpers to deterministically rank orders-web-asg above orders-worker-asg using time-window correlation on the static fixture data.

Confidence Score: 5/5

Safe to merge — test-only additions with no production behavior change and no regressions to existing suites.

All changes are confined to a new synthetic test suite directory. The fixture data is static JSON/YAML, the Python code re-uses existing tested rds_postgres loader and correlation helpers, and the init.py files, pytestmark placement, and envelope namespace issues raised in the previous review round have all been corrected. No production code paths are touched.

No files require special attention. The aws_cloudwatch_metrics_envelope.json namespace field is worth monitoring if future agent code branches on that value per-metric, but it is a known design constraint of the split-file approach.

Important Files Changed

Filename Overview
tests/synthetic/rds_upstream/correlation/test_001_request_burst.py Correlation ranking test for scenario 001; imports rds_postgres helpers, constructs topology nodes and time-series manually, asserts web tier ranks above flat worker tier — deterministic given the fixture data; pytestmark is correctly placed after all imports.
tests/synthetic/rds_upstream/scenario_loader.py Thin wrapper re-exporting rds_postgres loader with suite-local SUITE_DIR; load_all_scenarios correctly passes SUITE_DIR (rds_upstream/) so only directories whose names start with digits are picked up as scenarios.
tests/synthetic/rds_upstream/test_suite.py Three fixture and metadata tests covering load, evidence validation, and cross-system answer key checks; pytestmark placement and imports are correct.
tests/synthetic/rds_upstream/001-request-burst-ec2-app-tier/aws_cloudwatch_metrics_envelope.json Envelope now declares namespace "AWS/RDS,AWS/EC2" covering both metric types; individual MetricDataResult entries still lack a per-metric namespace field, so agent code reading the top-level namespace field receives a non-standard comma-separated value.
tests/synthetic/rds_upstream/001-request-burst-ec2-app-tier/scenario.yml Metadata is complete and valid; topology block (vpc, ALB, tiers) is well-formed but note that _validated_metadata and ScenarioMetadata do not expose the topology field — it's decorative in the loader.
tests/synthetic/rds_upstream/001-request-burst-ec2-app-tier/answer.yml Answer key is complete with required_keywords, required_evidence_sources, optimal_trajectory, and ruling_out_keywords; all values align with the fixture data and validate cleanly.
tests/synthetic/rds_upstream/init.py Package marker added correctly, enabling absolute imports and pytest's prepend import mode to resolve the suite.
tests/synthetic/rds_upstream/correlation/init.py Package marker for correlation subdirectory added correctly.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[pytest collects rds_upstream suite] --> B[test_suite.py]
    A --> C[correlation/test_001_request_burst.py]
    B --> B1[test_load_all_upstream_scenarios]
    B --> B2[test_upstream_scenario_metadata_and_evidence]
    B --> B3[test_request_burst_answer_key_expects_cross_system_evidence]
    C --> D[Load 3 time-series from fixture JSON]
    D --> E[Construct TopologyNodes for rds, web, worker]
    E --> F1[score_time_window_correlation RDS vs Web CPU - high correlation]
    E --> F2[score_time_window_correlation RDS vs Worker CPU - near zero]
    F1 --> G[score_candidate_correlation - web_score]
    F2 --> H[score_candidate_correlation - worker_score]
    G --> I[rank_upstream_candidates]
    H --> I
    I --> J{Assert web-asg ranks first with higher confidence}
    style J fill:#d4edda,stroke:#28a745
Loading

Reviews (2): Last reviewed commit: "feat(synthetic): add rds_upstream suite ..." | Re-trigger Greptile

Comment thread tests/synthetic/rds_upstream/test_suite.py
Comment thread tests/synthetic/rds_upstream/correlation/test_001_request_burst.py Outdated
Devesh36 added a commit to Devesh36/opensre that referenced this pull request May 26, 2026
…#2523)

Greptile P1: add __init__.py under tests/synthetic/rds_upstream/ and
tests/synthetic/rds_upstream/correlation/ so pytest resolves absolute
imports when the suite is run in isolation.

Greptile P2: move pytestmark below all imports in
correlation/test_001_request_burst.py (fixes E402 / isort ordering).

Greptile P2: set CloudWatch envelope namespace to AWS/RDS,AWS/EC2 so RDS
and EC2 Auto Scaling metrics are not all labeled under AWS/RDS alone.
@Devesh36 Devesh36 force-pushed the feat/rds-upstream-synthetic-001-1437 branch from 1a51074 to 3a83a99 Compare May 26, 2026 04:51
…Cloud#1437

Scaffolds a new synthetic RCA suite (tests/synthetic/rds_upstream/) that
exercises cross-system "RDS symptom -> EC2/app upstream cause" attribution
without live infrastructure, per EPIC Tracer-Cloud#1433.

Scenario 001 (request-burst-ec2-app-tier):
- RDS shows elevated CPU and connection count.
- Real upstream cause is a web-tier EC2 fleet request burst; worker tier
  remains idle (decoy).
- Fixtures include CloudWatch envelopes (RDS + EC2), RDS events, EC2
  instances-by-tag, and ELB target health.

Tests:
- test_suite.py: loader integrity, metadata, evidence, and answer-key
  expectations.
- correlation/test_001_request_burst.py: asserts web tier ranks above
  worker tier in cross-system correlation.

Follow-up PRs will add scenarios 002-006 and the Makefile target.

Refs: Tracer-Cloud#1437, Tracer-Cloud#1433
Co-authored-by: Cursor <cursoragent@cursor.com>
@Devesh36 Devesh36 force-pushed the feat/rds-upstream-synthetic-001-1437 branch from 3a83a99 to 483ae69 Compare May 26, 2026 04:55
@Devesh36
Copy link
Copy Markdown
Collaborator Author

@greptile-apps review again

@cerencamkiran
Copy link
Copy Markdown
Collaborator

Scope is pretty big, so I think this is a good start. Maybe your API limit got cooked too like mine. Would still be nice later to test this with real investigations / "run_suite" though. We usually end up changing quite a bit after trying things with real LLM providers.

Also maybe adding correlation pathway/report assertions later could help for the full #1437 acceptance.

One thing that felt a bit hacky to me was the ""AWS/RDS,AWS/EC2"" namespace value, but probably okay for now.

@muddlebee
Copy link
Copy Markdown
Collaborator

@Devesh36 adress the issues from ceren.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Synthetic coverage: cross-system RDS CPU spike → EC2/app causes (new suite)

3 participants