feat(dagster): add workflow orchestration integration by itsautomata · Pull Request #2652 · Tracer-Cloud/opensre

itsautomata · 2026-05-28T22:05:56Z

Summary

Adds Dagster integration covering:

Five read-only tools (list_dagster_runs, get_dagster_run_logs, list_dagster_assets, list_dagster_sensor_ticks, list_dagster_schedule_ticks) backed by the
Dagster GraphQL API
Works against OSS Dagster (dagster dev, self-hosted webserver) and Dagster+ Cloud
Wires config, verification, env-var fallback, interactive setup handler, alert-source routing, docs, and tests

Wiring

flowchart TD
    A([Dagster-source alert]) --> R[/"_ALERT_SOURCE_TO_TOOL_SOURCES<br/>(agent/investigation.py + agent/prompt.py)"/]
    R --> T1["list_dagster_runs"]
    R --> T2["get_dagster_run_logs"]
    R --> T3["list_dagster_assets"]
    R --> T4["list_dagster_sensor_ticks"]
    R --> T5["list_dagster_schedule_ticks"]

    subgraph helpers ["app/integrations/dagster.py"]
        H1["list_runs(status, job_name)<br/>+ _compute_run_durations"]
        H2["get_run_logs<br/>paginated; non-failures bounded, failures preserved<br/>+ _extract_step_failures (summary block)"]
        H3["list_assets_with_materialization"]
        H4["list_sensor_ticks<br/>(repo, location, sensor)"]
        H5["list_schedule_ticks<br/>(repo, location, schedule)"]
    end

    T1 --> H1
    T2 --> H2
    T3 --> H3
    T4 --> H4
    T5 --> H5

    H1 --> C
    H2 --> C
    H3 --> C
    H4 --> C
    H5 --> C
    C["DagsterClient<br/>app/services/dagster/client.py"]
    C -->|"POST /graphql"| API[("Dagster GraphQL<br/>OSS or Cloud")]

Notes

get_dagster_run_logs enriches the response with a summary block from _extract_step_failures that pre-counts step failures. The need surfaced from testing multi-step parallel runs where the agent fixated on the first failure and missed concurrent ones.

Demo

A supply_chain_pipeline Dagster job with 4 ops in a partial-success shape: fetch_inventory, fetch_sales_history, and
calculate_reorder_points all succeed, then generate_purchase_orders fails with a vendor-portal HTTP 503 deep in the DAG, blocking the
final notification step.

Setup

dagster_setup.mov

Investigation (model: deepseek-v4-pro):

dagster_investigation.mov

github-actions · 2026-05-28T22:06:05Z

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

itsautomata · 2026-05-28T22:24:09Z

@greptile review

greptile-apps · 2026-05-28T22:30:20Z

Greptile Summary

This PR adds a complete Dagster workflow orchestration integration with five read-only tools (list_dagster_runs, get_dagster_run_logs, list_dagster_assets, list_dagster_sensor_ticks, list_dagster_schedule_ticks) backed by the Dagster GraphQL API, supporting both OSS and Dagster+ Cloud.

Adds DagsterClient with paginated run-log fetching, a sliding window for non-failure events, and a _extract_step_failures summary to help the agent surface all step failures in parallel-execution runs.
Wires config, env-var fallback (DAGSTER_ENDPOINT/DAGSTER_API_TOKEN), interactive CLI setup, verification adapter, alert-source routing, and 1500+ lines of unit tests mirroring the existing test_rabbitmq.py pattern.

Confidence Score: 5/5

All five tools are read-only, properly scoped, and the integration handles auth, timeouts, and pagination errors gracefully without risk to the Dagster instance.

The integration is well-structured: the paginated run-log fetcher correctly handles mid-pagination errors without discarding accumulated failures, all required selector coordinates are enforced without defaults, the httpx client lifecycle is managed via context managers throughout, and the 1500-line test suite covers all edge cases including multi-page pagination, concurrent step failures, and RunFailureEvent vs ExecutionStepFailureEvent semantics. No functional gaps were found beyond those already addressed in prior review threads.

No files require special attention.

Important Files Changed

Filename	Overview
app/services/dagster/client.py	New GraphQL client wrapping httpx; correctly normalises endpoint URL, supports context-manager lifecycle, returns {"data": ...} or {"error": ...} envelope for all callers without leaking exceptions.
app/integrations/dagster.py	Core integration helpers including paginated get_run_logs with sliding non-failure-event window, failure aggregation, and mid-pagination error recovery — all addressed concerns from prior review threads are resolved.
app/services/dagster/queries.py	GraphQL query definitions for all five tools; inline fragments correctly handle Dagster's union types and interface inheritance for event fields.
app/tools/DagsterRunLogsTool/init.py	run_id is now a required keyword-only argument (no default), consistent with the tool description's guidance on always providing it.
app/tools/DagsterSensorsTool/init.py	All three SensorSelector coordinates are required keyword-only arguments with no defaults, preventing silent empty-result calls to Dagster.
app/integrations/_catalog_impl.py	Adds Dagster to both _classify_service_instance and load_env_integrations; guards empty endpoint with falsy check before returning a source dict.
tests/integrations/test_dagster.py	Comprehensive 1592-line test suite covering config validation, URL normalisation, all five query helpers, pagination edge cases, and the step-failure extraction logic with explicit coverage for RunFailureEvent vs ExecutionStepFailureEvent distinction.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant Tool as Dagster Tool Layer
    participant Integration as app/integrations/dagster.py
    participant Client as DagsterClient
    participant API as Dagster GraphQL API

    Agent->>Tool: get_dagster_run_logs(run_id)
    Tool->>Integration: get_run_logs(config, run_id)
    loop "Pagination up to MAX_RUN_LOG_PAGES=100"
        Integration->>Client: "get_run_logs(run_id, limit=250, cursor)"
        Client->>API: POST /graphql (GetRunLogs)
        API-->>Client: "EventConnection { events, cursor, hasMore }"
        Client-->>Integration: "{data: {logsForRun: ...}}"
        Integration->>Integration: partition into failure_events / non_failure_events deque
        alt "hasMore=false or error"
            Integration-->>Integration: break loop
        else "hasMore=true and cursor present"
            Integration->>Integration: advance cursor
        end
    end
    Integration->>Integration: _extract_step_failures(failure_events) → summary
    Integration-->>Tool: "{data: {logsForRun: aggregated}, summary: {failure_count, failures, truncated}}"
    Tool-->>Agent: enriched run log response

    Agent->>Tool: list_dagster_runs(job_name, status)
    Tool->>Integration: list_runs(config, status, job_name)
    Integration->>Client: list_runs(limit, statuses, pipelineName)
    Client->>API: POST /graphql (ListRuns)
    API-->>Client: "runsOrError { Runs | InvalidPipelineRunsFilterError | PythonError }"
    Client-->>Integration: "{data: {runsOrError: ...}}"
    Integration->>Integration: _compute_run_durations adds duration_seconds per run
    Integration-->>Tool: enriched runs response
    Tool-->>Agent: run list with durations

_{Reviews (10): Last reviewed commit: "fix(dagster): signal partial fetch when ..." | Re-trigger Greptile}

itsautomata · 2026-05-28T23:18:14Z

@greptile review

itsautomata · 2026-05-28T23:58:42Z

@greptile review

Davidson3556 · 2026-05-29T12:08:54Z

looks good to me
well done @itsautomata

itsautomata · 2026-05-29T12:15:34Z

@Davidson3556 thank you

muddlebee · 2026-05-29T21:38:21Z

need to see the entire flow from opensre onboard the integration should work here -> all the env needs to be pasted here, then all of it here https://www.opensre.com/docs/integrations-overview#local-integrations
basically the flow should work from CLI, including env setups etc

itsautomata · 2026-05-29T21:59:12Z

@muddlebee thanks for the review, just want to be sure I don't miss anything. since the CLI setup already works end-to-end, so what's missing is just the docs surfaces, right?

muddlebee · 2026-05-29T22:13:39Z

Docs and demo both.

itsautomata · 2026-05-29T22:16:42Z

alright will do. thanks.

Adds Dagster integration with four tools (list_runs, run_logs, assets, sensor_ticks) backed by the GraphQL API. Works against OSS Dagster and Dagster+ Cloud. Wires config, verification, env-var fallback, setup handler, docs, and tests. Fixes Tracer-Cloud#2571

Closes two integration gaps: - `list_dagster_schedule_ticks` mirrors the sensor tool shape - `list_dagster_runs` rows now include `duration_seconds` (null for in-flight runs); so the agent no longer derives it.

…page

itsautomata · 2026-05-30T17:36:20Z

@greptile review

- get_run_logs now paginates until hasMore=false, always retaining ExecutionStepFailureEvent and RunFailureEvent across pages. - Non-failure events are held in a sliding window of the most recent MAX_NON_FAILURE_RUN_LOG_EVENTS=1500 (older events evicted) so the kept context stays adjacent to the typically later-in-stream failures while bounding LLM context. - A MAX_RUN_LOG_PAGES=100 safety net bounds HTTP latency for outsized runs. - summary.truncated signals when the window overflowed or the page cap fired so the agent can qualify the finding as partial.

itsautomata · 2026-05-30T19:35:10Z

@greptile review

Sort the aggregated list by timestamp before returning. As concatenating non-failure and failure event partitions is not enough when events from each interleave chronologically: a downstream skip event (kept in non_failure_events) would appear BEFORE the upstream failure that caused it (kept in failure_events), inverting the causal chain for the LLM.

itsautomata · 2026-05-30T20:25:04Z

@greptile review

mid-pagination errors preserve collected failures and set summary.fetch_error so callers know the fetch was partial.

itsautomata · 2026-05-30T21:40:23Z

@greptile review

… cursor

itsautomata · 2026-05-30T22:12:16Z

@greptile review

greptile-apps · 2026-05-30T22:19:49Z

Want your agent to iterate on Greptile's feedback? Try greploops.

itsautomata · 2026-05-30T23:43:57Z

Hello @muddlebee here is what i added since last review:

setup demo + docs modif

get_run_logs pagination and tests:
get_run_logs wasn't considering long logs. Now it paginates until hasMore=false but doesn't fetch everything every time and this is the designed limits:

failure events always kept; non-failure events capped at MAX_NON_FAILURE_RUN_LOG_EVENTS=1500 (older non-failures are evicted first to keep context sharper)
MAX_RUN_LOG_PAGES=100 safety net to bound HTTP latency on outsized runs
summary.truncated / summary.fetch_error signals so the agent can qualify findings as partial

I considered exposing MAX_NON_FAILURE_RUN_LOG_EVENTS (currently 1500) as an env var, but as past a threshold larger windows are actively counterproductive and degrade RCA quality. I opted for an internal constant.

Let me know if you'd rather see it exposed, or bigger than 1500 or have another idea. same for MAX_RUN_LOG_PAGES (but bigger wouldn't hurt context quality, just trades latency for more pages scanned)

greptile-apps Bot reviewed May 28, 2026

View reviewed changes

Comment thread app/tools/DagsterRunLogsTool/__init__.py

Comment thread app/tools/DagsterSensorsTool/__init__.py

Comment thread app/services/dagster/client.py

itsautomata force-pushed the issue/2571-dagster-integration branch 3 times, most recently from 7ba0b74 to f9c032d Compare May 28, 2026 23:16

itsautomata marked this pull request as ready for review May 28, 2026 23:27

itsautomata marked this pull request as draft May 28, 2026 23:30

itsautomata force-pushed the issue/2571-dagster-integration branch from 874850c to d627982 Compare May 28, 2026 23:49

itsautomata marked this pull request as ready for review May 29, 2026 01:01

muddlebee added requirement-not-met in-review labels May 29, 2026

itsautomata added 4 commits May 30, 2026 16:52

fix(dagster): tighten tool signatures and close client lifecycle

5f5ed25

feat(dagster): cover schedule ticks and pre-compute run duration

cb902ab

Closes two integration gaps: - `list_dagster_schedule_ticks` mirrors the sensor tool shape - `list_dagster_runs` rows now include `duration_seconds` (null for in-flight runs); so the agent no longer derives it.

docs(dagster): add env-vars + integrations-overview entries; tighten …

965894e

…page

itsautomata force-pushed the issue/2571-dagster-integration branch from d627982 to 965894e Compare May 30, 2026 15:15

docs(dagster): clarify API token nuance and sort Orchestration row

7fd3c86

itsautomata force-pushed the issue/2571-dagster-integration branch from 6852640 to ae6e0cf Compare May 30, 2026 20:21

greptile-apps Bot reviewed May 30, 2026

View reviewed changes

Comment thread app/integrations/dagster.py Outdated

fix(dagster): preserve collected failures on mid-pagination error

81e7bc9

mid-pagination errors preserve collected failures and set summary.fetch_error so callers know the fetch was partial.

itsautomata force-pushed the issue/2571-dagster-integration branch from 40a782d to 81e7bc9 Compare May 30, 2026 21:27

greptile-apps Bot reviewed May 30, 2026

View reviewed changes

Comment thread app/integrations/dagster.py

fix(dagster): signal partial fetch when server returns hasMore but no…

0d48289

… cursor

Conversation

itsautomata commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Wiring

Notes

Demo

Uh oh!

github-actions Bot commented May 28, 2026

Greptile code review

Uh oh!

itsautomata commented May 28, 2026

Uh oh!

greptile-apps Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

itsautomata commented May 28, 2026

Uh oh!

itsautomata commented May 28, 2026

Uh oh!

Davidson3556 commented May 29, 2026

Uh oh!

itsautomata commented May 29, 2026

Uh oh!

muddlebee commented May 29, 2026

Uh oh!

itsautomata commented May 29, 2026

Uh oh!

muddlebee commented May 29, 2026

Uh oh!

itsautomata commented May 29, 2026

Uh oh!

itsautomata commented May 30, 2026

Uh oh!

itsautomata commented May 30, 2026

Uh oh!

itsautomata commented May 30, 2026

Uh oh!

Uh oh!

itsautomata commented May 30, 2026

Uh oh!

Uh oh!

itsautomata commented May 30, 2026

Uh oh!

greptile-apps Bot commented May 30, 2026

Uh oh!

itsautomata commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

itsautomata commented May 28, 2026 •

edited

Loading

greptile-apps Bot commented May 28, 2026 •

edited

Loading

itsautomata commented May 30, 2026 •

edited

Loading