Skip to content

feat(dagster): add workflow orchestration integration#2652

Open
itsautomata wants to merge 9 commits into
Tracer-Cloud:mainfrom
itsautomata:issue/2571-dagster-integration
Open

feat(dagster): add workflow orchestration integration#2652
itsautomata wants to merge 9 commits into
Tracer-Cloud:mainfrom
itsautomata:issue/2571-dagster-integration

Conversation

@itsautomata
Copy link
Copy Markdown
Contributor

@itsautomata itsautomata commented May 28, 2026

Fixes #2571

Summary

Adds Dagster integration covering:

  • Five read-only tools (list_dagster_runs, get_dagster_run_logs, list_dagster_assets, list_dagster_sensor_ticks, list_dagster_schedule_ticks) backed by the
    Dagster GraphQL API
  • Works against OSS Dagster (dagster dev, self-hosted webserver) and Dagster+ Cloud
  • Wires config, verification, env-var fallback, interactive setup handler, alert-source routing, docs, and tests

Wiring

flowchart TD
    A([Dagster-source alert]) --> R[/"_ALERT_SOURCE_TO_TOOL_SOURCES<br/>(agent/investigation.py + agent/prompt.py)"/]
    R --> T1["list_dagster_runs"]
    R --> T2["get_dagster_run_logs"]
    R --> T3["list_dagster_assets"]
    R --> T4["list_dagster_sensor_ticks"]
    R --> T5["list_dagster_schedule_ticks"]

    subgraph helpers ["app/integrations/dagster.py"]
        H1["list_runs(status, job_name)<br/>+ _compute_run_durations"]
        H2["get_run_logs<br/>paginated; non-failures bounded, failures preserved<br/>+ _extract_step_failures (summary block)"]
        H3["list_assets_with_materialization"]
        H4["list_sensor_ticks<br/>(repo, location, sensor)"]
        H5["list_schedule_ticks<br/>(repo, location, schedule)"]
    end

    T1 --> H1
    T2 --> H2
    T3 --> H3
    T4 --> H4
    T5 --> H5

    H1 --> C
    H2 --> C
    H3 --> C
    H4 --> C
    H5 --> C
    C["DagsterClient<br/>app/services/dagster/client.py"]
    C -->|"POST /graphql"| API[("Dagster GraphQL<br/>OSS or Cloud")]
Loading

Notes

  • get_dagster_run_logs enriches the response with a summary block from _extract_step_failures that pre-counts step failures. The need surfaced from testing multi-step parallel runs where the agent fixated on the first failure and missed concurrent ones.

Demo

A supply_chain_pipeline Dagster job with 4 ops in a partial-success shape: fetch_inventory, fetch_sales_history, and
calculate_reorder_points all succeed, then generate_purchase_orders fails with a vendor-portal HTTP 503 deep in the DAG, blocking the
final notification step.

Setup

dagster_setup.mov

Investigation (model: deepseek-v4-pro):

dagster_investigation.mov

@github-actions
Copy link
Copy Markdown
Contributor

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

@itsautomata
Copy link
Copy Markdown
Contributor Author

@greptile review

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Greptile Summary

This PR adds a complete Dagster workflow orchestration integration with five read-only tools (list_dagster_runs, get_dagster_run_logs, list_dagster_assets, list_dagster_sensor_ticks, list_dagster_schedule_ticks) backed by the Dagster GraphQL API, supporting both OSS and Dagster+ Cloud.

  • Adds DagsterClient with paginated run-log fetching, a sliding window for non-failure events, and a _extract_step_failures summary to help the agent surface all step failures in parallel-execution runs.
  • Wires config, env-var fallback (DAGSTER_ENDPOINT/DAGSTER_API_TOKEN), interactive CLI setup, verification adapter, alert-source routing, and 1500+ lines of unit tests mirroring the existing test_rabbitmq.py pattern.

Confidence Score: 5/5

All five tools are read-only, properly scoped, and the integration handles auth, timeouts, and pagination errors gracefully without risk to the Dagster instance.

The integration is well-structured: the paginated run-log fetcher correctly handles mid-pagination errors without discarding accumulated failures, all required selector coordinates are enforced without defaults, the httpx client lifecycle is managed via context managers throughout, and the 1500-line test suite covers all edge cases including multi-page pagination, concurrent step failures, and RunFailureEvent vs ExecutionStepFailureEvent semantics. No functional gaps were found beyond those already addressed in prior review threads.

No files require special attention.

Important Files Changed

Filename Overview
app/services/dagster/client.py New GraphQL client wrapping httpx; correctly normalises endpoint URL, supports context-manager lifecycle, returns {"data": ...} or {"error": ...} envelope for all callers without leaking exceptions.
app/integrations/dagster.py Core integration helpers including paginated get_run_logs with sliding non-failure-event window, failure aggregation, and mid-pagination error recovery — all addressed concerns from prior review threads are resolved.
app/services/dagster/queries.py GraphQL query definitions for all five tools; inline fragments correctly handle Dagster's union types and interface inheritance for event fields.
app/tools/DagsterRunLogsTool/init.py run_id is now a required keyword-only argument (no default), consistent with the tool description's guidance on always providing it.
app/tools/DagsterSensorsTool/init.py All three SensorSelector coordinates are required keyword-only arguments with no defaults, preventing silent empty-result calls to Dagster.
app/integrations/_catalog_impl.py Adds Dagster to both _classify_service_instance and load_env_integrations; guards empty endpoint with falsy check before returning a source dict.
tests/integrations/test_dagster.py Comprehensive 1592-line test suite covering config validation, URL normalisation, all five query helpers, pagination edge cases, and the step-failure extraction logic with explicit coverage for RunFailureEvent vs ExecutionStepFailureEvent distinction.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant Tool as Dagster Tool Layer
    participant Integration as app/integrations/dagster.py
    participant Client as DagsterClient
    participant API as Dagster GraphQL API

    Agent->>Tool: get_dagster_run_logs(run_id)
    Tool->>Integration: get_run_logs(config, run_id)
    loop "Pagination up to MAX_RUN_LOG_PAGES=100"
        Integration->>Client: "get_run_logs(run_id, limit=250, cursor)"
        Client->>API: POST /graphql (GetRunLogs)
        API-->>Client: "EventConnection { events, cursor, hasMore }"
        Client-->>Integration: "{data: {logsForRun: ...}}"
        Integration->>Integration: partition into failure_events / non_failure_events deque
        alt "hasMore=false or error"
            Integration-->>Integration: break loop
        else "hasMore=true and cursor present"
            Integration->>Integration: advance cursor
        end
    end
    Integration->>Integration: _extract_step_failures(failure_events) → summary
    Integration-->>Tool: "{data: {logsForRun: aggregated}, summary: {failure_count, failures, truncated}}"
    Tool-->>Agent: enriched run log response

    Agent->>Tool: list_dagster_runs(job_name, status)
    Tool->>Integration: list_runs(config, status, job_name)
    Integration->>Client: list_runs(limit, statuses, pipelineName)
    Client->>API: POST /graphql (ListRuns)
    API-->>Client: "runsOrError { Runs | InvalidPipelineRunsFilterError | PythonError }"
    Client-->>Integration: "{data: {runsOrError: ...}}"
    Integration->>Integration: _compute_run_durations adds duration_seconds per run
    Integration-->>Tool: enriched runs response
    Tool-->>Agent: run list with durations
Loading

Reviews (10): Last reviewed commit: "fix(dagster): signal partial fetch when ..." | Re-trigger Greptile

Comment thread app/tools/DagsterRunLogsTool/__init__.py
Comment thread app/tools/DagsterSensorsTool/__init__.py
Comment thread app/services/dagster/client.py
@itsautomata itsautomata force-pushed the issue/2571-dagster-integration branch 3 times, most recently from 7ba0b74 to f9c032d Compare May 28, 2026 23:16
@itsautomata
Copy link
Copy Markdown
Contributor Author

@greptile review

@itsautomata itsautomata marked this pull request as ready for review May 28, 2026 23:27
@itsautomata itsautomata marked this pull request as draft May 28, 2026 23:30
@itsautomata itsautomata force-pushed the issue/2571-dagster-integration branch from 874850c to d627982 Compare May 28, 2026 23:49
@itsautomata
Copy link
Copy Markdown
Contributor Author

@greptile review

@itsautomata itsautomata marked this pull request as ready for review May 29, 2026 01:01
@Davidson3556
Copy link
Copy Markdown
Contributor

looks good to me
well done @itsautomata

@itsautomata
Copy link
Copy Markdown
Contributor Author

@Davidson3556 thank you

@muddlebee
Copy link
Copy Markdown
Collaborator

need to see the entire flow from opensre onboard the integration should work here -> all the env needs to be pasted here, then all of it here https://www.opensre.com/docs/integrations-overview#local-integrations
basically the flow should work from CLI, including env setups etc

@itsautomata
Copy link
Copy Markdown
Contributor Author

@muddlebee thanks for the review, just want to be sure I don't miss anything. since the CLI setup already works end-to-end, so what's missing is just the docs surfaces, right?

@muddlebee
Copy link
Copy Markdown
Collaborator

Docs and demo both.

@itsautomata
Copy link
Copy Markdown
Contributor Author

alright will do. thanks.

Adds Dagster integration with four tools (list_runs, run_logs, assets,
sensor_ticks) backed by the GraphQL API. Works against OSS Dagster and
Dagster+ Cloud. Wires config, verification, env-var fallback, setup
handler, docs, and tests.

Fixes Tracer-Cloud#2571
Closes two integration gaps:
- `list_dagster_schedule_ticks` mirrors the sensor tool shape
- `list_dagster_runs` rows now include `duration_seconds` (null for
  in-flight runs); so the agent no longer derives it.
@itsautomata itsautomata force-pushed the issue/2571-dagster-integration branch from d627982 to 965894e Compare May 30, 2026 15:15
@itsautomata
Copy link
Copy Markdown
Contributor Author

@greptile review

- get_run_logs now paginates until hasMore=false, always retaining
  ExecutionStepFailureEvent and RunFailureEvent across pages.
- Non-failure events are held in a sliding window of the most recent
  MAX_NON_FAILURE_RUN_LOG_EVENTS=1500 (older events evicted) so the kept
  context stays adjacent to the typically later-in-stream failures while
  bounding LLM context.
- A MAX_RUN_LOG_PAGES=100 safety net bounds HTTP latency for outsized
  runs.
- summary.truncated signals when the window overflowed or the page cap
  fired so the agent can qualify the finding as partial.
@itsautomata
Copy link
Copy Markdown
Contributor Author

@greptile review

Sort the aggregated list by timestamp before returning.

As concatenating non-failure and failure event partitions is not enough
when events from each interleave chronologically: a downstream skip
event (kept in non_failure_events) would appear BEFORE the upstream
failure that caused it (kept in failure_events), inverting the causal
chain for the LLM.
@itsautomata itsautomata force-pushed the issue/2571-dagster-integration branch from 6852640 to ae6e0cf Compare May 30, 2026 20:21
@itsautomata
Copy link
Copy Markdown
Contributor Author

@greptile review

Comment thread app/integrations/dagster.py Outdated
mid-pagination errors preserve collected failures and set
summary.fetch_error so callers know the fetch was partial.
@itsautomata itsautomata force-pushed the issue/2571-dagster-integration branch from 40a782d to 81e7bc9 Compare May 30, 2026 21:27
@itsautomata
Copy link
Copy Markdown
Contributor Author

@greptile review

Comment thread app/integrations/dagster.py
@itsautomata
Copy link
Copy Markdown
Contributor Author

@greptile review

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 30, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

@itsautomata
Copy link
Copy Markdown
Contributor Author

itsautomata commented May 30, 2026

Hello @muddlebee here is what i added since last review:

setup demo + docs modif

get_run_logs pagination and tests:
get_run_logs wasn't considering long logs. Now it paginates until hasMore=false but doesn't fetch everything every time and this is the designed limits:

  • failure events always kept; non-failure events capped at MAX_NON_FAILURE_RUN_LOG_EVENTS=1500 (older non-failures are evicted first to keep context sharper)
  • MAX_RUN_LOG_PAGES=100 safety net to bound HTTP latency on outsized runs
  • summary.truncated / summary.fetch_error signals so the agent can qualify findings as partial

I considered exposing MAX_NON_FAILURE_RUN_LOG_EVENTS (currently 1500) as an env var, but as past a threshold larger windows are actively counterproductive and degrade RCA quality. I opted for an internal constant.

Let me know if you'd rather see it exposed, or bigger than 1500 or have another idea. same for MAX_RUN_LOG_PAGES (but bigger wouldn't hurt context quality, just trades latency for more pages scanned)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add Dagster workflow orchestration integration

3 participants