Skip to content

[Feature Request]: SDK support for programmatic eval-to-feedback loop — close the trace-to-action gap without manual UI steps #2907

@Ruthwik-Data

Description

Problem

LangSmith's tracing is strong. But the jump from trace → eval score → actionable feedback loop is still largely manual today.

Current workflow:

  1. Run evaluate() or custom evaluator on a dataset
  2. Get scores in the LangSmith UI
  3. Manually inspect traces to find the failure pattern
  4. Manually update prompts, re-run
  5. Repeat

Step 3 → 4 has no SDK automation. Teams have to visually inspect trace spans, identify the failure class, and manually act on it. This is fine for small-scale evaluations but breaks down in production at scale (hundreds of traces/day).

Proposed Feature

An SDK-level feedback_loop() interface that connects evaluator outputs back to an automated remediation or alerting action:

from langsmith import Client

client = Client()

# Example: auto-tag low-scoring runs for human review with context
client.create_feedback_loop(
    project_name="my-rag-pipeline",
    evaluator=my_faithfulness_evaluator,
    threshold=0.7,  # flag runs scoring below this
    on_failure=lambda run, score: {
        "action": "add_to_annotation_queue",
        "queue_name": "low-faithfulness-review",
        "metadata": {
            "failure_score": score,
            "trace_id": run.id,
            "span_ids": [s.id for s in run.child_runs if s.name == "retrieval"]
        }
    }
)

The key behaviors:

  1. Threshold-based routing: runs scoring below threshold automatically go to annotation queues or trigger webhooks
  2. Span-level context propagation: the feedback includes the specific span that caused the failure, not just the root trace
  3. Aggregated failure pattern detection: across a time window, surface the most common failure modes (e.g., "37% of failures have retrieval span score < 0.5")

Why This Matters

LangSmith's tracing is best-in-class for debugging individual traces. But the value compounds when eval results can automatically drive the next iteration cycle — without requiring a human to click through the UI for each failure.

This is the gap between LangSmith as a debugging tool and LangSmith as a production eval-feedback infrastructure layer.

Current Workaround

Users currently build custom scripts that call client.list_runs() + client.create_feedback() in a loop, filtering by eval score. This works but:

  • Has no built-in span-level context propagation
  • Requires re-inventing aggregation logic each time
  • Doesn't integrate natively with annotation queues

Happy to prototype this if there's interest from the team.

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions