[Feature Request]: SDK support for programmatic eval-to-feedback loop — close the trace-to-action gap without manual UI steps

## Problem

LangSmith's tracing is strong. But the jump from **trace → eval score → actionable feedback loop** is still largely manual today.

Current workflow:
1. Run `evaluate()` or custom evaluator on a dataset
2. Get scores in the LangSmith UI
3. Manually inspect traces to find the failure pattern
4. Manually update prompts, re-run
5. Repeat

Step 3 → 4 has no SDK automation. Teams have to visually inspect trace spans, identify the failure class, and manually act on it. This is fine for small-scale evaluations but breaks down in production at scale (hundreds of traces/day).

## Proposed Feature

An SDK-level `feedback_loop()` interface that connects evaluator outputs back to an automated remediation or alerting action:

```python
from langsmith import Client

client = Client()

# Example: auto-tag low-scoring runs for human review with context
client.create_feedback_loop(
    project_name="my-rag-pipeline",
    evaluator=my_faithfulness_evaluator,
    threshold=0.7,  # flag runs scoring below this
    on_failure=lambda run, score: {
        "action": "add_to_annotation_queue",
        "queue_name": "low-faithfulness-review",
        "metadata": {
            "failure_score": score,
            "trace_id": run.id,
            "span_ids": [s.id for s in run.child_runs if s.name == "retrieval"]
        }
    }
)
```

The key behaviors:
1. **Threshold-based routing**: runs scoring below threshold automatically go to annotation queues or trigger webhooks
2. **Span-level context propagation**: the feedback includes the specific span that caused the failure, not just the root trace
3. **Aggregated failure pattern detection**: across a time window, surface the most common failure modes (e.g., "37% of failures have retrieval span score < 0.5")

## Why This Matters

LangSmith's tracing is best-in-class for debugging individual traces. But the value compounds when eval results can automatically drive the next iteration cycle — without requiring a human to click through the UI for each failure.

This is the gap between LangSmith as a debugging tool and LangSmith as a production eval-feedback infrastructure layer.

## Current Workaround

Users currently build custom scripts that call `client.list_runs()` + `client.create_feedback()` in a loop, filtering by eval score. This works but:
- Has no built-in span-level context propagation
- Requires re-inventing aggregation logic each time
- Doesn't integrate natively with annotation queues

Happy to prototype this if there's interest from the team.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: SDK support for programmatic eval-to-feedback loop — close the trace-to-action gap without manual UI steps #2907

Problem

Proposed Feature

Why This Matters

Current Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature Request]: SDK support for programmatic eval-to-feedback loop — close the trace-to-action gap without manual UI steps #2907

Description

Problem

Proposed Feature

Why This Matters

Current Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions