Problem
LangSmith's tracing is strong. But the jump from trace → eval score → actionable feedback loop is still largely manual today.
Current workflow:
- Run
evaluate() or custom evaluator on a dataset
- Get scores in the LangSmith UI
- Manually inspect traces to find the failure pattern
- Manually update prompts, re-run
- Repeat
Step 3 → 4 has no SDK automation. Teams have to visually inspect trace spans, identify the failure class, and manually act on it. This is fine for small-scale evaluations but breaks down in production at scale (hundreds of traces/day).
Proposed Feature
An SDK-level feedback_loop() interface that connects evaluator outputs back to an automated remediation or alerting action:
from langsmith import Client
client = Client()
# Example: auto-tag low-scoring runs for human review with context
client.create_feedback_loop(
project_name="my-rag-pipeline",
evaluator=my_faithfulness_evaluator,
threshold=0.7, # flag runs scoring below this
on_failure=lambda run, score: {
"action": "add_to_annotation_queue",
"queue_name": "low-faithfulness-review",
"metadata": {
"failure_score": score,
"trace_id": run.id,
"span_ids": [s.id for s in run.child_runs if s.name == "retrieval"]
}
}
)
The key behaviors:
- Threshold-based routing: runs scoring below threshold automatically go to annotation queues or trigger webhooks
- Span-level context propagation: the feedback includes the specific span that caused the failure, not just the root trace
- Aggregated failure pattern detection: across a time window, surface the most common failure modes (e.g., "37% of failures have retrieval span score < 0.5")
Why This Matters
LangSmith's tracing is best-in-class for debugging individual traces. But the value compounds when eval results can automatically drive the next iteration cycle — without requiring a human to click through the UI for each failure.
This is the gap between LangSmith as a debugging tool and LangSmith as a production eval-feedback infrastructure layer.
Current Workaround
Users currently build custom scripts that call client.list_runs() + client.create_feedback() in a loop, filtering by eval score. This works but:
- Has no built-in span-level context propagation
- Requires re-inventing aggregation logic each time
- Doesn't integrate natively with annotation queues
Happy to prototype this if there's interest from the team.
Problem
LangSmith's tracing is strong. But the jump from trace → eval score → actionable feedback loop is still largely manual today.
Current workflow:
evaluate()or custom evaluator on a datasetStep 3 → 4 has no SDK automation. Teams have to visually inspect trace spans, identify the failure class, and manually act on it. This is fine for small-scale evaluations but breaks down in production at scale (hundreds of traces/day).
Proposed Feature
An SDK-level
feedback_loop()interface that connects evaluator outputs back to an automated remediation or alerting action:The key behaviors:
Why This Matters
LangSmith's tracing is best-in-class for debugging individual traces. But the value compounds when eval results can automatically drive the next iteration cycle — without requiring a human to click through the UI for each failure.
This is the gap between LangSmith as a debugging tool and LangSmith as a production eval-feedback infrastructure layer.
Current Workaround
Users currently build custom scripts that call
client.list_runs()+client.create_feedback()in a loop, filtering by eval score. This works but:Happy to prototype this if there's interest from the team.