Skip to content

Conversation

@rajdeepmahal24
Copy link

Summary

  • add evidence to the judge tool schema so each criterion includes supporting quotes/values
  • return evidence mapping in both Python ScenarioResult and JS JudgeResult
  • update judge prompt to request evidence per criterion
  • add Python unit test for evidence parsing

Why

When criteria fail, users need concrete evidence to decide whether a failure is legitimate or the criteria should be adjusted. This change makes the evidence first-class in the judge results.

Testing

  • python -m pytest python/tests/test_judge_agent.py

Notes

  • Evidence is required in the finish_test tool schema (non‑breaking for callers who use the JudgeAgent directly; it only affects the LLM judge output).

Fixes langwatch#161

## Problem
JudgeAgent intermittently fails with `AttributeError: 'str' object has no
attribute 'values'` when the LLM returns the `criteria` field as a JSON
string instead of a dictionary object.

This occurs at lines 439 and 444 when the code calls `criteria.values()`
without verifying that `criteria` is actually a dict.

## Root Cause
When the LLM is uncertain about the schema format (particularly with
complex dynamic schemas using sanitized criterion text as property names),
it sometimes serializes the nested `criteria` object as a JSON string
rather than a proper dict.

## Solution
Add defensive parsing after extracting criteria from tool call arguments:

1. Check if `criteria` is a string
2. If yes, attempt to parse it with `json.loads()`
3. If parsing fails, log a warning and use empty dict as fallback
4. Additionally verify `criteria` is a dict before calling `.values()`

This ensures the code gracefully handles both formats:
- Direct dict: `{"criterion_1": "true", "criterion_2": "false"}`
- JSON string: `"{\"criterion_1\": \"true\", \"criterion_2\": \"false\"}"`

## Testing
- Verified Python syntax with `python -m py_compile`
- Fix includes detailed logging for debugging
- Graceful fallback prevents test failures

## Impact
- Low risk: Only adds defensive parsing with fallback
- Fixes intermittent failures reported in issue langwatch#161
- No changes to normal execution path when criteria is already a dict
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant