Skip to content

Verify Evaluation Execution Workflow #794

@maxtechera

Description

@maxtechera

Objective

Test end-to-end evaluation execution workflow including running evaluations, viewing results, versioning, and re-running.

Tasks

Basic Execution

  • Run evaluation on chatflow with dataset
  • Run evaluation on agentflow with dataset
  • Run evaluation with simple evaluators
  • Run evaluation with LLM evaluators
  • Run evaluation with mixed evaluators

Results & Metrics

  • Verify evaluation results display correctly
  • Verify metrics calculation (accuracy, latency, pass/fail counts)
  • Verify individual run details show input/output/metrics
  • Test result filtering and sorting
  • Verify error handling for failed evaluations

Versioning

  • Test evaluation versioning (multiple runs with same name)
  • Verify version history display
  • Test comparing different versions
  • Verify version numbering is correct

Advanced Features

  • Test "outdated evaluation" detection
  • Test "run again" functionality
  • Verify evaluation status updates (pending → completed → error)
  • Test evaluation with "dataset as one conversation" mode
  • Test evaluation deletion and cleanup

Technical Context

Files to verify:

  • UI: packages/ui/src/views/evaluations/index.jsx
  • UI: packages/ui/src/views/evaluations/CreateEvaluationDialog.jsx
  • UI: packages/ui/src/views/evaluations/EvaluationResult.jsx
  • UI: packages/ui/src/views/evaluations/EvaluationResultSideDrawer.jsx
  • Service: packages/server/src/services/evaluations/index.ts
  • Entity: packages/server/src/database/entities/Evaluation.ts
  • Entity: packages/server/src/database/entities/EvaluationRun.ts

Acceptance Criteria

  • Evaluations run successfully on both chatflows and agentflows
  • All metrics calculate correctly
  • Versioning works as expected
  • Outdated detection identifies changed chatflows/datasets
  • Run again creates new version successfully
  • Any bugs documented with reproduction steps

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions