Skip to content

evaluation pipeline implemented#33

Merged
FacuSentena merged 2 commits into
mainfrom
feature/judges
Jan 5, 2026
Merged

evaluation pipeline implemented#33
FacuSentena merged 2 commits into
mainfrom
feature/judges

Conversation

@FacuSentena
Copy link
Copy Markdown
Collaborator

PR: SQL Translation Evaluation Pipeline

🚀 Summary

Implemented an automated framework to benchmark Snowflake → Databricks SQL translation models using MLflow and LLM-as-a-judge.

✨ Key Features

  • Strict Scoring: Deduction-based scoring for Compliance (functional correctness) and Best Practices (optimization/docs).
  • A/B Comparison: Uses Nested MLflow Runs to enable side-by-side grouped bar charts for model comparison.
  • Diagnostics: Exports top_issues_summary.txt and issues_table.json to pinpoint model weaknesses.
  • Interfaces:
    • CLI: run_local_benchmark.py for batch runs.
    • Notebook: benchmark_interactive.ipynb for visual research.

📁 Changes

  • Added src/artifact_translation_package/evaluation/ module.
  • Consolidated documentation in evaluation README.md.
  • Updated requirements.txt with MLflow and Databricks integrations.

@FacuSentena FacuSentena merged commit ec9c580 into main Jan 5, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant