Purpose: Step-by-step guide to activate the newly implemented evaluation automation system.
# Navigate to project root
cd d:\APPS\code-chef
# Install required packages
pip install langsmith langchain-openai httpx loguru
# Optional: Install langchain for prebuilt evaluators
pip install langchain# Check required variables
echo $LANGCHAIN_API_KEY # Should start with lsv2_sk_
echo $OPENAI_API_KEY # Should start with sk-
echo $LINEAR_API_KEY # For issue creation
# If missing, add to environment:
# Windows PowerShell:
$env:LANGCHAIN_API_KEY = "lsv2_sk_***"
$env:OPENAI_API_KEY = "sk-***"
$env:LINEAR_API_KEY = "lin_api_***"Navigate to: https://github.com/Appsmithery/code-chef/settings/secrets/actions
Add these secrets:
-
LANGCHAIN_API_KEY
- Value: Your LangSmith API key from https://smith.langchain.com
- Format:
lsv2_sk_***
-
OPENAI_API_KEY
- Value: Your OpenAI API key
- Format:
sk-***
-
LINEAR_API_KEY
- Value: Your Linear API key from https://linear.app/dev-ops/settings/api
- Format:
lin_api_***
Verification:
# Run this in GitHub Actions to test
- name: Test secrets
run: |
echo "LangSmith: ${LANGCHAIN_API_KEY:0:10}..."
echo "OpenAI: ${OPENAI_API_KEY:0:10}..."
echo "Linear: ${LINEAR_API_KEY:0:10}..."python support/scripts/evaluation/sync_dataset_from_annotations.py \
--dataset code-chef-gold-standard-v1 \
--days 30 \
--dry-run
# Review output, then run for real:
python support/scripts/evaluation/sync_dataset_from_annotations.py \
--dataset code-chef-gold-standard-v1 \
--days 30If you already have a dataset in LangSmith:
- Go to https://smith.langchain.com → Datasets
- Find your dataset name
- Update
DEFAULT_DATASETinrun_langsmith_evaluation.pyif needed
Goal: Add 10-20 high-quality annotated examples
-
Go to LangSmith: https://smith.langchain.com → Projects → code-chef-production
-
Filter for good traces:
- Status: Success
- Recent (last 7 days)
- Representative examples
-
Add feedback:
- Click trace → Add feedback
- Key:
correctness - Score: 0.0 - 1.0 (use ≥ 0.7 for dataset inclusion)
- Comment: Why this is a good/bad example
-
Add tags (for categorization):
agent_routing- Agent selectiontoken_efficiency- Token usagelatency- Response timeworkflow_completeness- Task completionmcp_integration- Tool usage
-
Sync to dataset:
python support/scripts/evaluation/sync_dataset_from_annotations.py \ --dataset code-chef-gold-standard-v1 \ --days 7
# Set environment
export LANGCHAIN_API_KEY=lsv2_sk_***
export OPENAI_API_KEY=sk-***
export ORCHESTRATOR_URL=https://codechef.appsmithery.co
# Run evaluation
python support/tests/evaluation/run_langsmith_evaluation.py \
--dataset code-chef-gold-standard-v1 \
--experiment-prefix test-run \
--output test_results.json
# Check results
cat test_results.json | jq '.code_chef.results'python support/tests/evaluation/run_langsmith_evaluation.py \
--dataset code-chef-gold-standard-v1 \
--compare-baseline \
--output baseline_comparison.json
# View comparison
cat baseline_comparison.json | jq '.comparison'# Review workflow file
cat .github/workflows/continuous-evaluation.yml
# Commit and push (will trigger first run)
git add .github/workflows/continuous-evaluation.yml
git commit -m "Enable continuous evaluation workflow"
git push origin main
# Monitor first run
# https://github.com/Appsmithery/code-chef/actions- Check Actions tab: https://github.com/Appsmithery/code-chef/actions
- Look for: "Continuous Evaluation" workflow
- Expected duration: 5-10 minutes
- Artifacts: evaluation-results-{sha} should be created
# Create mock regression results
cat > mock_regression.json << 'EOF'
{
"comparison": {
"overall_improvement_pct": -12.5,
"per_metric": {
"accuracy": {
"baseline": 0.85,
"codechef": 0.72,
"improvement_pct": -15.3,
"winner": "baseline"
}
}
}
}
EOF
# Test detection
python support/scripts/evaluation/detect_regression.py \
--results mock_regression.json \
--threshold 0.05 \
--create-linear-issue
# Check Linear for new issue
# https://linear.app/dev-ops/project/codechef-78b3b839d36b-
Review evaluation results
- LangSmith: https://smith.langchain.com → code-chef-evaluation
- Check for trends
-
Update dataset
python support/scripts/evaluation/sync_dataset_from_annotations.py \ --dataset code-chef-gold-standard-v1 \ --days 7 -
Check for regressions
- Linear: Review automated issues
- Investigate any critical regressions
-
Review cost tracking
# Check evaluation costs curl https://codechef.appsmithery.co/metrics/tokens | jq '.evaluation'
-
Tune thresholds if needed
- Edit
detect_regression.pythresholds - Update based on false positive rate
- Edit
-
Add new evaluators
- Identify new patterns to test
- Add to
evaluators.py - Update
run_langsmith_evaluation.py
Symptom: GitHub Actions fails with "Connection refused"
Solution:
# Check orchestrator health
curl https://codechef.appsmithery.co/health
# If down, SSH to droplet and restart
ssh root@45.55.173.72 "cd /opt/code-chef && docker compose restart orchestrator"Symptom: "No examples found in dataset"
Solution:
# Check dataset exists
python -c "from langsmith import Client; print(Client().list_datasets())"
# If not found, create it
python support/scripts/evaluation/sync_dataset_from_annotations.py \
--dataset code-chef-gold-standard-v1 \
--days 30Symptom: Regression detected but no Linear issue
Solution:
# Verify Linear API key
echo $LINEAR_API_KEY | cut -c1-10
# Test Linear client
python -c "from shared.lib.linear_client import linear_client; print(linear_client)"
# Check logs
cat evaluation_results.json | jq '.linear_issue_id'Symptom: "ModuleNotFoundError: No module named 'langchain.evaluation'"
Solution:
# Install langchain
pip install langchain
# Or skip prebuilt evaluators
python support/tests/evaluation/run_langsmith_evaluation.py \
--evaluators-only customAfter activation, you should see:
- ✅ Weekly evaluation runs in GitHub Actions
- ✅ Results artifacts uploaded to Actions
- ✅ Traces in LangSmith project
- ✅ Dataset growing with new examples
- ✅ Linear issues for regressions (hopefully none!)
- ✅ Email notifications on failures
- ✅ Cost tracking in metrics
| Task | Duration | Status |
|---|---|---|
| Install dependencies | 5 min | ⏳ Pending |
| Configure secrets | 10 min | ⏳ Pending |
| Create dataset | 30 min | ⏳ Pending |
| Annotate traces | 60 min | ⏳ Pending |
| Run first evaluation | 15 min | ⏳ Pending |
| Enable workflow | 5 min | ⏳ Pending |
| Test regression | 10 min | ⏳ Pending |
Total: ~2.5 hours to full activation
- ✅ Complete Steps 1-7 above
- ✅ Monitor first week of automated runs
- ✅ Tune thresholds based on experience
- ✅ Add domain-specific evaluators as needed
- ✅ Document learnings in Linear
Questions? File an issue in Linear with label evaluation.
Documentation: