Purpose: Step-by-step workflow for collecting production feedback to improve DSPy training
Cadence: Weekly (recommended)
Goal: Convert production @prompter logs into high-quality DSPy training examples
Time Required: ~15-30 minutes per week (depending on volume)
Success Metric: 80%+ feedback completion rate, 30%+ deployment rate
1. Check log statistics
cd /home/michael/soulfield
# View statistics for last 7 days
python3 workspace/training-examples/manage-logs.py --statsExpected Output:
============================================================
LOG STATISTICS
============================================================
Total Files: 7
Total Logs: 24
Logs with Feedback: 0
Feedback Completion: 0.0%
Deployed Count: 0
Action:
- Note total logs needing review
- Estimate time needed (2-3 min per log)
2. List logs needing feedback
python3 workspace/training-examples/collect-feedback.py --listExpected Output:
[INFO] Found 24 logs needing feedback:
2025-10-23 (line 1)
2025-10-23 (line 2)
2025-10-24 (line 1)
...
Action:
- Identify dates with most logs
- Plan review session
3. Review logs by date
# Review specific date
python3 workspace/training-examples/collect-feedback.py --review 2025-10-23Interactive Prompt:
================================================================================
Date: 2025-10-23 | Line: 1
================================================================================
Timestamp: 2025-10-23T14:32:15Z
INPUT:
Agent Domain: marketing
Deliverable Count: 35
Categories: Planning, Growth, Analytics, Content, Distribution
User Request (first 200 chars):
Create optimized prompt for @marketing with 35 deliverables across 5 categories
OUTPUT:
Prompt Length: 12543 chars
Sections: 9
Generation Time: 2847ms
Truncated: False
================================================================================
[FEEDBACK COLLECTION]
Deployed to production? (y/n/skip):
Decision Tree:
IF deployed to production (y):
Validation score (0-100): 94
User rating (1-5): 5
Notes (optional): Excellent coverage, all 35 deliverables included
IF not deployed (n):
Notes (optional): Output too long, needs condensing
IF skip:
- Move to next log
- Come back later if uncertain
Tips for Validation Scoring:
| Score Range | Criteria | Example |
|---|---|---|
| 95-100 | Perfect - production-ready, no changes needed | All sections present, accurate examples, clear structure |
| 90-94 | Excellent - minor tweaks only | 1-2 small corrections, otherwise perfect |
| 80-89 | Good - usable with moderate edits | Missing 1-2 sections, needs restructuring |
| 70-79 | Fair - significant edits required | Incomplete coverage, several errors |
| <70 | Poor - not usable | Missing critical sections, hallucinations |
4. Export high-scoring logs to DSPy format
# Export logs with validation score ≥90
python3 workspace/training-examples/export-production-data.py --verboseExpected Output:
[INFO] Exporting production logs with threshold: 90.0
[PROCESSING] 2025-10-23.jsonl
[EXPORT] Line 1 → marketing/production-2025-10-23T14-32-15Z.json
[EXPORT] Line 3 → finance/production-2025-10-23T16-45-22Z.json
[PROCESSING] 2025-10-24.jsonl
[EXPORT] Line 1 → legal/production-2025-10-24T09-15-33Z.json
================================================================================
EXPORT SUMMARY
================================================================================
Total Processed: 24
Exported: 8
Skipped (no deploy): 12
Skipped (low score): 3
Skipped (duplicate): 0
Skipped (invalid): 1
By Domain:
marketing: 3
finance: 2
legal: 2
seo: 1
================================================================================
Action:
- Verify exported files exist
- Review file structure (spot-check 1-2 examples)
5. Verify exported training examples
# List exported files
ls -lh workspace/training-examples/marketing/
ls -lh workspace/training-examples/finance/
# View sample export
cat workspace/training-examples/marketing/production-2025-10-23T14-32-15Z.json | jq .Expected Format:
{
"agent_domain": "marketing",
"output_requirements": "35 deliverables across 5 categories: Planning, Growth, Analytics, Content, Distribution",
"prompt": "Create optimized prompt for @marketing with 35 deliverables",
"metadata": {
"source": "production",
"log_date": "2025-10-23",
"validation_score": 94.0,
"generation_time_ms": 2847,
"user_rating": 5,
"prompt_length": 12543,
"sections": 9,
"truncated": false
}
}Action:
- Confirm all required fields present
- Check validation_score ≥90
- Verify domain categorization correct
6. Compress old logs (>30 days)
# Dry-run to see what would be compressed
python3 workspace/training-examples/manage-logs.py --compress --dry-run
# Actually compress
python3 workspace/training-examples/manage-logs.py --compressExpected Output:
[INFO] Compressing logs older than 30 days (before 2025-09-29)
[COMPRESSED] 2025-09-01.jsonl → 2025-09-01.jsonl.gz
[COMPRESSED] 2025-09-02.jsonl → 2025-09-02.jsonl.gz
...
[INFO] Compressed 28 files
7. Archive logs (>90 days)
# Dry-run first
python3 workspace/training-examples/manage-logs.py --archive --dry-run --days 90
# Actually archive
python3 workspace/training-examples/manage-logs.py --archive --days 90Expected Output:
[INFO] Archiving logs older than 90 days (before 2025-07-31)
[ARCHIVED] 2025-07-01.jsonl.gz → archive/
[ARCHIVED] 2025-07-02.jsonl.gz → archive/
...
[INFO] Archived 90 files
8. Health check for anomalies
python3 workspace/training-examples/manage-logs.py --health-checkExpected Output:
================================================================================
HEALTH CHECK
================================================================================
Anomalies Detected:
⚠️ [low_feedback_rate] Low feedback completion rate: 18.5%
ℹ️ [unusually_long_prompts] 3 unusually long prompts (>3x average: 8234 chars)
================================================================================
Action:
- Investigate warnings
- Low feedback rate → Schedule more review time
- Long prompts → Check if @prompter output truncating correctly
Required (90-94):
- ✅ All major sections present (Role, Workflow, Templates, Quality, Lenses, Output, Constraints, Integration, Metrics)
- ✅ Accurate deliverable count matches request
- ✅ Domain-specific examples (not generic placeholders)
- ✅ IF/THEN/BECAUSE causality chains
- ✅ All 6 lenses documented
- ✅ Workflow-first examples with time savings
- ✅ Proper markdown formatting
Bonus (95-100):
- ✅ Zero placeholder values (no "TBD", "[INSERT]", etc.)
- ✅ Quantified metrics throughout (not "significant" but "87.5%")
- ✅ Template library fully specified with section counts
- ✅ Integration points with 4+ other agents
- ✅ Real training data references (Cialdini, Rosenbaum, etc.)
Automatic Disqualifiers (<90):
- ❌ Missing major sections (e.g., no Template Library)
- ❌ Generic examples (no domain specificity)
- ❌ Placeholder values throughout
- ❌ Incorrect deliverable count
- ❌ Missing lens framework
- ❌ No workflow-first examples
- ✅ Review logs within 1 week of generation (while fresh in memory)
- ✅ Be honest with validation scores (don't inflate)
- ✅ Add notes for borderline scores (helps future optimization)
- ✅ Skip if uncertain (better to skip than guess wrong)
- ✅ Export weekly (don't let logs pile up)
- ❌ Mark as deployed if not actually used in production
- ❌ Give high scores to placeholders ("TBD", "[INSERT]", etc.)
- ❌ Rush through reviews (quality over speed)
- ❌ Ignore anomalies in health check
- ❌ Delete logs without archiving first
Solution:
# Review by domain (smaller batches)
python3 workspace/training-examples/export-production-data.py --domain marketing --verbose
# Review only marketing logs
# Or review by date range
python3 workspace/training-examples/collect-feedback.py --review 2025-10-23
# Single day onlySolution:
# Compare against golden examples
cat workspace/training-examples/marketing/golden-marketing-prompt.json | jq .
cat workspace/training-examples/finance/golden-finance-prompt.json | jq .
# Use these as 95+ reference pointsSolution:
- Check
workspace/agent-workspace/agents/{domain}/for saved prompts - Search codebase for prompt content:
grep -r "You are @marketing" backend/data/agents.json - When unsure, mark as "not deployed" (safer)
| Metric | Target | Action if Below |
|---|---|---|
| Feedback Completion Rate | >80% | Schedule dedicated review time |
| Deployment Rate | >30% | Review rejection reasons, improve prompts |
| Avg Validation Score (deployed) | >92 | Analyze low-scoring logs for patterns |
| Export Count (weekly) | >5 | Increase @prompter usage |
| Time per Review | <3 min | Simplify feedback form, add presets |
python3 workspace/training-examples/manage-logs.py --statsAfter collecting 20+ new training examples:
-
Backup current model:
cp workspace/training-examples/prompter-optimizer.pkl \ workspace/training-examples/backups/prompter-optimizer-$(date +%Y-%m-%d).pkl -
Retrain with new examples:
python3 workspace/training-examples/train_prompter.py \ --include-production \ --min-score 90
-
Validate improved performance:
python3 workspace/training-examples/evaluate_prompter.py
-
Deploy if validation score improved:
- Current: 94.9% → Target: 95.5%+
- PRODUCTION-LOGGING-GUIDE.md - Complete logging architecture
- LOGGING-QUICK-REFERENCE.md - Common commands cheat sheet
- TRAINING-DATA-INVENTORY.md - All training data sources
- DSPY-ENVIRONMENT-SETUP.md - Environment setup and verification
Last Updated: 2025-10-29 Workflow Version: 1.0 Next Review: Weekly (every Monday)