Skip to content

Latest commit

 

History

History
433 lines (323 loc) · 10.9 KB

File metadata and controls

433 lines (323 loc) · 10.9 KB

Feedback Collection Workflow - @prompter Training Data

Purpose: Step-by-step workflow for collecting production feedback to improve DSPy training

Cadence: Weekly (recommended)


Overview

Goal: Convert production @prompter logs into high-quality DSPy training examples

Time Required: ~15-30 minutes per week (depending on volume)

Success Metric: 80%+ feedback completion rate, 30%+ deployment rate


Weekly Workflow (Step-by-Step)

Monday: Review Last Week's Logs

1. Check log statistics

cd /home/michael/soulfield

# View statistics for last 7 days
python3 workspace/training-examples/manage-logs.py --stats

Expected Output:

============================================================
LOG STATISTICS
============================================================
Total Files:              7
Total Logs:               24
Logs with Feedback:       0
Feedback Completion:      0.0%
Deployed Count:           0

Action:

  • Note total logs needing review
  • Estimate time needed (2-3 min per log)

2. List logs needing feedback

python3 workspace/training-examples/collect-feedback.py --list

Expected Output:

[INFO] Found 24 logs needing feedback:
  2025-10-23 (line 1)
  2025-10-23 (line 2)
  2025-10-24 (line 1)
  ...

Action:

  • Identify dates with most logs
  • Plan review session

Tuesday-Thursday: Interactive Review

3. Review logs by date

# Review specific date
python3 workspace/training-examples/collect-feedback.py --review 2025-10-23

Interactive Prompt:

================================================================================
Date: 2025-10-23 | Line: 1
================================================================================

Timestamp: 2025-10-23T14:32:15Z

INPUT:
  Agent Domain: marketing
  Deliverable Count: 35
  Categories: Planning, Growth, Analytics, Content, Distribution

  User Request (first 200 chars):
  Create optimized prompt for @marketing with 35 deliverables across 5 categories

OUTPUT:
  Prompt Length: 12543 chars
  Sections: 9
  Generation Time: 2847ms
  Truncated: False
================================================================================

[FEEDBACK COLLECTION]
Deployed to production? (y/n/skip):

Decision Tree:

IF deployed to production (y):

Validation score (0-100): 94
User rating (1-5): 5
Notes (optional): Excellent coverage, all 35 deliverables included

IF not deployed (n):

Notes (optional): Output too long, needs condensing

IF skip:

  • Move to next log
  • Come back later if uncertain

Tips for Validation Scoring:

Score Range Criteria Example
95-100 Perfect - production-ready, no changes needed All sections present, accurate examples, clear structure
90-94 Excellent - minor tweaks only 1-2 small corrections, otherwise perfect
80-89 Good - usable with moderate edits Missing 1-2 sections, needs restructuring
70-79 Fair - significant edits required Incomplete coverage, several errors
<70 Poor - not usable Missing critical sections, hallucinations

Friday: Export Training Data

4. Export high-scoring logs to DSPy format

# Export logs with validation score ≥90
python3 workspace/training-examples/export-production-data.py --verbose

Expected Output:

[INFO] Exporting production logs with threshold: 90.0

[PROCESSING] 2025-10-23.jsonl
  [EXPORT] Line 1 → marketing/production-2025-10-23T14-32-15Z.json
  [EXPORT] Line 3 → finance/production-2025-10-23T16-45-22Z.json

[PROCESSING] 2025-10-24.jsonl
  [EXPORT] Line 1 → legal/production-2025-10-24T09-15-33Z.json

================================================================================
EXPORT SUMMARY
================================================================================
Total Processed:      24
Exported:             8
Skipped (no deploy):  12
Skipped (low score):  3
Skipped (duplicate):  0
Skipped (invalid):    1

By Domain:
  marketing: 3
  finance: 2
  legal: 2
  seo: 1
================================================================================

Action:

  • Verify exported files exist
  • Review file structure (spot-check 1-2 examples)

5. Verify exported training examples

# List exported files
ls -lh workspace/training-examples/marketing/
ls -lh workspace/training-examples/finance/

# View sample export
cat workspace/training-examples/marketing/production-2025-10-23T14-32-15Z.json | jq .

Expected Format:

{
  "agent_domain": "marketing",
  "output_requirements": "35 deliverables across 5 categories: Planning, Growth, Analytics, Content, Distribution",
  "prompt": "Create optimized prompt for @marketing with 35 deliverables",
  "metadata": {
    "source": "production",
    "log_date": "2025-10-23",
    "validation_score": 94.0,
    "generation_time_ms": 2847,
    "user_rating": 5,
    "prompt_length": 12543,
    "sections": 9,
    "truncated": false
  }
}

Action:

  • Confirm all required fields present
  • Check validation_score ≥90
  • Verify domain categorization correct

End of Month: Log Management

6. Compress old logs (>30 days)

# Dry-run to see what would be compressed
python3 workspace/training-examples/manage-logs.py --compress --dry-run

# Actually compress
python3 workspace/training-examples/manage-logs.py --compress

Expected Output:

[INFO] Compressing logs older than 30 days (before 2025-09-29)
[COMPRESSED] 2025-09-01.jsonl → 2025-09-01.jsonl.gz
[COMPRESSED] 2025-09-02.jsonl → 2025-09-02.jsonl.gz
...
[INFO] Compressed 28 files

7. Archive logs (>90 days)

# Dry-run first
python3 workspace/training-examples/manage-logs.py --archive --dry-run --days 90

# Actually archive
python3 workspace/training-examples/manage-logs.py --archive --days 90

Expected Output:

[INFO] Archiving logs older than 90 days (before 2025-07-31)
[ARCHIVED] 2025-07-01.jsonl.gz → archive/
[ARCHIVED] 2025-07-02.jsonl.gz → archive/
...
[INFO] Archived 90 files

8. Health check for anomalies

python3 workspace/training-examples/manage-logs.py --health-check

Expected Output:

================================================================================
HEALTH CHECK
================================================================================

Anomalies Detected:
  ⚠️ [low_feedback_rate] Low feedback completion rate: 18.5%
  ℹ️ [unusually_long_prompts] 3 unusually long prompts (>3x average: 8234 chars)

================================================================================

Action:

  • Investigate warnings
  • Low feedback rate → Schedule more review time
  • Long prompts → Check if @prompter output truncating correctly

Validation Score Guidelines

What Makes a 90+ Score?

Required (90-94):

  • ✅ All major sections present (Role, Workflow, Templates, Quality, Lenses, Output, Constraints, Integration, Metrics)
  • ✅ Accurate deliverable count matches request
  • ✅ Domain-specific examples (not generic placeholders)
  • ✅ IF/THEN/BECAUSE causality chains
  • ✅ All 6 lenses documented
  • ✅ Workflow-first examples with time savings
  • ✅ Proper markdown formatting

Bonus (95-100):

  • ✅ Zero placeholder values (no "TBD", "[INSERT]", etc.)
  • ✅ Quantified metrics throughout (not "significant" but "87.5%")
  • ✅ Template library fully specified with section counts
  • ✅ Integration points with 4+ other agents
  • ✅ Real training data references (Cialdini, Rosenbaum, etc.)

Automatic Disqualifiers (<90):

  • ❌ Missing major sections (e.g., no Template Library)
  • ❌ Generic examples (no domain specificity)
  • ❌ Placeholder values throughout
  • ❌ Incorrect deliverable count
  • ❌ Missing lens framework
  • ❌ No workflow-first examples

Best Practices

DO:

  • ✅ Review logs within 1 week of generation (while fresh in memory)
  • ✅ Be honest with validation scores (don't inflate)
  • ✅ Add notes for borderline scores (helps future optimization)
  • ✅ Skip if uncertain (better to skip than guess wrong)
  • ✅ Export weekly (don't let logs pile up)

DON'T:

  • ❌ Mark as deployed if not actually used in production
  • ❌ Give high scores to placeholders ("TBD", "[INSERT]", etc.)
  • ❌ Rush through reviews (quality over speed)
  • ❌ Ignore anomalies in health check
  • ❌ Delete logs without archiving first

Troubleshooting

Problem: Too many logs to review in one session

Solution:

# Review by domain (smaller batches)
python3 workspace/training-examples/export-production-data.py --domain marketing --verbose
# Review only marketing logs

# Or review by date range
python3 workspace/training-examples/collect-feedback.py --review 2025-10-23
# Single day only

Problem: Validation score criteria unclear

Solution:

# Compare against golden examples
cat workspace/training-examples/marketing/golden-marketing-prompt.json | jq .
cat workspace/training-examples/finance/golden-finance-prompt.json | jq .

# Use these as 95+ reference points

Problem: Can't remember if prompt was deployed

Solution:

  • Check workspace/agent-workspace/agents/{domain}/ for saved prompts
  • Search codebase for prompt content: grep -r "You are @marketing" backend/data/agents.json
  • When unsure, mark as "not deployed" (safer)

Success Metrics (Weekly Check)

Targets

Metric Target Action if Below
Feedback Completion Rate >80% Schedule dedicated review time
Deployment Rate >30% Review rejection reasons, improve prompts
Avg Validation Score (deployed) >92 Analyze low-scoring logs for patterns
Export Count (weekly) >5 Increase @prompter usage
Time per Review <3 min Simplify feedback form, add presets

Calculate Metrics

python3 workspace/training-examples/manage-logs.py --stats

Integration with Retraining

After collecting 20+ new training examples:

  1. Backup current model:

    cp workspace/training-examples/prompter-optimizer.pkl \
       workspace/training-examples/backups/prompter-optimizer-$(date +%Y-%m-%d).pkl
  2. Retrain with new examples:

    python3 workspace/training-examples/train_prompter.py \
      --include-production \
      --min-score 90
  3. Validate improved performance:

    python3 workspace/training-examples/evaluate_prompter.py
  4. Deploy if validation score improved:

    • Current: 94.9% → Target: 95.5%+

Related Documentation

  • PRODUCTION-LOGGING-GUIDE.md - Complete logging architecture
  • LOGGING-QUICK-REFERENCE.md - Common commands cheat sheet
  • TRAINING-DATA-INVENTORY.md - All training data sources
  • DSPY-ENVIRONMENT-SETUP.md - Environment setup and verification

Last Updated: 2025-10-29 Workflow Version: 1.0 Next Review: Weekly (every Monday)