Feedback Collection Workflow - @prompter Training Data

Purpose: Step-by-step workflow for collecting production feedback to improve DSPy training

Cadence: Weekly (recommended)

Overview

Goal: Convert production @prompter logs into high-quality DSPy training examples

Time Required: ~15-30 minutes per week (depending on volume)

Success Metric: 80%+ feedback completion rate, 30%+ deployment rate

Weekly Workflow (Step-by-Step)

Monday: Review Last Week's Logs

1. Check log statistics

cd /home/michael/soulfield

# View statistics for last 7 days
python3 workspace/training-examples/manage-logs.py --stats

Expected Output:

============================================================
LOG STATISTICS
============================================================
Total Files:              7
Total Logs:               24
Logs with Feedback:       0
Feedback Completion:      0.0%
Deployed Count:           0

Action:

Note total logs needing review
Estimate time needed (2-3 min per log)

2. List logs needing feedback

python3 workspace/training-examples/collect-feedback.py --list

Expected Output:

[INFO] Found 24 logs needing feedback:
  2025-10-23 (line 1)
  2025-10-23 (line 2)
  2025-10-24 (line 1)
  ...

Action:

Identify dates with most logs
Plan review session

Tuesday-Thursday: Interactive Review

3. Review logs by date

# Review specific date
python3 workspace/training-examples/collect-feedback.py --review 2025-10-23

Interactive Prompt:

================================================================================
Date: 2025-10-23 | Line: 1
================================================================================

Timestamp: 2025-10-23T14:32:15Z

INPUT:
  Agent Domain: marketing
  Deliverable Count: 35
  Categories: Planning, Growth, Analytics, Content, Distribution

  User Request (first 200 chars):
  Create optimized prompt for @marketing with 35 deliverables across 5 categories

OUTPUT:
  Prompt Length: 12543 chars
  Sections: 9
  Generation Time: 2847ms
  Truncated: False
================================================================================

[FEEDBACK COLLECTION]
Deployed to production? (y/n/skip):

Decision Tree:

IF deployed to production (y):

Validation score (0-100): 94
User rating (1-5): 5
Notes (optional): Excellent coverage, all 35 deliverables included

IF not deployed (n):

Notes (optional): Output too long, needs condensing

IF skip:

Move to next log
Come back later if uncertain

Tips for Validation Scoring:

Score Range	Criteria	Example
95-100	Perfect - production-ready, no changes needed	All sections present, accurate examples, clear structure
90-94	Excellent - minor tweaks only	1-2 small corrections, otherwise perfect
80-89	Good - usable with moderate edits	Missing 1-2 sections, needs restructuring
70-79	Fair - significant edits required	Incomplete coverage, several errors
<70	Poor - not usable	Missing critical sections, hallucinations

Friday: Export Training Data

4. Export high-scoring logs to DSPy format

# Export logs with validation score ≥90
python3 workspace/training-examples/export-production-data.py --verbose

Expected Output:

[INFO] Exporting production logs with threshold: 90.0

[PROCESSING] 2025-10-23.jsonl
  [EXPORT] Line 1 → marketing/production-2025-10-23T14-32-15Z.json
  [EXPORT] Line 3 → finance/production-2025-10-23T16-45-22Z.json

[PROCESSING] 2025-10-24.jsonl
  [EXPORT] Line 1 → legal/production-2025-10-24T09-15-33Z.json

================================================================================
EXPORT SUMMARY
================================================================================
Total Processed:      24
Exported:             8
Skipped (no deploy):  12
Skipped (low score):  3
Skipped (duplicate):  0
Skipped (invalid):    1

By Domain:
  marketing: 3
  finance: 2
  legal: 2
  seo: 1
================================================================================

Action:

Verify exported files exist
Review file structure (spot-check 1-2 examples)

5. Verify exported training examples

# List exported files
ls -lh workspace/training-examples/marketing/
ls -lh workspace/training-examples/finance/

# View sample export
cat workspace/training-examples/marketing/production-2025-10-23T14-32-15Z.json | jq .

Expected Format:

{
  "agent_domain": "marketing",
  "output_requirements": "35 deliverables across 5 categories: Planning, Growth, Analytics, Content, Distribution",
  "prompt": "Create optimized prompt for @marketing with 35 deliverables",
  "metadata": {
    "source": "production",
    "log_date": "2025-10-23",
    "validation_score": 94.0,
    "generation_time_ms": 2847,
    "user_rating": 5,
    "prompt_length": 12543,
    "sections": 9,
    "truncated": false
  }
}

Action:

Confirm all required fields present
Check validation_score ≥90
Verify domain categorization correct

End of Month: Log Management

6. Compress old logs (>30 days)

# Dry-run to see what would be compressed
python3 workspace/training-examples/manage-logs.py --compress --dry-run

# Actually compress
python3 workspace/training-examples/manage-logs.py --compress

Expected Output:

[INFO] Compressing logs older than 30 days (before 2025-09-29)
[COMPRESSED] 2025-09-01.jsonl → 2025-09-01.jsonl.gz
[COMPRESSED] 2025-09-02.jsonl → 2025-09-02.jsonl.gz
...
[INFO] Compressed 28 files

7. Archive logs (>90 days)

# Dry-run first
python3 workspace/training-examples/manage-logs.py --archive --dry-run --days 90

# Actually archive
python3 workspace/training-examples/manage-logs.py --archive --days 90

Expected Output:

[INFO] Archiving logs older than 90 days (before 2025-07-31)
[ARCHIVED] 2025-07-01.jsonl.gz → archive/
[ARCHIVED] 2025-07-02.jsonl.gz → archive/
...
[INFO] Archived 90 files

8. Health check for anomalies

python3 workspace/training-examples/manage-logs.py --health-check

Expected Output:

================================================================================
HEALTH CHECK
================================================================================

Anomalies Detected:
  ⚠️ [low_feedback_rate] Low feedback completion rate: 18.5%
  ℹ️ [unusually_long_prompts] 3 unusually long prompts (>3x average: 8234 chars)

================================================================================

Action:

Investigate warnings
Low feedback rate → Schedule more review time
Long prompts → Check if @prompter output truncating correctly

Validation Score Guidelines

What Makes a 90+ Score?

Required (90-94):

✅ All major sections present (Role, Workflow, Templates, Quality, Lenses, Output, Constraints, Integration, Metrics)
✅ Accurate deliverable count matches request
✅ Domain-specific examples (not generic placeholders)
✅ IF/THEN/BECAUSE causality chains
✅ All 6 lenses documented
✅ Workflow-first examples with time savings
✅ Proper markdown formatting

Bonus (95-100):

✅ Zero placeholder values (no "TBD", "[INSERT]", etc.)
✅ Quantified metrics throughout (not "significant" but "87.5%")
✅ Template library fully specified with section counts
✅ Integration points with 4+ other agents
✅ Real training data references (Cialdini, Rosenbaum, etc.)

Automatic Disqualifiers (<90):

❌ Missing major sections (e.g., no Template Library)
❌ Generic examples (no domain specificity)
❌ Placeholder values throughout
❌ Incorrect deliverable count
❌ Missing lens framework
❌ No workflow-first examples

Best Practices

DO:

✅ Review logs within 1 week of generation (while fresh in memory)
✅ Be honest with validation scores (don't inflate)
✅ Add notes for borderline scores (helps future optimization)
✅ Skip if uncertain (better to skip than guess wrong)
✅ Export weekly (don't let logs pile up)

DON'T:

❌ Mark as deployed if not actually used in production
❌ Give high scores to placeholders ("TBD", "[INSERT]", etc.)
❌ Rush through reviews (quality over speed)
❌ Ignore anomalies in health check
❌ Delete logs without archiving first

Troubleshooting

Problem: Too many logs to review in one session

Solution:

# Review by domain (smaller batches)
python3 workspace/training-examples/export-production-data.py --domain marketing --verbose
# Review only marketing logs

# Or review by date range
python3 workspace/training-examples/collect-feedback.py --review 2025-10-23
# Single day only

Problem: Validation score criteria unclear

Solution:

# Compare against golden examples
cat workspace/training-examples/marketing/golden-marketing-prompt.json | jq .
cat workspace/training-examples/finance/golden-finance-prompt.json | jq .

# Use these as 95+ reference points

Problem: Can't remember if prompt was deployed

Solution:

Check workspace/agent-workspace/agents/{domain}/ for saved prompts
Search codebase for prompt content: grep -r "You are @marketing" backend/data/agents.json
When unsure, mark as "not deployed" (safer)

Success Metrics (Weekly Check)

Targets

Metric	Target	Action if Below
Feedback Completion Rate	>80%	Schedule dedicated review time
Deployment Rate	>30%	Review rejection reasons, improve prompts
Avg Validation Score (deployed)	>92	Analyze low-scoring logs for patterns
Export Count (weekly)	>5	Increase @prompter usage
Time per Review	<3 min	Simplify feedback form, add presets

Calculate Metrics

python3 workspace/training-examples/manage-logs.py --stats

Integration with Retraining

After collecting 20+ new training examples:

Backup current model:

cp workspace/training-examples/prompter-optimizer.pkl \
   workspace/training-examples/backups/prompter-optimizer-$(date +%Y-%m-%d).pkl

Retrain with new examples:

python3 workspace/training-examples/train_prompter.py \
  --include-production \
  --min-score 90

Validate improved performance:

python3 workspace/training-examples/evaluate_prompter.py

Deploy if validation score improved:
- Current: 94.9% → Target: 95.5%+

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback Collection Workflow - @prompter Training Data

Overview

Weekly Workflow (Step-by-Step)

Monday: Review Last Week's Logs

Tuesday-Thursday: Interactive Review

Friday: Export Training Data

End of Month: Log Management

Validation Score Guidelines

What Makes a 90+ Score?

Best Practices

DO:

DON'T:

Troubleshooting

Problem: Too many logs to review in one session

Problem: Validation score criteria unclear

Problem: Can't remember if prompt was deployed

Success Metrics (Weekly Check)

Targets

Calculate Metrics

Integration with Retraining

Related Documentation

FilesExpand file tree

FEEDBACK-COLLECTION-WORKFLOW.md

Latest commit

History

FEEDBACK-COLLECTION-WORKFLOW.md

File metadata and controls

Feedback Collection Workflow - @prompter Training Data

Overview

Weekly Workflow (Step-by-Step)

Monday: Review Last Week's Logs

Tuesday-Thursday: Interactive Review

Friday: Export Training Data

End of Month: Log Management

Validation Score Guidelines

What Makes a 90+ Score?

Best Practices

DO:

DON'T:

Troubleshooting

Problem: Too many logs to review in one session

Problem: Validation score criteria unclear

Problem: Can't remember if prompt was deployed

Success Metrics (Weekly Check)

Targets

Calculate Metrics

Integration with Retraining

Related Documentation