This guide provides detailed instructions for reproducing all experiments from our paper: "Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity".
Each experiment is mapped to its corresponding paper section with LaTeX labels for easy cross-referencing.
- Main Experiments (§5-8)
Paper Section: Section 5 (Creative Writing) Key Results: Figure 2 (a-i), Table 2 Appendix Details: Appendix §A.1 (Creative Writing)
We evaluate Verbalized Sampling on three creative writing tasks to measure diversity and quality improvements.
Script: tasks/run_poem.py
Dataset: PoemHunter.com (100 poems sampled)
Metrics: Semantic diversity, lexical diversity (ROUGE-L), quality scores
# Run with VS-Standard method
python scripts/tasks/run_poem.py --method vs_standard --model "gpt-4.1"
# Compare multiple methods
python scripts/tasks/run_poem.py \
--methods "direct vs_standard vs_cot vs_multi" \
--model "gpt-4.1"Expected Output:
- Results:
results/poem/ - Metrics: Diversity scores, quality evaluations
- Paper Results: VS achieves 30.4 diversity vs. 10.8 for direct prompting (Figure 2a)
Script: tasks/run_story.py
Dataset: BookMIA dataset (100 stories sampled)
Metrics: Semantic diversity, lexical diversity, creativity scores
# Run story generation experiments
python scripts/tasks/run_story.py --method vs_standard --model "gpt-4.1"Expected Output:
- Results:
results/story/ - Paper Results: 2-3x diversity improvement (Figure 2b)
Script: tasks/run_jokes.py
Dataset: Reddit r/DadJokes (100 thematic prompts)
Metrics: Semantic diversity, humor quality
# Run joke writing experiments
python scripts/tasks/run_jokes.py --method vs_standard --model "gpt-4.1"Expected Output:
- Results:
results/jokes/ - Paper Results: Significant diversity gains while maintaining humor quality (Figure 2c)
Human Evaluation: See Appendix §A.1.2 (Human Study on Creative Writing) for human study details showing 25.7% improvement in human evaluation scores.
Paper Section: Section 6 (Dialogue Simulation) Key Results: Figure 3 (a-b), Table 3 Appendix Details: Appendix §A.2 (Dialogue Simulation)
Script: tasks/run_dialogue_simulation.py
Dataset: PersuasionForGood (939 dialogues, 739 train / 200 test)
Task: Simulate multi-turn persuasive dialogues where one agent persuades another to donate to charity
Metrics: Donation distribution alignment (KL divergence), linguistic alignment (Distinct-N, semantic diversity)
# Run dialogue simulation with VS
python scripts/tasks/run_dialogue_simulation.py \
--persuader-model "gpt-4.1" \
--persuadee-model "gpt-4.1" \
--method vs_standard \
--num-conversations 50 \
--num-samplings 5 \
--max-turns 10 \
--evaluate \
--output-file results/dialogue/persuasion_vs_standard.jsonl
# Compare with baseline methods
python scripts/tasks/run_dialogue_simulation.py \
--persuader-model "gpt-4.1" \
--persuadee-model "gpt-4.1" \
--method direct \
--num-conversations 50 \
--output-file results/dialogue/persuasion_direct.jsonlExpected Output:
- Results:
results/dialogue/ - Dialogue transcripts with donation amounts
- Evaluation metrics: KL divergence from human distribution, linguistic features
- Paper Results: VS achieves donation distributions closer to human behavior (Figure 3a), with improved linguistic alignment (Figure 3b)
Note: The persuadee is simulated with different methods, while the persuader uses GPT-4.1 with direct prompting. See Appendix §A.2 for fine-tuned baseline details.
Paper Section: Section 7 (Open-Ended QA) Key Results: Figure 4, Table 4 Appendix Details: Appendix §A.3 (Open-Ended QA)
Scripts:
tasks/run_simple_qa.py- Main CoverageQA experimentstasks/run_bias_task_coverageqa.py- CoverageQA specifictasks/run_bias_task_general.py- General bias tasks
Dataset: CoverageQA (40 questions: 10 original + 30 newly created) Task: Enumerative open-ended questions with multiple valid answers (e.g., "Name a US state") Metrics: KL divergence from pretraining distribution, Coverage-N, Precision
# Run CoverageQA experiments
python scripts/tasks/run_simple_qa.py \
--task coverageqa \
--method vs_standard \
--model "gpt-4.1" \
--num-responses 100
# Run with different probability tuning
python scripts/tasks/run_bias_task_general.py \
--method vs_standard \
--probability-tuning 0.1 \
--model "gpt-4.1"Expected Output:
- Results:
results/bias/orresults/openqa/ - Answer distributions compared to RedPajama pretraining corpus
- Coverage and precision metrics
- Paper Results: VS achieves KL=0.12 vs human pretraining distribution, while direct prompting shows severe mode collapse (Figure 4, qualitative in Figure 1)
Probing Pretraining Data: See Appendix §A.8 (Comparing Pre-trained and VS-Elicited Distributions) for methodology on comparing VS-elicited distributions with pretraining data.
Paper Section: Section 8 (Synthetic Data Generation) Key Results: Table 5 Appendix Details: Appendix §A.4 (Synthetic Data Generation)
We evaluate VS on generating diverse synthetic training data for math reasoning tasks.
Generate diverse, correct synthetic math problems for training.
Script: tasks/run_positive_gsm8k.py
Task: Generate 1K diverse GSM8K-style grade school math problems
Evaluation: Fine-tune models on synthetic data, test on MATH500/OlympiadBench/Minerva Math
# Generate synthetic GSM8K questions
python scripts/tasks/run_positive_gsm8k.py \
--method vs_standard \
--model "gpt-4.1" \
--num-samples 1000 \
--output-file results/synthetic/gsm8k_vs_standard.jsonl
# For comparison with Gemini
python scripts/tasks/run_positive_gsm8k.py \
--method vs_standard \
--model "gemini/gemini-2.5-flash" \
--num-samples 1000Expected Output:
- Synthetic questions:
results/synthetic/gsm8k_*.jsonl - Diversity metrics computed automatically
- Paper Results: VS-generated data improves downstream accuracy by 13.8 points (Table 5, Appendix Table A4.1)
Script: tasks/run_positive_amc_aime.py
Task: Generate AMC/AIME competition-level math problems
# Generate AMC/AIME synthetic data
python scripts/tasks/run_positive_amc_aime.py \
--method vs_standard \
--model "gpt-4.1" \
--num-samples 1000Script: tasks/run_positive_lcb.py
Task: Generate coding problems for LiveCodeBench
# Generate LiveCodeBench synthetic data
python scripts/tasks/run_positive_lcb.py \
--method vs_standard \
--model "gpt-4.1" \
--num-samples 1000Script: tasks/run_negative.py
Task: Generate diverse, incorrect but plausible solutions for offline RL training
Appendix: Appendix §A.4.2, paragraph "Negative Synthetic Data Generation"
# Generate negative examples
python scripts/tasks/run_negative.py \
--method vs_standard \
--model "gpt-4.1" \
--dataset gsm8k \
--num-samples 50Expected Output:
- Negative examples:
results/synthetic/negative_*.jsonl - Paper Results: Training with VS-generated negative data improves offline RL by 2.69 points (Appendix Table A4.2)
Paper Section: Appendix §A.5 (Random Number Generation) Key Results: Table 5 (Appendix), Figure 5 (Appendix)
Script: tasks/run_rng.py
Task: Simulate rolling a fair 6-sided die to test uniform sampling ability
Metrics: KL divergence from uniform distribution
# Test random number generation
python scripts/tasks/run_rng.py \
--method vs_standard \
--model "gemini/gemini-2.5-pro" \
--num-samples 600Expected Output:
- Results:
results/rng/ - Distribution of dice rolls (1-6)
- Paper Results: VS achieves KL=0.027, vs. 0.926 for direct prompting (Appendix Table 5)
Paper Section: Appendix (Bias Mitigation experiments) Related to: Open-Ended QA section
Script: tasks/run_state_name.py
Task: Test for geographic bias in naming US states
Metrics: Distribution uniformity, coverage of all 50 states
# Test state naming bias
python scripts/tasks/run_state_name.py \
--method vs_standard \
--model "gpt-4.1" \
--num-samples 100Expected Output:
- Results:
results/bias/state_name/ - State frequency distributions
- Paper Results: VS reduces mode collapse on frequent states (CA, TX, NY)
Paper Section: Appendix §A.6 (Commonsense Reasoning) Key Results: Appendix Table 6
Script: tasks/run_simple_qa.py (with SimpleQA dataset)
Dataset: SimpleQA (4,326 questions, 300 sampled across 10 domains)
Task: Verify that diversity gains don't sacrifice factual accuracy
Metrics: Top@1 accuracy, Pass@N accuracy
# Test factual accuracy with SimpleQA
python scripts/tasks/run_simple_qa.py \
--task simpleqa \
--method vs_standard \
--model "gpt-4.1" \
--num-samples 5 \
--num-prompts 300Expected Output:
- Results:
results/commonsense/ - Accuracy metrics per method
- Paper Results: VS maintains comparable accuracy while increasing diversity (Appendix Table 6)
Paper Section: Appendix §A.7 (Safety Evaluation) Key Results: Appendix Table 7
Script: tasks/run_safety.py
Dataset: StrongReject benchmark (353 harmful prompts) Task: Verify VS doesn't compromise safety alignment Metrics: Refusal rate on harmful prompts
# Run safety evaluation
python scripts/tasks/run_safety.py \
--method vs_standard \
--model "gpt-4.1" \
--dataset strongrejectExpected Output:
- Results:
results/safety/ - Refusal rates per method
- Paper Results: VS maintains >97% refusal rate, only 0.3-0.8 points lower than baselines (Appendix Table 7)
Paper Reference: Appendix §B (Experiment Settings)
All experiments use consistent decoding parameters unless otherwise specified:
- Temperature: 0.7
- Top-p: 1.0
- Max tokens: 8,192
- Random seed: 42
Exceptions:
- Synthetic data generation (§8): temperature=0.6, top-p=0.95 (to match related work)
- Dialogue simulation (§6): temperature=0.7, top-p=0.9, max-tokens=500
- Creative Writing: N=30 responses, k=5 candidates per call, 100 prompts, ~200 words
- Dialogue Simulation: 50 conversations, max 10 turns, k=5 candidates
- Open-Ended QA: N=100 responses, k=20 candidates per call, 40 questions
- Synthetic Data: N=1000 samples, k=5 candidates per call
- Random Number: N=600 samples, k=5 candidates per call
Paper Reference: Appendix §C (Full Prompts)
All prompts used in our experiments are documented in the paper appendix. Key prompt templates:
Generate {num_samplings} responses to the input prompt. Each response should be approximately {target_words} words.
Return the responses in JSON format with the key: "responses" (list of dicts). Each dictionary must include:
- text: the response string only (no explanation or extra text).
- probability: the estimated probability from 0.0 to 1.0 of this response given the input prompt (relative to the full distribution).
Give ONLY the JSON object, with no explanations or extra text.
Adds chain-of-thought reasoning before generating responses (see Appendix §C for full prompt).
Generates responses across multiple turns for increased diversity (see Appendix §C for full prompt).
For diversity tuning experiments (Figure 2g-i), add:
Randomly sample the responses from the distribution, with the probability of each response must be below {probability_threshold}.
Thresholds tested: 1.0 (no tuning), 0.9, 0.5, 0.1, 0.05, 0.01
To reproduce all main paper results:
# Main experiments (Sections 5-8)
python scripts/tasks/run_poem.py --method vs_standard --model "gpt-4.1"
python scripts/tasks/run_story.py --method vs_standard --model "gpt-4.1"
python scripts/tasks/run_jokes.py --method vs_standard --model "gpt-4.1"
python scripts/tasks/run_dialogue_simulation.py --method vs_standard
python scripts/tasks/run_simple_qa.py --method vs_standard --model "gpt-4.1"
python scripts/tasks/run_positive_gsm8k.py --method vs_standard --model "gpt-4.1"
python scripts/tasks/run_positive_amc_aime.py --method vs_standard --model "gpt-4.1"
# Appendix experiments
python scripts/tasks/run_rng.py --method vs_standard --model "gpt-4.1"
python scripts/tasks/run_state_name.py --method vs_standard --model "gpt-4.1"
python scripts/tasks/run_safety.py --method vs_standard --model "gpt-4.1"Note: These commands provide a starting point. See individual experiment sections above for complete parameter options.
If you use these experiments in your research, please cite our paper:
@misc{zhang2025verbalizedsamplingmitigatemode,
title={Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity},
author={Jiayi Zhang and Simon Yu and Derek Chong and Anthony Sicilia and Michael R. Tomz and Christopher D. Manning and Weiyan Shi},
year={2025},
eprint={2510.01171},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.01171}
}- For issues or questions, please open an issue on our GitHub repository
- See the main README.md for installation and quick start instructions
- Refer to the paper and appendix for detailed methodology and results