1. Introduction [Back to Top]
This repository accompanies the paper "Can GPT models Follow Human Summarization Guidelines? A Study for Targeted Communication Goals" [https://arxiv.org/abs/2310.16810], which investigates the ability of prompt-driven LLMs (e.g., ChatGPT, GPT-4) to adhere to human guidelines for dialogue summarization. Experiments were conducted on:
- DialogSum: English social conversations
- DECODA-FR: French call center interactions
Key findings:
- GPT models (ChatGPT, GPT-4, GPT-4o) outperform task-specific models and even reference summaries in human evaluation, likely due to longer, more comprehensive outputs.
- Despite lower automatic metric scores (ROUGE/BERTScore), GPT summaries are preferred by humans, highlighting the need for better-aligned metrics and continued human evaluation.
- Guideline adherence: GPT-4 better follows word limits, while the HGR→WL approach yields superior results over simple WordLimit.
- Subjectivity in evaluation: Human judges often favor GPT summaries over references due to stylistic differences.
- Shortcomings: GPT models occasionally miss rules (e.g., named entities in DialogSum or balanced perspectives in DECODA), though HGR intermediate steps improve adherence.
2. Code Structure [Back to Top]
├── decoda/ # DECODA-FR experiments
│ ├── output/ # One-step prompt outputs
│ ├── twoSteps/ # Two-step prompt outputs
│ └── sota_barthez_predictions.txt # BARTthez fine-tuned predictions
├── dialogsum/ # DialogSum experiments
│ ├── output/ # One-step prompt outputs
│ ├── twoSteps/ # Two-step prompt outputs
│ └── bart_large_summaries.txt # BART-large fine-tuned predictions
├── scripts/
│ ├── build_dataset.py # Preprocess DECODA raw data
│ ├── compute_metrics.py # ROUGE/BERTScore evaluation
│ ├── example_analysis.py # Data points of generated summaries (low ROUGE but high BERTScore)
│ ├── human_eval_save_annotated_json.py # Process human evaluation annotations and save evaluation samples with annotation scores
│ ├── human_eval_scores.py # Aggregate and display human evaluation scores
│ ├── openapi_summarization.py # GPT summarization experiments
│ ├── step2.py # Intermediate step for two-step prompting
│ └── summ_length_analysis.py # Length analysis & box plots
│ └── variance.py # Human-vs-model variance analysis
├── Guideline-Eval/ # LLM-based evaluation scripts and prompts
└── results/ # Evaluation outputs, figures, and human annotation data
conda create --name <yourEnv> python=3.8
conda activate <yourEnv>
pip install -r requirements.txt3. Experiments [Back to Top]
- DECODA: Download from MultiLing 2015 -- CCCS data download (test set requires author approval).
- DialogSum: Download from GitHub.
- DECODA: Speaker turns are marked with
<Spk A>,<Spk B>, etc. Noise labels (e.g.,<noise b/>) are filtered.
python ./scripts/build_dataset.pyYou can save the data downloaded (and preprocessed) in the data repo.
- DECODA: Test file path
./data/decoda/test.json - DialogSum: Test data path
./data/dialogsum/dialogsum.test.jsonl
Prompts include:
- Baseline: Word-length constraints.
- Guideline_Original: Human summarization guidelines.
- Guideline_Original_Annotator: Guidelines begin with "you are an annotator ...".
Refer to OpenAI’s prompt guide for design principles.
# DialogSum (English)
python ./scripts/openapi_summarization.py \
--dataset dialogsum \
--model_name gpt-4o \
--input_dir ./data/ \
--prompt_type Baseline \
--api_key YOUR_KEY# DECODA-FR (French)
python ./scripts/openapi_summarization.py \
--dataset decoda \
--model_name gpt-4o \
--input_dir ./data/ \
--prompt_type Baseline \
--api_key YOUR_KEY# dialogsum or decoda
python ./scripts/step2.py \
--dataset decoda \
--model_name gpt-4o \
--input_file_prompt Guideline_Original_Annotator \
--prompt_type Baseline \
--api_key YOUR_KEYFor further details, please refer to the previous articles cited for fine-tuning BARThez on the DECODA dataset, and for fine-tuning BART-Large on the DialogSum dataset.
4. Results [Back to Top]
# Compute ROUGE/BERTScore for DECODA (GPT-3.5)
python ./scripts/compute_metrics.py \
--dataset decoda \
--prompt_type Baseline \
--model gpt-3.5 \
--test_file /path/to/test.json \
--pred_file /path/to/pred.csv
# Compute ROUGE/BERTScore for DECODA (BARThez)
python ./scripts/compute_metrics.py --dataset decoda --prompt_type None --model bart-based &>barthez_results.txt
# Batch evaluation for all experiments
python ./results/run_results.sh > ../results/results_rouge_bertscore.txtFor DialogSum (3 references per dialogue), we compare model outputs against human variance:
# Example: GPT-generated summaries (4-WL) vs. variance in the reference summaries
python ./scripts/variance.py --dataset dialogsum --prompt_type Baseline --model gpt-4 &>../results/variance_dialogsum/variance_Baseline_4.txtUsing deepseek-reasoner as backbone for evaluation on four aspects (Faithfulness, Main Issues, Sub-Issues, Resolution):
# Default model: deepseek-reasoner
# Example: Faithfulness evaluation
python llm_eval.py --prompt_fp ./prompts/decoda/faithfulness.txt --save_fp ./results/r1_faithfulness.json --input_fp ./data/decoda_eval_samples.json --key YOUR_KEY --model deepseek-reasoner --base_url https://api.deepseek.comMore details see ./Guideline-Eval/README.md.
Identify summaries with low ROUGE but high BERTScore:
python ./scripts/example_analysis.py --dataset dialogsum --prompt_type Baseline --model gpt-4Outputs:
- DialogSum:
results/examples_analysis_rouge_bertscore/sorted_examples_multireferences.csv - DECODA:
results/examples_analysis_rouge_bertscore/sorted_examples.csv
Dataset: 20 DECODA dialogues (10 shortest + 10 longest)
Metrics (5-point Likert scale):
- Faithfulness
- Main Issues
- Sub-Issues
- Resolution
Resources:
- Evaluation Guidelines: HumanEval_Guideline_DECODA.pdf
- Raw Annotation Data: annotation_output
Results:
Run the following to aggregate and display human evaluation scores:
python ./scripts/human_eval_scores.pyProcessing Annotated Summaries:
To process human evaluation annotations and save enhanced evaluation samples with annotation scores:
python ./scripts/human_eval_save_annotated_json.pyProcessed annotated evaluation samples are saved in: ./results/human_annotations_decoda/decoda_eval_samples_annotated.json
python ./scripts/summ_length_analysis.py --dataset dialogsum --model_names gpt-4o gpt-4 gpt-3.5-turbo --prompt_types Baseline Guideline_Original_Annotator Guideline_Original_Annotator_ToBaseline --plot5. Citation [Back to Top]
If you find this work useful, please cite our paper using the following BibTeX:
@inproceedings{zhou-etal-2025-gpt,
title = "Can {GPT} models Follow Human Summarization Guidelines? A Study for Targeted Communication Goals",
author = "Zhou, Yongxin and
Ringeval, Fabien and
Portet, Fran{\c{c}}ois",
editor = "Flek, Lucie and
Narayan, Shashi and
Phương, L{\^e} Hồng and
Pei, Jiahuan",
booktitle = "Proceedings of the 18th International Natural Language Generation Conference",
month = oct,
year = "2025",
address = "Hanoi, Vietnam",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.inlg-main.17/",
pages = "249--273",
abstract = "This study investigates the ability of GPT models (ChatGPT, GPT-4 and GPT-4o) to generate dialogue summaries that adhere to human guidelines. Our evaluation involved experimenting with various prompts to guide the models in complying with guidelines on two datasets: DialogSum (English social conversations) and DECODA (French call center interactions). Human evaluation, based on summarization guidelines, served as the primary assessment method, complemented by extensive quantitative and qualitative analyses. Our findings reveal a preference for GPT-generated summaries over those from task-specific pre-trained models and reference summaries, highlighting GPT models' ability to follow human guidelines despite occasionally producing longer outputs and exhibiting divergent lexical and structural alignment with references. The discrepancy between ROUGE, BERTScore, and human evaluation underscores the need for more reliable automatic evaluation metrics."
}
For questions or issues, please open an issue on GitHub or contact yongxin.zhou@univ-grenoble-alpes.fr.