Skip to content

yongxin2020/LLM-Sum-Guidelines

Repository files navigation

Can GPT models Follow Human Summarization Guidelines? A Study for Targeted Communication Goals


Content:

  1. Introduction
  2. Code Structure
  3. Experiments
  4. Results
  5. Citation

1. Introduction [Back to Top]

This repository accompanies the paper "Can GPT models Follow Human Summarization Guidelines? A Study for Targeted Communication Goals" [https://arxiv.org/abs/2310.16810], which investigates the ability of prompt-driven LLMs (e.g., ChatGPT, GPT-4) to adhere to human guidelines for dialogue summarization. Experiments were conducted on:

  • DialogSum: English social conversations
  • DECODA-FR: French call center interactions

Key findings:

  • GPT models (ChatGPT, GPT-4, GPT-4o) outperform task-specific models and even reference summaries in human evaluation, likely due to longer, more comprehensive outputs.
  • Despite lower automatic metric scores (ROUGE/BERTScore), GPT summaries are preferred by humans, highlighting the need for better-aligned metrics and continued human evaluation.
  • Guideline adherence: GPT-4 better follows word limits, while the HGR→WL approach yields superior results over simple WordLimit.
  • Subjectivity in evaluation: Human judges often favor GPT summaries over references due to stylistic differences.
  • Shortcomings: GPT models occasionally miss rules (e.g., named entities in DialogSum or balanced perspectives in DECODA), though HGR intermediate steps improve adherence.

2. Code Structure [Back to Top]

├── decoda/ # DECODA-FR experiments
│ ├── output/ # One-step prompt outputs
│ ├── twoSteps/ # Two-step prompt outputs
│ └── sota_barthez_predictions.txt # BARTthez fine-tuned predictions
├── dialogsum/ # DialogSum experiments
│ ├── output/ # One-step prompt outputs
│ ├── twoSteps/ # Two-step prompt outputs
│ └── bart_large_summaries.txt # BART-large fine-tuned predictions
├── scripts/
│ ├── build_dataset.py # Preprocess DECODA raw data
│ ├── compute_metrics.py # ROUGE/BERTScore evaluation
│ ├── example_analysis.py # Data points of generated summaries (low ROUGE but high BERTScore)
│ ├── human_eval_save_annotated_json.py # Process human evaluation annotations and save evaluation samples with annotation scores
│ ├── human_eval_scores.py # Aggregate and display human evaluation scores
│ ├── openapi_summarization.py # GPT summarization experiments
│ ├── step2.py # Intermediate step for two-step prompting
│ └── summ_length_analysis.py # Length analysis & box plots
│ └── variance.py # Human-vs-model variance analysis
├── Guideline-Eval/ # LLM-based evaluation scripts and prompts
└── results/ # Evaluation outputs, figures, and human annotation data

Setup

conda create --name <yourEnv> python=3.8
conda activate <yourEnv>
pip install -r requirements.txt

3. Experiments [Back to Top]

3.1 Data Preparation

Datasets

Preprocessing

  • DECODA: Speaker turns are marked with <Spk A>, <Spk B>, etc. Noise labels (e.g., <noise b/>) are filtered.
python ./scripts/build_dataset.py

You can save the data downloaded (and preprocessed) in the data repo.

  • DECODA: Test file path ./data/decoda/test.json
  • DialogSum: Test data path ./data/dialogsum/dialogsum.test.jsonl

3.2 Prompt Design

Prompts include:

  1. Baseline: Word-length constraints.
  2. Guideline_Original: Human summarization guidelines.
  3. Guideline_Original_Annotator: Guidelines begin with "you are an annotator ...".

Refer to OpenAI’s prompt guide for design principles.

3.3 Summarization Experiments (DialogSum & DECODA-FR)

Direct Summarization

# DialogSum (English)
python ./scripts/openapi_summarization.py \
    --dataset dialogsum  \
    --model_name gpt-4o \
    --input_dir ./data/ \
    --prompt_type Baseline \ 
    --api_key YOUR_KEY
# DECODA-FR (French)
python ./scripts/openapi_summarization.py \
    --dataset decoda \
    --model_name gpt-4o \
    --input_dir ./data/ \
    --prompt_type Baseline \
    --api_key YOUR_KEY

Two-Step Prompting (Guideline → Length)

# dialogsum or decoda
python ./scripts/step2.py \
    --dataset decoda \
    --model_name gpt-4o \
    --input_file_prompt Guideline_Original_Annotator \
    --prompt_type Baseline \
    --api_key YOUR_KEY

Experiments with BART-based models

For further details, please refer to the previous articles cited for fine-tuning BARThez on the DECODA dataset, and for fine-tuning BART-Large on the DialogSum dataset.

4. Results [Back to Top]

4.1 Quantitative Evaluation

ROUGE & BERTScore

# Compute ROUGE/BERTScore for DECODA (GPT-3.5)
python ./scripts/compute_metrics.py \
    --dataset decoda \
    --prompt_type Baseline \
    --model gpt-3.5 \
    --test_file /path/to/test.json \
    --pred_file /path/to/pred.csv

# Compute ROUGE/BERTScore for DECODA (BARThez)
python ./scripts/compute_metrics.py --dataset decoda --prompt_type None --model bart-based &>barthez_results.txt

# Batch evaluation for all experiments
python ./results/run_results.sh > ../results/results_rouge_bertscore.txt

Model Variance Analysis

For DialogSum (3 references per dialogue), we compare model outputs against human variance:

# Example: GPT-generated summaries (4-WL) vs. variance in the reference summaries
python ./scripts/variance.py --dataset dialogsum --prompt_type Baseline --model gpt-4 &>../results/variance_dialogsum/variance_Baseline_4.txt

Using LLMs-as-judge

Using deepseek-reasoner as backbone for evaluation on four aspects (Faithfulness, Main Issues, Sub-Issues, Resolution):

# Default model: deepseek-reasoner 
# Example: Faithfulness evaluation
python llm_eval.py --prompt_fp ./prompts/decoda/faithfulness.txt --save_fp ./results/r1_faithfulness.json --input_fp ./data/decoda_eval_samples.json --key YOUR_KEY --model deepseek-reasoner --base_url https://api.deepseek.com

More details see ./Guideline-Eval/README.md.

4.2 Example Analysis

Identify summaries with low ROUGE but high BERTScore:

python ./scripts/example_analysis.py --dataset dialogsum --prompt_type Baseline --model gpt-4

Outputs:

  • DialogSum: results/examples_analysis_rouge_bertscore/sorted_examples_multireferences.csv
  • DECODA: results/examples_analysis_rouge_bertscore/sorted_examples.csv

4.3 Human Evaluation

Dataset: 20 DECODA dialogues (10 shortest + 10 longest)

Metrics (5-point Likert scale):

  • Faithfulness
  • Main Issues
  • Sub-Issues
  • Resolution

Resources:

Results:
Run the following to aggregate and display human evaluation scores:

python ./scripts/human_eval_scores.py

Processing Annotated Summaries:
To process human evaluation annotations and save enhanced evaluation samples with annotation scores:

python ./scripts/human_eval_save_annotated_json.py

Processed annotated evaluation samples are saved in: ./results/human_annotations_decoda/decoda_eval_samples_annotated.json

4.4 Summary Length Analysis

python ./scripts/summ_length_analysis.py --dataset dialogsum --model_names gpt-4o gpt-4 gpt-3.5-turbo --prompt_types Baseline Guideline_Original_Annotator Guideline_Original_Annotator_ToBaseline --plot

5. Citation [Back to Top]

If you find this work useful, please cite our paper using the following BibTeX:

@inproceedings{zhou-etal-2025-gpt,
    title = "Can {GPT} models Follow Human Summarization Guidelines? A Study for Targeted Communication Goals",
    author = "Zhou, Yongxin  and
      Ringeval, Fabien  and
      Portet, Fran{\c{c}}ois",
    editor = "Flek, Lucie  and
      Narayan, Shashi  and
      Phương, L{\^e} Hồng  and
      Pei, Jiahuan",
    booktitle = "Proceedings of the 18th International Natural Language Generation Conference",
    month = oct,
    year = "2025",
    address = "Hanoi, Vietnam",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.inlg-main.17/",
    pages = "249--273",
    abstract = "This study investigates the ability of GPT models (ChatGPT, GPT-4 and GPT-4o) to generate dialogue summaries that adhere to human guidelines. Our evaluation involved experimenting with various prompts to guide the models in complying with guidelines on two datasets: DialogSum (English social conversations) and DECODA (French call center interactions). Human evaluation, based on summarization guidelines, served as the primary assessment method, complemented by extensive quantitative and qualitative analyses. Our findings reveal a preference for GPT-generated summaries over those from task-specific pre-trained models and reference summaries, highlighting GPT models' ability to follow human guidelines despite occasionally producing longer outputs and exhibiting divergent lexical and structural alignment with references. The discrepancy between ROUGE, BERTScore, and human evaluation underscores the need for more reliable automatic evaluation metrics."
}

Contact

For questions or issues, please open an issue on GitHub or contact yongxin.zhou@univ-grenoble-alpes.fr.

About

[INLG 2025] Can GPT models Follow Human Summarization Guidelines? A Study for Targeted Communication Goals

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors