Skip to content

KongLongGeFDU/2024-LLM-Eval-Papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Large Language Models Evaluation

2022

  • [2022/12] Constitutional AI: Harmlessness from ai feedback Bai Y et al. arXiv.[paper]

2023

  • [2023/04] Human-like summarization evaluation with chatgpt Gao M et al. arXiv.[paper]
  • [2023/03] Chatgpt as a factual inconsistency evaluator for text summarization Luo Z et al. arXiv.[paper]
  • [2023/05] Halueval: A large-scale hallucination evaluation benchmark for large language models Li J et al. arXiv.[paper] [code]
  • [2023/07] Evaluating correctness and faithfulness of instruction-following models for question answering Adlakha V et al. arXiv.[paper] [code]
  • [2023/05] Lm vs lm: Detecting factual errors via cross examination Cohen R et al. arXiv.[paper]
  • [2023/03] G-eval: Nlg evaluation using gpt-4 with better human alignment Liu Y et al. arXiv.[paper] [code]
  • [2023/03] Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models Manakul P et al. arXiv.[paper]
  • [2023/10] Prometheus: Inducing fine-grained evaluation capability in language models Kim S et al. ICLR 2024.[paper]
  • [2023/08] Shepherd: A critic for language model generation Wang T et al. arXiv.[paper]
  • [2023/12] Judging llm-as-a-judge with mt-bench and chatbot arena Zheng L et al. NeurIPS.[paper]

2024

  • [2024/03] Aligning with human judgement: The role of pairwise preference in large language model evaluators Liu Y et al. arXiv.[paper]
  • [2024/06] Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments Zhou H et al. arXiv.[paper]
  • [2024/06] UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor Upadhyay S et al. arXiv.[paper] [code]
  • [2024/04] Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models Verga P et al. arXiv.[paper]
  • [2024/05] Evallm: Interactive evaluation of large language model prompts on user-defined criteria Kim TS et al. CHI.[paper]
  • [2024/05] "We Need Structured Output": Towards User-centered Constraints on Large Language Model Output Liu MX et al. CHI.[paper]
  • [2024/04] Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences Shankar S et al. arXiv.[paper]
  • [2024/02] Cappy: Outperforming and boosting large multi-task lms with a small scorer Tan B et al. NeurIPS.[paper] [code]
  • [2024/06] Llm critics help catch llm bugs McAleese N et al. arXiv.[paper]
  • [2024/06] Finding Blind Spots in Evaluator LLMs with Interpretable Checklists Doddapaneni S et al. arXiv.[paper] [code]
  • [2024/06] Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks Bavaresco A et al. arXiv.[paper]
  • [2024/06] Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges Thakur AS et al. arXiv.[paper]
  • [2024/06] On the Limitations of Fine-Tuned Judge Models for LLM Evaluation Hui Huang et al. arXiv.[paper]

About

记录24年大模型评测相关论文的Repo

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors