Large Language Models Evaluation

2022

[2022/12] Constitutional AI: Harmlessness from ai feedback Bai Y et al. arXiv.[paper]

2023

[2023/04] Human-like summarization evaluation with chatgpt Gao M et al. arXiv.[paper]
[2023/03] Chatgpt as a factual inconsistency evaluator for text summarization Luo Z et al. arXiv.[paper]
[2023/05] Halueval: A large-scale hallucination evaluation benchmark for large language models Li J et al. arXiv.[paper] [code]
[2023/07] Evaluating correctness and faithfulness of instruction-following models for question answering Adlakha V et al. arXiv.[paper] [code]
[2023/05] Lm vs lm: Detecting factual errors via cross examination Cohen R et al. arXiv.[paper]
[2023/03] G-eval: Nlg evaluation using gpt-4 with better human alignment Liu Y et al. arXiv.[paper] [code]
[2023/03] Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models Manakul P et al. arXiv.[paper]
[2023/10] Prometheus: Inducing fine-grained evaluation capability in language models Kim S et al. ICLR 2024.[paper]
[2023/08] Shepherd: A critic for language model generation Wang T et al. arXiv.[paper]
[2023/12] Judging llm-as-a-judge with mt-bench and chatbot arena Zheng L et al. NeurIPS.[paper]

2024

[2024/03] Aligning with human judgement: The role of pairwise preference in large language model evaluators Liu Y et al. arXiv.[paper]
[2024/06] Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments Zhou H et al. arXiv.[paper]
[2024/06] UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor Upadhyay S et al. arXiv.[paper] [code]
[2024/04] Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models Verga P et al. arXiv.[paper]
[2024/05] Evallm: Interactive evaluation of large language model prompts on user-defined criteria Kim TS et al. CHI.[paper]
[2024/05] "We Need Structured Output": Towards User-centered Constraints on Large Language Model Output Liu MX et al. CHI.[paper]
[2024/04] Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences Shankar S et al. arXiv.[paper]
[2024/02] Cappy: Outperforming and boosting large multi-task lms with a small scorer Tan B et al. NeurIPS.[paper] [code]
[2024/06] Llm critics help catch llm bugs McAleese N et al. arXiv.[paper]
[2024/06] Finding Blind Spots in Evaluator LLMs with Interpretable Checklists Doddapaneni S et al. arXiv.[paper] [code]
[2024/06] Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks Bavaresco A et al. arXiv.[paper]
[2024/06] Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges Thakur AS et al. arXiv.[paper]
[2024/06] On the Limitations of Fine-Tuned Judge Models for LLM Evaluation Hui Huang et al. arXiv.[paper]

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Large Language Models Evaluation

2022

2023

2024

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Large Language Models Evaluation

2022

2023

2024

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages