Repository files navigation Large Language Models Evaluation
[2022/12] Constitutional AI: Harmlessness from ai feedback Bai Y et al. arXiv. [paper ]
[2023/04] Human-like summarization evaluation with chatgpt Gao M et al. arXiv. [paper ]
[2023/03] Chatgpt as a factual inconsistency evaluator for text summarization Luo Z et al. arXiv. [paper ]
[2023/05] Halueval: A large-scale hallucination evaluation benchmark for large language models Li J et al. arXiv. [paper ] [code ]
[2023/07] Evaluating correctness and faithfulness of instruction-following models for question answering Adlakha V et al. arXiv. [paper ] [code ]
[2023/05] Lm vs lm: Detecting factual errors via cross examination Cohen R et al. arXiv. [paper ]
[2023/03] G-eval: Nlg evaluation using gpt-4 with better human alignment Liu Y et al. arXiv. [paper ] [code ]
[2023/03] Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models Manakul P et al. arXiv. [paper ]
[2023/10] Prometheus: Inducing fine-grained evaluation capability in language models Kim S et al. ICLR 2024. [paper ]
[2023/08] Shepherd: A critic for language model generation Wang T et al. arXiv. [paper ]
[2023/12] Judging llm-as-a-judge with mt-bench and chatbot arena Zheng L et al. NeurIPS. [paper ]
[2024/03] Aligning with human judgement: The role of pairwise preference in large language model evaluators Liu Y et al. arXiv. [paper ]
[2024/06] Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments Zhou H et al. arXiv. [paper ]
[2024/06] UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor Upadhyay S et al. arXiv. [paper ] [code ]
[2024/04] Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models Verga P et al. arXiv. [paper ]
[2024/05] Evallm: Interactive evaluation of large language model prompts on user-defined criteria Kim TS et al. CHI. [paper ]
[2024/05] "We Need Structured Output": Towards User-centered Constraints on Large Language Model Output Liu MX et al. CHI. [paper ]
[2024/04] Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences Shankar S et al. arXiv. [paper ]
[2024/02] Cappy: Outperforming and boosting large multi-task lms with a small scorer Tan B et al. NeurIPS. [paper ] [code ]
[2024/06] Llm critics help catch llm bugs McAleese N et al. arXiv. [paper ]
[2024/06] Finding Blind Spots in Evaluator LLMs with Interpretable Checklists Doddapaneni S et al. arXiv. [paper ] [code ]
[2024/06] Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks Bavaresco A et al. arXiv. [paper ]
[2024/06] Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges Thakur AS et al. arXiv. [paper ]
[2024/06] On the Limitations of Fine-Tuned Judge Models for LLM Evaluation Hui Huang et al. arXiv. [paper ]
About
记录24年大模型评测相关论文的Repo
Resources
Stars
Watchers
Forks
You can’t perform that action at this time.