Skip to content

Commit 51187ea

Browse files
committed
doc: update
1 parent 2ee7645 commit 51187ea

1 file changed

Lines changed: 52 additions & 24 deletions

File tree

README.md

Lines changed: 52 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
<a href="https://evo-eval.github.io/leaderboard.html"><img src="https://img.shields.io/badge/🏆-LeaderBoard-8e7cc3?style=for-the-badge"></a>
55
<a href="https://evo-eval.github.io/visualization.html"><img src="https://img.shields.io/badge/🔮-Visualization-3d85c6?style=for-the-badge"></a>
66
<a href="TODO"><img src="https://img.shields.io/badge/📃-Arxiv-b31b1b?style=for-the-badge"></a>
7-
<a href="https://huggingface.co/evoeval"><img src="https://img.shields.io/badge/🤗-Huggingface-f59e0b?style=for-the-badge"></a>
7+
<a href="https://huggingface.co/evoeval/"><img src="https://img.shields.io/badge/🤗-Huggingface-f59e0b?style=for-the-badge"></a>
88
<a href="https://pypi.org/project/evoeval/"><img src="https://img.shields.io/badge/0.1.0-Pypi-3b719f?style=for-the-badge&logo=pypi"></a>
99
</p>
1010

@@ -16,18 +16,24 @@
1616
<big><a href="#-acknowledgement">🙏Acknowledgement</a></big>
1717
</p>
1818

19-
## About
19+
## <img src="resources/butterfly_dark.png" width="23px" height="auto"> About
2020

21-
**EvoEval** is a holistic benchmark suite created by _evolving_ **HumanEval** problems:
22-
- Containing **828** new problems across **5** semantic-altering and **2** semantic-preserving benchmarks
23-
- Allows evaluation/comparison across different **dimensions** and problem **types** (i.e., _Difficult_, _Creative_ or _Tool Use_ problems)
24-
- Complete with [**leaderboard**](https://evo-eval.github.io/leaderboard.html), **groundtruth solutions** and **robust testcases** to easily fit into your evaluation pipeline
25-
- Generated LLM code samples from **>50** different models to save you time in running experiments
21+
**EvoEval**<sup>1</sup> is a holistic benchmark suite created by _evolving_ **HumanEval** problems:
22+
- 🔥 Containing **828** new problems across **5** 🌠 semantic-altering and **2** ⭐ semantic-preserving benchmarks
23+
- 🔮 Allows evaluation/comparison across different **dimensions** and problem **types** (i.e., _Difficult_, _Creative_ or _Tool Use_ problems). See our [**visualization tool**](https://evo-eval.github.io/visualization.html) for ready-to-use comparison
24+
- 🏆 Complete with [**leaderboard**](https://evo-eval.github.io/leaderboard.html), **groundtruth solutions**, **robust testcases** and **evaluation scripts** to easily fit into your evaluation pipeline
25+
- 🤖 Generated LLM code samples from **>50** different models to save you time in running experiments
26+
27+
<sup>1</sup> coincidentally similar pronunciation with 😈 EvilEval
2628

2729
<p align="center">
2830
<img src="./resources/example.gif" style="width:75%; margin-left: auto; margin-right: auto;">
2931
</p>
3032

33+
Checkout our 📃 [paper](TODO) and [webpage](https://evo-eval.github.io) for more detail!
34+
35+
36+
3137
## ⚡ Quick Start
3238

3339
Directly install the package:
@@ -61,16 +67,26 @@ pip install -r requirements.txt
6167

6268
Now you are ready to download EvoEval benchmarks and perform evaluation!
6369

64-
### Code generation
70+
### 🧑‍💻 Code generation
71+
72+
To download our benchmarks, simply use the following code snippet:
73+
74+
```python
75+
from evoeval.data import get_evo_eval
76+
77+
evoeval_benchmark = "EvoEval_difficult" # you can pick from 7 different benchmarks!
78+
79+
problems = get_evo_eval(evoeval_benchmark)
80+
```
6581

6682
For code generation and evaluation, we adopt the same style as [HumanEval+](https://github.com/evalplus/evalplus) and [HumanEval](https://github.com/openai/human-eval).
6783

68-
Implement the `GEN_SOLUTION` function by calling the LLM to produce the complete solution (include the code) and save the samples to `samples.jsonl`:
84+
Implement the `GEN_SOLUTION` function by calling the LLM to produce the complete solution (include the function header + code) and save the samples to `{benchmark}_samples.jsonl`:
6985

7086
```python
7187
from evoeval.data import get_evo_eval, write_jsonl
7288

73-
evoeval_benchmark = "EvoEval_difficult" # you can pick from 7 different benchmarks!
89+
evoeval_benchmark = "EvoEval_difficult"
7490

7591
samples = [
7692
dict(task_id=task_id, solution=GEN_SOLUTION(problem["prompt"]))
@@ -79,9 +95,17 @@ samples = [
7995
write_jsonl(f"{evoeval_benchmark}_samples.jsonl", samples)
8096
```
8197

82-
### Evaluation
98+
> [!TIP]
99+
>
100+
> EvoEval `samples.jsonl` expects the solution field to contain the **complete** code implementation, this is
101+
> slightly different from the original HumanEval where the solution field only contains the function body.
102+
>
103+
> If you want to follow exactly like HumanEval setup, checkout our 🤗 Huggingface [datasets](https://huggingface.co/evoeval), which can be directly ran with
104+
> HumanEval evaluation [script](https://huggingface.co/evoeval)
105+
106+
### 🕵️ Evaluation
83107

84-
You are strongly recommended to use a sandbox such as [docker](https://docs.docker.com/get-docker/):
108+
You can use our provided [docker](https://docs.docker.com/get-docker/) image:
85109

86110
```bash
87111
docker run -v $(pwd):/app evoeval/evoeval:latest --dataset EvoEval_difficult --samples EvoEval_difficult_samples.jsonl
@@ -93,30 +117,33 @@ Or run it locally:
93117
evoeval.evaluate --dataset EvoEval_difficult --samples EvoEval_difficult_samples.jsonl
94118
```
95119

120+
Or if you are using it as a local repository:
121+
96122
```bash
97-
# run if you are using the local repo
98123
export PYTHONPATH=$PYTHONPATH:$(pwd)
99124
python evoeval/evaluate.py --dataset EvoEval_difficult --samples EvoEval_difficult_samples.jsonl
100125
```
101126

102-
You should expect to see the following output (when evaluated on GPT4):
127+
You should expect to see the following output (when evaluated on GPT-4):
103128
```
104129
Computing expected output...
105130
Expected outputs computed in 11.24s
106131
Reading samples...
107132
100it [00:00, 164.16it/s]
108133
100%|████████████████████████████████████████████████████████████████| 100/100 [00:07<00:00, 12.77it/s]
109134
EvoEval_difficult
110-
pass@1: 0.520
135+
pass@1: 0.520 # for reference GPT-4 solves more than 80% of problems in HumanEval
111136
```
112137

113-
This shows the pass@1 score for the benchmark. You can use `--i-just-wanna-run` to recompute the evaluation result
138+
This shows the pass@1 score for the EvoEval_difficult benchmark. You can use `--i-just-wanna-run` to recompute the evaluation result
114139

115140
## 🔠 Benchmarks
116141

117-
**EvoEval** contains **7** different benchmarks, each with a unique set of problems evolved from the original **HumanEval** problems.:
142+
**EvoEval** contains **7** different benchmarks, each with a unique set of problems
143+
evolved from the original **HumanEval** problems. 🌠 denotes semantic-altering benchmarks,
144+
while ⭐ denotes semantic-preserving benchmarks.:
118145

119-
<details><summary><b>EvoEval_difficult:</b></summary>
146+
<details><summary><b>🌠EvoEval_difficult:</b></summary>
120147
<div>
121148

122149
> Introduce complexity by adding additional constraints and requirements,
@@ -125,7 +152,7 @@ This shows the pass@1 score for the benchmark. You can use `--i-just-wanna-run`
125152
</div>
126153
</details>
127154
128-
<details><summary><b>EvoEval_creative:</b></summary>
155+
<details><summary><b>🌠EvoEval_creative:</b></summary>
129156
<div>
130157

131158
> Generate a more creative problem compared to the original through the use
@@ -134,7 +161,7 @@ This shows the pass@1 score for the benchmark. You can use `--i-just-wanna-run`
134161
</details>
135162
136163

137-
<details><summary><b>EvoEval_subtle:</b></summary>
164+
<details><summary><b>🌠EvoEval_subtle:</b></summary>
138165
<div>
139166

140167
> Make a subtle and minor change to the original problem such as inverting or
@@ -143,7 +170,7 @@ This shows the pass@1 score for the benchmark. You can use `--i-just-wanna-run`
143170
</details>
144171
145172

146-
<details><summary><b>EvoEval_combine:</b></summary>
173+
<details><summary><b>🌠EvoEval_combine:</b></summary>
147174
<div>
148175

149176
> Combine two different problems by integrating the concepts from both problems. In order to select problems that make sense to combine, we apply a simple heuristic
@@ -152,7 +179,7 @@ This shows the pass@1 score for the benchmark. You can use `--i-just-wanna-run`
152179
</div>
153180
</details>
154181
155-
<details><summary><b>EvoEval_tool_use:</b></summary>
182+
<details><summary><b>🌠EvoEval_tool_use:</b></summary>
156183
<div>
157184

158185
> Produce a new problem containing a main problem and one or more helpers
@@ -212,8 +239,8 @@ Further, we also provide all code samples from LLMs on the **EvoEval** benchmark
212239
213240
* See the attachment of our [v0.1.0 release](https://github.com/evo-eval/evoeval/releases/tag/v0.1.0).
214241
215-
Each LLM generation is packaged in a zip file named like `${model_name}_temp_0.0.zip`. You can unzip the folder and obtain the
216-
LLM generation for each of our 7 benchmarks + the original HumanEval problems.
242+
Each LLM generation is packaged in a zip file named like `{model_name}_temp_0.0.zip`. You can unzip the folder and obtain the
243+
LLM generation for each of our 7 benchmarks + the original HumanEval problems. Note that we only evaluate the greedy output for each LLM.
217244
218245
## 📝 Citation
219246
@@ -236,3 +263,4 @@ LLM generation for each of our 7 benchmarks + the original HumanEval problems.
236263
* We especially thank [EvalPlus](https://github.com/evalplus/evalplus)
237264

238265

266+

0 commit comments

Comments
 (0)