You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## <imgsrc="resources/butterfly_dark.png"width="23px"height="auto"> About
20
20
21
-
**EvoEval** is a holistic benchmark suite created by _evolving_**HumanEval** problems:
22
-
- Containing **828** new problems across **5** semantic-altering and **2** semantic-preserving benchmarks
23
-
- Allows evaluation/comparison across different **dimensions** and problem **types** (i.e., _Difficult_, _Creative_ or _Tool Use_ problems)
24
-
- Complete with [**leaderboard**](https://evo-eval.github.io/leaderboard.html), **groundtruth solutions** and **robust testcases** to easily fit into your evaluation pipeline
25
-
- Generated LLM code samples from **>50** different models to save you time in running experiments
21
+
**EvoEval**<sup>1</sup> is a holistic benchmark suite created by _evolving_**HumanEval** problems:
22
+
- 🔥 Containing **828** new problems across **5** 🌠 semantic-altering and **2** ⭐ semantic-preserving benchmarks
23
+
- 🔮 Allows evaluation/comparison across different **dimensions** and problem **types** (i.e., _Difficult_, _Creative_ or _Tool Use_ problems). See our [**visualization tool**](https://evo-eval.github.io/visualization.html) for ready-to-use comparison
24
+
- 🏆 Complete with [**leaderboard**](https://evo-eval.github.io/leaderboard.html), **groundtruth solutions**, **robust testcases** and **evaluation scripts** to easily fit into your evaluation pipeline
25
+
- 🤖 Generated LLM code samples from **>50** different models to save you time in running experiments
26
+
27
+
<sup>1</sup> coincidentally similar pronunciation with 😈 EvilEval
Now you are ready to download EvoEval benchmarks and perform evaluation!
63
69
64
-
### Code generation
70
+
### 🧑💻 Code generation
71
+
72
+
To download our benchmarks, simply use the following code snippet:
73
+
74
+
```python
75
+
from evoeval.data import get_evo_eval
76
+
77
+
evoeval_benchmark ="EvoEval_difficult"# you can pick from 7 different benchmarks!
78
+
79
+
problems = get_evo_eval(evoeval_benchmark)
80
+
```
65
81
66
82
For code generation and evaluation, we adopt the same style as [HumanEval+](https://github.com/evalplus/evalplus) and [HumanEval](https://github.com/openai/human-eval).
67
83
68
-
Implement the `GEN_SOLUTION` function by calling the LLM to produce the complete solution (include the code) and save the samples to `samples.jsonl`:
84
+
Implement the `GEN_SOLUTION` function by calling the LLM to produce the complete solution (include the function header + code) and save the samples to `{benchmark}_samples.jsonl`:
69
85
70
86
```python
71
87
from evoeval.data import get_evo_eval, write_jsonl
72
88
73
-
evoeval_benchmark ="EvoEval_difficult"# you can pick from 7 different benchmarks!
> EvoEval `samples.jsonl` expects the solution field to contain the **complete** code implementation, this is
101
+
> slightly different from the original HumanEval where the solution field only contains the function body.
102
+
>
103
+
> If you want to follow exactly like HumanEval setup, checkout our 🤗 Huggingface [datasets](https://huggingface.co/evoeval), which can be directly ran with
> Combine two different problems by integrating the concepts from both problems. In order to select problems that make sense to combine, we apply a simple heuristic
@@ -152,7 +179,7 @@ This shows the pass@1 score for the benchmark. You can use `--i-just-wanna-run`
0 commit comments