openai / evals Public

Notifications You must be signed in to change notification settings
Fork 3k
Star 18.6k

Code
Issues 125
Pull requests 80
Discussions
Actions
Projects
Security and quality
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security and quality
Insights

Pull requests: openai/evals

Labels 10 Milestones 0

New pull request New

80 Open 1,262 Closed

Author

Filter by author

Uh oh!

There was an error while loading. Please reload this page.

Label

Filter by label

Uh oh!

There was an error while loading. Please reload this page.

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Uh oh!

There was an error while loading. Please reload this page.

Milestones

Filter by milestone

Uh oh!

There was an error while loading. Please reload this page.

Reviews

Filter by reviews

No reviews Review required Approved review Changes requested

Assignee

Filter by who’s assigned

Assigned to nobody

Uh oh!

There was an error while loading. Please reload this page.

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Pull requests list

fix: sanitize user-supplied command before logging to prevent log injection

#1671 opened May 25, 2026 by rajkumar-prog

Loading…

Fix EVALS_SHOW_EVAL_PROGRESS env var parsing

#1670 opened May 21, 2026 by LeSingh1

Loading…

fix: prevent stack trace exposure in flask-playwright API responses

#1669 opened May 21, 2026 by rajkumar-prog

Loading…

fix: guard against empty/filtered LLM responses in make_me_say utils

#1666 opened May 18, 2026 by qizwiz

Loading…

fix: omit non-numeric usage fields from token usage report

#1664 opened May 16, 2026 by rajkumar-prog

Loading…

Refresh contribution documentation references

#1662 opened May 15, 2026 by MukundaKatta

Loading…

feat: add MiniMax provider support

#1661 opened May 13, 2026 by octo-patch

Loading…

chore: remove obsolete GPT-4 PR guidance

#1659 opened May 11, 2026 by extrasmall0

Loading…

Add agent pre-action control eval

#1658 opened May 10, 2026 by mindbomber

Loading…

13 tasks done

Add atr_prompt_injection eval (modelgraded safety, 16 multilingual samples)

#1657 opened May 10, 2026 by eeee2345

Loading…

13 tasks done

Add agent-tool-abstention eval (13 samples, Match template)

#1656 opened May 8, 2026 by MukundaKatta

Loading…

5 tasks done

Add agent-tool-routing eval (12 samples, Match template)

#1655 opened May 8, 2026 by MukundaKatta

Loading…

5 tasks done

eval: add Oracle ERP Cloud workflow and terminology eval

#1654 opened May 4, 2026 by karthikchundi-commits

Loading…

Fix OpenAI completion args routing

#1653 opened Apr 23, 2026 by kayametehan

Loading…

Add explain mode to HumanCliSolver

#1652 opened Apr 23, 2026 by kayametehan

Loading…

Route modern OpenAI models through chat completions

#1651 opened Apr 23, 2026 by kayametehan

Loading…

Handle nested token usage details in oaieval

#1650 opened Apr 23, 2026 by kayametehan

Loading…

Add Turkish proverbs eval

#1649 opened Apr 23, 2026 by kayametehan

Loading…

Update Python version to 3.12 and refresh PR template

#1648 opened Apr 23, 2026 by kayametehan

Loading…

Add Turkish language evals: logical reasoning and grammar

#1647 opened Apr 23, 2026 by kayametehan

Loading…

eval: add RAIL Score responsible AI evaluation across 8 dimensions

#1640 opened Apr 2, 2026 by SumitVermakgp

Loading…

12 tasks done

fix: replace 11 bare except clauses with except Exception

#1626 opened Feb 25, 2026 by haosenwang1018

Loading…

Add finance-agent routing eval dataset and builder guidance

#1625 opened Feb 24, 2026 by maxpetrusenko

Loading…

README: fix Evals starter guide link

#1623 opened Feb 19, 2026 by dcol91863

Loading…

Add Logic Stress Stress-test Suite (v2, v3)

#1622 opened Feb 16, 2026 by 14H034160212 Contributor

Loading…

Previous 1 2 3 4 Next

Previous Next

ProTip! Adding no:label will show everything without a label.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!