Another day, another Awesome List repo. A comprehensive list of Chainforge-related content
-
Updated
Oct 24, 2025
Another day, another Awesome List repo. A comprehensive list of Chainforge-related content
Official implementation for "GLaPE: Gold Label-agnostic Prompt Evaluation and Optimization for Large Language Models" (stay tuned & more will be updated)
The prompt engineering, prompt management, and prompt evaluation tool for Python
Compare, improve, and verify prompt changes with evidence — not vibes.
The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.
A project to take a suboptimal prompt from Langsmith, enhance it, submit it again, and then reevaluate the results. #LangSmith #PromptEngineer
pi extension for fixed-task-set eval runs and prompt/system comparisons with reproducible reports
A Simple Prompt Optimization Using 3 different algorithms for testing.
A Streamlit web app that uses a Groq-powered LLM (Llama 3) to act as an impartial judge for evaluating and comparing two model outputs. Supports custom criteria, presets like creativity and brand tone, and returns structured scores, explanations, and a winner. Built end-to-end with Python, Groq API, and Streamlit.
Building a framework to run prompt evaluation tasks.
The prompt engineering, prompt management, and prompt evaluation tool for Ruby.
Test prompt variants across LLM providers with LLM-as-judge evaluation
Free, local Langfuse OSS setup with Ollama for LLM evaluation, scoring, and datasets.
A few prompts that I am storing in a repo for the purpose of running controlled experiments comparing and benchmarking different LLMs for defined use-cases
The prompt engineering, prompt management, and prompt evaluation tool for Java.
The prompt engineering, prompt management, and prompt evaluation tool for Kotlin.
Runs two simple test prompts against 5 Anthropic models. Visually compares speed, capability, costs.
The prompt engineering, prompt management, and prompt evaluation tool for C# and .NET
A hybrid machine learning system for scoring LLM prompts. Features a BERT-based gatekeeper for structural validation and an LLM-based classifier to ensure semantic intent, delivering consistent empirical metrics for prompt engineering.
Add a description, image, and links to the prompt-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the prompt-evaluation topic, visit your repo's landing page and select "manage topics."