lizhiyao / oh-my-knowledge Star 5 Code Issues Pull requests Discussions Evaluation framework for LLM knowledge inputs — prompts, RAG corpora, skills, agent workflows. Fix the model, vary the artifact. Built-in statistical rigor: bootstrap CI, Krippendorff α, length-debias, saturation curves. benchmark ai evaluation-framework claude knowledge-engineering skill-evaluation llm prompt-engineering prompt-testing llm-evaluation rag-evaluation llm-judge claude-code agent-evaluation bootstrap-ci krippendorff-alpha evaluation-as-code multi-judge-ensemble Updated May 30, 2026 TypeScript
WatchTree-19 / llm-judge-calibration Star 0 Code Issues Pull requests Measure how much your LLM judges actually agree. Inter-judge agreement metrics for LLM-as-a-judge evaluations. python evaluation calibration eval agreement cohens-kappa inter-rater-agreement llm llm-as-judge krippendorff-alpha Updated May 14, 2026 Python