GenAI Results Comparator (GAICo) helps you measure the quality of your Generative AI (LLM) outputs. It enables you to compare, analyze, and visualize results across text, images, audio, and structured data, helping you answer the question: "Which model performed better?"
🥳 Papers accepted at AAAI 2026!
We're pleased to announce our acceptance! Check out our materials:
Papers: Demo Paper (PDF) | Main Paper (arXiv)
Try it out: Interactive Demo App
Conference Tracks: IAAI-26 Call | AAAI-26 Demo Call
Note
This README provides a quick overview of GAICo. For detailed documentation, installation guides, examples, and developer resources, please visit our Documentation Site.
📖 Documentation · 📦 PyPI · 📄 Technical Paper · 🎥 Video Demo · 🤔 FAQ · 🗞️ News & Releases
Overview of the workflow supported by the GAICo library
At its core, the library provides a set of metrics for evaluating various types of outputs, from plain text strings to structured data like planning sequences and time-series, and multimedia content such as images and audio. While the Experiment class streamlines evaluation for text-based and structured string outputs, individual metric classes offer direct control for all data types, including binary or array-based multimedia. These metrics produce normalized scores (typically 0 to 1), where 1 indicates a perfect match, enabling robust analysis and visualization of LLM performance.
Key capabilities:
- Batch processing: Efficiently evaluate entire datasets with one-to-one or one-to-many comparisons
- Flexible inputs: Works with strings, lists, NumPy arrays, and Pandas Series
- Extensible architecture: Easily add custom metrics by inheriting from
BaseMetric - Automated reporting: Generate CSV reports and visualizations (bar charts, radar plots)
Note
The Experiment class evaluates model responses against a single reference at a time. For full dataset evaluation, either iterate with Experiment or use metric classes directly. See our FAQ for details.
Important
We recommend using a Python virtual environment to manage dependencies.
GAICo can be installed using pip.
Create and activate a virtual environment:
python3 -m venv gaico-env
source gaico-env/bin/activate # On macOS/Linux
# gaico-env\Scripts\activate # On WindowsInstall GAICo:
pip install gaicoThis installs the core GAICo library with essential metrics.
Optional dependencies for specialized metrics:
pip install 'gaico[audio]' # Audio metrics
pip install 'gaico[bertscore]' # BERTScore metric
pip install 'gaico[cosine]' # Cosine similarity
pip install 'gaico[jsd]' # JS Divergence
pip install 'gaico[audio,bertscore,cosine,jsd]' # All featuresTip
For detailed installation instructions including Jupyter setup, developer installation, installation size comparisons, and troubleshooting, see our Installation Guide.
Get started with GAICo in under 2 minutes:
from gaico import Experiment
# Sample LLM responses comparing different models
llm_responses = {
"Google": "Title: Jimmy Kimmel Reacts to Donald Trump Winning...",
"Mixtral 8x7b": "I'm an AI and I don't have the ability to predict...",
"SafeChat": "Sorry, I am designed not to answer such a question.",
}
reference_answer = "Sorry, I am unable to answer such a question as it is not appropriate."
# Initialize and run comparison
exp = Experiment(llm_responses=llm_responses, reference_answer=reference_answer)
results = exp.compare(
metrics=['Jaccard', 'ROUGE'],
plot=True,
output_csv_path="experiment_report.csv"
)
print(results)Explore complete examples:
quickstart.ipynb- Hands-on introduction to theExperimentclassexample-1.ipynb- Multiple models, single metricexample-2.ipynb- Single model, all metrics
Tip
More examples available in the examples/ folder. Run them directly in Google Colab! See our Resources page for videos, demos, and version history.
-
Comprehensive Metric Library
- Textual similarity: Jaccard, Cosine, Levenshtein, Sequence Matcher
- N-gram based: BLEU, ROUGE, JS Divergence
- Semantic similarity: BERTScore
- Structured data: Planning sequences and time-series metrics
- Multimedia: Image similarity (SSIM, hash-based) and audio quality metrics
-
Streamlined Evaluation Workflow
- High-level
Experimentclass for comparing models, applying thresholds, and generating reports summarize()method for aggregated performance overviews
- High-level
-
Dynamic & Extensible
- Register custom metrics at runtime
- Add your own evaluation criteria easily
-
Powerful Visualization
- Generate comparative plots automatically
- Support for bar charts and radar plots
-
Robust & Tested
- Comprehensive test suite with Pytest
- Production-ready reliability
Tip
Want to add your own metric? Check our custom metrics guide.
If you find this project useful, please cite our work:
@article{Gupta_Koppisetti_Lakkaraju_Srivastava_2026,
title={GAICo: A Deployed and Extensible Framework for Evaluating Diverse and Multimodal Generative AI Outputs},
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
author={Gupta, Nitin and Koppisetti, Pallav and Lakkaraju, Kausik and Srivastava, Biplav},
year={2026},
}Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/FeatureName) - Commit your changes with clear messages
- Ensure tests pass and code follows our style guidelines
- Submit a Pull Request
Tip
For development setup, running tests, code style guidelines, and project structure details, see our Developer Guide.
- The library is developed by Nitin Gupta, Pallav Koppisetti, Kausik Lakkaraju, and Biplav Srivastava. Members of AI4Society contributed to this tool as part of ongoing discussions. Major contributors are credited.
- This library uses several open-source packages including NLTK, scikit-learn, and others. Special thanks to the creators and maintainers of the implemented metrics.
This project is licensed under the MIT License - see the LICENSE file for details.
Questions? Reach out at ai4societyteam@gmail.com