GAICo: GenAI Results Comparator

GenAI Results Comparator (GAICo) helps you measure the quality of your Generative AI (LLM) outputs. It enables you to compare, analyze, and visualize results across text, images, audio, and structured data, helping you answer the question: "Which model performed better?"

🥳 Papers accepted at AAAI 2026!
We're pleased to announce our acceptance! Check out our materials:
Papers: Demo Paper (PDF) | Main Paper (arXiv)
Try it out: Interactive Demo App
Conference Tracks: IAAI-26 Call | AAAI-26 Demo Call

Note

This README provides a quick overview of GAICo. For detailed documentation, installation guides, examples, and developer resources, please visit our Documentation Site.

Resources

📖 Documentation · 📦 PyPI · 📄 Technical Paper · 🎥 Video Demo · 🤔 FAQ · 🗞️ News & Releases

What is GAICo?

Overview of the workflow supported by the GAICo library

At its core, the library provides a set of metrics for evaluating various types of outputs, from plain text strings to structured data like planning sequences and time-series, and multimedia content such as images and audio. While the Experiment class streamlines evaluation for text-based and structured string outputs, individual metric classes offer direct control for all data types, including binary or array-based multimedia. These metrics produce normalized scores (typically 0 to 1), where 1 indicates a perfect match, enabling robust analysis and visualization of LLM performance.

Key capabilities:

Batch processing: Efficiently evaluate entire datasets with one-to-one or one-to-many comparisons
Flexible inputs: Works with strings, lists, NumPy arrays, and Pandas Series
Extensible architecture: Easily add custom metrics by inheriting from BaseMetric
Automated reporting: Generate CSV reports and visualizations (bar charts, radar plots)

Note

The Experiment class evaluates model responses against a single reference at a time. For full dataset evaluation, either iterate with Experiment or use metric classes directly. See our FAQ for details.

Installation

Important

We recommend using a Python virtual environment to manage dependencies.

GAICo can be installed using pip.

Create and activate a virtual environment:

python3 -m venv gaico-env
source gaico-env/bin/activate  # On macOS/Linux
# gaico-env\Scripts\activate   # On Windows

Install GAICo:

pip install gaico

This installs the core GAICo library with essential metrics.

Optional dependencies for specialized metrics:

pip install 'gaico[audio]'                       # Audio metrics
pip install 'gaico[bertscore]'                   # BERTScore metric
pip install 'gaico[cosine]'                      # Cosine similarity
pip install 'gaico[jsd]'                         # JS Divergence
pip install 'gaico[audio,bertscore,cosine,jsd]'  # All features

Tip

For detailed installation instructions including Jupyter setup, developer installation, installation size comparisons, and troubleshooting, see our Installation Guide.

Quick Start

Get started with GAICo in under 2 minutes:

from gaico import Experiment

# Sample LLM responses comparing different models
llm_responses = {
    "Google": "Title: Jimmy Kimmel Reacts to Donald Trump Winning...",
    "Mixtral 8x7b": "I'm an AI and I don't have the ability to predict...",
    "SafeChat": "Sorry, I am designed not to answer such a question.",
}
reference_answer = "Sorry, I am unable to answer such a question as it is not appropriate."

# Initialize and run comparison
exp = Experiment(llm_responses=llm_responses, reference_answer=reference_answer)
results = exp.compare(
    metrics=['Jaccard', 'ROUGE'],
    plot=True,
    output_csv_path="experiment_report.csv"
)

print(results)

Explore complete examples:

quickstart.ipynb - Hands-on introduction to the Experiment class
example-1.ipynb - Multiple models, single metric
example-2.ipynb - Single model, all metrics

Tip

More examples available in the examples/ folder. Run them directly in Google Colab! See our Resources page for videos, demos, and version history.

Features

Comprehensive Metric Library
- Textual similarity: Jaccard, Cosine, Levenshtein, Sequence Matcher
- N-gram based: BLEU, ROUGE, JS Divergence
- Semantic similarity: BERTScore
- Structured data: Planning sequences and time-series metrics
- Multimedia: Image similarity (SSIM, hash-based) and audio quality metrics
Streamlined Evaluation Workflow
- High-level Experiment class for comparing models, applying thresholds, and generating reports
- summarize() method for aggregated performance overviews
Dynamic & Extensible
- Register custom metrics at runtime
- Add your own evaluation criteria easily
Powerful Visualization
- Generate comparative plots automatically
- Support for bar charts and radar plots
Robust & Tested
- Comprehensive test suite with Pytest
- Production-ready reliability

Tip

Want to add your own metric? Check our custom metrics guide.

Citation

If you find this project useful, please cite our work:

@article{Gupta_Koppisetti_Lakkaraju_Srivastava_2026,
  title={GAICo: A Deployed and Extensible Framework for Evaluating Diverse and Multimodal Generative AI Outputs},
  journal={Proceedings of the AAAI Conference on Artificial Intelligence},
  author={Gupta, Nitin and Koppisetti, Pallav and Lakkaraju, Kausik and Srivastava, Biplav},
  year={2026},
}

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/FeatureName)
Commit your changes with clear messages
Ensure tests pass and code follows our style guidelines
Submit a Pull Request

Tip

For development setup, running tests, code style guidelines, and project structure details, see our Developer Guide.

Acknowledgments

The library is developed by Nitin Gupta, Pallav Koppisetti, Kausik Lakkaraju, and Biplav Srivastava. Members of AI4Society contributed to this tool as part of ongoing discussions. Major contributors are credited.
This library uses several open-source packages including NLTK, scikit-learn, and others. Special thanks to the creators and maintainers of the implemented metrics.

License & Contact

This project is licensed under the MIT License - see the LICENSE file for details.

Questions? Reach out at ai4societyteam@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 253 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
gaico		gaico
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
PYPI_DESCRIPTION.md		PYPI_DESCRIPTION.md
README.md		README.md
gaico.drawio.png		gaico.drawio.png
mkdocs.yml		mkdocs.yml
project_macros.py		project_macros.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GAICo: GenAI Results Comparator

Resources

What is GAICo?

Installation

Quick Start

Features

Citation

Contributing

Acknowledgments

License & Contact

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GAICo: GenAI Results Comparator

Resources

What is GAICo?

Installation

Quick Start

Features

Citation

Contributing

Acknowledgments

License & Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages