This project benchmarks and compares the outputs of two large language models: Google Gemini (via the official API) and an Ollama-hosted model (e.g., Gemma3:12b).
It evaluates their responses to a set of prompts using ROUGE and BERTScore metrics, as well as latency and token usage, and saves all results to a CSV file.
- Batch evaluation of multiple prompts
- Automatic querying of both Gemini and Ollama models
- ROUGE and BERTScore metrics for output similarity
- Latency and token usage tracking
- Results saved in a structured CSV file for further analysis
- Python 3.8+
- Ollama running locally with your chosen model (default:
gemma3:12b) - Google Gemini API access and API key
This project uses uv as the package manager.
Install all required packages with:
uv sync-
Clone this repository.
-
Set up your
.envfile:Create a file named
.envin the project root with the following content:GEMINI_API_KEY=your_actual_gemini_api_key_here -
Start your Ollama server and ensure your chosen model is available (default:
gemma3:12b).
Run the script:
python compare_models.py- The script will process a set of prompts, query both models, compute metrics, and save results to
model_comparison_results.csv.
The CSV file contains, for each prompt:
- The prompt text
- Gemini response
- Ollama response
- Latency (seconds) for each model
- Token count for each model (estimated)
- ROUGE-1, ROUGE-2, ROUGE-L (precision, recall, F1)
- BERTScore (precision, recall, F1)
Developed by Martin Saxa, Lukáš Švihura a Dominik Keil