AI agent evaluation framework for energy analytics.
- ReAct Agent: Multi-provider LLM support (OpenAI, Anthropic, Google, DeepInfra)
- Energy Tools: GridStatus, Tariffs, Renewables, Battery optimization, Dockets, Weather, Search
- MCP Integration: External RAG and database tools via Model Context Protocol
- Benchmark Framework: Evaluate agents across questions with metrics and comparison
- Observability: JSON tracing with full execution data
# Install system dependencies (Ipopt solver for battery optimization)
sudo ./install.sh
# Create virtual environment
python -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Configure API keys
cp .env.example .env
# Edit .env with your keys
# Run a benchmark
python scripts/run_benchmark.pyFor battery optimization tools, install Ipopt solver:
# Debian/Ubuntu
sudo ./install.shThe install script builds Ipopt and required third-party solvers from source, skipping Java test harness to avoid JDK issues.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtFor development (includes testing and linting tools):
pip install -r requirements-dev.txtCreate a .env file with your credentials:
# LLM Providers (at least one required)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=...
DEEPINFRA_API_KEY=...
# Tools (optional - enables specific functionality)
EXA_API_KEY=... # SearchTool
GRIDSTATUS_API_KEY=... # GridStatusAPITool
OPENWEATHER_API_KEY=... # OpenWeatherTool
OPEN_EI_API_KEY=... # TariffsTool
RENEWABLES_NINJA_API_KEY=... # RenewablesToolCopy .env.example for a template.
MCP servers provide RAG and database access. They connect via remote URLs configured
in your .env file. Set the URL env vars below to enable them; if neither is set,
MCP is effectively disabled even with mcp.enabled: true.
RAG_SERVER_URL=https://energyevals-rag-mcp.tume.ai/sse
DATABASE_SERVER_URL=https://energyevals-db-mcp.tume.ai/sseThe quickest way to use EnergyEvals is the interactive agent script. Type a question, get an answer:
# Start interactive mode (defaults to openai / gpt-4o-mini)
python scripts/run_agent.py
# Choose a provider
python scripts/run_agent.py -p anthropic
python scripts/run_agent.py -p google
# Pick a specific model
python scripts/run_agent.py -p openai -m gpt-4o
# Enable MCP tools (RAG + database)
python scripts/run_agent.py --mcp
# Run without tools (pure LLM)
python scripts/run_agent.py --no-tools
# Ask a single question (no interactive loop)
python scripts/run_agent.py -q "What are current ERCOT energy prices?"Inside the interactive session, type your question at the > prompt. The agent will use its tools to research and answer. Type quit to exit.
For detailed benchmark configuration, custom questions, evaluation, and multi-model comparison, see the Benchmark Guide.
Benchmark runs require at least one explicit models entry in config; there is no provider/model fallback.
Multi-trial seed controls are configured in agent:
agent:
num_trials: 3
shuffle: true
seed_mode: rotate # fixed | rotate | random_per_trial
seed: 12345 # optional base seed
# seeds: [101, 202, 303] # optional explicit per-trial seedsThe agent uses a Reasoning-Acting loop:
- Thought: Analyze the question and plan next action
- Action: Select and execute a tool
- Observation: Process tool output
- Repeat: Continue until answer is complete
Maximum iterations default to 25 (configurable).
A unified interface to run models from any major LLM provider:
- OpenAI — GPT, O1, O3, and more
- Anthropic — Claude models (Sonnet, Opus, Haiku)
- Google — Gemini models (Flash, Pro)
- DeepInfra — Open-source models (Llama, Mistral, and more)
Providers implement a common BaseProvider protocol with tool calling and streaming support.
Tools are registered via the default tool registry in create_default_registry():
- Direct registration: Tools instantiated and registered in code
- MCP servers: External tools via Model Context Protocol
Each tool provides:
- JSON schema for LLM tool calling
- Async execution
- Error handling with structured results
Traces capture full execution:
- All ReAct steps (thought, action, observation)
- Tool inputs/outputs
- Token usage and latency
- Failed calls with errors
Trace output is stored as local JSON or JSONL files.
Run tests:
pytestLint and type check:
ruff check .
mypy energyevals