Retrieval-Augmented Multi-Agent LLMs for Time Series Forecasting
This repository contains the reference implementation for the paper:
This paper introduced a retrieval-augmented, agent-based framework that reformulates time series forecasting as a structured, language-driven reasoning process over historical temporal evidence. By decomposing forecasting into modular components—including deterministic retrieval, statistical grounding, temporal pattern analysis, data summarization, and synthesis—the proposed system enables interpretable, flexible, and reproducible forecasting in response to natural language queries.Forecasting as Reasoning: A Retrieval-Augmented Multi-Agent LLM Framework for Time Series Forecasting
Our primary contributions are:
-
Forecasting as Deterministic Evidence-Grounded Reasoning. We formalize time series forecasting as a modular reasoning process over deterministically retrieved temporal contexts, rather than as end-to-end parametric sequence modeling. This perspective transforms forecasting from function approximation into structured, interpretable decision-making conditioned on verifiable historical slices.
-
DataFrame-Grounded Retrieval for Numerical Fidelity. We introduce a deterministic, schema-aligned retrieval operator that operates directly on structured time-series tables, avoiding embedding-based similarity search. Unlike vector retrieval, our method guarantees exact timestamp alignment, numerical fidelity, and full reproducibility across executions.
-
Modular Multi-Agent Forecasting Architecture. We design a coordinated, role-specialized agent framework comprising horizon classification, feature extraction, statistical grounding, summarization, pattern detection, and forecast synthesis under centralized orchestration. This decomposition enables controlled ablation, backend-agnostic evaluation, and transparent intermediate reasoning artifacts.
-
Reproducibility-Centric Evaluation of LLM Forecasting. We introduce a systematic reproducibility protocol that quantifies run-to-run numerical dispersion, coefficient of variation, and worst-step instability under repeated identical executions. This analysis exposes backend-dependent stochasticity and provides an operational perspective on deployment stability.
-
Comprehensive Backend and Component Analysis. Across four state-of-the-art LLM backends, we demonstrate that structured retrieval and statistical grounding materially improve forecasting accuracy and stability relative to prompt-only baselines and component-wise ablations.
The repository mirrors the modular architecture described in the paper, with role-specialized agents, deterministic tools, and experiment scripts organized for reproducibility and ablation analysis.
.
├── experiments/
│ ├── eval/
│ ├── baselines/
│ ├── outputs/
│ ├── queries/
│ ├── stubs/
│ ├── utils/
│ ├── yamls/
│ └── scripts/
│
│
├── agents/
│ ├── orchestration_agent.py
│ ├── sector_detector.py
│ ├── timeseries_features.py
│ ├── energy_features.py
│ ├── summarization.py
│ ├── pattern_detection.py
│ ├── forecast_narrative.py
│ ├── redirecting_agent.py
│ └── tools/
│ ├── retrieval.py
│ └── statistics_calculation.py
│
├── data/
├── utils/
│
├── configs/
├── interactive.py
├── main.py
│
├── requirements.txt
├── .env
├── .gitignore
└── README.md
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Create a .env file at the project root:
- OPENAI_API_KEY="..."
- DEEPSEEK_API_KEY="..."
- GEMINI_API_KEY="..."
- ANTHROPIC_API_KEY="..."
extract_data.ipynb
python main.py
streamlit run interactive.py
python ablate_model.py
python ablate_parallel.py
All experiments are driven by YAML files in: experiments/yamls/
Each ablation disables an agent or a tool while keeping other components fixed.
