Skip to content

tancredelg/scene-investigation-agent

Repository files navigation

Scene Investigation Agent: LLM-Guided Adaptive Exploration

An agentic framework for autonomous scene understanding. This project leverages Large Language Models (LLMs) to guide an agent through an interactive, simulated environment (ALFWorld), gathering clues to infer the inhabitant's occupation via Bayesian-style belief updates.

Overview

Traditional SLAM focuses on geometry. This project explores semantic exploration: using pre-trained LLMs to reason about an environment, form hypotheses, and dynamically generate commands to gather information.

The agent operates in the ALFWorld benchmark (AI2-THOR based), exploring a kitchen environment to identify one of four possible inhabitant profiles (Professor, Assassin, Student, Billionaire). It utilizes DSPy for structured LM interactions, employing Chain-of-Thought (CoT) reasoning to drive exploration and maintain a probabilistic belief state.

Key Features

  • LLM-Driven Navigation: No hardcoded heuristics. The agent generates admissible ALFWorld commands (go to, examine, open) based on context and current beliefs.
  • Structured Belief Updates: Maintains and updates a probability distribution over possible occupations after every observation.
  • Adaptive Exploration: The agent decides what to examine next based on what it has already learned, aiming to maximize information gain.
  • Custom Clue Injection: Intercepts simulation calls to inject occupation-specific object descriptions (generated via separate LLMs), testing the agent's ability to ground textual clues in decision-making.

Tech Stack

  • Frameworks: DSPy (LM programming), ALFWorld (Embodied simulation).
  • Models Tested: Gemini 2.0 Flash Lite (API), Gemma 3 4B/12B (Local via Ollama).
  • Environment: Python 3.x, WSL 2 (Ubuntu).

Methodology

The agent loop consists of three primary modules managed via DSPy signatures:

  1. ChooseNextCommand: Analyzes the interaction history and current belief state to select the next optimal action from the environment's valid action space.
  2. Execute: Runs the command in ALFWorld. If the command is examine, the system retrieves context-specific descriptive clues (e.g., "A worn, coffee-stained textbook" for a Student vs. "An pristine, antique quill" for a Professor).
  3. UpdateBeliefs: Re-evaluates the probability distribution of occupations based on the new observation using CoT reasoning.

Termination: The episode ends when confidence in a single occupation exceeds a threshold (0.8) or the step limit is reached.

Setup & Usage

Prerequisites

  • Python 3.x
  • Ollama (for local inference with smaller language models)
  • An ALFWorld-compatible environment (instructions below assume a standard setup).

Installation

  1. Clone the repository:
    git clone https://github.com/tancredelg/scene-investigation-agent.git
    cd scene-investigation-agent
  2. Install dependencies:
    pip install -r requirements.txt
  3. (Optional) Pull local models if using Ollama:
    ollama pull gemma3:4b-it-qat

Configuration & Running

All experimental parameters are configured in main.py:

  • Mode: Select between single-episode debugging or batch evaluation.
  • Model: Switch between local (Ollama) and API (Gemini) backends.
  • Parameters: Adjust confidence_threshold, context_length, and use_cot.

To run the agent:

python main.py <path_to_alfworld_config.yaml>

Results

Experiments compared local models (Gemma 3 4B) vs. larger API models (Gemini 2.0 Flash Lite) across different context lengths and reasoning modes.

  • Best Performance: Gemini 2.0 Flash Lite with Chain-of-Thought and full history context achieved an 82.5% success rate across all occupations.
  • Findings: Full interaction history was crucial for preventing loops. CoT significantly improved the agent's ability to correlate subtle object descriptions with specific occupations.

For detailed analysis, ablation studies, and more, please refer to the Project Report.

File Structure

  • agents.py: Core DSPy agent logic and signatures.
  • main.py: Main loop, environment initialization, and experiment runner.
  • utils.py: Helpers for environment restriction and metadata handling.
  • object_attributes3.json: The dictionary of occupation-specific clues injected during simulation.
  • output/: Logs of agent reasoning traces and episode results.

About

An LLM-driven agent that explores simulated environments (ALFWorld) to infer inhabitant occupations using DSPy and Chain-of-Thought reasoning.

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages