An agentic framework for autonomous scene understanding. This project leverages Large Language Models (LLMs) to guide an agent through an interactive, simulated environment (ALFWorld), gathering clues to infer the inhabitant's occupation via Bayesian-style belief updates.
Traditional SLAM focuses on geometry. This project explores semantic exploration: using pre-trained LLMs to reason about an environment, form hypotheses, and dynamically generate commands to gather information.
The agent operates in the ALFWorld benchmark (AI2-THOR based), exploring a kitchen environment to identify one of four possible inhabitant profiles (Professor, Assassin, Student, Billionaire). It utilizes DSPy for structured LM interactions, employing Chain-of-Thought (CoT) reasoning to drive exploration and maintain a probabilistic belief state.
- LLM-Driven Navigation: No hardcoded heuristics. The agent generates admissible ALFWorld commands (
go to,examine,open) based on context and current beliefs. - Structured Belief Updates: Maintains and updates a probability distribution over possible occupations after every observation.
- Adaptive Exploration: The agent decides what to examine next based on what it has already learned, aiming to maximize information gain.
- Custom Clue Injection: Intercepts simulation calls to inject occupation-specific object descriptions (generated via separate LLMs), testing the agent's ability to ground textual clues in decision-making.
- Frameworks: DSPy (LM programming), ALFWorld (Embodied simulation).
- Models Tested: Gemini 2.0 Flash Lite (API), Gemma 3 4B/12B (Local via Ollama).
- Environment: Python 3.x, WSL 2 (Ubuntu).
The agent loop consists of three primary modules managed via DSPy signatures:
ChooseNextCommand: Analyzes the interaction history and current belief state to select the next optimal action from the environment's valid action space.Execute: Runs the command in ALFWorld. If the command isexamine, the system retrieves context-specific descriptive clues (e.g., "A worn, coffee-stained textbook" for a Student vs. "An pristine, antique quill" for a Professor).UpdateBeliefs: Re-evaluates the probability distribution of occupations based on the new observation using CoT reasoning.
Termination: The episode ends when confidence in a single occupation exceeds a threshold (0.8) or the step limit is reached.
- Python 3.x
- Ollama (for local inference with smaller language models)
- An ALFWorld-compatible environment (instructions below assume a standard setup).
- Clone the repository:
git clone https://github.com/tancredelg/scene-investigation-agent.git cd scene-investigation-agent - Install dependencies:
pip install -r requirements.txt
- (Optional) Pull local models if using Ollama:
ollama pull gemma3:4b-it-qat
All experimental parameters are configured in main.py:
- Mode: Select between single-episode debugging or batch evaluation.
- Model: Switch between local (Ollama) and API (Gemini) backends.
- Parameters: Adjust
confidence_threshold,context_length, anduse_cot.
To run the agent:
python main.py <path_to_alfworld_config.yaml>Experiments compared local models (Gemma 3 4B) vs. larger API models (Gemini 2.0 Flash Lite) across different context lengths and reasoning modes.
- Best Performance: Gemini 2.0 Flash Lite with Chain-of-Thought and full history context achieved an 82.5% success rate across all occupations.
- Findings: Full interaction history was crucial for preventing loops. CoT significantly improved the agent's ability to correlate subtle object descriptions with specific occupations.
For detailed analysis, ablation studies, and more, please refer to the Project Report.
agents.py: Core DSPy agent logic and signatures.main.py: Main loop, environment initialization, and experiment runner.utils.py: Helpers for environment restriction and metadata handling.object_attributes3.json: The dictionary of occupation-specific clues injected during simulation.output/: Logs of agent reasoning traces and episode results.