This repository contains a LangGraph-based multi-input AI agent that can handle text, images, audio, and video inputs. It uses OpenAI models bound to multiple tools for web search, image analysis, audio transcription, and video summarization, executing them dynamically based on the user's query. Both the simple LangGraph implementation and its MCP Client-Server implementation are done.
- Dynamic Tool Execution: The agent intelligently decides when to call tools via
tool_calls. - Multi-Modal Support:
- Web search for up-to-date information.
- Image analysis using OpenAI vision models.
- Audio transcription using Whisper.
- Video summarization (YouTube placeholder implementation).
- Stateful Workflow with LangGraph managing transitions between tool usage and final output.
- Batch Processing: Capable of fetching tasks from an API, running them through the agent, and submitting results.
- LangChain – Framework for building LLM-powered applications.
- LangGraph – Graph-based orchestration for multi-step reasoning and tool use.
- OpenAI Python Client – For GPT models, vision, and Whisper.
- python-dotenv – Environment variable management.
- pandas – Data handling utilities.
- requests – HTTP client for API interaction.
- langsmith – Tracing and debugging for LangChain apps.
The core workflow follows this structure:
The LLM is invoked (with tools bound) and may emit tool_calls.
Inside tool_node, the following tools are implemented as Python functions under the @tool decorator:
If further tool calls are needed, return to main_node; otherwise, proceed to output.
Produces the final answer.
This flow is visualized in graph_diagram.png.
- Clone the repository:
git clone <repo_url> cd <repo_dir>
- Install dependencies:
pip install -r requirements.txt
- Create a
.envfile with your API keys:OPENAI_API_KEY=your_key_here HF_USERNAME=your_hf_username API_URL=https://your.api.url/
- Run the agent:
python multi_input_agent.py
The evaluation is done on a subset of the GAIA: A Benchmark for General AI Assistants, originally released by META. The agent is supposed to process the attached files e.g. Pictures, Videos, Audios, Excel Files, Codes etc.
The script’s run_and_submit_all() function can fetch multiple tasks from a specified API, run them, and submit answers back.
This GitHub repo is built as the final project for HuggingFace AI Agents Course.
After finishing the course, one can get the official certificate from HuggingFace.

This project is licensed under the Apache License 2.0.

