Skip to content

raj-gupta1/Kalki-The-Everything-Web-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kalki - The Universal Web Agent

1. This project contains a modular Universal Web Agent designed to operate within the AGI SDK REAL benchmark and on the open web.
2. The agent uses a Dynamic Router to intelligently switch between Browsing (Playwright), Researching (Perplexity), and Coding (E2B) strategies based on the user's goal.

System Architecture Diagram

Current repo contains

  • Universal Router: An intelligent "Traffic Controller" that routes tasks to the best specialist (Researcher, Coder, or Navigator).
  • Cloud-Native Tooling: Integrated Perplexity AI (via Docker MCP) and E2B Code Interpreter running in secure cloud sandboxes.
  • High-Level Orchestrator: Manages task lifecycles, planning, and self-correction.
  • Dynamic Prompt Routing: Automatically selects specialized prompts for benchmarks (Omnizon, NetworkIn) vs. strict safety prompts for real web tasks.
  • Agent Memory: Stores interaction history and thought processes (JSON-based).
  • Self-Healing: Implemented self-critique loops to recover from navigation errors.
  • Strict Mode: Special handling for anti-bot sites (Amazon/Google) using keyboard shortcuts and OCR.
  • Benchmark Ready: Full setup and eval on REAL Bench for OMNIZON and NetworkIn.

Possible Improvements

  • Integrating DSpy for optimizing the Router and Planner prompts automatically.
  • Using RL (Reinforcement Learning) for post-training (GRPO, PPO) to improve navigation efficiency.
  • Testing with open-weights models (Llama 3, DeepSeek) to reduce costs.
  • Building on better browser-use frameworks (Nova-act) and fine-tuning parts of the multi-modal LLM.

Future improvements

  • The algorithm for capturing screenshots and BrowserGym’s HighLevelActionSet feature occasionally desync on heavy pages.
  • We can create a better semantic map for button tasks (bid) by fine-tuning prompts or using a dedicated VLM for UI element detection.
  • Integrating with more agentic frameworks (LangGraph) for multi-agent collaboration.

Cost & Model Limitations

  • The agent defaults to GPT-4o for planning and routing to ensure high reliability. Switching to gpt-4o-mini in config.py saves costs but reduces complex reasoning capabilities.
  • Requires E2B and Perplexity credits for the advanced tools functionality.
  • Post-training using GRPO with DSpy could improve performance significantly.

🌟 Key Features

  • Universal Router: The agent doesn't just click buttons; it thinks. If you ask "Who is the CEO?", it calls Perplexity. If you ask "Calculate Fibonacci", it spins up an E2B sandbox.
  • Cloud-Native Infrastructure: Heavy tools (Docker containers, Python runtime) run in E2B Cloud Sandboxes, ensuring sub-second startup times and zero load on your local machine.
  • Strict Mode for Real Web: When browsing real sites (Amazon, Google), the agent switches to a "Strict" prompt set that enforces keyboard shortcuts and prevents hallucination.
  • Dynamic Prompt Routing: Uses a smart selector to load the correct "instruction manual" (prompt file) for benchmarks (Omnizon, DashDish) while using safe defaults for the open web.
  • Chain-of-Thought Planning: The agent performs a "self-verification" step after creating a plan, critiquing it for logical flaws before execution.

📂 Project Structure

The agent's source code is located entirely within the agiwebagent/ directory.

agiwebagent/
├── main.py                 # The "Brain": Universal entry point with the Router logic.
├── requirements.txt        # Python dependencies.
└── agent_src/              # Core source code.
    ├── __init__.py
    ├── agent.py            # The "Specialist": Manages the LLM loop and Tool execution.
    ├── config.py           # Configuration (Models, API Keys, Template IDs).
    ├── memory.py           # Stores history of actions and thoughts.
    ├── orchestrator.py     # The "Manager": Oversees the plan-act-critique loop.
    ├── prompt_selector.py  # The "Librarian": Picks the right prompt file.
    ├── tools.py            # The "Toolbelt": Manages E2B Cloud Sandbox & Perplexity.
    ├── utils.py            # Helper functions (Image processing).
    └── prompts/            # Prompt Engineering logic.
        ├── __init__.py
        ├── general_prompts.py      # Strict prompts for Real Web tasks.
        ├── networkIn_prompts.py    # Specialized brain for NetworkIn.
        ├── omnizon_prompts.py      # Specialized brain for Omnizon.
        └── ... 

🛠️ Setup Instructions

Follow these steps from the root directory of the project.

1. Create and Activate a Virtual Environment

It's highly recommended to use a virtual environment to manage dependencies.

python -m venv agienv
source agienv/bin/activate
# On Windows: agienv\Scripts\activate

2. Install Dependencies

Install the required Python packages.

pip install -r requirements.txt
pip install -r agiwebagent/requirements.txt
playwright install chromium  # Required for the browser

3. Set Up API Keys

Create a file named .env in the agiwebagent/ directory. You need keys for the LLM and the Tools.

OPENAI_API_KEY="sk-..."
PERPLEXITY_API_KEY="pplx-..."
E2B_API_KEY="e2b_..."

🚀 Running the Agent

All commands should be run from the root directory. The main script is agiwebagent/main.py.

1. Real World Tasks (Custom Mode)

Use this for open-ended tasks on real websites or for general research. The Router will automatically decide if it needs a browser.

Research (No Browser):

python agiwebagent/main.py --goal "Who is the current CEO of OpenAI?"

Coding (No Browser):

python agiwebagent/main.py --goal "Generate a Fibonacci sequence in Python."

Browsing (Opens Chrome):

python agiwebagent/main.py \
  --url "[https://www.amazon.com](https://www.amazon.com)" \
  --goal "Find Sony Headphones and add to cart." \
  --use_ocr True --headless False

2. Benchmark Tasks (Evaluation Mode)

To run specific scientific benchmarks (WebClones), use the --task_name argument.

Run a specific task:

python agiwebagent/main.py --task_name webclones.networkin-3 --no-cache --headless true

Run a full suite:

python agiwebagent/main.py --task_type omnizon --no-cache --headless true

CLI Arguments Cheatsheet

Argument Description Example
--goal (New) The instruction for the agent in Custom Mode. "Find the price of TSLA"
--url (New) The starting URL for Custom Mode. Forces browsing. https://google.com
--task_name Runs a single benchmark task by ID. webclones.dashdish-2
--headless true/false. Hides the browser window. --headless false
--use_ocr Enables visual text reading (Crucial for Amazon/Google). --use_ocr True
--model Specifies the OpenAI model (default: gpt-4o). --model gpt-4o

About

AI Web Agent using E2B Sandbox and Docker MCP

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors