Kalki - The Universal Web Agent

1. This project contains a modular Universal Web Agent designed to operate within the AGI SDK REAL benchmark and on the open web.
2. The agent uses a Dynamic Router to intelligently switch between Browsing (Playwright), Researching (Perplexity), and Coding (E2B) strategies based on the user's goal.

System Architecture Diagram

View Architecture on Miro

Current repo contains

Universal Router: An intelligent "Traffic Controller" that routes tasks to the best specialist (Researcher, Coder, or Navigator).
Cloud-Native Tooling: Integrated Perplexity AI (via Docker MCP) and E2B Code Interpreter running in secure cloud sandboxes.
High-Level Orchestrator: Manages task lifecycles, planning, and self-correction.
Dynamic Prompt Routing: Automatically selects specialized prompts for benchmarks (Omnizon, NetworkIn) vs. strict safety prompts for real web tasks.
Agent Memory: Stores interaction history and thought processes (JSON-based).
Self-Healing: Implemented self-critique loops to recover from navigation errors.
Strict Mode: Special handling for anti-bot sites (Amazon/Google) using keyboard shortcuts and OCR.
Benchmark Ready: Full setup and eval on REAL Bench for OMNIZON and NetworkIn.

Possible Improvements

Integrating DSpy for optimizing the Router and Planner prompts automatically.
Using RL (Reinforcement Learning) for post-training (GRPO, PPO) to improve navigation efficiency.
Testing with open-weights models (Llama 3, DeepSeek) to reduce costs.
Building on better browser-use frameworks (Nova-act) and fine-tuning parts of the multi-modal LLM.

Future improvements

The algorithm for capturing screenshots and BrowserGym’s HighLevelActionSet feature occasionally desync on heavy pages.
We can create a better semantic map for button tasks (bid) by fine-tuning prompts or using a dedicated VLM for UI element detection.
Integrating with more agentic frameworks (LangGraph) for multi-agent collaboration.

Cost & Model Limitations

The agent defaults to GPT-4o for planning and routing to ensure high reliability. Switching to gpt-4o-mini in config.py saves costs but reduces complex reasoning capabilities.
Requires E2B and Perplexity credits for the advanced tools functionality.
Post-training using GRPO with DSpy could improve performance significantly.

🌟 Key Features

Universal Router: The agent doesn't just click buttons; it thinks. If you ask "Who is the CEO?", it calls Perplexity. If you ask "Calculate Fibonacci", it spins up an E2B sandbox.
Cloud-Native Infrastructure: Heavy tools (Docker containers, Python runtime) run in E2B Cloud Sandboxes, ensuring sub-second startup times and zero load on your local machine.
Strict Mode for Real Web: When browsing real sites (Amazon, Google), the agent switches to a "Strict" prompt set that enforces keyboard shortcuts and prevents hallucination.
Dynamic Prompt Routing: Uses a smart selector to load the correct "instruction manual" (prompt file) for benchmarks (Omnizon, DashDish) while using safe defaults for the open web.
Chain-of-Thought Planning: The agent performs a "self-verification" step after creating a plan, critiquing it for logical flaws before execution.

📂 Project Structure

The agent's source code is located entirely within the agiwebagent/ directory.

agiwebagent/
├── main.py                 # The "Brain": Universal entry point with the Router logic.
├── requirements.txt        # Python dependencies.
└── agent_src/              # Core source code.
    ├── __init__.py
    ├── agent.py            # The "Specialist": Manages the LLM loop and Tool execution.
    ├── config.py           # Configuration (Models, API Keys, Template IDs).
    ├── memory.py           # Stores history of actions and thoughts.
    ├── orchestrator.py     # The "Manager": Oversees the plan-act-critique loop.
    ├── prompt_selector.py  # The "Librarian": Picks the right prompt file.
    ├── tools.py            # The "Toolbelt": Manages E2B Cloud Sandbox & Perplexity.
    ├── utils.py            # Helper functions (Image processing).
    └── prompts/            # Prompt Engineering logic.
        ├── __init__.py
        ├── general_prompts.py      # Strict prompts for Real Web tasks.
        ├── networkIn_prompts.py    # Specialized brain for NetworkIn.
        ├── omnizon_prompts.py      # Specialized brain for Omnizon.
        └── ...

🛠️ Setup Instructions

Follow these steps from the root directory of the project.

1. Create and Activate a Virtual Environment

It's highly recommended to use a virtual environment to manage dependencies.

python -m venv agienv
source agienv/bin/activate
# On Windows: agienv\Scripts\activate

2. Install Dependencies

Install the required Python packages.

pip install -r requirements.txt
pip install -r agiwebagent/requirements.txt
playwright install chromium  # Required for the browser

3. Set Up API Keys

Create a file named .env in the agiwebagent/ directory. You need keys for the LLM and the Tools.

OPENAI_API_KEY="sk-..."
PERPLEXITY_API_KEY="pplx-..."
E2B_API_KEY="e2b_..."

🚀 Running the Agent

All commands should be run from the root directory. The main script is agiwebagent/main.py.

1. Real World Tasks (Custom Mode)

Use this for open-ended tasks on real websites or for general research. The Router will automatically decide if it needs a browser.

Research (No Browser):

python agiwebagent/main.py --goal "Who is the current CEO of OpenAI?"

Coding (No Browser):

python agiwebagent/main.py --goal "Generate a Fibonacci sequence in Python."

Browsing (Opens Chrome):

python agiwebagent/main.py \
  --url "[https://www.amazon.com](https://www.amazon.com)" \
  --goal "Find Sony Headphones and add to cart." \
  --use_ocr True --headless False

2. Benchmark Tasks (Evaluation Mode)

To run specific scientific benchmarks (WebClones), use the --task_name argument.

Run a specific task:

python agiwebagent/main.py --task_name webclones.networkin-3 --no-cache --headless true

Run a full suite:

python agiwebagent/main.py --task_type omnizon --no-cache --headless true

CLI Arguments Cheatsheet

Argument	Description	Example
`--goal`	(New) The instruction for the agent in Custom Mode.	`"Find the price of TSLA"`
`--url`	(New) The starting URL for Custom Mode. Forces browsing.	`https://google.com`
`--task_name`	Runs a single benchmark task by ID.	`webclones.dashdish-2`
`--headless`	`true`/`false`. Hides the browser window.	`--headless false`
`--use_ocr`	Enables visual text reading (Crucial for Amazon/Google).	`--use_ocr True`
`--model`	Specifies the OpenAI model (default: gpt-4o).	`--model gpt-4o`

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
agisdk		agisdk
agiwebagent		agiwebagent
result_images		result_images
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kalki - The Universal Web Agent

System Architecture Diagram

View Architecture on Miro

Current repo contains

Possible Improvements

Future improvements

Cost & Model Limitations

🌟 Key Features

📂 Project Structure

🛠️ Setup Instructions

1. Create and Activate a Virtual Environment

2. Install Dependencies

3. Set Up API Keys

🚀 Running the Agent

1. Real World Tasks (Custom Mode)

2. Benchmark Tasks (Evaluation Mode)

CLI Arguments Cheatsheet

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kalki - The Universal Web Agent

System Architecture Diagram

View Architecture on Miro

Current repo contains

Possible Improvements

Future improvements

Cost & Model Limitations

🌟 Key Features

📂 Project Structure

🛠️ Setup Instructions

1. Create and Activate a Virtual Environment

2. Install Dependencies

3. Set Up API Keys

🚀 Running the Agent

1. Real World Tasks (Custom Mode)

2. Benchmark Tasks (Evaluation Mode)

CLI Arguments Cheatsheet

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages