Extracts clean, RAG-friendly Markdown from any webpage.
BeepSeq-WebResearch is a powerful web content extraction tool designed to fetch the main text and relevant images from any URL and convert it into a clean, self-contained Markdown file. It intelligently handles complex scenarios like lazy-loading images and bypasses anti-scraping measures by leveraging a real browser environment.
This tool is ideal for developers and researchers building Retrieval-Augmented Generation (RAG) systems, enabling high-quality data pipelines for LLMs.
The tool is designed with a dual-mode architecture for maximum flexibility:
- Standalone CLI: For quick, on-demand tasks, you can use the
webresearchcommand directly. It spins up a browser instance, performs the extraction, and shuts down, all in a single process. - Client-Server Mode: For continuous or programmatic use, you can run
webresearch serveto launch a persistent FastAPI server. You can then interact with this server via its HTTP API, using the provided Python client or any other HTTP tool.
- RAG-Ready Markdown: Outputs clean, structured Markdown, perfect for LLM ingestion and RAG pipelines.
- Intelligent Content Extraction: Utilizes a robust engine (
readability-lxml) to accurately extract the main content while preserving structure. - Embedded Images: Seamlessly fetches and embeds all relevant images as base64, creating fully self-contained documents.
- Robust & Resilient: Built on
nodriverto handle dynamic content, lazy-loading, and anti-bot measures effectively. - High-Performance: Employs asynchronous processing and high-speed parallel image downloads.
- Flexible: Offers a fallback extraction engine (
trafilatura) for compatibility. - Dual-Mode Interface: Provides both a standalone command-line interface (CLI) and a FastAPI server for programmatic access.
- Python 3.9+
- Google Chrome (or a Chromium-based browser) installed.
pip install git+https://github.com/BICHENG/BeepSeq-WebResearch.gitThe CLI is the quickest way to extract content from a URL without running a server.
Extract and Save as Markdown (with embedded images):
webresearch read "https://your-target-url.com"This will create a .md file in the output/ directory.
Extract and Save as both HTML and Markdown:
webresearch read "https://your-target-url.com" --html --mdSearch and Extract Full Text from Results:
webresearch search "Your search query" --max-results 3 --fulltextThis mode is ideal for integrating the tool into your applications.
Step 1: Start the server
webresearch serve --port 8000The server will now be running in the background.
Step 2: Use the Python client to make requests
We provide a simple async client in client.py. You can use it as follows:
import asyncio
from webresearch.client import WebResearchClient, CrawlerConfig
async def main():
client = WebResearchClient(base_url="http://localhost:8000")
# Example 1: Read a single URL
try:
print("--- Reading a single URL ---")
# Note: The server uses default config unless specified in the request body,
# which is more complex. For CLI-like control, direct crawl is better.
# The client library here shows basic GET interaction.
content = await client.read(url="https://www.bilibili.com/video/BV1c1421f7bK/")
print(f"Successfully extracted content (first 200 chars):\\n{content[:200]}...")
except Exception as e:
print(f"An error occurred: {e}")
# Example 2: Search for a query
try:
print("\\n--- Searching for a query ---")
results = await client.search(query="NVIDIA Blackwell", max_results=2)
print(f"Search results: {results}")
except Exception as e:
print(f"An error occurred: {e}")
if __name__ == "__main__":
asyncio.run(main())For the most powerful workflow, you can expose the server as a set of tools for an AI agent like Cursor.
Step 1: Start the server
webresearch serve --port 8000Step 2: Configure Cursor's MCP
Open Cursor's mcp.json file. You can find it at:
- Windows:
C:\Users\<YourUsername>\.cursor\mcp.json - macOS/Linux:
~/.cursor/mcp.json
Add a new entry for your local server:
{
"mcpServers": {
"...": {
"...": "..."
},
"webresearch": {
"url": "http://localhost:8000/mcp"
}
}
}Note: Remember to replace ... with your existing configurations for other servers.
Step 3: Reload Cursor and Use the Tools
After reloading Cursor (you can use the Reload Window command), you can now use the tools directly in the chat:
Example 1: Search the web
@webresearch search_web query="Latest AI advancements"
Example 2: Read a specific URL
@webresearch read_url url="https://en.wikipedia.org/wiki/Artificial_intelligence"
Example 3: Batch read (GET) with comma-separated URLs
@webresearch read_url urls="https://example.com,https://en.wikipedia.org/wiki/WTFPL"
Example 4: Batch read (POST) with advanced config
@webresearch read_urls urls=["https://example.com","https://example.org"] config={"embed_images":true,"save_markdown":true}
For detailed examples, including comparisons and advanced use-cases, please see our Showcase & Examples page.
- MCP Server Integration: Expose the
readandsearchendpoints as Model Context Protocol (MCP) tools. This will turn the service into a compliant MCP server, allowing AI agents and applications (like Cursor) to discover and use its web research capabilities natively.
Contributions are welcome! Please feel free to open an issue or submit a pull request.
This project uses the WTFPL v2 (Do What The Fuck You Want To Public License). See the LICENSE file for details.