MCP 服务器

Documentation Retrieval & Web Scraping

Enables retrieval and cleaning of official documentation content for popular AI/Python libraries (uv, langchain, openai, llama-index) through web scraping and LLM-powered content extraction. Uses Serper API for search and Groq API to clean HTML into readable text with source attribution.

README

MCP Server: Documentation Retrieval & Web Scraping (uv + FastMCP)

This project provides a minimal, async MCP (Model Context Protocol) server that exposes a tool for retrieving and cleaning official documentation content for popular AI / Python ecosystem libraries. It uses:

fastmcp to define and run the MCP server over stdio.
httpx for async HTTP calls.
serper.dev for Google-like search (via API).
groq API (LLM) to clean raw HTML into readable text chunks.
python-dotenv for environment variable management.
uv as the package manager & runner (fast, lockfile-based, Python 3.11+).

Features

Search restricted to official docs domains (uv, langchain, openai, llama-index).
Tool: get_docs(query, library) returns concatenated cleaned sections with SOURCE: labels.
Streaming-safe async design (chunking large HTML pages before LLM cleaning).
Separate client.py demonstrating how to connect as an MCP client and call the tool, then post-process with an LLM.

Quick Start

Prerequisites:

Python 3.11+
uv installed (https://docs.astral.sh/uv/)
API keys for: SERPER_API_KEY, GROQ_API_KEY

1. Clone & Install

git clone <your-repo-url> mcp-server-python
cd mcp-server-python
uv sync

This will create/refresh a .venv based on pyproject.toml + uv.lock.

2. Environment Variables

Create a .env file in the project root:

SERPER_API_KEY=your_serper_api_key_here
GROQ_API_KEY=your_groq_api_key_here

Optional: add other model settings if you later extend functionality.

3. Run the MCP Server Directly

uv run mcp_server.py

The server will start and wait on stdio (no extra output unless you add logging). It registers the tool get_docs.

4. Use the Provided Client

uv run client.py

You should see something like:

Available tools: ['get_docs']
ANSWER: <model-produced answer referencing SOURCE lines>

If the list is empty, ensure the server started correctly and no exceptions were raised (add logging—see below).

Tool: get_docs

Signature:

get_docs(query: str, library: str) -> str

Supported libraries (keys): uv, langchain, openai, llama-index.

Flow:

Build a site-restricted query: site:<docs-domain> <query>.
Call Serper API for organic results.
Fetch each result URL (async) via httpx.
Split HTML into ~4000‑char chunks (memory safety & LLM limits).
Clean each chunk using Groq LLM (openai/gpt-oss-20b) with a system prompt.
Concatenate and label each block with SOURCE: <url> for traceability.

Returned value: A large text blob suitable for retrieval-augmented prompting, preserving source attribution lines.

Architecture

File overview:

File	Purpose
`mcp_server.py`	Defines `FastMCP` instance and implements `search_web`, `fetch_url`, and the `get_docs` tool.
`client.py`	Launches server via stdio, lists tools, calls `get_docs`, then feeds result to an LLM for a user-friendly answer.
`utils.py`	HTML cleaning helper (currently uses LLM + `trafilatura` for extraction and Groq for chunk transformation).
`.env`	Environment variables (excluded from VCS).
`pyproject.toml`	Declares dependencies and metadata.
`uv.lock`	Reproducible lockfile generated by `uv`.

Dependency Notes

Core runtime deps (from pyproject.toml):

fastmcp – MCP server helper.
httpx – async HTTP client.
groq – Groq API client.
python-dotenv – load variables from .env.
trafilatura – heuristic content extraction (currently partially used / can be extended).

Tip: If you add more scraping tools, reuse a single httpx.AsyncClient for performance.

Logging & Debugging

To see what the server is doing, you can temporarily add:

import logging, sys
logging.basicConfig(level=logging.INFO, stream=sys.stderr)

Place near the top of mcp_server.py after imports. Since protocol uses stdout for JSON-RPC, send logs to stderr only.

Common issues:

Empty tool list: The server exited early or crashed—add logging.
SERPER_API_KEY missing → 401 or empty search results.
GROQ_API_KEY missing → LLM cleaning fails (exception in get_response_from_llm).
Network timeouts: Adjust timeout in httpx.AsyncClient calls.

Extending

Ideas:

Add caching layer (e.g., sqlite or in-memory dict) to avoid re-fetching same URLs.
Parallelize URL fetch + clean with asyncio.gather() (mind rate limits / LLM cost).
Add another tool (e.g., summarize_diff, list_endpoints).
Provide structured JSON output (list of sources + cleaned text) instead of concatenated string.
Add tests using pytest + pytest-asyncio (mock Serper + LLM APIs).

Example Programmatic Use (Without Client Wrapper)

If you want to call the tool directly in a Python script using the client-side MCP library:

from mcp.client.stdio import stdio_client
from mcp import ClientSession, StdioServerParameters
import asyncio

async def demo():
	params = StdioServerParameters(command="uv", args=["run", "mcp_server.py"])
	async with stdio_client(params) as (r, w):
		async with ClientSession(r, w) as session:
			await session.initialize()
			tools = await session.list_tools()
			print([t.name for t in tools.tools])
			docs = await session.call_tool("get_docs", {"query": "install", "library": "uv"})
			print(docs.content[:500])

asyncio.run(demo())

Running With Active Virtualenv

If you have an already activated virtual environment and want to use that instead of the project’s pinned environment, you can force uv to target it:

uv run --active client.py

Otherwise, uv will warn that your active $VIRTUAL_ENV differs from the project .venv but continue using the project environment.

License

Add a license section here (e.g., MIT) if you intend to distribute.

Troubleshooting Cheat Sheet

Symptom	Cause	Fix
No tools listed	Server not running / crashed	Add stderr logging; run `uv run mcp_server.py` manually
AttributeError on `.text`	Cleaner returned None	Ensure you return actual string from `fetch_url` / LLM call
401 from Serper	Bad/missing API key	Check `.env` and reload shell
Empty search results	Narrow query	Simplify query or verify domain key
High latency	Many sequential LLM chunk calls	Batch or reduce chunk size