silkworm-mcp

silkworm-mcp

A full-featured MCP server for building scrapers, with tools for page fetching, HTML parsing, CSS/XPath querying, and spider generation using silkworm-rs and scraper-rs.

Category
访问服务器

README

silkworm-mcp

This is a full-featured MCP server for building scrapers with:

  • silkworm-rs: async crawling, fetching, follow links, and spider execution
  • scraper-rs: fast Rust-backed HTML parsing with CSS and XPath selectors

It is designed for LLM-assisted scraper development, so the server exposes both low-level page inspection tools and higher-level workflow helpers for validating selector plans and generating starter spider code.

An example: https://github.com/BitingSnakes/silkworm-example

Features

  • Fetch pages through silkworm's regular HTTP client or CDP renderer.
  • Query selectors directly against a CDP-rendered DOM snapshot.
  • Analyze inline and linked CSS with tinycss2, then optionally map selectors back onto HTML.
  • Extract structured records from live rendered pages before committing to a full crawl.
  • Cache HTML in a local document store and reuse it via document_handle.
  • Bound the document cache with max-document, max-bytes, and idle-TTL controls.
  • Inspect pages with summaries, parsed DOM trees, prettified HTML, CSS/XPath queries, selector comparisons, and link extraction.
  • Run ad hoc crawls from a structured CrawlBlueprint.
  • Generate reusable silkworm spider templates from the same blueprint and statically validate them, including pattern-specific variants for list-only, list+detail, sitemap/XML, and CDP-heavy crawls.
  • Expose MCP diagnostics plus HTTP /healthz and /readyz routes for production monitoring.
  • Publish MCP resources and prompts so clients can discover workflows, Silkworm idioms, and blueprint schemas.

Tools

  • store_html_document
  • list_documents
  • delete_document
  • clear_documents
  • server_status
  • inspect_document
  • parse_html_document
  • parse_html_fragment
  • prettify_document
  • query_selector
  • analyze_css_selectors
  • find_selectors_by_text
  • compare_selectors
  • extract_links
  • silkworm_fetch
  • silkworm_fetch_cdp
  • query_selector_cdp
  • extract_structured_data_cdp
  • run_crawl_blueprint
  • generate_spider_template
  • validate_spider_code

Run

Install dependencies:

uv sync

Run over stdio for a desktop MCP client:

uv run python mcp_server.py --transport stdio

Run over HTTP:

uv run python mcp_server.py --transport http --host 127.0.0.1 --port 8000

HTTP deployments also expose:

  • GET /healthz: process liveness
  • GET /readyz: readiness, optionally including a CDP browser probe

The project also exposes a console entrypoint:

uv run silkworm-mcp --transport stdio

Docker

Build the image:

docker build -t silkworm-mcp .

Run the container over HTTP on port 8000:

docker run --rm -it -p 8000:8000 silkworm-mcp

The container entrypoint starts two processes by default:

  • the MCP server over HTTP on 0.0.0.0:8000
  • a bundled Lightpanda browser on 127.0.0.1:9222 for CDP-backed tools such as silkworm_fetch_cdp, query_selector_cdp, and extract_structured_data_cdp

Useful container environment variables:

  • MCP_TRANSPORT (default: http)
  • MCP_HOST (default: 0.0.0.0)
  • MCP_PORT (default: 8000)
  • MCP_PATH
  • LIGHTPANDA_ENABLED (default: 1)
  • LIGHTPANDA_HOST (default: 127.0.0.1)
  • LIGHTPANDA_PORT (default: 9222)
  • LIGHTPANDA_ADVERTISE_HOST (default: unset, falls back to LIGHTPANDA_HOST)
  • LIGHTPANDA_LOG_FORMAT (default: pretty)
  • LIGHTPANDA_LOG_LEVEL (default: info)

When Lightpanda binds to 0.0.0.0 inside a container, set LIGHTPANDA_ADVERTISE_HOST to a reachable hostname such as the container DNS name. Otherwise /json/version can advertise ws://0.0.0.0:9222/, which remote CDP clients cannot use.

Example with custom document-cache limits:

docker run --rm -it \
  -p 8000:8000 \
  -e SILKWORM_MCP_DOCUMENT_MAX_COUNT=256 \
  -e SILKWORM_MCP_DOCUMENT_MAX_TOTAL_BYTES=64000000 \
  -e SILKWORM_MCP_DOCUMENT_TTL_SECONDS=7200 \
  silkworm-mcp

For local development, compose.yml provides the same setup with health checks and restart policy:

docker compose up --build

Then verify the container is ready:

curl http://127.0.0.1:8000/readyz

Key runtime environment variables:

  • SILKWORM_MCP_DOCUMENT_MAX_COUNT
  • SILKWORM_MCP_DOCUMENT_MAX_TOTAL_BYTES
  • SILKWORM_MCP_DOCUMENT_TTL_SECONDS
  • SILKWORM_MCP_DOCUMENT_STORE_PATH
  • SILKWORM_MCP_LOG_LEVEL
  • SILKWORM_MCP_READINESS_REQUIRE_CDP
  • SILKWORM_MCP_READINESS_CDP_WS_ENDPOINT

Example Workflow

  1. Call silkworm_fetch for the target page.
  2. Use the returned document_handle with inspect_document.
  3. Use parse_html_document or parse_html_fragment when you need exact parser structure, node types, or parser errors.
  4. Use find_selectors_by_text to derive candidates from visible text, then iterate on query_selector, compare_selectors, and analyze_css_selectors when stylesheet structure or hidden elements matter.
  5. For JS-heavy pages, use query_selector_cdp or extract_structured_data_cdp against the rendered DOM.
  6. Use extract_links to verify pagination or detail pages.
  7. Feed the stable plan into run_crawl_blueprint.
  8. Convert the same blueprint into code with generate_spider_template, then check it with validate_spider_code.

Useful built-in MCP references:

  • silkworm://reference/overview
  • silkworm://reference/silkworm-cheatsheet
  • silkworm://reference/silkworm-playbook
  • silkworm://reference/template-variants
  • silkworm://reference/scraper-rs-cheatsheet
  • silkworm://reference/crawl-blueprint-schema

Use transport: "cdp" when pages require JavaScript rendering. run_crawl_blueprint will connect to the configured CDP endpoint, and generate_spider_template will emit a starter spider that runs through CDPClient instead of the default HTTP client.

Both run_crawl_blueprint and generate_spider_template accept a variant override. When omitted, they infer a crawl style from the blueprint:

  • list_only: listing pages emit items directly, with optional pagination
  • list_detail: listing pages schedule detail requests and a separate parse_detail
  • sitemap_xml: sitemap/XML entrypoints are fetched with meta={"allow_non_html": True} and parsed before scheduling page requests
  • cdp_heavy: rendered-page crawls keep the CDP execution path and a general-purpose parse/follow flow

run_crawl_blueprint returns the resolved execution_variant, and generate_spider_template returns the resolved template_variant, so clients can see which crawl shape was actually used.

Testing

Run the automated test suite with:

just test

Acknowledgement

This project builds on the excellent work behind FastMCP, silkworm-rs, and scraper-rs. Together they provide the MCP server framework, crawling runtime, and HTML parsing foundations that make this project possible.

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选