crawl4ai-mcp

crawl4ai-mcp

A minimal MCP server for agent-friendly web extraction and search. Offers two tools: fetching real pages with Playwright and Crawl4AI, and searching across 7 engines with automatic fallback.

Category
访问服务器

README

crawl4ai-mcp

<div align="center">

License: AGPL v3 Python MCP Playwright Crawl4AI PyPI GitHub stars

A minimal MCP server for agent-friendly web extraction and search.

Two tools: fetch real pages with Playwright + Crawl4AI, or search across 7 engines with automatic fallback.

</div>


Quick entry

Audience Read this
Human developer README.zh-CN.md / README.md
Living in the AI era, delegating your remaining sanity to an agent README_AGENT.md

At a glance

Item Reality in this repo
MCP tools 2 tools: fetch_urls + search_web
Single-page fetch urls: ["https://example.com"]
Web search search_web(query="...", engine="auto") — 7 engines, auto fallback
Search engines DuckDuckGo · Bing · Google · Yandex · Sogou · 360Search · Baidu
Output title + content + links + blocked + llm_used/llm_error
Non-LLM mode First-class, default, usable without any model
LLM mode Off by default. Enabled only with use_llm=true + optional llm_instruction
Fallback Missing/failed LLM call automatically falls back to non-LLM result
Anti-bot realism proxy / cookies / persistent profile / randomized browser behavior
License AGPL-3.0-or-later

How it works

Fetch flow:

---
config:
  theme: base
  themeVariables:
    primaryColor: "#FAF9F5"
    primaryTextColor: "#1A1A1A"
    primaryBorderColor: "#D97757"
    lineColor: "#8B5E3C"
    secondaryColor: "#F5F1E8"
    tertiaryColor: "#FAF9F5"
    fontFamily: "ui-sans-serif, system-ui, -apple-system, sans-serif"
    fontSize: "14px"
---
flowchart LR
    A[URL list] --> B[Playwright + Crawl4AI]
    B --> C{Fast path enough?}
    C -- Yes --> D[Markdown / HTML]
    C -- No --> E[Stronger fallback]
    E --> D
    D --> F{use_llm?}
    F -- No --> G[Return result]
    F -- Yes --> H[OpenAI-compatible cleanup]
    H --> I{LLM success?}
    I -- Yes --> J[Return enhanced result]
    I -- No --> G

Search flow:

---
config:
  theme: base
  themeVariables:
    primaryColor: "#FAF9F5"
    primaryTextColor: "#1A1A1A"
    primaryBorderColor: "#D97757"
    lineColor: "#8B5E3C"
    secondaryColor: "#F5F1E8"
    tertiaryColor: "#FAF9F5"
    fontFamily: "ui-sans-serif, system-ui, -apple-system, sans-serif"
    fontSize: "14px"
---
flowchart LR
    A[query + engine] --> B{engine=auto?}
    B -- Yes --> C[Detect language]
    C --> D[Build engine plan]
    B -- No --> E[Use specified engine]
    D --> F[Try engines in order]
    E --> F
    F --> G{Results?}
    G -- Yes --> H[Aggregate + deduplicate]
    G -- No, next engine --> F
    H --> I[Return results]

Installation

Quick install (recommended)

Step 1: Create a virtual environment

# macOS/Linux - using system Python 3 (3.10-3.13)
python3 -m venv crawl4ai
source crawl4ai/bin/activate

# Windows
python -m venv crawl4ai
crawl4ai\Scripts\activate

Step 2: Install

pip install --upgrade pip
pip install crawl4agent
playwright install chromium

Alternative methods

If python3 is too old (3.9 or below):

# Use specific Python version (3.10, 3.11, 3.12, or 3.13)
python3.12 -m venv crawl4ai
source crawl4ai/bin/activate
pip install crawl4agent

Using conda:

conda create -n crawl4ai python=3.12
conda activate crawl4ai
pip install crawl4agent
playwright install chromium

Using pipx (global command):

pipx install crawl4agent
crawl4ai-mcp --help

Troubleshooting

Problem: "pip install" uses Python 2.7

# macOS: use python3 explicitly
python3 -m pip install crawl4agent

# Or check which pip you're using
which pip
pip --version

Problem: "No matching distribution found for crawl4agent"

  • Check Python version: python3 --version (must be 3.10-3.13)
  • Upgrade pip: python3 -m pip install --upgrade pip

Problem: "playwright install" fails

  • Use mirror (China): export PLAYWRIGHT_DOWNLOAD_HOST=https://npmmirror.com/mirrors/playwright/
  • Then: python3 -m playwright install chromium

Why this project exists

Most generic “web fetch” tools either fail on JS-heavy pages or return too much boilerplate. This project focuses on four things:

  • Non-LLM quality first: usable even with zero model config
  • Minimal MCP surface: easier for agents, easier to maintain
  • Pragmatic anti-bot workflow: proxy / cookies / persistent profile are first-class
  • Golden regression review: full markdown outputs can be saved and inspected page by page

Core capabilities

Non-LLM mode

Capability Actual behavior
Rendering Real browser rendering via Playwright
Extraction Crawl4AI markdown/html extraction
Fallback Fast path → stronger path when content is too thin
Cleanup Remove obvious noise, compress blanks, strip data-image placeholders
Site tuning Medium / Claude Docs / GitHub and other mainstream sites
ChatGPT shared links Full conversation extraction from chatgpt.com/share/... URLs
Video transcripts YouTube / Bilibili URLs prefer subtitle extraction via yt-dlp, then fall back to webpage extraction
Block detection blocked=true for likely verification/interstitial output
Batch control Bounded concurrency via concurrency

Optional LLM mode

Input Meaning
use_llm=true Turn on post-cleanup with an OpenAI-compatible model
llm_instruction Tell the model what to keep / remove

Important reality check:

  • With llm_instruction, the prompt is constraint-heavy and biased toward preserving original lines.
  • Without llm_instruction, the model does a more generic “clean readable markdown” pass.
  • If the LLM call fails for any reason, the tool returns the original non-LLM extraction plus llm_used=false and llm_error.

MCP Tools

fetch_urls

{
  "urls": ["https://a.com", "https://b.com"],
  "format": "markdown",
  "max_chars": 200000,
  "concurrency": 3,
  "use_llm": false,
  "llm_instruction": "keep only the tutorial body and in-body references"
}

Use a single-element list if you only need one page.

For supported video URLs (youtube.com, youtu.be, bilibili.com, b23.tv), fetch_urls prefers transcript extraction and returns readable markdown built from subtitles when available.

Return shape

Field Meaning
url Original URL
final_url Final resolved URL after redirects
title Extracted title
content Markdown or HTML
content_format markdown or html
links Normalized extracted links
video_metadata Present for supported video transcript extraction results
blocked Likely anti-bot / verification / denied result
llm_used Whether LLM enhancement was actually applied
llm_error Why the LLM step degraded

search_web

{
  "query": "crawl4ai web scraping",
  "engine": "auto",
  "max_results": 10,
  "lang": ""
}
Parameter Default Description
query (required) Search query string
engine auto Engine to use: auto, google, bing, duckduckgo, baidu
max_results 10 Maximum number of results
lang "" Language hint (e.g. en, zh-CN)

When engine="auto", the server tries engines in fallback order: DuckDuckGo → Bing → Google → Baidu. The first engine that returns results wins.

Search return shape

Field Meaning
engine Which engine actually returned results
query Original query
results List of {title, url, snippet}
total Number of results
fallback_engines_tried Engines that failed before the successful one

Anti-bot realism

The server already includes randomized browser behavior in code:

Mechanism Actual status
Random viewport Yes
Random user agent mode Yes, when explicit UA is not provided
Delay jitter Yes
override_navigator Yes
simulate_user Yes, in stronger fallback mode
Proxy / cookies / persistent profile Supported via env vars
Cloudflare bypass Enhanced browser fingerprinting + configurable wait strategies

Note: For overseas websites (Medium, ProductHunt, etc.), using a proxy is recommended. The server supports HTTP/HTTPS/SOCKS5 proxies via CRAWL4AI_MCP_PROXY environment variable.

Proxy input formats

CRAWL4AI_MCP_PROXY accepts all of these:

Input Interpreted as
http://127.0.0.1:7890 HTTP proxy
https://127.0.0.1:7890 HTTPS proxy
socks5://127.0.0.1:7890 SOCKS5 proxy
socket5://127.0.0.1:7890 Auto-normalized to socks5://...
127.0.0.1:7890 Auto-normalized to http://127.0.0.1:7890
7890 Auto-normalized to http://127.0.0.1:7890

That means the README should not claim “perfect stealth”, but it can honestly claim human-like randomization and practical anti-bot knobs.


Quickstart

Conda

conda env create -f environment.yml
conda activate crawl4ai-mcp
python -m playwright install
crawl4ai-mcp

venv

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e '.[dev]'
python -m playwright install
crawl4ai-mcp

MCP server config example

{
  "mcpServers": {
    "crawl4ai": {
      "command": "crawl4ai-mcp",
      "env": {
        "CRAWL4AI_MCP_HEADLESS": "true",
        "CRAWL4AI_MCP_PROXY": "127.0.0.1:7890",
        "CRAWL4AI_MCP_NAVIGATION_TIMEOUT_MS": "30000",
        "CRAWL4AI_MCP_WAIT_UNTIL": "load",

        "OPENAI_BASE_URL": "https://your-openai-compatible-host",
        "OPENAI_API_KEY": "your-api-key",
        "OPENAI_MODEL": "your-model-name"
      }
    }
  }
}

LLM-related env vars are optional. use_llm is still off by default at call time. If any LLM env is missing, invalid, or the model call fails, the server automatically falls back to non-LLM extraction.


Runtime configuration

Env var Purpose
CRAWL4AI_MCP_HEADLESS Run browser headless
CRAWL4AI_MCP_PROXY Upstream proxy, supports http://, https://, socks5://, host:port, and port-only
CRAWL4AI_MCP_COOKIES_JSON Playwright storage state JSON
CRAWL4AI_MCP_YTDLP_COOKIES_FROM_BROWSER Browser cookies source for video transcript extraction, e.g. chrome, firefox:default
CRAWL4AI_MCP_YTDLP_COOKIEFILE Netscape cookies.txt path for yt-dlp video transcript extraction
CRAWL4AI_MCP_USE_PERSISTENT_CONTEXT Reuse browser profile
CRAWL4AI_MCP_USER_DATA_DIR Profile directory
CRAWL4AI_MCP_NAVIGATION_TIMEOUT_MS Default max single navigation wait, default 30000
CRAWL4AI_MCP_WAIT_UNTIL Default page readiness strategy, default load
OPENAI_BASE_URL OpenAI-compatible base URL
OPENAI_API_KEY API key
OPENAI_MODEL Model name

One-shot CLI

This project now exposes a stateless one-shot CLI in addition to the MCP stdio server.

Fetch a single URL once and print JSON:

crawl4agent fetch "https://obsidian.md/help/cli" --format markdown

Search the web once and print JSON:

crawl4agent search "agent framework" --engine auto --max-results 5

Use proxy and browser cookies for video transcript extraction:

crawl4agent fetch "https://www.youtube.com/watch?v=OFfwN23hR8U" \
  --proxy http://127.0.0.1:7890 \
  --cookies-from-browser chrome

Run golden smoke once and print a JSON array:

crawl4agent smoke --out-dir ./_golden_outputs

The existing crawl4ai-mcp command remains the MCP stdio server entrypoint for MCP hosts.

Available help surfaces:

crawl4agent --help
crawl4agent fetch --help
crawl4agent search --help
crawl4agent smoke --help

Golden smoke regression

CRAWL4AI_MCP_SMOKE_DIR=./_golden_outputs .venv/bin/python -m crawl4ai_mcp.smoke_golden

For overseas video URLs, a local proxy is often needed:

CRAWL4AI_MCP_PROXY=http://127.0.0.1:7890 \
CRAWL4AI_MCP_SMOKE_DIR=./_golden_outputs \
.venv/bin/python -m crawl4ai_mcp.smoke_golden

This writes full markdown outputs to _golden_outputs/ so you can inspect extraction quality page by page.

The golden set now includes the earlier baseline URLs plus ainew.me, openclaw, watcha, producthunt, mydrivers, caihongtu, openrouter, mobile Douban, and video pages from YouTube / Bilibili. For sites outside mainland China, proxy-based verification is recommended.

Some overseas sites may still return Cloudflare or similar verification pages even when a proxy is configured. In those cases the server now marks them with blocked=true. The recommended path is: better proxy quality, valid cookies, or a persistent browser profile after manual verification.

For some video golden URLs, subtitle extraction may require login. If yt-dlp reports login-required subtitles, configure either CRAWL4AI_MCP_YTDLP_COOKIES_FROM_BROWSER or CRAWL4AI_MCP_YTDLP_COOKIEFILE before running golden smoke.


Prior art


License

This project is licensed under AGPL-3.0-or-later.

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选