
MCP Server for Crawl4AI
TypeScript implementation of an MCP server that provides tools for web crawling, content extraction, and browser automation, enabling AI systems to access and process web content through 15 specialized tools.
README
MCP Server for Crawl4AI
TypeScript implementation of an MCP server for Crawl4AI. Provides tools for web crawling, content extraction, and browser automation.
Table of Contents
- Prerequisites
- Quick Start
- Configuration
- Client-Specific Instructions
- Available Tools
- Advanced Configuration
- Development
- License
Prerequisites
- Node.js 18+ and npm
- A running Crawl4AI server
Quick Start
1. Start the Crawl4AI server (for example, local docker)
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.2
Note: Tested with Crawl4AI version 0.7.2
2. Add to your MCP client
This MCP server works with any MCP-compatible client (Claude Desktop, Claude Code, Cursor, LMStudio, etc.).
Using npx (Recommended)
{
"mcpServers": {
"crawl4ai": {
"command": "npx",
"args": ["mcp-crawl4ai-ts"],
"env": {
"CRAWL4AI_BASE_URL": "http://localhost:11235"
}
}
}
}
Using local installation
{
"mcpServers": {
"crawl4ai": {
"command": "node",
"args": ["/path/to/mcp-crawl4ai-ts/dist/index.js"],
"env": {
"CRAWL4AI_BASE_URL": "http://localhost:11235"
}
}
}
}
With all optional variables
{
"mcpServers": {
"crawl4ai": {
"command": "npx",
"args": ["mcp-crawl4ai-ts"],
"env": {
"CRAWL4AI_BASE_URL": "http://localhost:11235",
"CRAWL4AI_API_KEY": "your-api-key",
"SERVER_NAME": "custom-name",
"SERVER_VERSION": "1.0.0"
}
}
}
}
Configuration
Environment Variables
# Required
CRAWL4AI_BASE_URL=http://localhost:11235
# Optional - Server Configuration
CRAWL4AI_API_KEY= # If your server requires auth
SERVER_NAME=crawl4ai-mcp # Custom name for the MCP server
SERVER_VERSION=1.0.0 # Custom version
Client-Specific Instructions
Claude Desktop
Add to ~/Library/Application Support/Claude/claude_desktop_config.json
Claude Code
claude mcp add crawl4ai -e CRAWL4AI_BASE_URL=http://localhost:11235 -- npx mcp-crawl4ai-ts
Other MCP Clients
Consult your client's documentation for MCP server configuration. The key details:
- Command:
npx mcp-crawl4ai-ts
ornode /path/to/dist/index.js
- Required env:
CRAWL4AI_BASE_URL
- Optional env:
CRAWL4AI_API_KEY
,SERVER_NAME
,SERVER_VERSION
Available Tools
1. get_markdown
- Extract content as markdown with filtering
{
url: string, // Required: URL to extract markdown from
filter?: 'raw'|'fit'|'bm25'|'llm', // Filter type (default: 'fit')
query?: string, // Query for bm25/llm filters
cache?: string // Cache-bust parameter (default: '0')
}
Extracts content as markdown with various filtering options. Use 'bm25' or 'llm' filters with a query for specific content extraction.
2. capture_screenshot
- Capture webpage screenshot
{
url: string, // Required: URL to capture
screenshot_wait_for?: number // Seconds to wait before screenshot (default: 2)
}
Returns base64-encoded PNG. Note: This is stateless - for screenshots after JS execution, use crawl
with screenshot: true
.
3. generate_pdf
- Convert webpage to PDF
{
url: string // Required: URL to convert to PDF
}
Returns base64-encoded PDF. Stateless tool - for PDFs after JS execution, use crawl
with pdf: true
.
4. execute_js
- Execute JavaScript and get return values
{
url: string, // Required: URL to load
scripts: string | string[] // Required: JavaScript to execute
}
Executes JavaScript and returns results. Each script can use 'return' to get values back. Stateless - for persistent JS execution use crawl
with js_code
.
5. batch_crawl
- Crawl multiple URLs concurrently
{
urls: string[], // Required: List of URLs to crawl
max_concurrent?: number, // Parallel request limit (default: 5)
remove_images?: boolean, // Remove images from output (default: false)
bypass_cache?: boolean // Bypass cache for all URLs (default: false)
}
Efficiently crawls multiple URLs in parallel. Each URL gets a fresh browser instance.
6. smart_crawl
- Auto-detect and handle different content types
{
url: string, // Required: URL to crawl
max_depth?: number, // Maximum depth for recursive crawling (default: 2)
follow_links?: boolean, // Follow links in content (default: true)
bypass_cache?: boolean // Bypass cache (default: false)
}
Intelligently detects content type (HTML/sitemap/RSS) and processes accordingly.
7. get_html
- Get sanitized HTML for analysis
{
url: string // Required: URL to extract HTML from
}
Returns preprocessed HTML optimized for structure analysis. Use for building schemas or analyzing patterns.
8. extract_links
- Extract and categorize page links
{
url: string, // Required: URL to extract links from
categorize?: boolean // Group by type (default: true)
}
Extracts all links and groups them by type: internal, external, social media, documents, images.
9. crawl_recursive
- Deep crawl website following links
{
url: string, // Required: Starting URL
max_depth?: number, // Maximum depth to crawl (default: 3)
max_pages?: number, // Maximum pages to crawl (default: 50)
include_pattern?: string, // Regex pattern for URLs to include
exclude_pattern?: string // Regex pattern for URLs to exclude
}
Crawls a website following internal links up to specified depth. Returns content from all discovered pages.
10. parse_sitemap
- Extract URLs from XML sitemaps
{
url: string, // Required: Sitemap URL (e.g., /sitemap.xml)
filter_pattern?: string // Optional: Regex pattern to filter URLs
}
Extracts all URLs from XML sitemaps. Supports regex filtering for specific URL patterns.
11. crawl
- Advanced web crawling with full configuration
{
url: string, // URL to crawl
// Browser Configuration
browser_type?: 'chromium'|'firefox'|'webkit', // Browser engine
viewport_width?: number, // Browser width (default: 1080)
viewport_height?: number, // Browser height (default: 600)
user_agent?: string, // Custom user agent
proxy_server?: string, // Proxy URL
proxy_username?: string, // Proxy auth
proxy_password?: string, // Proxy password
cookies?: Array<{name, value, domain}>, // Pre-set cookies
headers?: Record<string,string>, // Custom headers
// Crawler Configuration
word_count_threshold?: number, // Min words per block (default: 200)
excluded_tags?: string[], // HTML tags to exclude
remove_overlay_elements?: boolean, // Remove popups/modals
js_code?: string | string[], // JavaScript to execute
wait_for?: string, // Wait condition (selector or JS)
wait_for_timeout?: number, // Wait timeout (default: 30000)
delay_before_scroll?: number, // Pre-scroll delay
scroll_delay?: number, // Between-scroll delay
process_iframes?: boolean, // Include iframe content
exclude_external_links?: boolean, // Remove external links
screenshot?: boolean, // Capture screenshot
pdf?: boolean, // Generate PDF
session_id?: string, // Reuse browser session (only works with crawl tool)
cache_mode?: 'ENABLED'|'BYPASS'|'DISABLED', // Cache control
extraction_type?: 'llm', // Only 'llm' extraction is supported via REST API
extraction_schema?: object, // Schema for structured extraction
extraction_instruction?: string, // Natural language extraction prompt
timeout?: number, // Overall timeout (default: 60000)
verbose?: boolean // Detailed logging
}
12. create_session
- Create persistent browser session
{
session_id?: string, // Optional: Custom ID (auto-generated if not provided)
initial_url?: string, // Optional: URL to load when creating session
browser_type?: 'chromium'|'firefox'|'webkit' // Optional: Browser engine (default: 'chromium')
}
Creates a persistent browser session for maintaining state across multiple requests. Returns the session_id for use with the crawl
tool.
Important: Only the crawl
tool supports session_id. Other tools are stateless and create new browsers each time.
13. clear_session
- Remove session from tracking
{ session_id: string }
Removes session from local tracking. Note: The actual browser session on the server persists until timeout.
14. list_sessions
- List tracked browser sessions
{} // No parameters required
Returns all locally tracked sessions with creation time, last used time, and initial URL. Note: These are session references - actual server state may differ.
15. extract_with_llm
- Extract structured data using AI
{
url: string, // URL to extract data from
query: string // Natural language extraction instructions
}
Uses AI to extract structured data from webpages. Returns results immediately without any polling or job management. This is the recommended way to extract specific information since CSS/XPath extraction is not supported via the REST API.
Advanced Configuration
For detailed information about all available configuration options, extraction strategies, and advanced features, please refer to the official Crawl4AI documentation:
Changelog
See CHANGELOG.md for detailed version history and recent updates.
Development
Setup
# 1. Start the Crawl4AI server
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest
# 2. Install MCP server
git clone https://github.com/omgwtfwow/mcp-crawl4ai-ts.git
cd mcp-crawl4ai-ts
npm install
cp .env.example .env
# 3. Development commands
npm run dev # Development mode
npm test # Run tests
npm run lint # Check code quality
npm run build # Production build
# 4. Add to your MCP client (See "Using local installation")
Running Integration Tests
Integration tests require a running Crawl4AI server. Configure your environment:
# Required for integration tests
export CRAWL4AI_BASE_URL=http://localhost:11235
export CRAWL4AI_API_KEY=your-api-key # If authentication is required
# Optional: For LLM extraction tests
export LLM_PROVIDER=openai/gpt-4o-mini
export LLM_API_TOKEN=your-llm-api-key
export LLM_BASE_URL=https://api.openai.com/v1 # If using custom endpoint
# Run integration tests
npm run test:integration
Integration tests cover:
- Dynamic content and JavaScript execution
- Session management and cookies
- Content extraction (LLM-based only)
- Media handling (screenshots, PDFs)
- Performance and caching
- Content filtering
- Bot detection avoidance
- Error handling
License
MIT
推荐服务器

Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。