crawl-mcp-server
An MCP server for web content extraction that converts HTML pages into clean, LLM-optimized Markdown using Mozilla's Readability. It supports batch processing, intelligent multi-page crawling, and configurable caching while respecting robots.txt standards.
README
crawl-mcp-server
A comprehensive MCP (Model Context Protocol) server providing 11 powerful tools for web crawling and search. Transform web content into clean, LLM-optimized Markdown or search the web with SearXNG integration.
✨ Features
- 🔍 SearXNG Web Search - Search the web with automatic browser management
- 📄 4 Crawling Tools - Extract and convert web content to Markdown
- 🚀 Auto-Browser-Launch - Search tools automatically manage browser lifecycle
- 📦 11 Total Tools - Complete toolkit for web interaction
- 💾 Built-in Caching - SHA-256 based caching with graceful fallbacks
- ⚡ Concurrent Processing - Handle multiple URLs simultaneously (up to 50)
- 🎯 LLM-Optimized Output - Clean Markdown perfect for AI consumption
- 🛡️ Robust Error Handling - Graceful failure with detailed error messages
- 🧪 Comprehensive Testing - Full CI/CD with performance benchmarks
📦 Installation
Method 1: npm (Recommended)
npm install crawl-mcp-server
Method 2: Direct from Git
# Install latest from GitHub
npm install git+https://github.com/Git-Fg/searchcrawl-mcp-server.git
# Or specific branch
npm install git+https://github.com/Git-Fg/searchcrawl-mcp-server.git#main
# Or from a fork
npm install git+https://github.com/YOUR_FORK/searchcrawl-mcp-server.git
Method 3: Clone and Build
git clone https://github.com/Git-Fg/searchcrawl-mcp-server.git
cd crawl-mcp-server
npm install
npm run build
Method 4: npx (No Installation)
# Run directly without installing
npx git+https://github.com/Git-Fg/searchcrawl-mcp-server.git
🔧 Setup for Claude Code
Option 1: MCP Desktop (Recommended)
Add to your Claude Desktop configuration file:
** macOS/Linux: ~/.config/claude/claude_desktop_config.json**
{
"mcpServers": {
"crawl-server": {
"command": "npx",
"args": [
"git+https://github.com/Git-Fg/searchcrawl-mcp-server.git"
],
"env": {
"NODE_ENV": "production"
}
}
}
}
** Windows: %APPDATA%\Claude\claude_desktop_config.json**
{
"mcpServers": {
"crawl-server": {
"command": "npx",
"args": [
"git+https://github.com/Git-Fg/searchcrawl-mcp-server.git"
],
"env": {
"NODE_ENV": "production"
}
}
}
}
Option 2: Local Installation
If you've installed locally:
{
"mcpServers": {
"crawl-server": {
"command": "node",
"args": [
"/path/to/crawl-mcp-server/dist/index.js"
],
"env": {}
}
}
}
Option 3: Custom Path
For a specific installation:
{
"mcpServers": {
"crawl-server": {
"command": "node",
"args": [
"/usr/local/lib/node_modules/crawl-mcp-server/dist/index.js"
],
"env": {}
}
}
}
After configuration, restart Claude Desktop.
🔧 Setup for Other MCP Clients
Claude CLI
# Using npx
claude mcp add crawl-server npx git+https://github.com/Git-Fg/searchcrawl-mcp-server.git
# Using local installation
claude mcp add crawl-server node /path/to/crawl-mcp-server/dist/index.js
Zed Editor
Add to ~/.config/zed/settings.json:
{
"assistant": {
"mcp": {
"servers": {
"crawl-server": {
"command": "node",
"args": ["/path/to/crawl-mcp-server/dist/index.js"]
}
}
}
}
}
VSCode with Copilot Chat
{
"mcpServers": {
"crawl-server": {
"command": "node",
"args": ["/path/to/crawl-mcp-server/dist/index.js"]
}
}
}
🚀 Quick Start
Using MCP Inspector (Testing)
# Install MCP Inspector globally
npm install -g @modelcontextprotocol/inspector
# Run the server
node dist/index.js
# In another terminal, test tools
npx @modelcontextprotocol/inspector --cli node dist/index.js --method tools/list
Development Mode
# Watch mode (auto-rebuild on changes)
npm run dev
# Build TypeScript
npm run build
# Run tests
npm run test:run
📚 Available Tools
Search Tools (7 tools)
1. search_searx
Search the web using SearXNG with automatic browser management.
// Example call
{
"query": "TypeScript MCP server",
"maxResults": 10,
"category": "general",
"timeRange": "week",
"language": "en"
}
Parameters:
query(string, required): Search querymaxResults(number, default: 20): Results to return (1-50)category(enum, default: general): one of general, images, videos, news, map, music, it, sciencetimeRange(enum, optional): one of day, week, month, yearlanguage(string, default: en): Language code
Returns: JSON with search results array, URLs, and metadata
2. launch_chrome_cdp
Launch system Chrome with remote debugging for advanced SearX usage.
{
"headless": true,
"port": 9222,
"userDataDir": "/path/to/profile"
}
Parameters:
headless(boolean, default: true): Run Chrome headlessport(number, default: 9222): Remote debugging portuserDataDir(string, optional): Custom Chrome profile
3. connect_cdp
Connect to remote CDP browser (Browserbase, etc.).
{
"cdpWsUrl": "http://localhost:9222"
}
Parameters:
cdpWsUrl(string, required): CDP WebSocket URL or HTTP endpoint
4. launch_local
Launch bundled Chromium for SearX search.
{
"headless": true,
"userAgent": "custom user agent string"
}
Parameters:
headless(boolean, default: true): Run headlessuserAgent(string, optional): Custom user agent
5. chrome_status
Check Chrome CDP status and health.
{}
Returns: Running status, health, endpoint URL, and PID
6. close
Close browser session (keeps Chrome CDP running).
{}
7. shutdown_chrome_cdp
Shutdown Chrome CDP and cleanup resources.
{}
Crawling Tools (4 tools)
1. crawl_read ⭐ (Simple & Fast)
Quick single-page extraction to Markdown.
{
"url": "https://example.com/article",
"options": {
"timeout": 30000
}
}
Best for:
- ✅ News articles
- ✅ Blog posts
- ✅ Documentation pages
- ✅ Simple content extraction
Returns: Clean Markdown content
2. crawl_read_batch ⭐ (Multiple URLs)
Process 1-50 URLs concurrently.
{
"urls": [
"https://example.com/article1",
"https://example.com/article2",
"https://example.com/article3"
],
"options": {
"maxConcurrency": 5,
"timeout": 30000,
"maxResults": 10
}
}
Best for:
- ✅ Processing multiple articles
- ✅ Building content aggregates
- ✅ Bulk content extraction
Returns: Array of Markdown results with summary statistics
3. crawl_fetch_markdown
Controlled single-page extraction with full option control.
{
"url": "https://example.com/article",
"options": {
"timeout": 30000
}
}
Best for:
- ✅ Advanced crawling options
- ✅ Custom timeout control
- ✅ Detailed extraction
4. crawl_fetch
Multi-page crawling with intelligent link extraction.
{
"url": "https://example.com",
"options": {
"pages": 5,
"maxConcurrency": 3,
"sameOriginOnly": true,
"timeout": 30000,
"maxResults": 20
}
}
Best for:
- ✅ Crawling entire sites
- ✅ Link-based discovery
- ✅ Multi-page scraping
Features:
- Extracts links from starting page
- Crawls discovered pages
- Concurrent processing
- Same-origin filtering (configurable)
💡 Usage Examples
Example 1: Search + Crawl Workflow
// Step 1: Search for topics
{
"tool": "search_searx",
"arguments": {
"query": "TypeScript best practices 2024",
"maxResults": 5
}
}
// Step 2: Extract URLs from results
// (Parse the search results to get URLs)
// Step 3: Crawl selected articles
{
"tool": "crawl_read_batch",
"arguments": {
"urls": [
"https://example.com/article1",
"https://example.com/article2",
"https://example.com/article3"
]
}
}
Example 2: Batch Content Extraction
{
"tool": "crawl_read_batch",
"arguments": {
"urls": [
"https://news.site/article1",
"https://news.site/article2",
"https://news.site/article3"
],
"options": {
"maxConcurrency": 10,
"timeout": 30000,
"maxResults": 3
}
}
}
Example 3: Site Crawling
{
"tool": "crawl_fetch",
"arguments": {
"url": "https://docs.example.com",
"options": {
"pages": 10,
"maxConcurrency": 5,
"sameOriginOnly": true,
"timeout": 30000,
"maxResults": 10
}
}
}
🎯 Tool Selection Guide
| Use Case | Recommended Tool | Complexity |
|---|---|---|
| Single article | crawl_read |
Simple |
| Multiple articles | crawl_read_batch |
Simple |
| Advanced options | crawl_fetch_markdown |
Medium |
| Site crawling | crawl_fetch |
Complex |
| Web search | search_searx |
Simple |
| Research workflow | search_searx → crawl_read |
Medium |
🏗️ Architecture
Core Components
┌─────────────────────────────────────────┐
│ crawl-mcp-server │
├─────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────┐ │
│ │ MCP Server Core │ │
│ │ - 11 registered tools │ │
│ │ - STDIO/HTTP transport │ │
│ └──────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────┐ │
│ │ @just-every/crawl │ │
│ │ - HTML → Markdown │ │
│ │ - Mozilla Readability │ │
│ │ - Concurrent crawling │ │
│ └──────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────┐ │
│ │ Playwright (Browser) │ │
│ │ - SearXNG integration │ │
│ │ - Auto browser management │ │
│ │ - Anti-detection │ │
│ └──────────────────────────────┘ │
│ │
└─────────────────────────────────────────┘
Technology Stack
- Runtime: Node.js 18+
- Language: TypeScript 5.7
- Framework: MCP SDK (@modelcontextprotocol/sdk)
- Crawling: @just-every/crawl
- Browser: Playwright Core
- Validation: Zod
- Transport: STDIO (local) + HTTP (remote)
Data Flow
Client Request
↓
MCP Protocol
↓
Tool Handler
↓
┌─────────────────────┐
│ Crawl/Search │
│ @just-every/crawl │ → HTML content
│ or SearXNG │ → Search results
└─────────────────────┘
↓
HTML → Markdown
↓
Result Formatting
↓
MCP Response
↓
Client
🧪 Testing
Run Test Suite
# All unit tests
npm run test:run
# Performance benchmarks
npm run test:performance
# Full CI suite
npm run test:ci
# Individual tool test
npx @modelcontextprotocol/inspector --cli node dist/index.js \
--method tools/call \
--tool-name crawl_read \
--tool-arg url="https://example.com"
Test Coverage
- ✅ All 11 tools tested
- ✅ Error handling validated
- ✅ Performance benchmarks
- ✅ Integration workflows
- ✅ Multi-Node support (Node 18, 20, 22)
CI/CD Pipeline
┌────────────────────────────────────┐
│ GitHub Actions │
├────────────────────────────────────┤
│ 1. Test (Matrix: Node 18,20,22) │
│ 2. Integration Tests (PR only) │
│ 3. Performance Tests (main) │
│ 4. Security Scan │
│ 5. Coverage Report │
└────────────────────────────────────┘
🔧 Development
Prerequisites
- Node.js 18 or higher
- npm or yarn
Setup
# Clone the repository
git clone https://github.com/Git-Fg/searchcrawl-mcp-server.git
cd crawl-mcp-server
# Install dependencies
npm install
# Build TypeScript
npm run build
# Run in development mode (watch)
npm run dev
Development Commands
# Build project
npm run build
# Watch mode (auto-rebuild)
npm run dev
# Run tests
npm run test:run
# Lint code
npm run lint
# Type check
npm run typecheck
# Clean build artifacts
npm run clean
Project Structure
crawl-mcp-server/
├── src/
│ ├── index.ts # Main server (11 tools)
│ ├── types.ts # TypeScript interfaces
│ └── cdp.ts # Chrome CDP manager
├── test/
│ ├── run-tests.ts # Unit test suite
│ ├── performance.ts # Performance tests
│ └── config.ts # Test configuration
├── dist/ # Compiled JavaScript
├── .github/workflows/ # CI/CD pipeline
└── package.json
📊 Performance
Benchmarks
| Operation | Avg Duration | Max Memory |
|---|---|---|
| crawl_read | ~1500ms | 32MB |
| crawl_read_batch (2 URLs) | ~2500ms | 64MB |
| search_searx | ~4000ms | 128MB |
| crawl_fetch | ~2000ms | 48MB |
| tools/list | ~100ms | 8MB |
Performance Features
- ✅ Concurrent request processing (up to 20)
- ✅ Built-in caching (SHA-256)
- ✅ Automatic timeout management
- ✅ Memory optimization
- ✅ Resource cleanup
🛡️ Error Handling
All tools include comprehensive error handling:
- Network errors: Graceful degradation with error messages
- Timeout handling: Configurable timeouts
- Partial failures: Batch operations continue on individual failures
- Structured errors: Clear error codes and messages
- Recovery: Automatic retries where appropriate
Example error response:
{
"content": [
{
"type": "text",
"text": "Error: Failed to fetch https://example.com: Timeout after 30000ms"
}
],
"structuredContent": {
"error": "Network timeout",
"url": "https://example.com",
"code": "TIMEOUT"
}
}
🔐 Security
- No API keys required for basic crawling
- Respect robots.txt (configurable)
- User agent rotation
- Rate limiting (built-in via concurrency limits)
- Input validation (Zod schemas)
- Dependency scanning (npm audit, Snyk)
🌐 Transport Modes
STDIO (Default)
For local MCP clients:
node dist/index.js
HTTP
For remote access:
TRANSPORT=http PORT=3000 node dist/index.js
Server runs on: http://localhost:3000/mcp
📝 Configuration
Environment Variables
# Transport mode (stdio or http)
TRANSPORT=stdio
# HTTP port (when TRANSPORT=http)
PORT=3000
# Node environment
NODE_ENV=production
Tool Configuration
Each tool accepts an options object:
{
"timeout": 30000, // Request timeout (ms)
"maxConcurrency": 5, // Concurrent requests (1-20)
"maxResults": 10, // Limit results (1-50)
"respectRobots": false, // Respect robots.txt
"sameOriginOnly": true // Only same-origin URLs
}
🤝 Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make changes and add tests
- Run tests:
npm run test:ci - Commit:
git commit -m 'Add amazing feature' - Push:
git push origin feature/amazing-feature - Open a Pull Request
Development Guidelines
- Follow TypeScript strict mode
- Add tests for new features
- Update documentation
- Run linting:
npm run lint - Ensure CI passes
📄 License
MIT License - see LICENSE file
🙏 Acknowledgments
- @just-every/crawl - Web crawling
- Model Context Protocol - MCP specification
- SearXNG - Search aggregator
- Playwright - Browser automation
📞 Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: your-email@example.com
🚀 What's Next?
- [ ] Add DuckDuckGo search support
- [ ] Implement content filtering
- [ ] Add screenshot capabilities
- [ ] Support for authenticated content
- [ ] PDF extraction
- [ ] Real-time monitoring
Built with ❤️ using TypeScript, MCP, and modern web technologies.
推荐服务器
Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。
VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。
e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。