MCP 服务器

crawl-mcp-server

An MCP server for web content extraction that converts HTML pages into clean, LLM-optimized Markdown using Mozilla's Readability. It supports batch processing, intelligent multi-page crawling, and configurable caching while respecting robots.txt standards.

README

crawl-mcp-server

A comprehensive MCP (Model Context Protocol) server providing 11 powerful tools for web crawling and search. Transform web content into clean, LLM-optimized Markdown or search the web with SearXNG integration.

✨ Features

🔍 SearXNG Web Search - Search the web with automatic browser management
📄 4 Crawling Tools - Extract and convert web content to Markdown
🚀 Auto-Browser-Launch - Search tools automatically manage browser lifecycle
📦 11 Total Tools - Complete toolkit for web interaction
💾 Built-in Caching - SHA-256 based caching with graceful fallbacks
⚡ Concurrent Processing - Handle multiple URLs simultaneously (up to 50)
🎯 LLM-Optimized Output - Clean Markdown perfect for AI consumption
🛡️ Robust Error Handling - Graceful failure with detailed error messages
🧪 Comprehensive Testing - Full CI/CD with performance benchmarks

📦 Installation

Method 1: npm (Recommended)

npm install crawl-mcp-server

Method 2: Direct from Git

# Install latest from GitHub
npm install git+https://github.com/Git-Fg/searchcrawl-mcp-server.git

# Or specific branch
npm install git+https://github.com/Git-Fg/searchcrawl-mcp-server.git#main

# Or from a fork
npm install git+https://github.com/YOUR_FORK/searchcrawl-mcp-server.git

Method 3: Clone and Build

git clone https://github.com/Git-Fg/searchcrawl-mcp-server.git
cd crawl-mcp-server
npm install
npm run build

Method 4: npx (No Installation)

# Run directly without installing
npx git+https://github.com/Git-Fg/searchcrawl-mcp-server.git

🔧 Setup for Claude Code

Option 1: MCP Desktop (Recommended)

Add to your Claude Desktop configuration file:

** macOS/Linux: ~/.config/claude/claude_desktop_config.json**

{
  "mcpServers": {
    "crawl-server": {
      "command": "npx",
      "args": [
        "git+https://github.com/Git-Fg/searchcrawl-mcp-server.git"
      ],
      "env": {
        "NODE_ENV": "production"
      }
    }
  }
}

** Windows: %APPDATA%\Claude\claude_desktop_config.json**

{
  "mcpServers": {
    "crawl-server": {
      "command": "npx",
      "args": [
        "git+https://github.com/Git-Fg/searchcrawl-mcp-server.git"
      ],
      "env": {
        "NODE_ENV": "production"
      }
    }
  }
}

Option 2: Local Installation

If you've installed locally:

{
  "mcpServers": {
    "crawl-server": {
      "command": "node",
      "args": [
        "/path/to/crawl-mcp-server/dist/index.js"
      ],
      "env": {}
    }
  }
}

Option 3: Custom Path

For a specific installation:

{
  "mcpServers": {
    "crawl-server": {
      "command": "node",
      "args": [
        "/usr/local/lib/node_modules/crawl-mcp-server/dist/index.js"
      ],
      "env": {}
    }
  }
}

After configuration, restart Claude Desktop.

🔧 Setup for Other MCP Clients

Claude CLI

# Using npx
claude mcp add crawl-server npx git+https://github.com/Git-Fg/searchcrawl-mcp-server.git

# Using local installation
claude mcp add crawl-server node /path/to/crawl-mcp-server/dist/index.js

Zed Editor

Add to ~/.config/zed/settings.json:

{
  "assistant": {
    "mcp": {
      "servers": {
        "crawl-server": {
          "command": "node",
          "args": ["/path/to/crawl-mcp-server/dist/index.js"]
        }
      }
    }
  }
}

VSCode with Copilot Chat

{
  "mcpServers": {
    "crawl-server": {
      "command": "node",
      "args": ["/path/to/crawl-mcp-server/dist/index.js"]
    }
  }
}

🚀 Quick Start

Using MCP Inspector (Testing)

# Install MCP Inspector globally
npm install -g @modelcontextprotocol/inspector

# Run the server
node dist/index.js

# In another terminal, test tools
npx @modelcontextprotocol/inspector --cli node dist/index.js --method tools/list

Development Mode

# Watch mode (auto-rebuild on changes)
npm run dev

# Build TypeScript
npm run build

# Run tests
npm run test:run

📚 Available Tools

Search Tools (7 tools)

1. search_searx

Search the web using SearXNG with automatic browser management.

// Example call
{
  "query": "TypeScript MCP server",
  "maxResults": 10,
  "category": "general",
  "timeRange": "week",
  "language": "en"
}

Parameters:

query (string, required): Search query
maxResults (number, default: 20): Results to return (1-50)
category (enum, default: general): one of general, images, videos, news, map, music, it, science
timeRange (enum, optional): one of day, week, month, year
language (string, default: en): Language code

Returns: JSON with search results array, URLs, and metadata

2. launch_chrome_cdp

Launch system Chrome with remote debugging for advanced SearX usage.

{
  "headless": true,
  "port": 9222,
  "userDataDir": "/path/to/profile"
}

Parameters:

headless (boolean, default: true): Run Chrome headless
port (number, default: 9222): Remote debugging port
userDataDir (string, optional): Custom Chrome profile

3. connect_cdp

Connect to remote CDP browser (Browserbase, etc.).

{
  "cdpWsUrl": "http://localhost:9222"
}

Parameters:

cdpWsUrl (string, required): CDP WebSocket URL or HTTP endpoint

4. launch_local

Launch bundled Chromium for SearX search.

{
  "headless": true,
  "userAgent": "custom user agent string"
}

Parameters:

headless (boolean, default: true): Run headless
userAgent (string, optional): Custom user agent

5. chrome_status

Check Chrome CDP status and health.

{}

Returns: Running status, health, endpoint URL, and PID

6. close

Close browser session (keeps Chrome CDP running).

{}

7. shutdown_chrome_cdp

Shutdown Chrome CDP and cleanup resources.

{}

Crawling Tools (4 tools)

1. crawl_read ⭐ (Simple & Fast)

Quick single-page extraction to Markdown.

{
  "url": "https://example.com/article",
  "options": {
    "timeout": 30000
  }
}

Best for:

✅ News articles
✅ Blog posts
✅ Documentation pages
✅ Simple content extraction

Returns: Clean Markdown content

2. crawl_read_batch ⭐ (Multiple URLs)

Process 1-50 URLs concurrently.

{
  "urls": [
    "https://example.com/article1",
    "https://example.com/article2",
    "https://example.com/article3"
  ],
  "options": {
    "maxConcurrency": 5,
    "timeout": 30000,
    "maxResults": 10
  }
}

Best for:

✅ Processing multiple articles
✅ Building content aggregates
✅ Bulk content extraction

Returns: Array of Markdown results with summary statistics

3. crawl_fetch_markdown

Controlled single-page extraction with full option control.

{
  "url": "https://example.com/article",
  "options": {
    "timeout": 30000
  }
}

Best for:

✅ Advanced crawling options
✅ Custom timeout control
✅ Detailed extraction

4. crawl_fetch

Multi-page crawling with intelligent link extraction.

{
  "url": "https://example.com",
  "options": {
    "pages": 5,
    "maxConcurrency": 3,
    "sameOriginOnly": true,
    "timeout": 30000,
    "maxResults": 20
  }
}

Best for:

✅ Crawling entire sites
✅ Link-based discovery
✅ Multi-page scraping

Features:

Extracts links from starting page
Crawls discovered pages
Concurrent processing
Same-origin filtering (configurable)

💡 Usage Examples

Example 1: Search + Crawl Workflow

// Step 1: Search for topics
{
  "tool": "search_searx",
  "arguments": {
    "query": "TypeScript best practices 2024",
    "maxResults": 5
  }
}

// Step 2: Extract URLs from results
// (Parse the search results to get URLs)

// Step 3: Crawl selected articles
{
  "tool": "crawl_read_batch",
  "arguments": {
    "urls": [
      "https://example.com/article1",
      "https://example.com/article2",
      "https://example.com/article3"
    ]
  }
}

Example 2: Batch Content Extraction

{
  "tool": "crawl_read_batch",
  "arguments": {
    "urls": [
      "https://news.site/article1",
      "https://news.site/article2",
      "https://news.site/article3"
    ],
    "options": {
      "maxConcurrency": 10,
      "timeout": 30000,
      "maxResults": 3
    }
  }
}

Example 3: Site Crawling

{
  "tool": "crawl_fetch",
  "arguments": {
    "url": "https://docs.example.com",
    "options": {
      "pages": 10,
      "maxConcurrency": 5,
      "sameOriginOnly": true,
      "timeout": 30000,
      "maxResults": 10
    }
  }
}

🎯 Tool Selection Guide

Use Case	Recommended Tool	Complexity
Single article	`crawl_read`	Simple
Multiple articles	`crawl_read_batch`	Simple
Advanced options	`crawl_fetch_markdown`	Medium
Site crawling	`crawl_fetch`	Complex
Web search	`search_searx`	Simple
Research workflow	`search_searx` → `crawl_read`	Medium

🏗️ Architecture

Core Components

┌─────────────────────────────────────────┐
│         crawl-mcp-server                │
├─────────────────────────────────────────┤
│                                          │
│  ┌──────────────────────────────┐      │
│  │     MCP Server Core         │      │
│  │  - 11 registered tools      │      │
│  │  - STDIO/HTTP transport    │      │
│  └──────────────────────────────┘      │
│              │                           │
│  ┌──────────────────────────────┐      │
│  │   @just-every/crawl         │      │
│  │  - HTML → Markdown          │      │
│  │  - Mozilla Readability       │      │
│  │  - Concurrent crawling      │      │
│  └──────────────────────────────┘      │
│              │                           │
│  ┌──────────────────────────────┐      │
│  │   Playwright (Browser)       │      │
│  │  - SearXNG integration       │      │
│  │  - Auto browser management   │      │
│  │  - Anti-detection            │      │
│  └──────────────────────────────┘      │
│                                          │
└─────────────────────────────────────────┘

Technology Stack

Runtime: Node.js 18+
Language: TypeScript 5.7
Framework: MCP SDK (@modelcontextprotocol/sdk)
Crawling: @just-every/crawl
Browser: Playwright Core
Validation: Zod
Transport: STDIO (local) + HTTP (remote)

Data Flow

Client Request
    ↓
MCP Protocol
    ↓
Tool Handler
    ↓
┌─────────────────────┐
│   Crawl/Search     │
│  @just-every/crawl │  →  HTML content
│   or SearXNG       │  →  Search results
└─────────────────────┘
    ↓
HTML → Markdown
    ↓
Result Formatting
    ↓
MCP Response
    ↓
Client

🧪 Testing

Run Test Suite

# All unit tests
npm run test:run

# Performance benchmarks
npm run test:performance

# Full CI suite
npm run test:ci

# Individual tool test
npx @modelcontextprotocol/inspector --cli node dist/index.js \
  --method tools/call \
  --tool-name crawl_read \
  --tool-arg url="https://example.com"

Test Coverage

✅ All 11 tools tested
✅ Error handling validated
✅ Performance benchmarks
✅ Integration workflows
✅ Multi-Node support (Node 18, 20, 22)

CI/CD Pipeline

┌────────────────────────────────────┐
│        GitHub Actions              │
├────────────────────────────────────┤
│  1. Test (Matrix: Node 18,20,22) │
│  2. Integration Tests (PR only)    │
│  3. Performance Tests (main)       │
│  4. Security Scan                  │
│  5. Coverage Report                │
└────────────────────────────────────┘

🔧 Development

Prerequisites

Node.js 18 or higher
npm or yarn

Setup

# Clone the repository
git clone https://github.com/Git-Fg/searchcrawl-mcp-server.git
cd crawl-mcp-server

# Install dependencies
npm install

# Build TypeScript
npm run build

# Run in development mode (watch)
npm run dev

Development Commands

# Build project
npm run build

# Watch mode (auto-rebuild)
npm run dev

# Run tests
npm run test:run

# Lint code
npm run lint

# Type check
npm run typecheck

# Clean build artifacts
npm run clean

Project Structure

crawl-mcp-server/
├── src/
│   ├── index.ts          # Main server (11 tools)
│   ├── types.ts           # TypeScript interfaces
│   └── cdp.ts            # Chrome CDP manager
├── test/
│   ├── run-tests.ts       # Unit test suite
│   ├── performance.ts     # Performance tests
│   └── config.ts          # Test configuration
├── dist/                  # Compiled JavaScript
├── .github/workflows/      # CI/CD pipeline
└── package.json

📊 Performance

Benchmarks

Operation	Avg Duration	Max Memory
crawl_read	~1500ms	32MB
crawl_read_batch (2 URLs)	~2500ms	64MB
search_searx	~4000ms	128MB
crawl_fetch	~2000ms	48MB
tools/list	~100ms	8MB

Performance Features

✅ Concurrent request processing (up to 20)
✅ Built-in caching (SHA-256)
✅ Automatic timeout management
✅ Memory optimization
✅ Resource cleanup

🛡️ Error Handling

All tools include comprehensive error handling:

Network errors: Graceful degradation with error messages
Timeout handling: Configurable timeouts
Partial failures: Batch operations continue on individual failures
Structured errors: Clear error codes and messages
Recovery: Automatic retries where appropriate

Example error response:

{
  "content": [
    {
      "type": "text",
      "text": "Error: Failed to fetch https://example.com: Timeout after 30000ms"
    }
  ],
  "structuredContent": {
    "error": "Network timeout",
    "url": "https://example.com",
    "code": "TIMEOUT"
  }
}

🔐 Security

No API keys required for basic crawling
Respect robots.txt (configurable)
User agent rotation
Rate limiting (built-in via concurrency limits)
Input validation (Zod schemas)
Dependency scanning (npm audit, Snyk)

🌐 Transport Modes

STDIO (Default)

For local MCP clients:

node dist/index.js

HTTP

For remote access:

TRANSPORT=http PORT=3000 node dist/index.js

Server runs on: http://localhost:3000/mcp

📝 Configuration

Environment Variables

# Transport mode (stdio or http)
TRANSPORT=stdio

# HTTP port (when TRANSPORT=http)
PORT=3000

# Node environment
NODE_ENV=production

Tool Configuration

Each tool accepts an options object:

{
  "timeout": 30000,          // Request timeout (ms)
  "maxConcurrency": 5,       // Concurrent requests (1-20)
  "maxResults": 10,          // Limit results (1-50)
  "respectRobots": false,    // Respect robots.txt
  "sameOriginOnly": true     // Only same-origin URLs
}

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make changes and add tests
Run tests: npm run test:ci
Commit: git commit -m 'Add amazing feature'
Push: git push origin feature/amazing-feature
Open a Pull Request

Development Guidelines

Follow TypeScript strict mode
Add tests for new features
Update documentation
Run linting: npm run lint
Ensure CI passes

📄 License

MIT License - see LICENSE file

🙏 Acknowledgments

@just-every/crawl - Web crawling
Model Context Protocol - MCP specification
SearXNG - Search aggregator
Playwright - Browser automation

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: your-email@example.com

🚀 What's Next?

[ ] Add DuckDuckGo search support
[ ] Implement content filtering
[ ] Add screenshot capabilities
[ ] Support for authenticated content
[ ] PDF extraction
[ ] Real-time monitoring

Built with ❤️ using TypeScript, MCP, and modern web technologies.

crawl-mcp-server

README

crawl-mcp-server

✨ Features

📦 Installation

Method 1: npm (Recommended)

Method 2: Direct from Git

Method 3: Clone and Build

Method 4: npx (No Installation)

🔧 Setup for Claude Code

Option 1: MCP Desktop (Recommended)

Option 2: Local Installation

Option 3: Custom Path

🔧 Setup for Other MCP Clients

Claude CLI

Zed Editor

VSCode with Copilot Chat

🚀 Quick Start

Using MCP Inspector (Testing)

Development Mode

📚 Available Tools

Search Tools (7 tools)

1. search_searx

2. launch_chrome_cdp

3. connect_cdp

4. launch_local

5. chrome_status

6. close

7. shutdown_chrome_cdp

Crawling Tools (4 tools)

1. crawl_read ⭐ (Simple & Fast)

2. crawl_read_batch ⭐ (Multiple URLs)

3. crawl_fetch_markdown

4. crawl_fetch

💡 Usage Examples

Example 1: Search + Crawl Workflow

Example 2: Batch Content Extraction

Example 3: Site Crawling

🎯 Tool Selection Guide

🏗️ Architecture

Core Components

Technology Stack

Data Flow

🧪 Testing

Run Test Suite

Test Coverage

CI/CD Pipeline

🔧 Development

Prerequisites

Setup

Development Commands

Project Structure

📊 Performance

Benchmarks

Performance Features

🛡️ Error Handling

🔐 Security

🌐 Transport Modes

STDIO (Default)

HTTP

📝 Configuration

Environment Variables

Tool Configuration

🤝 Contributing

Development Guidelines

📄 License

🙏 Acknowledgments

📞 Support

🚀 What's Next?

推荐服务器