Dev Tool MCP
Provides web crawling and browser automation capabilities with support for multiple content formats (HTML, JSON, PDF, screenshots, Markdown), page content extraction, console message monitoring, and network request tracking.
README
MCP Server for Web Crawling
An MCP (Model Context Protocol) server that provides web crawling capabilities using crawl4ai. Supports multiple content formats output (HTML, JSON, PDF, screenshots, Markdown) and browser interaction features.
Quick Start
To launch the server, you have several options:
Option 1: Using the Python launch script
python launch_server.py
Option 2: Using the shell script (Unix-based systems)
./launch_server.sh
Option 3: Using the batch script (Windows)
launch_server.bat
Option 4: Direct execution
python mcp_server/server.py
Table of Contents
- MCP Configuration
- Features
- Architecture
- Installation
- Configuration
- Usage
- Testing
- Deployment
- Troubleshooting
- License
MCP Configuration
To configure this server with MCP, you have multiple options depending on your setup:
Method 1: Using the launch script (Recommended)
This is the recommended approach as it handles virtual environment setup and dependency installation automatically:
{
"mcpServers": {
"dev-tool-mcp": {
"command": "/bin/bash",
"args": [
"-c",
"cd /absolute/path/to/your/dev-tool-mcp && ./launch_server.sh"
],
"description": "MCP development tool server providing web crawling, browser automation, content extraction, and real-time page analysis capabilities"
}
}
}
Make sure to replace /absolute/path/to/your/dev-tool-mcp with the actual absolute path to your project directory.
Method 2: Using the virtual environment Python directly
If you prefer to run the server directly with the virtual environment:
{
"mcpServers": {
"dev-tool-mcp": {
"command": "/absolute/path/to/your/dev-tool-mcp/launch_server.sh",
"args": [],
"description": "MCP development tool server providing web crawling, browser automation, content extraction, and real-time page analysis capabilities"
}
}
}
Method 3: Using the installed console script
If the package is installed in the virtual environment, you can use the console script:
{
"mcpServers": {
"dev-tool-mcp": {
"command": "/absolute/path/to/your/dev-tool-mcp/launch_server.sh",
"args": [],
"description": "MCP development tool server providing web crawling, browser automation, content extraction, and real-time page analysis capabilities"
}
}
}
Method 4: Windows Configuration
For Windows systems, use the batch script:
{
"mcpServers": {
"dev-tool-mcp": {
"command": "cmd",
"args": [
"/c",
"cd /d C:\\absolute\\path\\to\\your\\dev-tool-mcp && launch_server.bat"
],
"description": "MCP development tool server providing web crawling, browser automation, content extraction, and real-time page analysis capabilities"
}
}
}
Note: The launch scripts (Methods 1-3 on Unix/Linux/Mac and Method 4 on Windows) are recommended because they will:
- Check if the virtual environment exists
- Create it if needed
- Install dependencies from pyproject.toml
- Activate the environment
- Start the server with all necessary dependencies
This ensures a consistent and reliable setup for the MCP server.
Features
- Web Crawling: Advanced web crawling capabilities using crawl4ai
- Multiple Output Formats: Support for HTML, JSON, PDF, screenshots, and Markdown output
- Browser Interaction: Get page content, console messages, and network requests
- LLM Integration: Supports LLM extraction strategies for content processing
- File Download: Automatic download and saving of files found on crawled pages
- Progress Tracking: Streaming progress updates during crawling operations
- Security: URL validation and sanitization to prevent security issues
Architecture
The MCP server is structured into the following modules:
mcp_server/
├── server.py # Main MCP server definition and tool handling
├── utils.py # Utility functions for file operations
├── browser/ # Browser automation functionality
│ ├── browser_service.py # Playwright-based browser service
│ └── README.md # Browser module documentation
└── crawl/ # Web crawling functionality
└── crawl.py # Core crawling implementation with crawl4ai
Core Components
- Server: Implements MCP protocol with tool registration and execution
- BrowserService: Manages Playwright browser instances for page content and network monitoring
- Crawler: Uses crawl4ai for advanced web crawling with multiple output formats
- Utils: Provides file handling utilities for saving content
Installation
Prerequisites
- Python 3.8 or higher
- pip package manager
- System dependencies for Playwright (Chromium browser)
Steps
-
Clone the repository:
git clone <repository-url> cd mcp-server -
Install the package and dependencies:
pip install -e . -
Install Playwright browsers:
python -m playwright install chromium
Dependencies
The server requires the following dependencies (automatically installed via pip):
crawl4ai>=0.7.7: Web crawling librarypydantic>=2.0.0: Data validationmcp==1.0.0: Model Context Protocol implementationhttpx[socks]: HTTP clientlitellm: LLM interfacebeautifulsoup4>=4.12.2: HTML parsinglxml>=4.9.3: XML/HTML processingsentencepiece: Text processingplaywright>=1.40.0: Browser automation
Configuration
Environment Variables
The server uses the following environment variables (optional):
TEST_URL: URL for testing (used in test files)
Available Tools
The server exposes the following tools via MCP:
say_hello
- Description: A simple greeting tool that returns personalized messages to users
- Parameters:
name(string, optional): The name to greet, defaults to "World"
- Returns: Greeting message text
echo_message
- Description: Echo tool that returns user-provided information as-is
- Parameters:
message(string, required): The message to echo back
- Returns: Echoed message text
crawl_web_page
- Description: Crawl web page content and save in multiple formats (HTML, JSON, PDF, screenshots) while downloading file resources from the page
- Parameters:
url(string, required): The URL of the web page to crawlsave_path(string, required): The base file path to save the crawled content and downloaded filesinstruction(string, optional): The instruction to use for the LLM (default: "")save_screenshot(boolean, optional): Save a screenshot of the page (default: false)save_pdf(boolean, optional): Save a PDF of the page (default: false)generate_markdown(boolean, optional): Generate a Markdown representation of the page (default: false)
- Returns: Success message with file count and save location
get_page_content
- Description: Get complete content of a specified URL webpage, including HTML structure and page data
- Parameters:
url(string, required): The URL of the web page to get content fromwait_for_selector(string, optional): Optional CSS selector to wait for before getting contentwait_timeout(integer, optional): Wait timeout in milliseconds, default 30000
- Returns: JSON object containing page content, title, HTML, text, metadata, links, and images
get_console_messages
- Description: Capture console output information from specified URL webpage (including logs, warnings, errors, etc.)
- Parameters:
url(string, required): The URL of the web page to get console messages fromwait_for_selector(string, optional): Optional CSS selector to wait for before getting console messageswait_timeout(integer, optional): Wait timeout in milliseconds, default 30000
- Returns: JSON object containing console messages with type, text, location, and stack information
get_network_requests
- Description: Monitor and retrieve all network requests initiated by specified URL webpage (API calls, resource loading, etc.)
- Parameters:
url(string, required): The URL of the web page to get network requests fromwait_for_selector(string, optional): Optional CSS selector to wait for before getting network requestswait_timeout(integer, optional): Wait timeout in milliseconds, default 30000
- Returns: JSON object containing requests and responses with URLs, status, headers, and timing information
Usage
Running the Server
The server can be started using the console script defined in pyproject.toml:
dev-tool-mcp
Or directly via Python:
python -m mcp_server.server
The server uses stdio for MCP communication, making it compatible with MCP clients.
Tool Examples
Crawling a Web Page
To crawl a web page and save content in multiple formats:
{
"name": "crawl_web_page",
"arguments": {
"url": "https://example.com",
"save_path": "/path/to/save",
"save_screenshot": true,
"save_pdf": true,
"generate_markdown": true
}
}
This will create a timestamped subdirectory with:
output.html- Page HTML contentoutput.json- Page content in JSON formatoutput.png- Screenshot of the page (if requested)output.pdf- PDF of the page (if requested)raw_markdown.md- Markdown representation of the page (if requested)downloaded_files.json- List of downloaded filesfiles/- Directory containing downloaded files
Getting Page Content
To retrieve page content:
{
"name": "get_page_content",
"arguments": {
"url": "https://example.com",
"wait_for_selector": "#main-content",
"wait_timeout": 10000
}
}
Getting Console Messages
To capture console messages from a page:
{
"name": "get_console_messages",
"arguments": {
"url": "https://example.com",
"wait_for_selector": ".app",
"wait_timeout": 15000
}
}
Getting Network Requests
To monitor network requests made by a page:
{
"name": "get_network_requests",
"arguments": {
"url": "https://example.com",
"wait_for_selector": "[data-loaded]",
"wait_timeout": 20000
}
}
Testing
The project includes comprehensive tests for both browser and crawler functionality:
Run Tests
# Run all tests
pytest
# Run specific test files
python -m pytest test/test_crawler.py
python -m pytest test/test_browser.py
# Run browser tests directly
python -m test.test_browser
Test Coverage
test_crawler.py: Tests the complete crawler functionality, including file saving and format generationtest_browser.py: Tests browser service functions for page content, console messages, and network requests
The crawler test specifically verifies:
- HTML, JSON, PDF, screenshot, and Markdown file generation
- File download and saving functionality
- Proper error handling and directory creation
Deployment
Production Deployment
-
Install the package in your production environment:
pip install dev-tool-mcp -
Install Playwright browsers:
python -m playwright install chromium --with-deps -
Run the server:
dev-tool-mcp
Docker Deployment (Optional)
Create a Dockerfile for containerized deployment:
FROM python:3.11-slim
WORKDIR /app
COPY pyproject.toml .
COPY mcp_server/ ./mcp_server/
RUN pip install -e .
RUN python -m playwright install chromium --with-deps
CMD ["dev-tool-mcp"]
Security
URL Validation
The server includes security measures to prevent malicious URLs:
- URL length limit (2048 characters)
- Protocol validation (only http/https allowed)
- Input sanitization to prevent injection attacks
- Validation to prevent access to local network addresses
Browser Security
- Playwright runs in headless mode by default
- Security flags are set to disable dangerous features
- Browser runs with limited privileges
File System Security
- Files are saved only to explicitly specified paths
- No arbitrary file system access is allowed
- Temporary files are properly cleaned up
Troubleshooting
Common Issues
Playwright Browser Not Found
If you encounter browser-related errors:
python -m playwright install chromium
Permission Errors
Ensure the server has write permissions to the specified save paths:
mkdir -p /path/to/save
chmod 755 /path/to/save
Network Issues
For network-related crawling issues, verify:
- The target URL is accessible
- No firewall restrictions exist
- Appropriate timeout values are set
Memory Usage
Large page crawls can consume significant memory. For large-scale crawling:
- Monitor memory usage
- Process pages sequentially rather than in parallel
- Clean up temporary files regularly
Debugging
Enable verbose logging to troubleshoot issues:
# Check the server logs for error messages
# Monitor file system permissions
# Verify network connectivity to target URLs
Version Information
- Package: dev-tool-mcp
- Version: 0.1.0
- Python Support: 3.8+
- Dependencies: See pyproject.toml for exact versions
License
This project is licensed under the terms specified in the pyproject.toml file. See the project repository for full license details.
推荐服务器
Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。
VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。
e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。