MCP Web Scraper

MCP Web Scraper

A lightweight web scraping server that allows Claude Desktop users to extract various types of data from websites, including text, links, images, tables, headlines, and metadata using CSS selectors.

Category
访问服务器

README

MCP Web Scraper

A lightweight and efficient web scraping MCP server using direct STDIO protocol

🚀 Quick Start

Option 1: Automated Setup

# Clone and setup
git clone https://github.com/navin4078/mcp-web-scraper
cd mcp-web-scraper
chmod +x setup.sh && ./setup.sh

Option 2: Manual Setup

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install minimal dependencies
pip install -r requirements.txt

⚙️ Claude Desktop Configuration

Step 1: Find Your Paths

# Get absolute paths (run this in your project directory)
echo "Python path: $(pwd)/venv/bin/python"
echo "Script path: $(pwd)/app_mcp.py"

Step 2: Configure Claude Desktop

Open your Claude Desktop config file:

macOS:

~/Library/Application Support/Claude/claude_desktop_config.json

Windows:

%APPDATA%\Claude\claude_desktop_config.json

Linux:

~/.config/Claude/claude_desktop_config.json

Step 3: Add Configuration

Add this to your config file:

{
  "mcpServers": {
    "web-scraper": {
      "command": "/full/path/to/your/venv/bin/python",
      "args": ["/full/path/to/your/app_mcp.py"]
    }
  }
}

Example:

{
  "mcpServers": {
    "web-scraper": {
      "command": "/Users/username/Desktop/scrapper/venv/bin/python",
      "args": ["/Users/username/Desktop/scrapper/app_mcp.py"]
    }
  }
}

Step 4: Restart Claude Desktop

  1. Completely close Claude Desktop (Cmd+Q on Mac)
  2. Restart the application
  3. Look for the hammer icon (🔨)
  4. You should see "web-scraper" in your MCP servers

🛠 Available Tools

scrape_website

Extract data from websites with flexible options:

  • extract_type: text, links, images, table
  • selector: CSS selector for targeting specific elements
  • max_results: Limit number of results (1-50)

extract_headlines

Get all headlines (h1, h2, h3) from a webpage with hierarchy and attributes.

extract_metadata

Extract comprehensive metadata:

  • Basic: title, description, keywords, author
  • Open Graph: og:title, og:description, og:image
  • Twitter Cards: twitter:title, twitter:description

get_page_info

Get page structure overview:

  • Element counts (paragraphs, headings, links, images, tables)
  • Basic metadata
  • Page statistics

💡 Usage Examples

Basic Scraping

Scrape the text content from https://example.com

Extract all links from https://news.ycombinator.com

Get headlines from https://www.bbc.com/news

Advanced Examples

Extract all images from https://example.com with their alt text

Scrape text from https://example.com using the CSS selector ".article-content p"

Get metadata and Open Graph tags from https://github.com

What's the page structure of https://stackoverflow.com?

Specific Selectors

Extract text from https://news.ycombinator.com using selector ".titleline a"

Get all table data from https://example.com/data-page

Scrape only paragraph text from articles using selector "article p"

📁 Project Structure

scrapper/
├── app_mcp.py             # Main MCP server (STDIO)
├── requirements.txt       # Minimal dependencies
├── setup.sh              # Simple setup script
├── .gitignore            # Git ignore rules
└── README.md             # This file

🔧 Features

Web Scraping Capabilities

  • ✅ Text extraction with CSS selectors
  • ✅ Link extraction with full attributes
  • ✅ Image extraction with metadata
  • ✅ Table data extraction and formatting
  • ✅ Comprehensive metadata extraction
  • ✅ Headline extraction with hierarchy
  • ✅ Custom CSS selector support
  • ✅ Configurable result limits
  • ✅ Error handling and validation

MCP Integration

  • ✅ Direct STDIO protocol (no HTTP needed)
  • ✅ Native Claude Desktop integration
  • ✅ Automatic server lifecycle management
  • ✅ Schema validation and documentation
  • ✅ Comprehensive error handling
  • ✅ Minimal dependencies

🛡 Security & Best Practices

  1. Respect robots.txt: Always check robots.txt before scraping
  2. Rate limiting: Built-in 10-second request timeout
  3. User-Agent: Uses modern browser headers
  4. Input validation: URL and parameter validation
  5. Error handling: Graceful error handling and reporting
  6. Resource limits: Configurable result limits prevent overload

🐛 Troubleshooting

MCP Server Not Appearing

Check your paths:

# Verify files exist
ls -la /path/to/your/venv/bin/python
ls -la /path/to/your/app_mcp.py

# Test the script manually
/path/to/your/venv/bin/python /path/to/your/app_mcp.py

Validate JSON configuration:

  • Use a JSON validator to check syntax
  • Ensure no trailing commas
  • Use absolute paths (not relative)

Permission Issues

# Make script executable
chmod +x app_mcp.py

# Check virtual environment
source venv/bin/activate
python --version

Import Errors

# Reinstall dependencies
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Testing the MCP Server

You can test if the server works by running it manually:

source venv/bin/activate
python app_mcp.py

The server should start and wait for STDIO input from Claude Desktop.

📚 Dependencies

  • requests: HTTP library for web requests
  • beautifulsoup4: HTML/XML parsing
  • lxml: Fast XML and HTML processor
  • mcp: Model Context Protocol library

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Test thoroughly with Claude Desktop
  4. Submit a pull request

📄 License

This project is open source and available under the MIT License.

🔗 Resources


Simple, efficient web scraping for Claude Desktop! 🕷️✨

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选