Fetch as Markdown

Fetch as Markdown

Fetches web pages and converts them to clean, readable markdown format by extracting main content while removing navigation, ads, and other non-essential elements to minimize token usage.

Category
访问服务器

README

Fetch as Markdown MCP Server

A Model Context Protocol (MCP) server that fetches web pages and converts them to clean, readable markdown format, focusing on main content extraction while minimizing context overhead.

Overview

This MCP server acts as a bridge between AI assistants and the web, specifically designed to:

  • Extract Clean Content: Focuses on main article content, removing navigation, ads, and sidebars
  • Minimize Context: Strips unnecessary elements to reduce token usage while preserving content structure
  • Respectful Scraping: Implements proper rate limiting, user-agent headers, and timeout handling
  • Error Resilience: Gracefully handles various web-related errors and edge cases

Features

🌐 Web Page Fetching

  • Fetch any publicly accessible web page
  • Automatic redirect handling with final URL reporting
  • Configurable timeouts and proper error handling
  • Respectful rate limiting (1-second intervals between requests)

🧹 Content Cleaning

  • Removes navigation, ads, sidebars, and other non-essential elements
  • Focuses on main content areas using semantic HTML detection
  • Strips unnecessary HTML attributes to reduce token usage
  • Preserves content structure and readability

📝 Markdown Conversion

  • Converts HTML to clean, readable markdown
  • Configurable link and image inclusion
  • Proper heading hierarchy and formatting
  • Post-processing to remove excessive whitespace

Installation & Setup

Prerequisites

  • Python 3.12 or higher
  • uv package manager

Quick Start

Run directly with uvx:

uvx git+https://github.com/bhubbb/mcp-fetch-as-markdown

Or install locally:

  1. Clone or download this project

  2. Install dependencies:

    cd mcp-fetch-as-markdown
    uv sync
    
  3. Run the server:

    uv run python main.py
    

Integration with AI Assistants

This MCP server is designed to work with AI assistants that support the Model Context Protocol. Configure your AI assistant to connect to this server via stdio.

Example configuration for Claude Desktop:

{
  "mcpServers": {
    "fetch-as-markdown": {
      "command": "uvx",
      "args": ["git+https://github.com/bhubbb/mcp-fetch-as-markdown"]
    }
  }
}

Or if using a local installation:

{
  "mcpServers": {
    "fetch-as-markdown": {
      "command": "uv",
      "args": ["run", "python", "/path/to/mcp-fetch-as-markdown/main.py"]
    }
  }
}

Available Tools

fetch

Fetch a web page and convert it to clean markdown format.

Parameters:

  • url (required): URL of the web page to fetch and convert
  • include_links (optional): Whether to preserve links in markdown output (default: true)
  • include_images (optional): Whether to include image references (default: false)
  • timeout (optional): Request timeout in seconds (5-30, default: 10)

Returns:

  • Fetch metadata (original URL, final URL, title, content length, status code, content type)
  • Clean markdown content with proper formatting

Example:

{
  "name": "fetch",
  "arguments": {
    "url": "https://example.com/article",
    "include_links": true,
    "include_images": false,
    "timeout": 15
  }
}

How It Works

Content Extraction Strategy

  1. Fetch Page: Makes HTTP request with proper headers and timeout handling
  2. Parse HTML: Uses BeautifulSoup to parse the HTML content
  3. Remove Unwanted Elements: Strips scripts, styles, navigation, ads, sidebars, footers
  4. Find Main Content: Looks for semantic elements like <main>, <article>, or common content classes
  5. Clean Attributes: Removes unnecessary HTML attributes to reduce size
  6. Convert to Markdown: Uses configurable markdown conversion with proper formatting
  7. Post-process: Removes excessive whitespace and blank lines

Respectful Web Scraping

  • Rate Limiting: Minimum 1-second interval between requests
  • User Agent: Proper identification as "MCP-Fetch-As-Markdown" tool
  • Timeout Handling: Configurable timeouts to avoid hanging requests
  • Error Handling: Graceful handling of network issues, HTTP errors, and malformed content
  • Redirect Support: Follows redirects and reports final URLs

Structured Output Format

All responses include:

  • Metadata Block: Original URL, final URL, page title, content statistics, HTTP status
  • Content Block: Clean markdown conversion of the main page content

This structure makes responses both human-readable and machine-parseable while minimizing token usage.

Error Handling

  • Invalid URLs: Clear validation and error messages
  • Network Issues: Timeout, connection error, and DNS failure handling
  • HTTP Errors: Proper handling of 404, 500, and other HTTP status codes
  • Malformed Content: Graceful handling of broken HTML and encoding issues

Use Cases

For Research & Analysis

  • Convert articles and blog posts to clean markdown for analysis
  • Extract main content from news articles and research papers
  • Gather information while minimizing irrelevant context

For Content Processing

  • Prepare web content for further AI processing
  • Extract clean text from web pages for summarization
  • Convert HTML content to markdown for documentation

For AI Assistants

  • Fetch and process web content with minimal token overhead
  • Extract relevant information while filtering out noise
  • Provide clean, structured content for AI reasoning

Examples

Basic Page Fetching

Ask your AI assistant: "Fetch the content from this article URL as markdown"

The server will:

  1. Fetch the web page with proper headers and rate limiting
  2. Extract the main content area, removing navigation and ads
  3. Convert to clean markdown format
  4. Return structured metadata and content

With Link Preservation

Ask your AI assistant: "Fetch this page but keep all the links intact"

The server will:

  1. Fetch and process the page normally
  2. Preserve all hyperlinks in markdown format [text](url)
  3. Maintain link structure while cleaning other elements

Error Handling Example

Ask your AI assistant: "Try to fetch content from this broken URL"

The server will:

  1. Validate the URL format
  2. Attempt the request with proper timeout
  3. Return a structured error message if the request fails
  4. Provide helpful information about what went wrong

Development

Project Structure

mcp-fetch-as-markdown/
├── main.py          # Main MCP server implementation
├── pyproject.toml   # Project dependencies and metadata
├── AGENT.md         # Development rules and guidelines
├── example.py       # Usage examples and demonstrations
└── .venv/           # Virtual environment (created by uv)

Key Dependencies

  • mcp: Model Context Protocol framework
  • requests: HTTP request handling
  • beautifulsoup4: HTML parsing and content extraction
  • markdownify: HTML to markdown conversion

Customization

The server can be easily customized by modifying main.py:

  • Content Selectors: Modify the CSS selectors used to find main content
  • Rate Limiting: Adjust the minimum interval between requests
  • Timeout Settings: Change default and maximum timeout values
  • Content Filtering: Add custom content processing or filtering rules
  • Markdown Options: Customize markdown conversion settings

Testing the Server

Test the server directly:

uvx git+https://github.com/bhubbb/mcp-fetch-as-markdown

Or with local installation:

cd mcp-fetch-as-markdown
uv run python main.py

For interactive testing, use the example script:

uv run python example.py

Troubleshooting

Common Issues

  1. Import Errors: Make sure all dependencies are installed with uv sync
  2. Connection Timeouts: Some websites may be slow; try increasing the timeout parameter
  3. Rate Limiting: The server enforces 1-second intervals between requests
  4. Blocked Requests: Some websites may block automated requests; this is expected behavior

Debugging

Enable debug logging by modifying the logging level in main.py:

logging.basicConfig(level=logging.DEBUG)

Website Compatibility

  • Modern Websites: Works best with standard HTML structure
  • JavaScript-heavy Sites: Cannot execute JavaScript; fetches initial HTML only
  • Protected Content: Respects robots.txt and website access restrictions
  • Rate Limits: Implements respectful scraping practices

Ethical Usage

This tool is designed for legitimate research, analysis, and content processing. Please:

  • Respect Terms of Service: Always check and comply with website terms of service
  • Avoid Overloading: The built-in rate limiting helps, but be mindful of request frequency
  • Attribution: Give proper credit to original sources when using extracted content
  • Legal Compliance: Ensure your use case complies with applicable laws and regulations

Contributing

This is a simple, single-file implementation designed for clarity and ease of modification. Feel free to:

  • Add support for additional content extraction strategies
  • Implement custom filtering for specific website types
  • Add caching for better performance
  • Extend with additional markdown formatting options

License

This project uses the same license as its dependencies. Content fetched from websites remains subject to the original website's terms of service and copyright.

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选