MCP Data Fetch Server
Securely fetches web content, extracts links and metadata, and downloads files through a sandboxed MCP server without JavaScript execution. Includes prompt-injection detection and comprehensive HTML sanitization for safe web data retrieval.
README
📂 MCP Data Fetch Server
MCP Data Fetch Server is secure, sandboxed server that fetches web content and extracts data via the Model Control Protocol (MCP). without executing JavaScript.
Table of Contents
- Features
- Installation & Quick Start
- Command‑Line Options
- Integration with LM Studio
- MCP API Overview
- Available Tools
- Security Features
🎯 Features
- Secure web page fetching – strips scripts, iframes and cookie banners; no JavaScript execution.
- Rich data extraction – retrieve links, metadata, Open Graph/Twitter cards, and downloadable resources.
- Safe file downloads – size limits, filename sanitisation, and path‑traversal protection within a sandboxed cache.
- Built‑in caching – optional cache directory reduces repeated network calls.
- Prompt‑injection detection – validates URLs and fetched content for malicious instructions.
📦 Installation & Quick Start
# Clone the repository (or copy the MCPDataFetchServer.1 folder)
git clone https://github.com/undici77/MCPDataFetchServer.git
cd MCPDataFetchServer
# Make the startup script executable
chmod +x run.sh
# Run the server, pointing to a sandboxed working directory
./run.sh -d /path/to/working/directory
📌 Three‑step overview
1️⃣ The script creates a virtual environment and installs dependencies.
2️⃣ It prepares a cache folder (.fetch_cache) inside the project root.
3️⃣main.pylaunches the MCP server, listening on stdin/stdout for JSON‑RPC requests.
⚙️ Command‑Line Options
| Option | Description |
|---|---|
-d, --working-dir |
Path to the sandboxed working directory where all file operations are confined (default: ~/.mcp_datafetch). |
-c, --cache-dir |
Name of the cache subdirectory relative to the working directory (default: cache). |
-h, --help |
Show help message and exit. |
🤝 Integration with LM Studio (or any MCP‑compatible client)
Add an entry to your mcp.json configuration so that LM Studio can launch the server automatically.
{
"mcpServers": {
"datafetch": {
"command": "/absolute/path/to/MCPDataFetchServer.1/run.sh",
"args": [
"-d",
"/absolute/path/to/working/directory"
],
"env": { "WORKING_DIR": "." }
}
}
}
📌 Tip: Ensure
run.shis executable (chmod +x …) and that the virtual environment can install the required Python packages on first launch.
📡 MCP API Overview
All communication follows JSON‑RPC 2.0 over stdin/stdout.
initialize
Request:
{
"jsonrpc": "2.0",
"id": 1,
"method": "initialize",
"params": {}
}
Response contains the protocol version, server capabilities and basic metadata (e.g., name = mcp-datafetch-server, version = 2.1.0).
tools/list
Request:
{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/list",
"params": {}
}
Response: { "tools": [ …tool definitions… ] }. Each definition includes name, description and an input schema (JSON Schema).
tools/call
Generic request shape (replace <tool_name> and arguments as needed):
{
"jsonrpc": "2.0",
"id": 3,
"method": "tools/call",
"params": {
"name": "<tool_name>",
"arguments": { … }
}
}
The server validates the request against the tool’s schema, executes the operation, and returns a ToolResult containing one or more content blocks.
🛠️ Available Tools
fetch_webpage
- Securely fetches a web page and returns clean content in the requested format.
| Name | Type | Required | Description |
|---|---|---|---|
url |
string | ✅ (no default) | URL to fetch (http/https only). |
format |
string | ❌ (markdown) |
Output format – one of markdown, text, or html. |
include_links |
boolean | ❌ (true) |
Whether to append an extracted links list. |
include_images |
boolean | ❌ (false) |
Whether to list image URLs in the output. |
remove_banners |
boolean | ❌ (true) |
Attempt to strip cookie banners & pop‑ups. |
Example
{
"jsonrpc": "2.0",
"id": 10,
"method": "tools/call",
"params": {
"name": "fetch_webpage",
"arguments": {
"url": "https://example.com/article",
"format": "markdown",
"include_links": true,
"include_images": false,
"remove_banners": true
}
}
}
Note: The tool sanitises HTML, removes scripts/iframes, and checks for prompt‑injection patterns before returning content.
extract_links
- Extracts and categorises all hyperlinks from a page.
| Name | Type | Required | Description |
|---|---|---|---|
url |
string | ✅ (no default) | URL of the page to analyse. |
filter |
string | ❌ (all) |
Return only all, internal, external, or resources. |
Example
{
"jsonrpc": "2.0",
"id": 11,
"method": "tools/call",
"params": {
"name": "extract_links",
"arguments": {
"url": "https://example.com/blog",
"filter": "internal"
}
}
}
Note: Links are classified as internal (same domain) or external; resource links (images, PDFs…) can be filtered with resources.
download_file
- Safely downloads a remote file into the sandboxed cache directory.
| Name | Type | Required | Description |
|---|---|---|---|
url |
string | ✅ (no default) | Direct URL to the file. |
filename |
string | ❌ (auto‑generated) | Desired filename; will be sanitised and forced into the cache directory. |
Example
{
"jsonrpc": "2.0",
"id": 12,
"method": "tools/call",
"params": {
"name": "download_file",
"arguments": {
"url": "https://example.com/files/report.pdf",
"filename": "report_latest.pdf"
}
}
}
Note: The server enforces a 100 MB download limit, validates the URL against blocked domains/extensions, and returns the relative path inside the working directory for cross‑agent access.
get_page_metadata
- Extracts structured metadata (title, description, Open Graph, Twitter Cards) from a web page.
| Name | Type | Required | Description |
|---|---|---|---|
url |
string | ✅ (no default) | URL of the page to inspect. |
Example
{
"jsonrpc": "2.0",
"id": 13,
"method": "tools/call",
"params": {
"name": "get_page_metadata",
"arguments": { "url": "https://example.com/product/42" }
}
}
Note: The tool returns a formatted text block with title, description, keywords, Open Graph properties and Twitter Card fields.
check_url
- Performs a lightweight HEAD request to report status code, headers and size without downloading the body.
| Name | Type | Required | Description |
|---|---|---|---|
url |
string | ✅ (no default) | URL to probe. |
Example
{
"jsonrpc": "2.0",
"id": 14,
"method": "tools/call",
"params": {
"name": "check_url",
"arguments": { "url": "https://example.com/resource.zip" }
}
}
Note: The response includes the final URL after redirects, a concise status summary (✅ OK or ⚠️ Error), and selected HTTP headers such as Content‑Type and Content‑Length.
🔐 Security Features
- Path‑traversal protection – all file operations are confined to the sandboxed working directory.
- Prompt‑injection detection in URLs, fetched HTML and generated content.
- Blocked domains & extensions (localhost, private IP ranges, executable/script files).
- Content‑size limits – max 50 MB for page fetches, max 100 MB for file downloads.
- HTML sanitisation – removes
<script>,<iframe>, event handlers and other risky elements before processing. - Cookie/banner handling – optional removal of consent banners and pop‑ups during fetch.
© 2025 Undici77 – All rights reserved.
推荐服务器
Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。
VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。
e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。