MCP 服务器

MCP Data Fetch Server

Securely fetches web content, extracts links and metadata, and downloads files through a sandboxed MCP server without JavaScript execution. Includes prompt-injection detection and comprehensive HTML sanitization for safe web data retrieval.

README

📂 MCP Data Fetch Server

MCP Data Fetch Server is secure, sandboxed server that fetches web content and extracts data via the Model Control Protocol (MCP). without executing JavaScript.

Features
Installation & Quick Start
Command‑Line Options
Integration with LM Studio
MCP API Overview
Available Tools
Security Features

🎯 Features

Secure web page fetching – strips scripts, iframes and cookie banners; no JavaScript execution.
Rich data extraction – retrieve links, metadata, Open Graph/Twitter cards, and downloadable resources.
Safe file downloads – size limits, filename sanitisation, and path‑traversal protection within a sandboxed cache.
Built‑in caching – optional cache directory reduces repeated network calls.
Prompt‑injection detection – validates URLs and fetched content for malicious instructions.

📦 Installation & Quick Start

# Clone the repository (or copy the MCPDataFetchServer.1 folder)
git clone https://github.com/undici77/MCPDataFetchServer.git
cd MCPDataFetchServer

# Make the startup script executable
chmod +x run.sh

# Run the server, pointing to a sandboxed working directory
./run.sh -d /path/to/working/directory

📌 Three‑step overview
1️⃣ The script creates a virtual environment and installs dependencies.
2️⃣ It prepares a cache folder (.fetch_cache) inside the project root.
3️⃣ main.py launches the MCP server, listening on stdin/stdout for JSON‑RPC requests.

⚙️ Command‑Line Options

Option	Description
`-d`, `--working-dir`	Path to the sandboxed working directory where all file operations are confined (default: `~/.mcp_datafetch`).
`-c`, `--cache-dir`	Name of the cache subdirectory relative to the working directory (default: `cache`).
`-h`, `--help`	Show help message and exit.

🤝 Integration with LM Studio (or any MCP‑compatible client)

Add an entry to your mcp.json configuration so that LM Studio can launch the server automatically.

{
  "mcpServers": {
    "datafetch": {
      "command": "/absolute/path/to/MCPDataFetchServer.1/run.sh",
      "args": [
        "-d",
        "/absolute/path/to/working/directory"
      ],
      "env": { "WORKING_DIR": "." }
    }
  }
}

📌 Tip: Ensure run.sh is executable (chmod +x …) and that the virtual environment can install the required Python packages on first launch.

📡 MCP API Overview

All communication follows JSON‑RPC 2.0 over stdin/stdout.

`initialize`

Request:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "initialize",
  "params": {}
}

Response contains the protocol version, server capabilities and basic metadata (e.g., name = mcp-datafetch-server, version = 2.1.0).

`tools/list`

Request:

{
  "jsonrpc": "2.0",
  "id": 2,
  "method": "tools/list",
  "params": {}
}

Response: { "tools": [ …tool definitions… ] }. Each definition includes name, description and an input schema (JSON Schema).

`tools/call`

Generic request shape (replace <tool_name> and arguments as needed):

{
  "jsonrpc": "2.0",
  "id": 3,
  "method": "tools/call",
  "params": {
    "name": "<tool_name>",
    "arguments": { … }
  }
}

The server validates the request against the tool’s schema, executes the operation, and returns a ToolResult containing one or more content blocks.

🛠️ Available Tools

`fetch_webpage`

Securely fetches a web page and returns clean content in the requested format.

Name	Type	Required	Description
`url`	string	✅ (no default)	URL to fetch (http/https only).
`format`	string	❌ (`markdown`)	Output format – one of `markdown`, `text`, or `html`.
`include_links`	boolean	❌ (`true`)	Whether to append an extracted links list.
`include_images`	boolean	❌ (`false`)	Whether to list image URLs in the output.
`remove_banners`	boolean	❌ (`true`)	Attempt to strip cookie banners & pop‑ups.

Example

{
  "jsonrpc": "2.0",
  "id": 10,
  "method": "tools/call",
  "params": {
    "name": "fetch_webpage",
    "arguments": {
      "url": "https://example.com/article",
      "format": "markdown",
      "include_links": true,
      "include_images": false,
      "remove_banners": true
    }
  }
}

Note: The tool sanitises HTML, removes scripts/iframes, and checks for prompt‑injection patterns before returning content.

`extract_links`

Extracts and categorises all hyperlinks from a page.

Name	Type	Required	Description
`url`	string	✅ (no default)	URL of the page to analyse.
`filter`	string	❌ (`all`)	Return only `all`, `internal`, `external`, or `resources`.

Example

{
  "jsonrpc": "2.0",
  "id": 11,
  "method": "tools/call",
  "params": {
    "name": "extract_links",
    "arguments": {
      "url": "https://example.com/blog",
      "filter": "internal"
    }
  }
}

Note: Links are classified as internal (same domain) or external; resource links (images, PDFs…) can be filtered with resources.

`download_file`

Safely downloads a remote file into the sandboxed cache directory.

Name	Type	Required	Description
`url`	string	✅ (no default)	Direct URL to the file.
`filename`	string	❌ (auto‑generated)	Desired filename; will be sanitised and forced into the cache directory.

Example

{
  "jsonrpc": "2.0",
  "id": 12,
  "method": "tools/call",
  "params": {
    "name": "download_file",
    "arguments": {
      "url": "https://example.com/files/report.pdf",
      "filename": "report_latest.pdf"
    }
  }
}

Note: The server enforces a 100 MB download limit, validates the URL against blocked domains/extensions, and returns the relative path inside the working directory for cross‑agent access.

`get_page_metadata`

Extracts structured metadata (title, description, Open Graph, Twitter Cards) from a web page.

Name	Type	Required	Description
`url`	string	✅ (no default)	URL of the page to inspect.

Example

{
  "jsonrpc": "2.0",
  "id": 13,
  "method": "tools/call",
  "params": {
    "name": "get_page_metadata",
    "arguments": { "url": "https://example.com/product/42" }
  }
}

Note: The tool returns a formatted text block with title, description, keywords, Open Graph properties and Twitter Card fields.

`check_url`

Performs a lightweight HEAD request to report status code, headers and size without downloading the body.

Name	Type	Required	Description
`url`	string	✅ (no default)	URL to probe.

Example

{
  "jsonrpc": "2.0",
  "id": 14,
  "method": "tools/call",
  "params": {
    "name": "check_url",
    "arguments": { "url": "https://example.com/resource.zip" }
  }
}

Note: The response includes the final URL after redirects, a concise status summary (✅ OK or ⚠️ Error), and selected HTTP headers such as Content‑Type and Content‑Length.

🔐 Security Features

Path‑traversal protection – all file operations are confined to the sandboxed working directory.
Prompt‑injection detection in URLs, fetched HTML and generated content.
Blocked domains & extensions (localhost, private IP ranges, executable/script files).
Content‑size limits – max 50 MB for page fetches, max 100 MB for file downloads.
HTML sanitisation – removes <script>, <iframe>, event handlers and other risky elements before processing.
Cookie/banner handling – optional removal of consent banners and pop‑ups during fetch.