MCP 服务器

Fetch MCP Server

Okay, here's a breakdown of how you can fetch URLs from a webpage using Playwright, integrate it with an SSE (Server-Sent Events) MCP (Management Control Plane) server, and use Node.js with Express.js to orchestrate everything. I'll provide code snippets and explanations to guide you. **Conceptual Overview** 1. **Playwright (Web Scraping):** Playwright will be used to launch a browser, navigate to the target webpage, and extract the URLs you need. 2. **Node.js/Express.js (Server):** Express.js will create a web server that handles requests to start the scraping process and stream the results back to the client. 3. **SSE (Server-Sent Events):** SSE will be used to push the extracted URLs from the server to the client in real-time as they are found. This is more efficient than repeatedly polling the server. 4. **MCP (Management Control Plane):** The MCP part is a bit more abstract. It implies you have a system for managing and controlling the scraping process. This could involve: * Configuration: Specifying the target URL, selectors for extracting URLs, etc. * Monitoring: Tracking the progress of the scraping job. * Error Handling: Managing errors that occur during scraping. * Scaling: Distributing the scraping workload across multiple instances. **Code Example (Illustrative)** **1. Project Setup** ```bash mkdir playwright-sse-mcp cd playwright-sse-mcp npm init -y npm install playwright express eventsource ``` **2. `server.js` (Node.js/Express.js Server)** ```javascript const express = require('express'); const { chromium } = require('playwright'); const app = express(); const port = 3000; app.use(express.json()); // For parsing JSON request bodies // In-memory storage for SSE connections (replace with a more robust solution for production) const sseClients = []; app.get('/sse', (req, res) => { res.setHeader('Content-Type', 'text/event-stream'); res.setHeader('Cache-Control', 'no-cache'); res.setHeader('Connection', 'keep-alive'); res.flushHeaders(); const clientId = Date.now(); // Unique ID for the client const newClient = { id: clientId, res, }; sseClients.push(newClient); console.log(`${clientId} Connection open`); req.on('close', () => { console.log(`${clientId} Connection closed`); sseClients = sseClients.filter((client) => client.id !== clientId); }); }); function sendSSE(data) { sseClients.forEach((client) => { client.res.write(`data: ${JSON.stringify(data)}\n\n`); }); } app.post('/scrape', async (req, res) => { const { url, selector } = req.body; // Get URL and selector from request body if (!url || !selector) { return res.status(400).send('URL and selector are required.'); } console.log(`Starting scrape for ${url} with selector ${selector}`); try { const browser = await chromium.launch(); const page = await browser.newPage(); await page.goto(url); const links = await page.locator(selector).evaluateAll(links => links.map(link => link.href)); for (const link of links) { sendSSE({ url: link }); // Send each URL via SSE } await browser.close(); console.log(`Scrape complete for ${url}`); res.status(200).send('Scrape started and URLs are being streamed.'); } catch (error) { console.error('Scrape error:', error); sendSSE({ error: error.message }); // Send error via SSE res.status(500).send('Scrape failed.'); } }); app.listen(port, () => { console.log(`Server listening at http://localhost:${port}`); }); ``` **3. `client.html` (Simple Client to Receive SSE)** ```html <!DOCTYPE html> <html> <head> <title>SSE Client</title> </head> <body> <h1>SSE Stream</h1> <ul id="urlList"></ul> <script> const urlList = document.getElementById('urlList'); const eventSource = new EventSource('http://localhost:3000/sse'); eventSource.onmessage = (event) => { const data = JSON.parse(event.data); if (data.url) { const listItem = document.createElement('li'); listItem.textContent = data.url; urlList.appendChild(listItem); } else if (data.error) { const listItem = document.createElement('li'); listItem.textContent = `Error: ${data.error}`; urlList.appendChild(listItem); } }; eventSource.onerror = (error) => { console.error('SSE error:', error); }; </script> </body> </html> ``` **4. Running the Example** 1. **Start the Server:** `node server.js` 2. **Open `client.html`** in your browser. 3. **Send a POST request to `/scrape`:** You can use `curl`, `Postman`, or a similar tool. For example: ```bash curl -X POST -H "Content-Type: application/json" -d '{"url": "https://www.example.com", "selector": "a"}' http://localhost:3000/scrape ``` Replace `"https://www.example.com"` with the URL you want to scrape and `"a"` with the CSS selector for the links you want to extract. **Explanation** * **`server.js`:** * Sets up an Express.js server. * `/sse` endpoint: Handles SSE connections. It sets the correct headers for SSE and keeps track of connected clients. * `sendSSE(data)`: Sends data to all connected SSE clients. * `/scrape` endpoint: * Receives the target URL and CSS selector from the request body. * Launches a Playwright browser. * Navigates to the URL. * Uses `page.locator(selector).evaluateAll()` to extract the `href` attributes of all elements matching the selector. * Sends each URL back to the client via SSE. * Handles errors and sends error messages via SSE. * **`client.html`:** * Creates an `EventSource` object to connect to the `/sse` endpoint. * Listens for `message` events from the server. * Parses the JSON data and displays the URLs in a list. * Handles errors. **Important Considerations for Production (MCP)** * **Configuration Management:** Instead of hardcoding the URL and selector in the `curl` command, you'd typically store them in a database or configuration file. Your MCP would provide an interface for managing these configurations. * **Job Queuing:** Use a message queue (e.g., RabbitMQ, Kafka) to queue scraping jobs. This allows you to handle a large number of requests without overloading the server. * **Scaling:** Run multiple instances of the scraping server behind a load balancer. The message queue will distribute the jobs across the instances. * **Monitoring:** Use a monitoring system (e.g., Prometheus, Grafana) to track the performance of the scraping servers, the number of jobs in the queue, and any errors that occur. * **Error Handling:** Implement robust error handling and retry mechanisms. For example, if a scraping job fails, you might retry it a few times before giving up. * **Rate Limiting:** Be respectful of the websites you are scraping. Implement rate limiting to avoid overloading their servers. * **Data Storage:** Instead of just displaying the URLs in the browser, you'll likely want to store them in a database or other data store. * **Authentication/Authorization:** Secure your MCP endpoints with authentication and authorization to prevent unauthorized access. * **Headless Mode:** Run the Playwright browser in headless mode (which is the default) for better performance. **Example of MCP Integration (Conceptual)** Let's say you have an MCP API endpoint `/api/scrape-jobs` that allows you to create new scraping jobs. The request body might look like this: ```json { "url": "https://www.example.com", "selector": "a", "callbackUrl": "https://your-data-store.com/api/store-data" // Where to send the scraped data } ``` Your server would then: 1. Receive the request to `/api/scrape-jobs`. 2. Validate the request. 3. Create a job in the message queue (e.g., RabbitMQ). 4. A worker process (one of your scraping server instances) would pick up the job from the queue. 5. The worker would scrape the URL, extract the data, and send it to the `callbackUrl`. 6. The worker would update the job status in the MCP (e.g., "in progress", "completed", "failed"). This is a simplified example, but it illustrates the basic principles of integrating Playwright with an SSE server and an MCP. The specific implementation will depend on your requirements and the architecture of your MCP. **Chinese Translation of Key Concepts** * **Web Scraping:** 网络爬虫 (wǎngluò páchóng) * **Server-Sent Events (SSE):** 服务器发送事件 (fúwùqì fāsòng shìjiàn) * **Management Control Plane (MCP):** 管理控制平面 (guǎnlǐ kòngzhì píngmiàn) * **Playwright:** Playwright (no direct translation, use the English name) * **Node.js:** Node.js (no direct translation, use the English name) * **Express.js:** Express.js (no direct translation, use the English name) * **URL:** 网址 (wǎngzhǐ) * **CSS Selector:** CSS 选择器 (CSS xuǎnzéqì) * **Endpoint:** 端点 (duāndiǎn) * **Message Queue:** 消息队列 (xiāoxi duìliè) * **Load Balancer:** 负载均衡器 (fùzài jūnhéngqì) * **Monitoring:** 监控 (jiānkòng) * **Error Handling:** 错误处理 (cuòwù chǔlǐ) * **Rate Limiting:** 速率限制 (sùlǜ xiànzhì) This comprehensive explanation and code example should give you a solid foundation for building your Playwright-based web scraping solution with SSE and MCP integration. Remember to adapt the code and architecture to your specific needs and environment. Good luck!

pnparadise

开发者工具

访问服务器

README

Fetch MCP 服务器

一个实现了模型上下文协议 (MCP) 的服务器，具备网页内容抓取能力，并支持基于服务器发送事件 (SSE) 的通信。

概述

Fetch MCP 服务器提供了一个模型上下文协议的实现，允许使用服务器发送事件 (SSE) 进行实时通信。它具有网页内容抓取和搜索功能，旨在与 AI 应用程序配合使用，并提供一种管理模型和资源以及与之通信的标准化方式。

特性

模型上下文协议 (MCP) 实现
服务器发送事件 (SSE) 用于实时通信
使用 Playwright 进行网页内容抓取
使用 SERP 结果的 Google 搜索功能
HTML 到 Markdown 转换
资源模板化和管理

前提条件

Node.js (建议 v14 或更高版本)
Yarn 包管理器
Kubernetes 集群 (用于部署)

安装

克隆存储库并安装依赖项：

git clone git@github.com:pnparadise/fetch-mcp-server.git
cd fetch-mcp-server
yarn install

用法

启动服务器

要在开发模式下启动服务器：

yarn dev

这将运行带有 MCP 检查器的服务器以进行调试。

dev-ss

要在生产模式下启动服务器：

yarn start

默认情况下，服务器在端口 3001 上运行。您可以通过设置 PORT 环境变量来覆盖此设置：

PORT=5000 yarn start

端点

/sse - 用于建立持久连接的 SSE 端点
/messages - 用于向 MCP 服务器发送消息的端点

可用工具

fetchUrls - 抓取多个 URL 并将其内容转换为 markdown
search - 使用 SERP 结果搜索和检索网页内容

项目结构

src/index.ts - 应用程序的主要入口点
src/mcp-server.ts - MCP 服务器和工具定义的实现
src/sse-server.ts - 使用 Express 的 SSE 服务器实现
k8s/ - Kubernetes 部署配置

依赖项

@modelcontextprotocol/sdk - 模型上下文协议的 SDK
Express - Node.js 的 Web 框架
Playwright - 浏览器自动化库
Turndown - HTML 到 Markdown 转换器