Fetch MCP Server

Fetch MCP Server

Okay, here's a breakdown of how you can fetch URLs from a webpage using Playwright, integrate it with an SSE (Server-Sent Events) MCP (Management Control Plane) server, and use Node.js with Express.js to orchestrate everything. I'll provide code snippets and explanations to guide you. **Conceptual Overview** 1. **Playwright (Web Scraping):** Playwright will be used to launch a browser, navigate to the target webpage, and extract the URLs you need. 2. **Node.js/Express.js (Server):** Express.js will create a web server that handles requests to start the scraping process and stream the results back to the client. 3. **SSE (Server-Sent Events):** SSE will be used to push the extracted URLs from the server to the client in real-time as they are found. This is more efficient than repeatedly polling the server. 4. **MCP (Management Control Plane):** The MCP part is a bit more abstract. It implies you have a system for managing and controlling the scraping process. This could involve: * Configuration: Specifying the target URL, selectors for extracting URLs, etc. * Monitoring: Tracking the progress of the scraping job. * Error Handling: Managing errors that occur during scraping. * Scaling: Distributing the scraping workload across multiple instances. **Code Example (Illustrative)** **1. Project Setup** ```bash mkdir playwright-sse-mcp cd playwright-sse-mcp npm init -y npm install playwright express eventsource ``` **2. `server.js` (Node.js/Express.js Server)** ```javascript const express = require('express'); const { chromium } = require('playwright'); const app = express(); const port = 3000; app.use(express.json()); // For parsing JSON request bodies // In-memory storage for SSE connections (replace with a more robust solution for production) const sseClients = []; app.get('/sse', (req, res) => { res.setHeader('Content-Type', 'text/event-stream'); res.setHeader('Cache-Control', 'no-cache'); res.setHeader('Connection', 'keep-alive'); res.flushHeaders(); const clientId = Date.now(); // Unique ID for the client const newClient = { id: clientId, res, }; sseClients.push(newClient); console.log(`${clientId} Connection open`); req.on('close', () => { console.log(`${clientId} Connection closed`); sseClients = sseClients.filter((client) => client.id !== clientId); }); }); function sendSSE(data) { sseClients.forEach((client) => { client.res.write(`data: ${JSON.stringify(data)}\n\n`); }); } app.post('/scrape', async (req, res) => { const { url, selector } = req.body; // Get URL and selector from request body if (!url || !selector) { return res.status(400).send('URL and selector are required.'); } console.log(`Starting scrape for ${url} with selector ${selector}`); try { const browser = await chromium.launch(); const page = await browser.newPage(); await page.goto(url); const links = await page.locator(selector).evaluateAll(links => links.map(link => link.href)); for (const link of links) { sendSSE({ url: link }); // Send each URL via SSE } await browser.close(); console.log(`Scrape complete for ${url}`); res.status(200).send('Scrape started and URLs are being streamed.'); } catch (error) { console.error('Scrape error:', error); sendSSE({ error: error.message }); // Send error via SSE res.status(500).send('Scrape failed.'); } }); app.listen(port, () => { console.log(`Server listening at http://localhost:${port}`); }); ``` **3. `client.html` (Simple Client to Receive SSE)** ```html <!DOCTYPE html> <html> <head> <title>SSE Client</title> </head> <body> <h1>SSE Stream</h1> <ul id="urlList"></ul> <script> const urlList = document.getElementById('urlList'); const eventSource = new EventSource('http://localhost:3000/sse'); eventSource.onmessage = (event) => { const data = JSON.parse(event.data); if (data.url) { const listItem = document.createElement('li'); listItem.textContent = data.url; urlList.appendChild(listItem); } else if (data.error) { const listItem = document.createElement('li'); listItem.textContent = `Error: ${data.error}`; urlList.appendChild(listItem); } }; eventSource.onerror = (error) => { console.error('SSE error:', error); }; </script> </body> </html> ``` **4. Running the Example** 1. **Start the Server:** `node server.js` 2. **Open `client.html`** in your browser. 3. **Send a POST request to `/scrape`:** You can use `curl`, `Postman`, or a similar tool. For example: ```bash curl -X POST -H "Content-Type: application/json" -d '{"url": "https://www.example.com", "selector": "a"}' http://localhost:3000/scrape ``` Replace `"https://www.example.com"` with the URL you want to scrape and `"a"` with the CSS selector for the links you want to extract. **Explanation** * **`server.js`:** * Sets up an Express.js server. * `/sse` endpoint: Handles SSE connections. It sets the correct headers for SSE and keeps track of connected clients. * `sendSSE(data)`: Sends data to all connected SSE clients. * `/scrape` endpoint: * Receives the target URL and CSS selector from the request body. * Launches a Playwright browser. * Navigates to the URL. * Uses `page.locator(selector).evaluateAll()` to extract the `href` attributes of all elements matching the selector. * Sends each URL back to the client via SSE. * Handles errors and sends error messages via SSE. * **`client.html`:** * Creates an `EventSource` object to connect to the `/sse` endpoint. * Listens for `message` events from the server. * Parses the JSON data and displays the URLs in a list. * Handles errors. **Important Considerations for Production (MCP)** * **Configuration Management:** Instead of hardcoding the URL and selector in the `curl` command, you'd typically store them in a database or configuration file. Your MCP would provide an interface for managing these configurations. * **Job Queuing:** Use a message queue (e.g., RabbitMQ, Kafka) to queue scraping jobs. This allows you to handle a large number of requests without overloading the server. * **Scaling:** Run multiple instances of the scraping server behind a load balancer. The message queue will distribute the jobs across the instances. * **Monitoring:** Use a monitoring system (e.g., Prometheus, Grafana) to track the performance of the scraping servers, the number of jobs in the queue, and any errors that occur. * **Error Handling:** Implement robust error handling and retry mechanisms. For example, if a scraping job fails, you might retry it a few times before giving up. * **Rate Limiting:** Be respectful of the websites you are scraping. Implement rate limiting to avoid overloading their servers. * **Data Storage:** Instead of just displaying the URLs in the browser, you'll likely want to store them in a database or other data store. * **Authentication/Authorization:** Secure your MCP endpoints with authentication and authorization to prevent unauthorized access. * **Headless Mode:** Run the Playwright browser in headless mode (which is the default) for better performance. **Example of MCP Integration (Conceptual)** Let's say you have an MCP API endpoint `/api/scrape-jobs` that allows you to create new scraping jobs. The request body might look like this: ```json { "url": "https://www.example.com", "selector": "a", "callbackUrl": "https://your-data-store.com/api/store-data" // Where to send the scraped data } ``` Your server would then: 1. Receive the request to `/api/scrape-jobs`. 2. Validate the request. 3. Create a job in the message queue (e.g., RabbitMQ). 4. A worker process (one of your scraping server instances) would pick up the job from the queue. 5. The worker would scrape the URL, extract the data, and send it to the `callbackUrl`. 6. The worker would update the job status in the MCP (e.g., "in progress", "completed", "failed"). This is a simplified example, but it illustrates the basic principles of integrating Playwright with an SSE server and an MCP. The specific implementation will depend on your requirements and the architecture of your MCP. **Chinese Translation of Key Concepts** * **Web Scraping:** 网络爬虫 (wǎngluò páchóng) * **Server-Sent Events (SSE):** 服务器发送事件 (fúwùqì fāsòng shìjiàn) * **Management Control Plane (MCP):** 管理控制平面 (guǎnlǐ kòngzhì píngmiàn) * **Playwright:** Playwright (no direct translation, use the English name) * **Node.js:** Node.js (no direct translation, use the English name) * **Express.js:** Express.js (no direct translation, use the English name) * **URL:** 网址 (wǎngzhǐ) * **CSS Selector:** CSS 选择器 (CSS xuǎnzéqì) * **Endpoint:** 端点 (duāndiǎn) * **Message Queue:** 消息队列 (xiāoxi duìliè) * **Load Balancer:** 负载均衡器 (fùzài jūnhéngqì) * **Monitoring:** 监控 (jiānkòng) * **Error Handling:** 错误处理 (cuòwù chǔlǐ) * **Rate Limiting:** 速率限制 (sùlǜ xiànzhì) This comprehensive explanation and code example should give you a solid foundation for building your Playwright-based web scraping solution with SSE and MCP integration. Remember to adapt the code and architecture to your specific needs and environment. Good luck!

pnparadise

开发者工具
访问服务器

README

Fetch MCP 服务器

一个实现了模型上下文协议 (MCP) 的服务器,具备网页内容抓取能力,并支持基于服务器发送事件 (SSE) 的通信。

概述

Fetch MCP 服务器提供了一个模型上下文协议的实现,允许使用服务器发送事件 (SSE) 进行实时通信。它具有网页内容抓取和搜索功能,旨在与 AI 应用程序配合使用,并提供一种管理模型和资源以及与之通信的标准化方式。

特性

  • 模型上下文协议 (MCP) 实现
  • 服务器发送事件 (SSE) 用于实时通信
  • 使用 Playwright 进行网页内容抓取
  • 使用 SERP 结果的 Google 搜索功能
  • HTML 到 Markdown 转换
  • 资源模板化和管理

前提条件

  • Node.js (建议 v14 或更高版本)
  • Yarn 包管理器
  • Kubernetes 集群 (用于部署)

安装

克隆存储库并安装依赖项:

git clone git@github.com:pnparadise/fetch-mcp-server.git
cd fetch-mcp-server
yarn install

用法

启动服务器

要在开发模式下启动服务器:

yarn dev

这将运行带有 MCP 检查器的服务器以进行调试。

dev-ss

要在生产模式下启动服务器:

yarn start

默认情况下,服务器在端口 3001 上运行。您可以通过设置 PORT 环境变量来覆盖此设置:

PORT=5000 yarn start

端点

  • /sse - 用于建立持久连接的 SSE 端点
  • /messages - 用于向 MCP 服务器发送消息的端点

可用工具

  • fetchUrls - 抓取多个 URL 并将其内容转换为 markdown
  • search - 使用 SERP 结果搜索和检索网页内容

项目结构

  • src/index.ts - 应用程序的主要入口点
  • src/mcp-server.ts - MCP 服务器和工具定义的实现
  • src/sse-server.ts - 使用 Express 的 SSE 服务器实现
  • k8s/ - Kubernetes 部署配置

依赖项

推荐服务器

Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
MCP Package Docs Server

MCP Package Docs Server

促进大型语言模型高效访问和获取 Go、Python 和 NPM 包的结构化文档,通过多语言支持和性能优化来增强软件开发。

精选
本地
TypeScript
Claude Code MCP

Claude Code MCP

一个实现了 Claude Code 作为模型上下文协议(Model Context Protocol, MCP)服务器的方案,它可以通过标准化的 MCP 接口来使用 Claude 的软件工程能力(代码生成、编辑、审查和文件操作)。

精选
本地
JavaScript
@kazuph/mcp-taskmanager

@kazuph/mcp-taskmanager

用于任务管理的模型上下文协议服务器。它允许 Claude Desktop(或任何 MCP 客户端)在基于队列的系统中管理和执行任务。

精选
本地
JavaScript
mermaid-mcp-server

mermaid-mcp-server

一个模型上下文协议 (MCP) 服务器,用于将 Mermaid 图表转换为 PNG 图像。

精选
JavaScript
Jira-Context-MCP

Jira-Context-MCP

MCP 服务器向 AI 编码助手(如 Cursor)提供 Jira 工单信息。

精选
TypeScript
Linear MCP Server

Linear MCP Server

一个模型上下文协议(Model Context Protocol)服务器,它与 Linear 的问题跟踪系统集成,允许大型语言模型(LLM)通过自然语言交互来创建、更新、搜索和评论 Linear 问题。

精选
JavaScript
Sequential Thinking MCP Server

Sequential Thinking MCP Server

这个服务器通过将复杂问题分解为顺序步骤来促进结构化的问题解决,支持修订,并通过完整的 MCP 集成来实现多条解决方案路径。

精选
Python
Curri MCP Server

Curri MCP Server

通过管理文本笔记、提供笔记创建工具以及使用结构化提示生成摘要,从而实现与 Curri API 的交互。

官方
本地
JavaScript