MCP 服务器

V2.ai Insights Scraper MCP

A Model Context Protocol server that scrapes blog posts from V2.ai Insights, extracts content, and provides AI-powered summaries using OpenAI's GPT-4.

README

V2.ai Insights Scraper MCP

A Model Context Protocol (MCP) server that scrapes blog posts from V2.ai Insights, extracts content, and provides AI-powered summaries using OpenAI's GPT-4. Currently supports Contentful CMS integration with search capabilities.

📋 Strategic Vision: This project is evolving into a comprehensive AI intelligence platform. See STRATEGIC_VISION.md for the complete roadmap from content API to strategic intelligence platform.

Features

🔍 Multi-Source Content: Fetches from Contentful CMS and V2.ai web scraping
📝 Content Extraction: Extracts title, date, author, and content with intelligent fallbacks
🔎 Full-Text Search: Search across all blog content with Contentful's search API
🤖 AI Summarization: Generates summaries using OpenAI GPT-4
🔧 MCP Integration: Exposes tools for Claude Desktop integration

Tools Available

get_latest_posts() - Retrieves blog posts with metadata (Contentful + V2.ai fallback)
get_contentful_posts(limit) - Fetch posts directly from Contentful CMS
search_blogs(query, limit) - NEW - Search across all blog content
summarize_post(index) - Returns AI-generated summary of a specific post
get_post_content(index) - Returns full content of a specific post

Setup

Prerequisites

Python 3.12+
uv package manager
OpenAI API key
Contentful CMS credentials (optional, for enhanced functionality)

Installation

Clone and navigate to project:
```
cd v2-ai-mcp
```

Install dependencies:

uv add fastmcp beautifulsoup4 requests openai

Set up environment variables:

Create a .env file based on .env.example:

cp .env.example .env

Edit .env with your credentials:

# Required
OPENAI_API_KEY=your-openai-api-key-here

# Optional (for Contentful integration)
CONTENTFUL_SPACE_ID=your-contentful-space-id
CONTENTFUL_ACCESS_TOKEN=your-contentful-access-token
CONTENTFUL_CONTENT_TYPE=pageBlogPost

Running the Server

uv run python -m src.v2_ai_mcp.main

The server will start and be available for MCP connections.

Testing the Scraper

Test individual components:

# Test scraper
uv run python -c "from src.v2_ai_mcp.scraper import fetch_blog_posts; print(fetch_blog_posts()[0]['title'])"

# Test with summarizer (requires OpenAI API key)
uv run python -c "from src.v2_ai_mcp.scraper import fetch_blog_posts; from src.v2_ai_mcp.summarizer import summarize; post = fetch_blog_posts()[0]; print(summarize(post['content'][:1000]))"

# Run unit tests
uv run pytest tests/ -v --cov=src

Claude Desktop Integration

Configuration

Install Claude Desktop (if not already installed)

Configure MCP in Claude Desktop:

Add to your Claude Desktop MCP configuration:

{
  "mcpServers": {
    "v2-insights-scraper": {
      "command": "/path/to/uv",
      "args": ["run", "--directory", "/path/to/your/v2-ai-mcp", "python", "-m", "src.v2_ai_mcp.main"],
      "env": {
        "OPENAI_API_KEY": "your-api-key-here",
        "CONTENTFUL_SPACE_ID": "your-contentful-space-id",
        "CONTENTFUL_ACCESS_TOKEN": "your-contentful-access-token",
        "CONTENTFUL_CONTENT_TYPE": "pageBlogPost"
      }
    }
  }
}

Restart Claude Desktop to load the MCP server

Using the Tools

Once configured, you can use these tools in Claude Desktop:

Get latest posts: get_latest_posts() (intelligent Contentful + V2.ai fallback)
Get Contentful posts: get_contentful_posts(10) (direct CMS access)
Search blogs: search_blogs("AI automation", 5) (NEW - full-text search)
Summarize post: summarize_post(0) (index 0 for first post)
Get full content: get_post_content(0)

Example Usage

🔍 Search for AI-related content:
search_blogs("artificial intelligence", 3)

📚 Get latest posts with automatic source selection:
get_latest_posts()

🤖 Get AI summary of specific post:
summarize_post(0)

Project Structure

v2-ai-mcp/
├── src/
│   └── v2_ai_mcp/
│       ├── __init__.py      # Package initialization
│       ├── main.py          # FastMCP server with tool definitions
│       ├── scraper.py       # Web scraping logic
│       └── summarizer.py    # OpenAI GPT-4 integration
├── tests/
│   ├── __init__.py          # Test package initialization
│   ├── test_scraper.py      # Unit tests for scraper
│   └── test_summarizer.py   # Unit tests for summarizer
├── .github/
│   └── workflows/
│       └── ci.yml           # GitHub Actions CI/CD pipeline
├── pyproject.toml           # Project dependencies and config
├── .env.example             # Environment variables template
├── .gitignore               # Git ignore patterns
└── README.md                # This file

Current Implementation

The scraper currently targets this specific blog post:

URL: https://www.v2.ai/insights/adopting-AI-assistants-while-balancing-risks

Extracted Data

Title: "Adopting AI Assistants while Balancing Risks"
Author: "Ashley Rodan"
Date: "July 3, 2025"
Content: ~12,785 characters of main content

Development

Adding More Blog Posts

To scrape multiple posts or different URLs, modify the fetch_blog_posts() function in scraper.py:

def fetch_blog_posts() -> list:
    urls = [
        "https://www.v2.ai/insights/post1",
        "https://www.v2.ai/insights/post2",
        # Add more URLs
    ]
    return [fetch_blog_post(url) for url in urls]

Improving Content Extraction

The scraper uses multiple fallback strategies for extracting content. You can enhance it by:

Inspecting V2.ai's HTML structure
Adding more specific CSS selectors
Improving date/author extraction patterns

Troubleshooting

Common Issues

OpenAI API Key Error: Ensure your API key is set in environment variables
Import Errors: Run uv sync to ensure all dependencies are installed
Scraping Issues: Check if the target URL is accessible and the HTML structure hasn't changed

Testing Components

# Test scraper only
uv run python -c "from src.v2_ai_mcp.scraper import fetch_blog_posts; posts = fetch_blog_posts(); print(f'Found {len(posts)} posts')"

# Run full test suite
uv run pytest tests/ -v --cov=src

# Test MCP server startup
uv run python -m src.v2_ai_mcp.main

Development

Running Tests

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src --cov-report=html

# Run specific test file
uv run pytest tests/test_scraper.py -v

Code Quality

# Format code
uv run ruff format src tests

# Lint code
uv run ruff check src tests

# Fix auto-fixable issues
uv run ruff check --fix src tests

License

This project is for educational and development purposes.

推荐服务器

Baidu Map

百度地图核心API现已全面兼容MCP协议，是国内首家兼容MCP协议的地图服务商。

官方

精选

JavaScript

Playwright MCP Server

一个模型上下文协议服务器，它使大型语言模型能够通过结构化的可访问性快照与网页进行交互，而无需视觉模型或屏幕截图。

官方

精选

TypeScript

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互，从而促进营销洞察和受众数据的提取和分析，包括人口统计信息、行为和影响者互动。

Magic Component Platform (MCP)

一个由人工智能驱动的工具，可以从自然语言描述生成现代化的用户界面组件，并与流行的集成开发环境（IDE）集成，从而简化用户界面开发流程。

VeyraX

一个单一的 MCP 工具，连接你所有喜爱的工具：Gmail、日历以及其他 40 多个工具。

官方

精选

本地

Kagi MCP Server

一个 MCP 服务器，集成了 Kagi 搜索功能和 Claude AI，使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方

精选

Python

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。除了网络爬取之外，还可以将任何内容（从 Slack 到 Gmail 再到播客订阅源）导入到 Graphlit 项目中，然后从 MCP 客户端检索相关内容。

官方

精选

TypeScript

Exa MCP Server

模型上下文协议（MCP）服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方

精选

mcp-server-qdrant

这个仓库展示了如何为向量搜索引擎 Qdrant 创建一个 MCP (Managed Control Plane) 服务器的示例。

官方

精选

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方

精选