Document Extractor MCP Server

Document Extractor MCP Server

Extracts and stores documentation content from Microsoft Learn and GitHub URLs into PocketBase with full-text search, metadata preservation, and automatic collection management for easy retrieval and organization.

Category
访问服务器

README

Document Extractor MCP Server

A Model Context Protocol (MCP) server that extracts document content from Microsoft Learn and GitHub URLs, storing them in PocketBase for easy retrieval and search.

Features

Latest MCP SDK Features (v1.12.0+)

  • Modern McpServer architecture with enhanced capabilities
  • Multiple transport protocols: STDIO, Streamable HTTP, SSE
  • Dynamic tool management with lazy loading
  • Session management for stateful connections
  • Server-Sent Events support with backwards compatibility
  • Real-time server statistics and metrics

Content Extraction

  • Microsoft Learn articles with rich metadata
  • GitHub files (README, documentation, code files)
  • Intelligent content parsing and cleaning
  • Duplicate detection and updates

PocketBase Integration

  • Persistent document storage
  • Full-text search capabilities
  • Metadata preservation
  • CRUD operations

Advanced Server Features

  • Multiple transport modes (STDIO/HTTP)
  • Health check and info endpoints
  • Read-only mode support
  • Enhanced error handling and debugging
  • Resource endpoints for server metrics

Rich Metadata

  • Word counts and content statistics
  • Source attribution and URLs
  • Extraction timestamps
  • Content headers and descriptions

Requirements

  • Node.js 18+ with ES modules support
  • PocketBase server running
  • Network access for content extraction

Installation

1. Install Dependencies

# Navigate to the project directory
cd c:\powershell_scripts\pocketbase_document_mcp\document-extractor-mcp

# Install dependencies
npm install

2. PocketBase Setup

The MCP server supports both local and remote PocketBase instances. Choose the setup that best fits your needs:

Option A: Local PocketBase Instance

  1. Download and install PocketBase:

    # Download from https://pocketbase.io/docs/
    # Extract the executable to your preferred directory
    
  2. Start local PocketBase server:

    # Run from the directory containing pocketbase.exe
    .\pocketbase.exe serve
    
    # Or specify custom port and data directory
    .\pocketbase.exe serve --http="127.0.0.1:8090" --dir="./pb_data"
    
  3. Set up admin account:

    • Access PocketBase Admin UI at http://127.0.0.1:8090/_/
    • Create your admin account
    • Note the email/password for configuration

Option B: Remote PocketBase Instance

  1. Deploy PocketBase to your preferred hosting:

    • Railway, Fly.io, DigitalOcean, AWS, etc.
    • Follow your hosting provider's deployment guide
    • Ensure HTTPS is enabled for production
  2. Configure your remote instance:

    • Set up admin account through the web interface
    • Configure CORS settings if needed
    • Note the full URL (e.g., https://your-pb-instance.com)

Option C: Docker PocketBase

  1. Using Docker Compose:

    version: '3.8'
    services:
      pocketbase:
        image: ghcr.io/muchobien/pocketbase:latest
        ports:
          - "8090:8090"
        volumes:
          - ./pb_data:/pb/pb_data
    
  2. Collection Management (Automatic for all setups):

    • The server will automatically create the required documents collection on startup
    • If AUTO_CREATE_COLLECTION=true (default), no manual setup needed
    • Use the ensure_collection tool to manually verify/create collections
    • Use the collection_info tool to check collection status
  3. Manual Collection Setup (if needed):

    • Access PocketBase Admin UI
    • Create a new collection named documents
    • Add these fields:
      title (Text, required)
      content (Text, required)
      metadata (JSON, required)
      created (Date, auto-generated)
      updated (Date, optional)
      

3. Environment Configuration

Create a .env file in the project root. The server supports both local and remote PocketBase instances:

For Local PocketBase Instance:

# PocketBase Configuration - Local
POCKETBASE_URL=http://127.0.0.1:8090
POCKETBASE_ADMIN_EMAIL=admin@example.com
POCKETBASE_ADMIN_PASSWORD=your-secure-password

# Collection Settings
DOCUMENTS_COLLECTION=documents

# Transport Configuration
TRANSPORT_MODE=stdio
HTTP_PORT=3000

# Development Settings
DEBUG=true
NODE_ENV=development
READ_ONLY_MODE=false

# Collection Management ✨ New!
AUTO_CREATE_COLLECTION=true

For Remote PocketBase Instance:

# PocketBase Configuration - Remote
POCKETBASE_URL=https://your-pocketbase-instance.com
POCKETBASE_ADMIN_EMAIL=admin@yourdomain.com
POCKETBASE_ADMIN_PASSWORD=your-secure-password

# Collection Settings
DOCUMENTS_COLLECTION=documents

# Transport Configuration
TRANSPORT_MODE=stdio
HTTP_PORT=3000

# Production Settings
DEBUG=false
NODE_ENV=production
READ_ONLY_MODE=false

# Collection Management
AUTO_CREATE_COLLECTION=true

For Dockerized PocketBase:

# PocketBase Configuration - Docker
POCKETBASE_URL=http://pocketbase:8090
POCKETBASE_ADMIN_EMAIL=admin@localhost
POCKETBASE_ADMIN_PASSWORD=admin123

# Collection Settings
DOCUMENTS_COLLECTION=documents

# Transport Configuration
TRANSPORT_MODE=stdio
HTTP_PORT=3000

# Container Settings
DEBUG=false
NODE_ENV=production
READ_ONLY_MODE=false

# Collection Management
AUTO_CREATE_COLLECTION=true

Usage

Starting the Server

The server supports multiple transport modes:

# STDIO mode (default) - for Claude Desktop and CLI clients
npm start
# or explicitly
npm run start:stdio

# HTTP mode - for web clients and testing
npm run start:http

# Development modes with debug logging
npm run dev              # STDIO mode with debugging
npm run dev:http         # HTTP mode with debugging
npm run dev:stdio        # STDIO mode with debugging

# Test the setup
npm run test

Transport Modes

STDIO Mode (Default)

Perfect for Claude Desktop and command-line MCP clients:

npm start

HTTP Mode

Enables web-based clients and testing with multiple protocols:

npm run start:http

Available endpoints in HTTP mode:

  • POST /mcp - Streamable HTTP transport (modern protocol 2025-03-26)
  • GET /sse - Server-Sent Events transport (legacy protocol 2024-11-05)
  • POST /messages - SSE message endpoint
  • GET /health - Health check endpoint
  • GET /info - Server information endpoint

Available Tools

1. extract_document

Extract and store content from URLs.

Parameters:

  • url (string, required): Microsoft Learn or GitHub URL

Example:

{
  "url": "https://learn.microsoft.com/en-us/azure/cognitive-services/openai/"
}

2. list_documents

List stored documents with pagination.

Parameters:

  • limit (number, optional): Max results per page (1-100, default: 20)
  • page (number, optional): Page number (default: 1)

3. search_documents

Search documents by title or content.

Parameters:

  • query (string, required): Search query
  • limit (number, optional): Max results (1-100, default: 50)

4. get_document

Retrieve a specific document by ID.

Parameters:

  • id (string, required): Document ID

5. delete_document

Delete a document by ID.

Parameters:

  • id (string, required): Document ID to delete

6. ensure_collection ✨ New!

Check if the documents collection exists and create it if needed.

Parameters: None

Description: Automatically verifies the documents collection exists in PocketBase. If not found, creates the collection with the proper schema including all required fields and indexes.

7. collection_info ✨ New!

Get detailed information about the documents collection including statistics.

Parameters: None

Description: Returns comprehensive collection information including schema details, record counts, indexes, and timestamps.

Available Resources

1. stats://server

Real-time server statistics and metrics.

Content:

  • Total document count
  • Server information (name, version, uptime)
  • Memory usage statistics
  • Environment information
  • Read-only mode status

Dynamic Tool Management

The server supports dynamic tool management with lazy loading:

// Tools can be dynamically enabled/disabled
if (process.env.READ_ONLY_MODE === 'true') {
  // Write operations are disabled in read-only mode
  deleteDocumentTool.disable();
  extractDocumentTool.disable();
}

// Tools can be re-enabled at runtime
tool.enable();

Session Management

In HTTP mode, the server supports session management:

  • Streamable HTTP: Modern session management with automatic session ID generation
  • SSE (Legacy): Backwards compatible session handling
  • Session persistence: Sessions are maintained across requests
  • Automatic cleanup: Sessions are cleaned up when connections close

Supported Sources

Microsoft Learn

  • Full article extraction
  • Metadata preservation (description, keywords, author)
  • Section headers extraction
  • Content cleaning and formatting

Example URLs:

  • https://learn.microsoft.com/en-us/azure/cognitive-services/openai/
  • https://learn.microsoft.com/en-us/dotnet/core/introduction

GitHub

  • File content extraction (README, docs, code)
  • Repository metadata
  • Branch handling (main/master fallback)
  • File type detection

Supported URL formats:

  • https://github.com/owner/repo (assumes README.md)
  • https://github.com/owner/repo/blob/main/file.md
  • https://raw.githubusercontent.com/owner/repo/main/file.md

Configuration Options

Environment Variables

Variable Description Default
POCKETBASE_URL PocketBase server URL http://127.0.0.1:8090
POCKETBASE_ADMIN_EMAIL Admin email for authentication Required
POCKETBASE_ADMIN_PASSWORD Admin password Required
DOCUMENTS_COLLECTION Collection name for documents documents
DEBUG Enable debug logging false
NODE_ENV Environment mode development
READ_ONLY_MODE Disable write operations false
AUTO_CREATE_COLLECTION Auto-create collections on startup true

Debug Mode

Enable detailed logging:

$env:DEBUG="true"; node server.js

Debug logs include:

  • Authentication status
  • Content extraction details
  • Database operations
  • Error context

Error Handling

The server implements comprehensive error handling:

  • Network errors: Timeout and connection issues
  • Authentication errors: PocketBase connection problems
  • Validation errors: Invalid input parameters
  • Content errors: Extraction failures
  • Database errors: Storage and retrieval issues

All errors are returned as structured MCP responses with appropriate error codes.

Development

Scripts

# Start in development mode
npm run dev

# Start in production mode
npm start

# Install dependencies
npm run install-deps

Testing the Server

# Test basic functionality
$env:DEBUG="true"; node server.js

# In another terminal, you can test with MCP tools or:
# Use Claude Desktop with MCP configuration
# Use other MCP-compatible clients

Troubleshooting

Common Issues

  1. Authentication Failed

    • Verify PocketBase is running: http://127.0.0.1:8090
    • Check admin credentials in .env
    • Ensure admin user exists in PocketBase
  2. Content Extraction Errors

    • Check network connectivity
    • Verify URL accessibility
    • Review debug logs for details
  3. Collection Not Found

    • Use the ensure_collection tool to automatically create the collection
    • Check collection name in environment variables
    • Verify AUTO_CREATE_COLLECTION is enabled
    • Check collection permissions
  4. Module Import Errors

    • Ensure "type": "module" in package.json
    • Use Node.js 18+ with ES modules support
    • Check all dependencies are installed

Debug Information

Enable debug mode to see detailed logs:

$env:DEBUG="true"; node server.js

PocketBase Collection Schema

If you need to recreate the collection, use this schema:

{
  "name": "documents",
  "type": "base",
  "schema": [
    {
      "name": "title",
      "type": "text",
      "required": true,
      "options": {
        "max": 255
      }
    },
    {
      "name": "content",
      "type": "text",
      "required": true
    },
    {
      "name": "metadata",
      "type": "json",
      "required": true
    },
    {
      "name": "created",
      "type": "date",
      "required": false
    },
    {
      "name": "updated",
      "type": "date",
      "required": false
    }
  ]
}

MCP Client Configuration

Claude Desktop Configuration

Add this to your Claude Desktop MCP settings:

{
  "mcpServers": {
    "document-extractor": {
      "command": "node",
      "args": ["c:\\powershell_scripts\\pocketbase_document_mcp\\document-extractor-mcp\\server.js"],
      "env": {
        "POCKETBASE_URL": "http://127.0.0.1:8090",
        "POCKETBASE_ADMIN_EMAIL": "your-admin@example.com",
        "POCKETBASE_ADMIN_PASSWORD": "your-password",
        "DEBUG": "false"
      }
    }
  }
}

License

MIT License - see LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

Changelog

v1.1.0 ✨ Latest Update

  • Latest MCP SDK v1.13.1+: Upgraded to the newest Model Context Protocol SDK
  • Latest PocketBase SDK v0.26.1+: Updated to the latest PocketBase features
  • Collection Management Tools: Added ensure_collection and collection_info tools
  • Auto-Collection Creation: Automatic database schema setup on startup
  • Enhanced Lazy Loading: Improved dynamic tool management
  • Latest SSE Features: Modern Server-Sent Events implementation
  • Improved Error Handling: Better collection management error recovery
  • Enhanced Documentation: Comprehensive usage examples and troubleshooting

v1.0.0

  • Updated to latest Anthropic MCP SDK
  • Added comprehensive error handling
  • Implemented input validation with Zod
  • Enhanced metadata extraction
  • Added debug logging
  • Improved documentation
  • Added PocketBase integration
  • Support for Microsoft Learn and GitHub

Deployment

Smithery Deployment

This MCP server supports deployment on Smithery, a platform for hosting MCP servers.

TypeScript Deploy (Recommended)

The fastest way to deploy this server on Smithery:

  1. Fork or Clone this repository to your GitHub account
  2. Connect GitHub to Smithery (or claim your server if already listed)
  3. Navigate to the Deployments tab on your server page
  4. Click Deploy - Smithery will automatically build and host your server

The smithery.yaml file is already configured for TypeScript/Node.js deployment.

Note: Despite being called "TypeScript Deploy", this method works perfectly for Node.js projects with ES modules.

Custom Deploy (Docker)

For advanced deployment with full Docker control:

  1. Replace smithery.yaml with the container configuration:
    cp smithery-container.yaml smithery.yaml
    
  2. Push to GitHub with the updated configuration
  3. Deploy via Smithery's Deployments tab

The Dockerfile is optimized for production deployment with security best practices.

Configuration

When deploying on Smithery, you'll configure:

  • PocketBase URL: Your PocketBase instance URL
  • Admin Credentials: Email and password for PocketBase admin
  • Collection Settings: Default collection name and auto-creation
  • Debug Mode: Enable detailed logging (optional)

Best Practices for Smithery

  • Tool Discovery: All tools are available without authentication for discovery
  • Lazy Authentication: API validation occurs only when tools are invoked
  • Environment Variables: Configuration is handled via Smithery's config schema
  • Health Checks: Built-in health monitoring at /health endpoint

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选