PDF Knowledgebase MCP Server

PDF Knowledgebase MCP Server

A Model Context Protocol server that enables intelligent document search and retrieval from PDF collections, providing semantic search capabilities powered by OpenAI embeddings and ChromaDB vector storage.

Category
访问服务器

README

PDF Knowledgebase MCP Server

A Model Context Protocol (MCP) server that enables intelligent document search and retrieval from PDF collections. Built for seamless integration with Claude Desktop, Continue, Cline, and other MCP clients, this server provides semantic search capabilities powered by OpenAI embeddings and ChromaDB vector storage.

Table of Contents

🚀 Quick Start

Step 1: Install the Server

uvx pdfkb-mcp

Step 2: Configure Your MCP Client

Claude Desktop (Most Common):

Configuration file locations:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json
  • Linux: ~/.config/Claude/claude_desktop_config.json
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "KNOWLEDGEBASE_PATH": "/Users/yourname/Documents/PDFs"
      },
      "transport": "stdio",
      "autoRestart": true
    }
  }
}

VS Code (Native MCP) - Create .vscode/mcp.json in workspace:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "KNOWLEDGEBASE_PATH": "${workspaceFolder}/pdfs"
      },
      "transport": "stdio"
    }
  }
}

Step 3: Verify Installation

  1. Restart your MCP client completely
  2. Check for PDF KB tools: Look for add_document, search_documents, list_documents, remove_document
  3. Test functionality: Try adding a PDF and searching for content

🏗️ Architecture Overview

MCP Integration

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   MCP Client    │    │   MCP Client     │    │   MCP Client    │
│ (Claude Desktop)│    │(VS Code/Continue)|    │   (Other)       │
└─────────┬───────┘    └─────────┬────────┘    └─────────┬───────┘
          │                      │                       │
          └──────────────────────┼───────────────────────┘
                                 │
                    ┌────────────┴────────────┐
                    │    Model Context        │
                    │    Protocol (MCP)       │
                    │    Standard Layer       │
                    └────────────┬────────────┘
                                 │
          ┌──────────────────────┼───────────────────────┐
          │                      │                       │
┌─────────┴───────┐    ┌─────────┴────────┐    ┌─────────┴───────┐
│ PDF KB Server   │    │  Other MCP       │    │  Other MCP      │
│ (This Server)   │    │  Server          │    │  Server         │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Available Tools & Resources

Tools (Actions your client can perform):

Resources (Data your client can access):

  • pdf://{document_id} - Full document content as JSON
  • pdf://{document_id}/page/{page_number} - Specific page content
  • pdf://list - List of all documents with metadata

🎯 Parser Selection Guide

Decision Tree

Document Type & Priority?
├── 🏃 Speed Priority → PyMuPDF4LLM (fastest processing, low memory)
├── 📚 Academic Papers → MinerU (fast with GPU, excellent formulas)
├── 📊 Business Reports → Docling (medium speed, best tables)
├── ⚖️ Balanced Quality → Marker (medium speed, good structure)
└── 🎯 Maximum Accuracy → LLM (slow, vision-based API calls)
```</search>
</search_and_replace>

### Performance Comparison

| Parser | Processing Speed | Memory | Text Quality | Table Quality | Best For |
|--------|------------------|--------|--------------|---------------|----------|
| **PyMuPDF4LLM** | **Fastest** | Low | Good | Basic | Speed priority |
| **MinerU** | Fast (with GPU) | High | Excellent | Excellent | Scientific papers |
| **Docling** | Medium | Medium | Excellent | **Excellent** | Business documents |
| **Marker** | Medium | Medium | Excellent | Good | **Balanced (default)** |
| **LLM** | Slow | Low | Excellent | Excellent | Maximum accuracy |</search>
</search_and_replace>

*Benchmarks from research studies and technical reports*

## ⚙️ Configuration

### Tier 1: Basic Configurations (80% of users)

**Default (Recommended)**:
```json
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDF_PARSER": "marker"
      },
      "transport": "stdio"
    }
  }
}

Speed Optimized:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDF_PARSER": "pymupdf4llm",
        "CHUNK_SIZE": "800"
      },
      "transport": "stdio"
    }
  }
}

Memory Efficient:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDF_PARSER": "pymupdf4llm",
        "EMBEDDING_BATCH_SIZE": "50"
      },
      "transport": "stdio"
    }
  }
}

Tier 2: Use Case Specific (15% of users)

Academic Papers:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDF_PARSER": "mineru",
        "CHUNK_SIZE": "1200"
      },
      "transport": "stdio"
    }
  }
}

Business Documents:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDF_PARSER": "docling",
        "DOCLING_TABLE_MODE": "ACCURATE",
        "DOCLING_DO_TABLE_STRUCTURE": "true"
      },
      "transport": "stdio"
    }
  }
}

Multi-language Documents:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "PDF_PARSER": "docling",
        "DOCLING_OCR_LANGUAGES": "en,fr,de,es",
        "DOCLING_DO_OCR": "true"
      },
      "transport": "stdio"
    }
  }
}

Maximum Quality:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "OPENROUTER_API_KEY": "sk-or-v1-abc123def456ghi789...",
        "PDF_PARSER": "llm",
        "LLM_MODEL": "anthropic/claude-3.5-sonnet",
        "EMBEDDING_MODEL": "text-embedding-3-large"
      },
      "transport": "stdio"
    }
  }
}

Essential Environment Variables

Variable Default Description
OPENAI_API_KEY required OpenAI API key for embeddings
KNOWLEDGEBASE_PATH ./pdfs Directory containing PDF files
CACHE_DIR ./.cache Cache directory for processing
PDF_PARSER marker Parser: marker, pymupdf4llm, mineru, docling, llm
CHUNK_SIZE 1000 Target chunk size for LangChain chunker
EMBEDDING_MODEL text-embedding-3-large OpenAI embedding model

🖥️ MCP Client Setup

Claude Desktop

Configuration File Location:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json
  • Linux: ~/.config/Claude/claude_desktop_config.json

Configuration:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "KNOWLEDGEBASE_PATH": "/Users/yourname/Documents/PDFs",
        "CACHE_DIR": "/Users/yourname/Documents/PDFs/.cache"
      },
      "transport": "stdio",
      "autoRestart": true
    }
  }
}

Verification:

  1. Restart Claude Desktop completely
  2. Look for PDF KB tools in the interface
  3. Test with "Add a document" or "Search documents"

VS Code with Native MCP Support

Configuration (.vscode/mcp.json in workspace):

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "KNOWLEDGEBASE_PATH": "${workspaceFolder}/pdfs"
      },
      "transport": "stdio"
    }
  }
}

Verification:

  1. Reload VS Code window
  2. Check VS Code's MCP server status in Command Palette
  3. Use MCP tools in Copilot Chat

VS Code with Continue Extension

Configuration (.continue/config.json):

{
  "models": [...],
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...",
        "KNOWLEDGEBASE_PATH": "${workspaceFolder}/pdfs"
      },
      "transport": "stdio"
    }
  }
}

Verification:

  1. Reload VS Code window
  2. Check Continue panel for server connection
  3. Use @pdfkb in Continue chat</search> </search_and_replace>

Generic MCP Client

Standard Configuration Template:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "required",
        "KNOWLEDGEBASE_PATH": "required-absolute-path",
        "PDF_PARSER": "optional-default-marker"
      },
      "transport": "stdio",
      "autoRestart": true,
      "timeout": 30000
    }
  }
}

📊 Performance & Troubleshooting

Common Issues

Server not appearing in MCP client:

// ❌ Wrong: Missing transport
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"]
    }
  }
}

// ✅ Correct: Include transport and restart client
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "transport": "stdio"
    }
  }
}

Processing too slow:

// Switch to faster parser
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-key",
        "PDF_PARSER": "pymupdf4llm"
      },
      "transport": "stdio"
    }
  }
}

Memory issues:

// Reduce memory usage
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-key",
        "EMBEDDING_BATCH_SIZE": "25",
        "CHUNK_SIZE": "500"
      },
      "transport": "stdio"
    }
  }
}

Poor table extraction:

// Use table-optimized parser
{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-key",
        "PDF_PARSER": "docling",
        "DOCLING_TABLE_MODE": "ACCURATE"
      },
      "transport": "stdio"
    }
  }
}

Resource Requirements

Configuration RAM Usage Processing Speed Best For
Speed 2-4 GB Fastest Large collections
Balanced 4-6 GB Medium Most users
Quality 6-12 GB Medium-Fast Accuracy priority
GPU 8-16 GB Very Fast High-volume processing

🔧 Advanced Configuration

Parser-Specific Options

MinerU Configuration:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-key",
        "PDF_PARSER": "mineru",
        "MINERU_LANG": "en",
        "MINERU_METHOD": "auto",
        "MINERU_VRAM": "16"
      },
      "transport": "stdio"
    }
  }
}

LLM Parser Configuration:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-key",
        "OPENROUTER_API_KEY": "sk-or-v1-abc123def456ghi789...",
        "PDF_PARSER": "llm",
        "LLM_MODEL": "google/gemini-2.5-flash-lite",
        "LLM_CONCURRENCY": "5",
        "LLM_DPI": "150"
      },
      "transport": "stdio"
    }
  }
}

Performance Tuning

High-Performance Setup:

{
  "mcpServers": {
    "pdfkb": {
      "command": "uvx",
      "args": ["pdfkb-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-key",
        "PDF_PARSER": "mineru",
        "KNOWLEDGEBASE_PATH": "/Volumes/FastSSD/Documents/PDFs",
        "CACHE_DIR": "/Volumes/FastSSD/Documents/PDFs/.cache",
        "EMBEDDING_BATCH_SIZE": "200",
        "VECTOR_SEARCH_K": "15",
        "FILE_SCAN_INTERVAL": "30"
      },
      "transport": "stdio"
    }
  }
}

Intelligent Caching

The server uses multi-stage caching:

Cache Invalidation Rules:

  • Changing PDF_PARSER → Full reset (parsing + chunking + embeddings)
  • Changing PDF_CHUNKER → Partial reset (chunking + embeddings)
  • Changing EMBEDDING_MODEL → Minimal reset (embeddings only)

📚 Appendix

Installation Options

Primary (Recommended):

uvx pdfkb-mcp

With Specific Parser Dependencies:

uvx pdfkb-mcp[marker]     # Marker parser
uvx pdfkb-mcp[mineru]     # MinerU parser
uvx pdfkb-mcp[docling]    # Docling parser
uvx pdfkb-mcp[llm]        # LLM parser
uvx pdfkb-mcp[langchain]  # LangChain chunker

Development Installation:

git clone https://github.com/juanqui/pdfkb-mcp.git
cd pdfkb-mcp
pip install -e ".[dev]"

Complete Environment Variables Reference

Variable Default Description
OPENAI_API_KEY required OpenAI API key for embeddings
OPENROUTER_API_KEY optional Required for LLM parser
KNOWLEDGEBASE_PATH ./pdfs PDF directory path
CACHE_DIR ./.cache Cache directory
PDF_PARSER marker PDF parser selection
PDF_CHUNKER unstructured Chunking strategy
CHUNK_SIZE 1000 LangChain chunk size
CHUNK_OVERLAP 200 LangChain chunk overlap
EMBEDDING_MODEL text-embedding-3-large OpenAI model
EMBEDDING_BATCH_SIZE 100 Embedding batch size
VECTOR_SEARCH_K 5 Default search results
FILE_SCAN_INTERVAL 60 File monitoring interval
LOG_LEVEL INFO Logging level

Parser Comparison Details

Feature PyMuPDF4LLM Marker MinerU Docling LLM
Speed Fastest Medium Fast (GPU) Medium Slowest
Memory Lowest Medium High Medium Lowest
Tables Basic Good Excellent Excellent Excellent
Formulas Basic Good Excellent Good Excellent
Images Basic Good Good Excellent Excellent
Setup Simple Simple Moderate Simple Simple
Cost Free Free Free Free API costs

Chunking Strategies

LangChain (PDF_CHUNKER=langchain):

  • Header-aware splitting with MarkdownHeaderTextSplitter
  • Configurable via CHUNK_SIZE and CHUNK_OVERLAP
  • Best for customizable chunking

Unstructured (PDF_CHUNKER=unstructured):

  • Intelligent semantic chunking with unstructured library
  • Zero configuration required
  • Best for document structure awareness

Troubleshooting Guide

API Key Issues:

  1. Verify key format starts with sk-
  2. Check account has sufficient credits
  3. Test connectivity: curl -H "Authorization: Bearer $OPENAI_API_KEY" https://api.openai.com/v1/models

Parser Installation Issues:

  1. MinerU: pip install mineru[all] and verify mineru --version
  2. Docling: pip install docling for basic, pip install pdfkb-mcp[docling-complete] for all features
  3. LLM: Requires OPENROUTER_API_KEY environment variable

Performance Optimization:

  1. Speed: Use pymupdf4llm parser
  2. Memory: Reduce EMBEDDING_BATCH_SIZE and CHUNK_SIZE
  3. Quality: Use mineru (GPU) or docling (CPU)
  4. Tables: Use docling with DOCLING_TABLE_MODE=ACCURATE

For additional support, see implementation details in src/pdfkb/main.py and src/pdfkb/config.py.

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选