PDF RAG MCP Server
Enables intelligent search and question-answering over PDF documents using semantic similarity and keyword search. Supports OCR for scanned PDFs, persistent vector storage with ChromaDB, and maintains source tracking with page numbers.
README
PDF RAG MCP Server
A Model Context Protocol (MCP) server that provides powerful RAG (Retrieval-Augmented Generation) capabilities for PDF documents. This server uses ChromaDB for vector storage, sentence-transformers for embeddings, and semantic chunking for intelligent text segmentation.
Features
- ✅ Semantic Chunking: Intelligently groups sentences together instead of splitting at arbitrary character limits
- ✅ Vector Search: Find semantically similar content using embeddings
- ✅ Keyword Search: Traditional keyword-based search for exact terms
- ✅ OCR Support: Automatic detection and OCR processing for scanned/image-based PDFs
- ✅ Source Tracking: Maintains document names and page numbers for all chunks
- ✅ Add/Remove PDFs: Easily manage your document collection
- ✅ Persistent Storage: ChromaDB persists your embeddings to disk
- ✅ Multiple Output Formats: Get results in Markdown or JSON format
- ✅ Progress Reporting: Real-time feedback during long operations
Architecture
- Embedding Model:
multi-qa-mpnet-base-dot-v1(optimized for question-answering) - Vector Database: ChromaDB with cosine similarity
- Chunking Strategy: Semantic chunking with configurable sentence grouping and overlap
- PDF Extraction: PyMuPDF for text extraction with OCR fallback for scanned PDFs
Installation
1. Install Python Dependencies
pip install -r requirements.txt
2. Download NLTK Data (Automatic)
The server automatically downloads required NLTK punkt tokenizer data on first run.
3. Install Tesseract (Optional - for OCR)
For scanned PDF support, install Tesseract:
- macOS:
brew install tesseract - Ubuntu/Debian:
sudo apt-get install tesseract-ocr - Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
The server automatically detects scanned pages and uses OCR when Tesseract is available.
4. Test the Server
python pdf_rag_mcp.py --help
Configuration
Database Location
The server stores its ChromaDB database in a configurable location. You can specify the database path using the --db-path command line argument:
# Use default location (~/.dotfiles/files/mcps/pdfrag/chroma_db)
python pdf_rag_mcp.py
# Use custom database location
python pdf_rag_mcp.py --db-path /path/to/your/database
Chunking Parameters
Default chunking settings:
- Chunk Size: 3 sentences per chunk
- Overlap: 1 sentence overlap between chunks
These can be customized when adding PDFs:
{
"pdf_path": "/path/to/document.pdf",
"chunk_size": 5, # Use 5 sentences per chunk
"overlap": 2 # 2 sentences overlap
}
Character Limit
Responses are limited to 25,000 characters by default. If exceeded, results are automatically truncated with a warning message.
MCP Tools
1. pdf_add
Add a PDF document to the RAG database.
Input:
{
"pdf_path": "/absolute/path/to/document.pdf",
"chunk_size": 3, // optional, default: 3
"overlap": 1 // optional, default: 1
}
Output:
{
"status": "success",
"message": "Successfully added 'document.pdf' to the database",
"document_id": "a1b2c3d4...",
"filename": "document.pdf",
"pages": 15,
"chunks": 127,
"chunk_size": 3,
"overlap": 1
}
Example Use Cases:
- Adding research papers for reference
- Indexing documentation
- Building a searchable knowledge base
2. pdf_remove
Remove a PDF document from the database.
Input:
{
"document_id": "a1b2c3d4..." // Get from pdf_list
}
Output:
{
"status": "success",
"message": "Successfully removed 'document.pdf' from the database",
"document_id": "a1b2c3d4...",
"removed_chunks": 127
}
3. pdf_list
List all PDF documents in the database.
Input:
{
"response_format": "markdown" // or "json"
}
Output (Markdown):
# PDF Documents (2 total)
## research_paper.pdf
**Document ID:** a1b2c3d4...
**Chunks:** 127
**Added:** N/A
## documentation.pdf
**Document ID:** e5f6g7h8...
**Chunks:** 89
**Added:** N/A
Output (JSON):
{
"count": 2,
"documents": [
{
"document_id": "a1b2c3d4...",
"filename": "research_paper.pdf",
"chunk_count": 127
},
{
"document_id": "e5f6g7h8...",
"filename": "documentation.pdf",
"chunk_count": 89
}
]
}
4. pdf_search_similarity
Search using semantic similarity (vector search).
Input:
{
"query": "machine learning techniques for text classification",
"top_k": 5, // optional, default: 5
"document_filter": null, // optional, search specific doc
"response_format": "markdown" // optional, default: markdown
}
Output (Markdown):
# Search Results for: 'machine learning techniques for text classification'
Found 5 relevant chunks:
## Result 1
**Document:** research_paper.pdf
**Page:** 7
**Similarity Score:** 0.8754
**Content:**
Machine learning approaches to text classification have evolved significantly...
---
Use Cases:
- Finding relevant information without exact keywords
- Discovering related concepts
- Question answering over documents
5. pdf_search_keywords
Search using keyword matching.
Input:
{
"keywords": "neural network backpropagation",
"top_k": 5, // optional, default: 5
"document_filter": null, // optional
"response_format": "markdown" // optional, default: markdown
}
Output:
Similar to pdf_search_similarity, but ranked by keyword occurrence count.
Use Cases:
- Finding specific technical terms
- Locating exact phrases or terminology
- Verifying presence of keywords in documents
Usage with Claude Desktop
1. Add to Claude Desktop Configuration
Edit your Claude Desktop config file:
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Add the server:
{
"mcpServers": {
"pdf-rag": {
"command": "python",
"args": [
"/absolute/path/to/pdf_rag_mcp.py"
]
}
}
}
Custom Database Location:
To use a custom database path with Claude Desktop:
{
"mcpServers": {
"pdf-rag": {
"command": "python",
"args": [
"/absolute/path/to/pdf_rag_mcp.py",
"--db-path",
"/custom/path/to/database"
]
}
}
}
2. Restart Claude Desktop
After adding the configuration, restart Claude Desktop to load the MCP server.
3. Test the Connection
In Claude Desktop, try:
Can you list the PDFs in the RAG database?
Claude will use the pdf_list tool to show available documents.
Example Workflows
Building a Research Database
1. Add documents:
"Add these PDFs to the database: /research/paper1.pdf, /research/paper2.pdf"
2. Search for concepts:
"Search for information about 'gradient descent optimization' in the database"
3. Find specific terms:
"Search for the keyword 'convolutional neural network' and show me the pages"
Document Q&A
1. Add documentation:
"Add this user manual: /docs/product_manual.pdf"
2. Ask questions:
"How do I configure the network settings according to the manual?"
3. Find references:
"Which page discusses troubleshooting connection errors?"
Knowledge Base Management
1. List documents:
"Show me all documents in the RAG database"
2. Remove outdated docs:
"Remove the document with ID a1b2c3d4..."
3. Search across all:
"Search all documents for information about API authentication"
Advanced Configuration
Custom Chunk Sizes
For different document types:
Technical Documents (code, APIs):
- Smaller chunks (2-3 sentences)
- Minimal overlap (0-1 sentences)
- Preserves code structure
Narrative Documents (articles, books):
- Larger chunks (5-7 sentences)
- More overlap (2-3 sentences)
- Maintains context flow
Scientific Papers:
- Medium chunks (3-5 sentences)
- Moderate overlap (1-2 sentences)
- Balances detail and context
Document Filtering
Search within specific documents:
{
"query": "data preprocessing",
"document_filter": "a1b2c3d4..." // Only search this doc
}
Output Format Selection
Choose format based on use case:
Markdown: Best for human reading, Claude's analysis JSON: Best for programmatic processing, data extraction
Troubleshooting
"File not found" Error
Ensure you're using absolute paths:
"/home/user/documents/paper.pdf" ✅
"~/documents/paper.pdf" ❌ (needs expansion)
"./paper.pdf" ❌ (relative path)
Empty PDF Results / Scanned PDFs
The server automatically detects and processes scanned PDFs using OCR. If you get an error about no text being extracted:
-
Install Tesseract (if not already installed):
- macOS:
brew install tesseract - Ubuntu/Debian:
sudo apt-get install tesseract-ocr - Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
- macOS:
-
Retry adding the PDF - the server will automatically use OCR for pages with minimal text
The error message will indicate if OCR is needed: "ensure tesseract is installed for scanned PDFs"
Out of Memory
If processing large PDFs causes memory issues:
- Reduce
chunk_sizeto create more, smaller chunks - Process documents one at a time
- Increase system swap space
ChromaDB Errors
If ChromaDB complains about existing collections:
# Remove the database directory
rm -rf ./chroma_db
# Restart the server
Performance Considerations
Embedding Generation
The first time you add a document, the model will be downloaded (~400MB). Subsequent operations are faster.
Typical Times:
- 10-page PDF: ~5-10 seconds
- 100-page PDF: ~30-60 seconds
- 1000-page PDF: ~5-10 minutes
Search Performance
- Similarity Search: Fast (< 1 second for most queries)
- Keyword Search: Slower for large collections (scales with document count)
Storage
- Embeddings: ~1.5KB per chunk (768-dimensional vectors)
- Text Storage: Depends on chunk size
- Example: 1000 chunks ≈ 1.5MB in ChromaDB
Best Practices
1. Organize Documents
Use descriptive filenames:
research_ml_2024.pdf ✅
document (1).pdf ❌
2. Test Chunk Sizes
Different documents benefit from different chunking:
# Try multiple chunk sizes for the same document
pdf_add(path="doc.pdf", chunk_size=3, overlap=1) # Test 1
pdf_remove(document_id="...") # Remove
pdf_add(path="doc.pdf", chunk_size=5, overlap=2) # Test 2
3. Use Document Filters
When searching specific documents:
# More focused, faster results
pdf_search_similarity(
query="...",
document_filter="specific_doc_id"
)
4. Combine Search Types
Use both search methods for comprehensive results:
- Semantic search for concepts
- Keyword search for exact terms
Security Notes
- File Access: Server can read any PDF the Python process can access
- Storage: Embeddings and text stored unencrypted in ChromaDB
- No Authentication: MCP servers trust the client (Claude Desktop)
For production use:
- Restrict file system permissions
- Use dedicated database directories
- Consider encryption for sensitive documents
Contributing
To extend this server:
- Add New Tools: Follow the
@mcp.tool()decorator pattern - Custom Chunking: Implement in
semantic_chunking()function - Additional Embeddings: Swap models in initialization
- Metadata: Extend
metadatasdict inpdf_add()
License
MIT License - See LICENSE file for details
Acknowledgments
- Anthropic: MCP Protocol and SDK
- ChromaDB: Vector database
- Sentence Transformers: Embedding models
- PyMuPDF: PDF text extraction and OCR support
Support
For issues or questions:
- Check the troubleshooting section
- Review MCP documentation: https://modelcontextprotocol.io
- Check ChromaDB docs: https://docs.trychroma.com
Built with ❤️ using the Model Context Protocol
推荐服务器
Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。
VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。
e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。