MCP Server Knowledge Engine
Transforms PDF collections into a searchable knowledge base using TF-IDF indexing and proximity matching. It enables users to search documents, retrieve specific page content, and manage document libraries through natural language via MCP clients.
README
MCP Server Knowledge Engine
A powerful Model Context Protocol (MCP) server that transforms any PDF document collection into an intelligent, searchable knowledge base accessible through Claude Desktop. This server features advanced search capabilities using TF-IDF scoring, proximity matching, and domain-specific optimization.
🌟 Key Features
- 🔍 Advanced Search Engine: TF-IDF-based inverted index with proximity matching for highly relevant results
- 📄 Universal PDF Support: Process any PDF collection - technical docs, legal papers, research, and more
- ⚡ High Performance: Cached search index, incremental processing, and background initialization
- 🎯 Domain Optimization: Configure domain-specific keywords for enhanced search accuracy
- ⚙️ Fully Configurable: JSON-based configuration with environment variable support
- 🛠️ Comprehensive CLI: Complete server management through intuitive commands
- 🔗 Seamless MCP Integration: Ready-to-use with Claude Desktop, VS Code, and other MCP clients
- 📊 Smart Caching: MD5 hash-based change detection for efficient updates
📋 Quick Start
Prerequisites
- Python 3.8 or higher
- pip (Python package manager)
- Claude Desktop app (for MCP integration)
1. Installation
# Clone the repository
git clone https://github.com/lhstorm/mcp_server_knowledge_engine.git
cd mcp_server_knowledge_engine
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
2. Create Your Server
# Interactive setup
python manage_server.py create-config
# This will ask you for:
# - Server name (e.g., 'legal-docs-server')
# - Display name (e.g., 'Legal Documents Server')
# - PDF folder location
# - Domain-specific keywords
3. Add PDF Documents
# Add individual PDFs
python manage_server.py add-pdf /path/to/document.pdf
python manage_server.py add-pdf /path/to/another-doc.pdf
# Or copy PDFs directly to your configured folder
4. Process Documents
# Convert PDFs to searchable format
python manage_server.py process-pdfs
5. Generate MCP Configuration
# Generate configuration for Claude Desktop
python generate_mcp_config.py --merge
# Or get the config to copy manually
python generate_mcp_config.py
6. Start Using with Claude
Restart Claude Desktop and your server will appear in the MCP tools menu!
💬 Using with Claude Desktop
Once configured, you can interact with your PDFs naturally:
Example prompts:
- "Search for information about [topic] in the documentation"
- "What does the documentation say about [specific feature]?"
- "Find all references to [keyword] across all PDFs"
- "Show me the content of [document name]"
- "List all available documents"
Advanced usage:
- "Search for [term1] near [term2]" - Leverages proximity matching
- "Get page 15 of [document]" - Retrieves specific pages
- "Find the top 10 results for [query]" - Adjusts result count
📁 Project Structure
mcp_server_knowledge_engine/
├── server.py # Main MCP server with search engine
├── config.py # Configuration management & validation
├── manage_server.py # CLI for server management
├── generate_mcp_config.py # MCP configuration generator
├── convert_pdfs.py # Standalone PDF conversion utility
├── server_config.json # Active server configuration
├── requirements.txt # Python dependencies
├── examples/ # Example configurations
│ ├── legal_docs_config.json
│ ├── medical_docs_config.json
│ ├── research_papers_config.json
│ └── tech_docs_config.json
└── your-pdfs/ # Your PDF folder (configurable)
├── document1.pdf
├── document2.pdf
└── markdown/ # Auto-generated cache
├── .pdf_cache.json # Processing metadata
├── .search_index.pkl # Cached search index
├── document1.md # Converted documents
└── document2.md
⚙️ Configuration
The server is configured via server_config.json:
{
"server": {
"name": "my-docs-server",
"display_name": "My Documents Server",
"description": "Search through my PDF collection",
"version": "1.0.0"
},
"storage": {
"pdf_folder": "./docs",
"markdown_folder": "./docs/markdown",
"domain_keywords": ["keyword1", "keyword2", "domain-term"]
},
"tools": {
"search": {
"name": "search_docs",
"description": "Search through PDF documentation"
},
"list": {
"name": "list_docs",
"description": "List all available documents"
},
"content": {
"name": "get_document_content",
"description": "Get full content from documents"
},
"max_results_default": 5
},
"processing": {
"cache_enabled": true,
"parallel_processing": true,
"max_file_size_mb": 50,
"context_size": 500
}
}
🛠️ Management Commands
Server Management
# Create new configuration
python manage_server.py create-config
# Test configuration
python manage_server.py test
# Generate MCP config
python manage_server.py generate-mcp-config
PDF Management
# List all PDFs
python manage_server.py list-pdfs
# Add PDF
python manage_server.py add-pdf document.pdf
# Remove PDF
python manage_server.py remove-pdf document.pdf
# Process all PDFs
python manage_server.py process-pdfs
MCP Configuration
# Print MCP config
python generate_mcp_config.py
# Automatically merge with Claude Desktop config
python generate_mcp_config.py --merge
# Save to file
python generate_mcp_config.py --output my_mcp_config.json
💡 Usage Examples
Legal Documents Server
{
"server": {
"name": "legal-docs-server",
"display_name": "Legal Documents Server"
},
"storage": {
"domain_keywords": ["contract", "liability", "jurisdiction", "plaintiff", "defendant"]
}
}
Technical Documentation Server
{
"server": {
"name": "tech-docs-server",
"display_name": "Technical Documentation Server"
},
"storage": {
"domain_keywords": ["API", "function", "class", "method", "parameter", "return"]
}
}
Research Papers Server
{
"server": {
"name": "research-server",
"display_name": "Research Papers Server"
},
"storage": {
"domain_keywords": ["hypothesis", "methodology", "results", "conclusion", "analysis"]
}
}
🔧 Available MCP Tools
Each server provides three configurable tools:
-
Search Tool (default:
search_docs)- Intelligent search through all documents
- TF-IDF scoring with proximity matching
- Returns relevant excerpts with context
-
List Tool (default:
list_docs)- Lists all available documents
- Shows document metadata and page counts
-
Content Tool (default:
get_document_content)- Retrieves full document content
- Can fetch specific pages
- Includes complete markdown formatting
🎯 Domain Customization
The server adapts to your domain through:
- Domain Keywords: Configure terms important to your field
- Tool Names: Customize tool names (e.g.,
search_legal_docs) - Descriptions: Tailor descriptions for your use case
- Context Size: Adjust how much context to return in search results
🔍 How the Search Engine Works
Inverted Index Architecture
The server uses an advanced inverted index for lightning-fast searches:
- Document Processing: PDFs are converted to markdown and tokenized
- Index Building: Words are mapped to their locations (document, page, position)
- TF-IDF Scoring:
- TF (Term Frequency): How often a word appears in a document
- IDF (Inverse Document Frequency): How rare a word is across all documents
- Combined score ensures relevant, unique results rank higher
Search Features
- Proximity Boosting: Multi-word queries score higher when terms appear close together
- Context Extraction: Returns relevant snippets with search terms highlighted
- Domain Keyword Recognition: Configured keywords get special treatment
- Page-Level Precision: Results include specific page numbers
- Smart Caching: Search index persists between server restarts
📊 Performance Optimizations
- Incremental Processing: MD5 hash-based change detection - only new/modified PDFs are processed
- Persistent Search Index: Pickled index loads instantly on server restart
- Background Initialization: Server accepts connections while building index
- Memory Efficiency: Streaming PDF processing and markdown storage
- Configurable Limits: Control file size limits and processing parameters
🐛 Troubleshooting
Common Issues & Solutions
Server not appearing in Claude Desktop:
- Ensure MCP configuration was merged:
python generate_mcp_config.py --merge - Check Python path:
which pythonorwhere python(Windows) - Verify server_config.json exists and is valid JSON
- Restart Claude Desktop after configuration changes
PDFs not processing:
- Check folder permissions:
ls -la /path/to/pdf/folder - Verify PDF files aren't corrupted:
file document.pdf - Look for errors in stderr:
python server.py 2>error.log - Ensure sufficient disk space for markdown cache
Search returns no/poor results:
- Initial indexing may take time - check stderr for progress
- Verify markdown files exist:
ls markdown/*.md - Check search index exists:
ls markdown/.search_index.pkl - Try single-word queries first, then expand
- Review domain keywords in configuration
Server crashes or hangs:
- Check Python version (3.8+ required):
python --version - Verify all dependencies installed:
pip install -r requirements.txt - Clear cache and reprocess:
rm -rf markdown/.pdf_cache.json markdown/.search_index.pkl - Check for file locking issues on Windows
Debug Mode
# Run with full debug output
python server.py 2>&1 | tee debug.log
# Check server initialization
grep "initialization" debug.log
# Monitor PDF processing
grep "Processing\|Error" debug.log
Validation Commands
# Test configuration validity
python manage_server.py test
# Verify configuration loading
python -c "from config import load_config_from_env_or_file; c=load_config_from_env_or_file(); print(f'✓ Config loaded: {c.server.name}')"
# Check MCP integration
python generate_mcp_config.py # Should output valid JSON
🚀 Advanced Usage
Multiple Servers
You can run multiple specialized servers:
# Legal documents server
python manage_server.py --config legal_config.json create-config
# Technical docs server
python manage_server.py --config tech_config.json create-config
# Research papers server
python manage_server.py --config research_config.json create-config
Batch Processing
# Process multiple PDF folders
for folder in docs legal_docs tech_docs; do
python convert_pdfs.py "$folder" "$folder/markdown"
done
Custom Keywords
Configure domain-specific keywords for better search relevance:
{
"storage": {
"domain_keywords": [
"algorithm", "data structure", "complexity",
"optimization", "performance", "scalability"
]
}
}
🏗️ Architecture Overview
Core Components
-
SearchIndex Class (
server.py:27-140)- Implements inverted index with TF-IDF scoring
- Handles word tokenization and document indexing
- Provides proximity-based ranking for multi-word queries
-
GenericPDFServer Class (
server.py:142-661)- Main server implementation with MCP protocol handling
- Manages PDF processing pipeline
- Handles async operations and background initialization
-
Configuration System (
config.py)- Dataclass-based type-safe configuration
- JSON schema validation
- Environment variable support
-
Management CLI (
manage_server.py)- Interactive configuration creation
- PDF management operations
- Server testing and validation
Data Flow
PDFs → PDF Reader → Markdown Converter → Search Index → MCP Tools → Claude
↓ ↓ ↓
[.pdf files] [.md cache files] [.search_index.pkl]
🔄 Current Server Configuration
The repository currently includes a configuration for QuantConnect documentation (server_config.json). To create your own server:
# Option 1: Interactive setup
python manage_server.py create-config
# Option 2: Copy and modify an example
cp examples/tech_docs_config.json server_config.json
# Edit server_config.json with your settings
📚 Example Use Cases
- Legal Firms: Search through contracts, case files, and legal documents
- Research Labs: Query scientific papers and technical reports
- Software Teams: Access API documentation and technical specs
- Medical Practices: Search patient records and medical literature
- Educational Institutions: Browse course materials and textbooks
🤝 Contributing
We welcome contributions! Here are some ways to help:
Enhancement Ideas
- Document Format Support: Add support for Word, HTML, or other formats
- Search Improvements: Implement semantic search, fuzzy matching, or ML-based ranking
- Performance: Add database backend, parallel processing, or distributed indexing
- Tools: Create specialized MCP tools for specific domains
- UI: Build a web interface for configuration management
Development Guidelines
- Follow existing code style and patterns
- Add tests for new functionality
- Update documentation for new features
- Submit PRs with clear descriptions
🔐 Security Considerations
- The server only has read access to specified PDF folders
- No external network calls are made during operation
- Sensitive data remains local - nothing is sent to external services
- Configure appropriate file permissions for your PDF folders
📄 License
This project is open source. See LICENSE file for details.
🙏 Acknowledgments
Built with the Model Context Protocol by Anthropic.
Ready to transform your PDFs into a searchable knowledge base?
Run python manage_server.py create-config to get started! 🚀
📦 Dependencies
- mcp: Model Context Protocol SDK for building MCP servers
- PyPDF2: PDF parsing and text extraction
- asyncio: Asynchronous I/O for concurrent operations
- jsonschema: JSON validation for configuration files
All dependencies are lightweight and have minimal system requirements.
推荐服务器
Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。
VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。
e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。