MCP 服务器

PDF MCP Server

Enables AI-powered querying of PDF documents using hybrid retrieval (BM25 + vector search) and retrieval-augmented generation, returning structured answers with source citations and confidence scores.

README

PDF Retrieval MCP Server

A completely free Model Context Protocol (MCP) server for retrieving relevant chunks from PDF documents using hybrid search (BM25 + Vector Search).

🚀 Features

PDF Document Processing: Automatic parsing and indexing of PDF files using Docling
Hybrid Retrieval: Combines BM25 (keyword) and vector search (semantic) for accurate retrieval
Free Embeddings: Uses ChromaDB's default sentence-transformers (no API costs!)
Pure Retrieval Mode: Returns raw document chunks for agent processing (no LLM answer generation)
Fresh Start: Clears vector database on each startup for clean indexing
MCP Integration: Exposes retrieve_pdf_chunks tool via FastMCP for seamless agent integration

📋 Prerequisites

Python 3.11 or later
PDF documents to index
No API keys required! ✨

🛠️ Installation

1. Clone the Repository (if not already done)

git clone <repository-url>
cd pdf_mcpserver

2. Install Dependencies with uv

uv sync

This will automatically:

Create a virtual environment (.venv)
Install all dependencies from pyproject.toml
Set up the project

3. Add PDF Documents

Create a documents directory and add your PDF files:

mkdir documents
# Copy your PDF files to the documents/ directory

That's it! No API keys or additional configuration needed.

🎯 Usage

Running the Server

uv run python main.py

Or activate the virtual environment first:

source .venv/bin/activate  # On Windows: .venv\Scripts\activate
python main.py

The server will:

Start immediately (lazy initialization)
Load and index PDFs on first query
Be ready to retrieve document chunks via MCP

Using the `retrieve_pdf_chunks` Tool

The server exposes a single MCP tool: retrieve_pdf_chunks(query: str, max_chunks: int = 5) -> str

Example Query:

retrieve_pdf_chunks("machine learning algorithms", max_chunks=3)

Example Response:

{
  "query": "machine learning algorithms",
  "chunks": [
    {
      "content": "Machine learning algorithms can be categorized into supervised, unsupervised, and reinforcement learning...",
      "document_name": "ml_guide.pdf",
      "page_number": 12,
      "metadata": {"source": "ml_guide.pdf"}
    },
    {
      "content": "Common supervised learning algorithms include linear regression, decision trees, and neural networks...",
      "document_name": "ml_guide.pdf",
      "page_number": 15,
      "metadata": {"source": "ml_guide.pdf"}
    }
  ],
  "total_chunks": 2
}

Response Structure

Field	Type	Description
`query`	string	The original search query
`chunks`	array	List of relevant document chunks
`chunks[].content`	string	The text content of the chunk
`chunks[].document_name`	string	Source PDF filename
`chunks[].page_number`	int	Page number (if available)
`chunks[].metadata`	object	Additional metadata
`total_chunks`	int	Number of chunks returned

How Agents Use This

When an agent (like Claude) calls this tool:

Agent sends a search query
Server returns relevant document chunks
Agent uses chunks in its context to answer questions

Example Agent Flow:

User: "What are the main ML algorithms discussed?"
  ↓
Agent calls: retrieve_pdf_chunks("machine learning algorithms")
  ↓
Server returns: 3 relevant chunks from PDFs
  ↓
Agent reads chunks and generates answer for user

🔍 Testing with MCP Inspector

The MCP Inspector is a web-based tool for testing and debugging MCP servers interactively.

Running the Inspector

npx @modelcontextprotocol/inspector uv run python main.py

This command will:

Start the MCP Inspector proxy server
Launch your PDF Retrieval Server
Open a web browser with the Inspector UI

What You'll See

The Inspector provides:

Tool Discovery: View available tools (retrieve_pdf_chunks)
Interactive Testing: Test queries with custom parameters
Real-time Responses: See JSON responses in real-time
Request/Response Logs: Debug the MCP protocol communication

Example Inspector Workflow

Open the Inspector - Browser opens automatically at http://localhost:6274
Wait for Initialization - Server loads and indexes PDFs on first query (~1-2 minutes)
Select Tool - Click on retrieve_pdf_chunks in the tools list
Enter Query - Type your search query (e.g., "machine learning")
Set Parameters - Optionally adjust max_chunks (default: 5)
Execute - Click "Run" to see the results
View Response - Inspect the returned chunks and metadata

Inspector Tips

First query is slow: PDF indexing happens on first query (87 seconds for typical PDFs)
Subsequent queries are fast: Embeddings are cached in ChromaDB
Fresh start: Server clears ChromaDB on each restart for clean indexing
Check logs: Terminal shows detailed logging of the indexing process

🏗️ Architecture

pdf_mcpserver/
├── src/
│   ├── config.py              # Configuration management
│   ├── constants.py           # Configuration constants
│   ├── models.py              # Pydantic response models
│   ├── pdf_processor.py       # PDF loading and hybrid retrieval
│   └── retrieval_handler.py   # Document chunk retrieval
├── main.py                    # MCP server entry point
├── pyproject.toml             # Project metadata
└── .env                       # Environment configuration

Key Components

PDFProcessor: Singleton class that loads PDFs, converts to Markdown using Docling, and builds hybrid retriever (BM25 + Vector Search)
RetrievalHandler: Retrieves relevant chunks for queries - no LLM answer## 🔧 Configuration

Configuration is managed through environment variables. Create a .env file in the project root:

# Optional: PDF Documents Directory (defaults to ./documents)
PDF_DOCUMENTS_DIR=./documents

# Optional: ChromaDB Directory (defaults to ./chroma_db)
CHROMA_DB_DIR=./chroma_db

# Optional: Log Level (defaults to INFO)
LOG_LEVEL=INFO

Configuration Options

Variable	Required	Default	Description
`PDF_DOCUMENTS_DIR`	No	`./documents`	Directory containing PDF files to index
`CHROMA_DB_DIR`	No	`./chroma_db`	Directory for ChromaDB vector storage
`LOG_LEVEL`	No	`INFO`	Logging level (DEBUG, INFO, WARNING, ERROR)

Note: No API keys required! ChromaDB uses free local embeddings (sentence-transformers).

🧪 Testing

Run unit tests:

uv run pytest tests/

📝 Troubleshooting

No PDF files found

Error: No PDF files found in ./documents

Solution: Add PDF files to the documents/ directory or update PDF_DOCUMENTS_DIR in .env

Import errors

Error: ModuleNotFoundError: No module named 'docling'

Solution: Ensure all dependencies are installed: uv sync

CUDA out of memory

Error: CUDA out of memory

Solution: The server is configured to use CPU-only mode. If you still see this error, check that CUDA_VISIBLE_DEVICES="" is set in src/pdf_processor.py

📚 Dependencies

fastmcp: MCP server framework
docling: Document processing and parsing
chromadb: Vector database with free sentence-transformers embeddings
langchain: RAG framework and retrievers
loguru: Logging

No paid APIs required! All embeddings are generated locally using ChromaDB's default model (all-MiniLM-L6-v2).

🤝 Contributing

This is a Proof of Concept (PoC) implementation. For production use, consider:

Adding caching for processed documents
Implementing multi-agent workflow with fact verification
Supporting additional document formats (DOCX, TXT, etc.)
Adding authentication and rate limiting

📄 License

[Your License Here]

🙏 Acknowledgments

Based on the docchat-docling architecture.