OCR-MCP

OCR-MCP

Provides advanced OCR capabilities with multiple state-of-the-art backends (DeepSeek-OCR, Florence-2, DOTS.OCR, PP-OCRv5), supporting document processing, scanner integration, and multi-format output with layout preservation.

Category
访问服务器

README

OCR-MCP: Advanced Document Processing Server

Python FastMCP GOT-OCR2.0 License Status

FastMCP 2.13+ server providing advanced OCR capabilities including GOT-OCR2.0 integration, WIA scanner control, and multi-format document processing.

📋 Table of Contents

What is OCR-MCP?

OCR-MCP is a FastMCP server that provides comprehensive OCR (Optical Character Recognition) capabilities to MCP clients. It processes various document formats and integrates with scanner hardware.

State-of-the-Art OCR Integration

OCR-MCP integrates multiple current state-of-the-art OCR models for comprehensive document processing:

Primary OCR Engines

🔥 DeepSeek-OCR (October 2025) - Current State-of-the-Art

  • Downloads: 4.7M+ on Hugging Face (most downloaded OCR model)
  • Capabilities: Vision-language OCR with advanced text understanding
  • Strengths: Multilingual support, complex layouts, mathematical formulas
  • Repository: https://huggingface.co/deepseek-ai/DeepSeek-OCR
  • Paper: https://arxiv.org/abs/2510.18234

🎯 Florence-2 (June 2024) - Microsoft's Vision Foundation Model

  • Architecture: Unified vision-language model for various vision tasks
  • OCR Capabilities: Excellent text extraction and layout understanding
  • Strengths: Multi-task learning, fine-grained text recognition
  • Repository: https://huggingface.co/microsoft/Florence-2-base

📊 DOTS.OCR (July 2025) - Document Understanding Specialist

  • Focus: Document layout analysis, table recognition, formula extraction
  • Strengths: Structured document parsing, multilingual support
  • Repository: https://huggingface.co/rednote-hilab/dots.ocr

🚀 PP-OCRv5 (2025) - Industrial-Grade OCR

  • Performance: PaddlePaddle's latest production-ready OCR system
  • Strengths: High accuracy, fast inference, edge deployment
  • Repository: https://huggingface.co/PaddlePaddle/PP-OCRv5

🎨 Qwen-Image-Layered (December 2025) - Advanced Image Decomposition

  • Technology: Decomposes images into multiple independent RGBA layers
  • OCR Integration: Isolate text, background, and structural elements for better OCR
  • Capabilities: Layer-independent editing, resizing, repositioning, recoloring
  • Repository: https://huggingface.co/Qwen/Qwen-Image-Layered
  • Paper: https://arxiv.org/abs/2512.15603
  • Use Case: Pre-process complex documents by separating text layers from backgrounds

OCR Capabilities

  • Plain Text OCR: Standard text extraction from images
  • Formatted Text OCR: Preserves layout and formatting structure
  • Fine-Grained OCR: Extract text from specific regions with coordinate precision
  • Multi-Crop OCR: Process documents with complex layouts by dividing into regions
  • HTML Rendering: Generate HTML output with visual layout preservation
  • Document Understanding: Table extraction, formula recognition, layout analysis

Auto-Backend Selection

OCR-MCP automatically selects the best backend based on:

  • Document Type: PDF, image, scanned document, or comic
  • Content Complexity: Plain text vs. structured documents
  • Language Requirements: Multilingual content detection
  • Performance Needs: Speed vs. accuracy trade-offs

Advanced Document Pre-processing

Qwen-Image-Layered Integration revolutionizes OCR through intelligent image decomposition:

  • Layer Separation: Decompose documents into independent RGBA layers (text, background, images, graphics)
  • Selective OCR: Process text layers independently for improved accuracy on complex documents
  • Noise Reduction: Isolate and remove background noise, watermarks, and interfering elements
  • Content Isolation: Separate handwritten notes, stamps, and annotations from main text
  • Layout Preservation: Maintain document structure while enabling targeted OCR processing
  • Multi-modal Enhancement: Combine with traditional OCR for hybrid processing pipelines

Community & Industry Adoption

Current OCR landscape shows rapid evolution:

  • DeepSeek-OCR: Leading downloads indicate community preference
  • Florence-2: Academic and research adoption
  • DOTS.OCR: Document processing industry standard
  • PP-OCRv5: Production deployment in enterprise applications

Key Features

  • Multiple OCR Backends: GOT-OCR2.0, Tesseract, EasyOCR
  • Processing Modes: Plain text, formatted text, layout preservation, HTML rendering, fine-grained region extraction
  • Document Formats: PDF, CBZ/CBR comic archives, JPG/PNG/TIFF images, scanner input
  • Scanner Integration: Direct WIA control for Windows flatbed scanners
  • Batch Processing: Concurrent processing of multiple documents
  • Output Formats: Text, HTML, Markdown, JSON, XML

🏗️ Architecture

Backend Support Matrix

Backend Plain OCR Formatted OCR Multi-language GPU Support Offline
GOT-OCR2.0
Tesseract
EasyOCR
PaddleOCR
TrOCR

Tool Ecosystem

  • process_document - Main OCR processing with backend selection
  • process_batch - Batch document processing with progress tracking
  • extract_regions - Fine-grained region-based OCR
  • analyze_layout - Document structure and layout analysis
  • convert_format - OCR result format conversion
  • ocr_health_check - Backend availability and diagnostics

🚀 Quick Start

Prerequisites

  • Python 3.11+
  • GPU recommended (for GOT-OCR2.0 and other ML models)
  • 8GB+ VRAM for optimal performance

Installation

# Clone the repository
git clone https://github.com/sandraschi/ocr-mcp.git
cd ocr-mcp

# Install dependencies with Poetry (recommended)
poetry install

# For GPU support (optional but recommended)
poetry run pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

MCP Configuration

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "ocr-mcp": {
      "command": "python",
      "args": ["-m", "ocr_mcp.server"],
      "env": {
        "OCR_CACHE_DIR": "/path/to/model/cache",
        "OCR_DEVICE": "cuda"
      }
    }
  }
}

WebApp Mode

OCR-MCP includes a full-featured web interface for document processing:

# Run the web application
poetry run ocr-mcp-webapp

# Or use the script directly
python scripts/run_webapp.py

The web interface provides:

  • 📤 Drag & drop file upload - Support for PDF, images, CBZ
  • 🔄 Real-time processing - Live status updates and progress
  • 📷 Scanner integration - Direct scanner control via web interface
  • 📊 Batch processing - Process multiple documents simultaneously
  • 🎨 OCR backend selection - Choose from 5 different OCR engines
  • 📋 Results visualization - Text, JSON, and HTML output formats

Access the webapp at: http://localhost:8000

🌐 WebApp Interface

OCR-MCP provides a modern web interface for document processing and scanner control:

Features

  • 📤 File Upload: Drag & drop interface supporting PDF, PNG, JPG, TIFF, BMP, CBZ, CBR
  • 🔄 Live Processing: Real-time status updates with progress indicators
  • 📷 Scanner Control: Discover and control WIA-compatible scanners
  • 📊 Batch Operations: Process multiple documents simultaneously
  • 🎨 Backend Selection: Choose from 5 different OCR engines per task
  • 📋 Multi-format Output: View results as plain text, JSON, or HTML
  • 💾 Export Options: Download results or copy to clipboard

Interface Sections

Upload & Process Tab

  • Single document processing with drag-and-drop upload
  • OCR backend selection (DeepSeek-OCR, Florence-2, DOTS.OCR, PP-OCRv5, Qwen-Image-Layered)
  • Processing mode selection (Text, Formatted, Fine-grained)
  • Real-time processing status and results display

Scanner Control Tab

  • Automatic scanner discovery
  • Scanner properties configuration (DPI, color mode, paper size)
  • Single document scanning
  • Direct integration with OCR processing

Batch Processing Tab

  • Multiple file selection and management
  • Concurrent processing with progress tracking
  • Batch results aggregation

Settings Tab

  • System health monitoring
  • OCR backend availability status
  • Configuration diagnostics

WebApp Architecture

The webapp consists of:

  • FastAPI Backend: RESTful API server with async processing
  • MCP Integration: Direct communication with OCR-MCP server
  • Modern Frontend: Responsive HTML/CSS/JavaScript interface
  • File Management: Secure temporary file handling
  • Real-time Updates: WebSocket-like status polling

💡 Usage Examples

Basic OCR Processing

# Auto-select best available backend
result = await process_document(
    image_path="/path/to/document.png"
)
print(result["text"])  # Extracted text

Formatted OCR with HTML Output

# GOT-OCR2.0 formatted text preservation
result = await process_document(
    image_path="/path/to/scanned_page.png",
    backend="got-ocr",
    mode="format",
    output_format="html"
)
# Returns: HTML with preserved layout and formatting

Fine-grained Region Extraction

# Extract text from specific coordinates
result = await extract_regions(
    image_path="/path/to/document.png",
    regions=[
        {"x1": 100, "y1": 200, "x2": 400, "y2": 300, "label": "title"},
        {"x1": 100, "y1": 350, "x2": 500, "y2": 600, "label": "content"}
    ]
)
# Returns: Structured text extraction by region

Batch Processing

# Process multiple documents
results = await process_batch(
    image_paths=[
        "/path/to/doc1.png",
        "/path/to/doc2.png",
        "/path/to/doc3.png"
    ],
    backend="got-ocr",
    output_format="json"
)
# Returns: Array of OCR results with progress tracking

🎨 Advanced Features

Document Layout Analysis

# Analyze document structure
layout = await analyze_layout(
    image_path="/path/to/complex_document.png"
)
# Returns: Detected tables, columns, headers, text blocks

Multi-Backend Comparison

# Compare OCR accuracy across backends
comparison = await compare_backends(
    image_path="/path/to/test_image.png",
    backends=["got-ocr", "tesseract", "easyocr"]
)
# Returns: Accuracy scores, processing times, confidence metrics

Format Conversion

# Convert OCR results between formats
html_result = await convert_format(
    ocr_result=raw_result,
    from_format="text",
    to_format="html",
    preserve_layout=True
)

🔧 Configuration Options

Environment Variables

  • OCR_CACHE_DIR: Model cache directory (default: ~/.cache/ocr-mcp)
  • OCR_DEVICE: Computing device (cuda, cpu, auto)
  • OCR_MAX_MEMORY: Maximum GPU memory usage in GB
  • OCR_DEFAULT_BACKEND: Default OCR backend (got-ocr, tesseract, etc.)
  • OCR_BATCH_SIZE: Default batch processing size

Backend-Specific Settings

# config/ocr_config.yaml
backends:
  got_ocr:
    model_size: "base"  # or "large"
    cache_dir: "/models/got-ocr"
    device: "cuda:0"

  tesseract:
    language: "eng+fra+deu"
    config: "--psm 6"

  easyocr:
    languages: ["en", "fr", "de"]
    gpu: true

📊 Performance Benchmarks

Single Image Processing (GTX 3080)

Backend Plain OCR Formatted OCR Fine-grained
GOT-OCR2.0 2.3s 3.1s 4.2s
Tesseract 0.8s N/A 1.2s
EasyOCR 1.5s N/A 2.1s
PaddleOCR 1.8s 2.9s 3.5s

Accuracy Comparison (Clean Documents)

Backend Print Text Handwriting Mixed Content
GOT-OCR2.0 97.2% 89.1% 94.8%
Tesseract 92.1% 45.3% 78.9%
EasyOCR 94.7% 78.2% 88.5%
PaddleOCR 95.8% 82.1% 91.2%

🛠️ Development Status

  • Planning: Complete master plan and architecture
  • 🟡 Phase 1: Core infrastructure (In Progress)
  • Phase 2: GOT-OCR2.0 integration
  • Phase 3: Multi-backend support
  • Phase 4: Advanced features
  • Phase 5: Specialized tools
  • Phase 6: Production deployment

See OCR-MCP_MASTER_PLAN.md for detailed roadmap.

🤝 Integration with Existing MCP Servers

CalibreMCP Integration

OCR-MCP enhances CalibreMCP's OCR capabilities:

# CalibreMCP can now use OCR-MCP for advanced processing
result = await calibre_ocr(
    source="/path/to/scanned_book.pdf",
    provider="ocr-mcp",  # New option!
    mode="format",
    render_html=True
)

Document Processing Workflows

  • Research Papers: Extract structured text from academic PDFs
  • Receipt Processing: Automated data extraction from scanned receipts
  • Book Digitization: High-quality OCR for scanned books
  • Accessibility: Convert images to readable text for screen readers

📈 Roadmap

Immediate (Next 4 weeks)

  • [ ] Complete core infrastructure
  • [ ] GOT-OCR2.0 integration
  • [ ] Basic tool implementation
  • [ ] Documentation and examples

Medium-term (2-3 months)

  • [ ] Multi-backend support
  • [ ] Advanced processing modes
  • [ ] Batch processing optimization
  • [ ] Performance benchmarking

Long-term (6+ months)

  • [ ] Community backend integrations
  • [ ] Specialized domain models
  • [ ] Real-time processing capabilities
  • [ ] Mobile app integration

🤝 Contributing

OCR-MCP welcomes contributions! Areas of particular interest:

  • New OCR Backends: Integration of additional OCR engines
  • Performance Optimization: GPU memory management, batch processing
  • Specialized Models: Domain-specific OCR improvements
  • Documentation: Usage examples, integration guides
  • Testing: Comprehensive test coverage and benchmarks

📄 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

  • GOT-OCR2.0 Team (UCAS): Revolutionary OCR model that inspired this project
  • FastMCP Community: Excellent framework for MCP server development
  • Open Source OCR Community: Tesseract, EasyOCR, PaddleOCR, and others

OCR-MCP: Democratizing state-of-the-art document understanding for the MCP ecosystem! 🌟

See OCR-MCP_MASTER_PLAN.md for technical details and implementation roadmap.

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选