OfficeReader-MCP
Converts Microsoft Office documents (Word, Excel, PowerPoint) to Markdown format with intelligent image extraction and optimization, preserving document structure and metadata.
README
OfficeReader-MCP
A Model Context Protocol (MCP) server that converts Microsoft Office documents (Word, Excel, PowerPoint) to Markdown format with intelligent image extraction and optimization.
Features
- Multi-Format Support: Word (.docx, .doc), Excel (.xlsx, .xls), PowerPoint (.pptx, .ppt)
- Intelligent Image Processing: Automatic extraction and optimization with WebP compression
- Format Preservation: Maintains document structure including headings, tables, lists, and formatting
- Metadata Extraction: Access document properties (author, title, creation date, etc.)
- Efficient Caching: Smart caching system for quick reuse of converted documents
- Cross-Platform: Works on Windows, macOS, and Linux
Supported Formats
| Format | Extensions | Features |
|---|---|---|
| Word | .docx, .doc |
Text formatting, headings, lists, tables, images |
| Excel | .xlsx, .xls |
Multi-sheet support, tables, charts, embedded images |
| PowerPoint | .pptx, .ppt |
Slides, text boxes, images, speaker notes, tables |
Installation
Prerequisites
- Python 3.10 or higher
- Claude Desktop or Claude Code
Step 1: Install the Package
# Clone the repository
git clone https://github.com/Asunainlove/office-reader-mcp.git
cd office-reader-mcp
# Install in editable mode
pip install -e .
Step 2: Configure Claude
For Claude Desktop
Add to your Claude Desktop config file:
Windows: %APPDATA%\Claude\claude_desktop_config.json
macOS/Linux: ~/.config/Claude/claude_desktop_config.json
{
"mcpServers": {
"officereader": {
"command": "python",
"args": ["-m", "officereader_mcp.server"],
"env": {
"OFFICEREADER_CACHE_DIR": "/path/to/cache"
}
}
}
}
For Claude Code
Add to your Claude Code settings:
Windows: %LOCALAPPDATA%\claude-code\settings.json
macOS/Linux: ~/.config/claude-code/settings.json
{
"mcpServers": {
"officereader": {
"command": "python",
"args": ["-m", "officereader_mcp.server"],
"env": {
"OFFICEREADER_CACHE_DIR": "/path/to/cache"
}
}
}
}
Step 3: Restart Claude
Restart Claude Desktop or Claude Code to load the MCP server.
Quick Start
After installation, you can use OfficeReader-MCP directly in your conversations with Claude:
Convert my Excel file at D:\Reports\sales_2024.xlsx to markdown
Extract text and images from D:\Presentations\keynote.pptx
Get metadata from my document at C:\Documents\report.docx
Available Tools
1. convert_document
Convert any supported Office document to Markdown format.
Parameters:
file_path(required): Absolute path to the documentextract_images(optional, default: true): Extract embedded imagesimage_format(optional, default: "file"): How to handle images"file": Save images to disk (recommended)"base64": Embed images as base64 in markdown"both": Both save and embed
output_name(optional): Custom name for output files
Example:
Convert D:\Documents\report.xlsx with images
2. read_converted_markdown
Read the full content of a previously converted markdown file.
Parameters:
markdown_path(required): Path to the markdown file
Example:
Read the markdown at D:\cache\output\report_abc12345\report_abc12345.md
3. list_conversions
List all cached document conversions with details.
Example:
List all converted documents
4. clear_cache
Clear all cached conversions to free up disk space.
Example:
Clear the document cache
5. get_document_metadata
Extract metadata from a document without full conversion (faster).
Parameters:
file_path(required): Path to the document
Example:
Get metadata from D:\Documents\presentation.pptx
6. get_supported_formats
Get list of all supported file formats and extensions.
Example:
What file formats does officereader support?
Output Structure
Converted documents are organized in the cache directory:
cache/
└── output/
└── document_name_abc12345/
├── document_name_abc12345.md # Converted markdown
└── images/
├── image_001.webp # Optimized images
├── slide2_image_002.webp
└── excel_image_003.webp
Image Optimization
Images are automatically optimized to reduce file size while maintaining quality:
- Max Dimensions: 1920×1080 pixels (configurable)
- Format: WebP (preferred) or PNG/JPEG fallback
- Quality: 80% for photos, 85% for JPEG, lossless PNG for graphics with transparency
- Typical Compression: 50-80% size reduction
- Smart Detection: Automatically distinguishes between photos and graphics
Technical Details
Architecture
OfficeReader-MCP/
├── src/officereader_mcp/
│ ├── server.py # MCP server implementation
│ ├── converter.py # Word converter (DocxConverter, OfficeConverter)
│ ├── excel_converter.py # Excel to Markdown converter
│ ├── pptx_converter.py # PowerPoint to Markdown converter
│ ├── image_optimizer.py # Image compression utility
│ └── __init__.py # Package initialization
├── test/
│ ├── test_converter.py # Basic functionality tests
│ └── test_all_formats.py # Comprehensive test suite
├── pyproject.toml # Project configuration
└── README.md # Documentation
Dependencies
| Package | Version | Purpose |
|---|---|---|
mcp |
>=1.0.0 | Model Context Protocol SDK |
python-docx |
>=1.1.0 | DOCX file parsing and manipulation |
mammoth |
>=1.6.0 | DOC/DOCX to HTML conversion (fallback) |
Pillow |
>=10.0.0 | Image processing and optimization |
markdownify |
>=0.11.0 | HTML to Markdown conversion |
openpyxl |
>=3.1.0 | Excel file parsing |
python-pptx |
>=0.6.21 | PowerPoint file parsing |
All dependencies are automatically installed when you run pip install -e .
Testing
Run Tests
# Basic converter test
python test/test_converter.py
# Comprehensive test suite for all formats
python test/test_all_formats.py
# Test with a specific document
python test/test_converter.py path/to/your/document.docx
Test Coverage
The test suite verifies:
- Module imports and initialization
- Converter functionality for all formats
- Image extraction and optimization
- File type detection
- Cache management
- Metadata extraction
Configuration
OfficeReader-MCP supports multiple configuration methods to customize cache locations and behavior.
Quick Configuration (Recommended)
-
Copy the example config file:
cp config.example.json config.json -
Edit
config.jsonto set your cache directory:{ "cache_dir": "D:/MyDocuments/OfficeReaderCache", "image_optimization": { "enabled": true, "max_dimension": 1920, "quality": 80 } } -
The config file will be automatically loaded on startup.
For detailed configuration options, see CONFIG.md.
Environment Variables
| Variable | Description | Default |
|---|---|---|
OFFICEREADER_CACHE_DIR |
Directory for cached conversions | System temp directory |
Example usage:
# Set custom cache directory
export OFFICEREADER_CACHE_DIR=/path/to/custom/cache
# Or in Windows
set OFFICEREADER_CACHE_DIR=C:\path\to\custom\cache
Note: Environment variables take priority over config file settings.
Usage Examples
Converting Excel with Multiple Sheets
User: Convert my Excel file at D:\Reports\Q4_sales.xlsx
Claude: I'll convert that Excel file. Each sheet will be converted to a separate
section in the markdown with properly formatted tables...
[Output includes all sheets as markdown tables with preserved formatting]
Extracting PowerPoint Content
User: Extract all text and images from D:\Presentations\product_launch.pptx
Claude: Converting the PowerPoint presentation. I'll extract text from each slide,
including speaker notes, along with all embedded images...
[Output includes slide-by-slide breakdown with images and notes]
Batch Processing
User: Convert all Office documents in D:\Documents\
Claude: I'll convert each document and cache the results for quick access...
[Processes all supported files and provides summary]
Troubleshooting
"Module not found" Error
# Reinstall the package
pip install -e .
Configuration Not Loading
- Verify the config file location is correct
- Check JSON syntax is valid (use a JSON validator)
- Restart Claude Desktop or Claude Code completely
- Check logs for error messages
Images Not Extracting
Possible causes:
- Document contains linked images (not embedded)
- Insufficient write permissions for cache directory
- Image format not supported by the document library
Solution:
# Verify cache directory is writable
ls -la /path/to/cache # Unix/Mac
dir /path/to/cache # Windows
# Check if images are embedded
# Use convert_document with extract_images=true explicitly
Encoding Issues
The converter uses UTF-8 encoding throughout. If you see garbled text:
- Check the source document encoding
- Ensure your terminal/console supports UTF-8
- Try converting with different system locale settings
Changelog
v2.0.0 (2024-11)
Major Features:
- Added Excel (.xlsx, .xls) support with multi-sheet conversion
- Added PowerPoint (.pptx, .ppt) support with slide extraction
- Implemented intelligent image optimization with WebP compression
- Added unified OfficeConverter interface for all document types
- Enhanced metadata extraction for all formats
Improvements:
- Smart caching system with hash-based file identification
- Lazy-loading of format-specific converters for better performance
- Better error handling and validation
- Comprehensive test suite for all formats
Tools:
- Added
get_supported_formatstool - Enhanced
get_document_metadatafor all formats - Improved
list_conversionswith detailed cache information
v1.0.0 (2024-09)
- Initial release
- Word document (.docx, .doc) conversion
- Basic image extraction
- MCP server implementation
Contributing
Contributions are welcome! Here's how you can help:
- Report Bugs: Open an issue with details and steps to reproduce
- Suggest Features: Describe your idea and use case
- Submit Pull Requests:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to your branch (
git push origin feature/amazing-feature) - Open a Pull Request
Development Setup
# Clone and install with dev dependencies
git clone https://github.com/Asunainlove/office-reader-mcp.git
cd office-reader-mcp
pip install -e ".[dev]"
# Run tests
python test/test_all_formats.py
# Run linting (if configured)
black src/
ruff check src/
License
MIT License - see LICENSE file for details.
Author
Asunainlove
- GitHub: @Asunainlove
- Repository: office-reader-mcp
- Issues: Report a bug
Acknowledgments
This project uses the following open-source libraries:
- Model Context Protocol (MCP) by Anthropic
- python-docx for Word processing
- openpyxl for Excel processing
- python-pptx for PowerPoint processing
- Pillow for image processing
Support
If you find this project helpful, please:
- ⭐ Star the repository
- 🐛 Report bugs and issues
- 💡 Suggest new features
- 🔀 Contribute code improvements
Happy converting! 🚀
推荐服务器
Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。
VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。
e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。