DataBeak
Provides 40+ specialized tools for AI assistants to load, transform, analyze, and validate CSV data from URLs and string content through the Model Context Protocol.
README
DataBeak
AI-Powered CSV Processing via Model Context Protocol
Transform how AI assistants work with CSV data. DataBeak provides 40+ specialized tools for data manipulation, analysis, and validation through the Model Context Protocol (MCP).
Features
- 🔄 Complete Data Operations - Load, transform, and analyze CSV data from URLs and string content
- 📊 Advanced Analytics - Statistics, correlations, outlier detection, data profiling
- ✅ Data Validation - Schema validation, quality scoring, anomaly detection
- 🎯 Stateless Design - Clean MCP architecture with external context management
- ⚡ High Performance - Async I/O, streaming downloads, chunked processing
- 🔒 Session Management - Multi-user support with isolated sessions
- 🛡️ Web-Safe - No file system access; designed for secure web hosting
- 🌟 Code Quality - Zero ruff violations, 100% mypy compliance, perfect MCP documentation standards, comprehensive test coverage
Getting Started
The fastest way to use DataBeak is with uvx (no installation required):
For Claude Desktop
Add this to your MCP Settings file:
{
"mcpServers": {
"databeak": {
"command": "uvx",
"args": [
"--from",
"git+https://github.com/jonpspri/databeak.git",
"databeak"
]
}
}
}
For Other AI Clients
DataBeak works with Continue, Cline, Windsurf, and Zed. See the installation guide for specific configuration examples.
HTTP Mode (Advanced)
For HTTP-based AI clients or custom deployments:
# Run in HTTP mode
uv run databeak --transport http --host 0.0.0.0 --port 8000
# Access server at http://localhost:8000/mcp
# Health check at http://localhost:8000/health
Quick Test
Once configured, ask your AI assistant:
"Load this CSV data: name,price\nWidget,10.99\nGadget,25.50"
"Load CSV from URL: https://example.com/data.csv"
"Remove duplicate rows and show me the statistics"
"Find outliers in the price column"
Documentation
- Installation Guide - Setup for all AI clients
- Quick Start Tutorial - Learn in 10 minutes
- API Reference - All 40+ tools documented
- Architecture - Technical details
Environment Variables
Configure DataBeak behavior with environment variables (all use DATABEAK_
prefix):
| Variable | Default | Description |
|---|---|---|
DATABEAK_SESSION_TIMEOUT |
3600 | Session timeout (seconds) |
DATABEAK_MAX_DOWNLOAD_SIZE_MB |
100 | Maximum URL download size (MB) |
DATABEAK_MAX_MEMORY_USAGE_MB |
1000 | Max DataFrame memory (MB) |
DATABEAK_MAX_ROWS |
1,000,000 | Max DataFrame rows |
DATABEAK_URL_TIMEOUT_SECONDS |
30 | URL download timeout |
DATABEAK_HEALTH_MEMORY_THRESHOLD_MB |
2048 | Health monitoring memory threshold |
See settings.py for complete configuration options.
Known Limitations
DataBeak is designed for interactive CSV processing with AI assistants. Be aware of these constraints:
- Data Loading: URLs and string content only (no local file system access for web hosting security)
- Download Size: Maximum 100MB per URL download (configurable via
DATABEAK_MAX_DOWNLOAD_SIZE_MB) - DataFrame Size: Maximum 1GB memory and 1M rows per DataFrame (configurable)
- Session Management: Maximum 100 concurrent sessions, 1-hour timeout (configurable)
- Memory: Large datasets may require significant memory; monitor with
health_checktool - CSV Dialects: Assumes standard CSV format; complex dialects may require pre-processing
- Concurrency: Async I/O for concurrent URL downloads; parallel sessions supported
- Data Types: Automatic type inference; complex types may need explicit conversion
- URL Loading: HTTPS only; blocks private networks (127.0.0.1, 192.168.x.x, 10.x.x.x) for security
For production deployments with larger datasets, adjust environment variables
and monitor resource usage with health_check and get_server_info tools.
Contributing
We welcome contributions! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes with tests
- Run quality checks:
uv run -m pytest - Submit a pull request
Note: All changes must go through pull requests. Direct commits to main
are blocked by pre-commit hooks.
Development
# Setup development environment
git clone https://github.com/jonpspri/databeak.git
cd databeak
uv sync
# Run the server locally
uv run databeak
# Run tests
uv run -m pytest tests/unit/ # Unit tests (primary)
uv run -m pytest # All tests
# Run quality checks
uv run ruff check
uv run mypy src/databeak/
Testing Structure
DataBeak implements comprehensive unit and integration testing:
- Unit Tests (
tests/unit/) - 940+ fast, isolated module tests - Integration Tests (
tests/integration/) - 43 FastMCP Client-based protocol tests across 7 test files - E2E Tests (
tests/e2e/) - Planned: Complete workflow validation
Test Execution:
uv run pytest -n auto tests/unit/ # Run unit tests (940+ tests)
uv run pytest -n auto tests/integration/ # Run integration tests (43 tests)
uv run pytest -n auto --cov=src/databeak # Run with coverage analysis
See Testing Guide for comprehensive testing details.
License
Apache 2.0 - see LICENSE file.
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: jonpspri.github.io/databeak
推荐服务器
Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。
VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。
e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。