Dataproc MCP Server
Enables management of Google Cloud Dataproc clusters and jobs through 22 production-ready tools, featuring intelligent parameter injection, semantic search capabilities, and enterprise-grade security for seamless big data operations.
README
Dataproc MCP Server
A production-ready Model Context Protocol (MCP) server for Google Cloud Dataproc operations with intelligent parameter injection, enterprise-grade security, and comprehensive tooling. Designed for seamless integration with Roo (VS Code).
🚀 Quick Start
Recommended: Roo (VS Code) Integration
Add this to your Roo MCP settings:
{
"mcpServers": {
"dataproc": {
"command": "npx",
"args": ["@dipseth/dataproc-mcp-server@latest"],
"env": {
"LOG_LEVEL": "info"
}
}
}
}
With Custom Config File
{
"mcpServers": {
"dataproc": {
"command": "npx",
"args": ["@dipseth/dataproc-mcp-server@latest"],
"env": {
"LOG_LEVEL": "info",
"DATAPROC_CONFIG_PATH": "/path/to/your/config.json"
}
}
}
}
Alternative: Global Installation
# Install globally
npm install -g @dipseth/dataproc-mcp-server
# Start the server
dataproc-mcp-server
# Or run directly
npx @dipseth/dataproc-mcp-server@latest
5-Minute Setup
-
Install the package:
npm install -g @dipseth/dataproc-mcp-server@latest -
Run the setup:
dataproc-mcp --setup -
Configure authentication:
# Edit the generated config file nano config/server.json -
Start the server:
dataproc-mcp
🌐 Claude.ai Web App Compatibility
✅ PRODUCTION-READY: Full Claude.ai Integration with HTTPS Tunneling & OAuth
The Dataproc MCP Server now provides complete Claude.ai web app compatibility with a working solution that includes all 22 MCP tools!
🚀 Working Solution (Tested & Verified)
Terminal 1 - Start MCP Server:
DATAPROC_CONFIG_PATH=config/github-oauth-server.json npm start -- --http --oauth --port 8080
Terminal 2 - Start Cloudflare Tunnel:
cloudflared tunnel --url https://localhost:8443 --origin-server-name localhost --no-tls-verify
Result: Claude.ai can see and use all tools successfully! 🎉
Key Features:
- ✅ Complete Tool Access - All 22 MCP tools available in Claude.ai
- ✅ HTTPS Tunneling - Cloudflare tunnel for secure external access
- ✅ OAuth Authentication - GitHub OAuth for secure authentication
- ✅ Trusted Certificates - No browser warnings or connection issues
- ✅ WebSocket Support - Full WebSocket compatibility with Claude.ai
- ✅ Production Ready - Tested and verified working solution
Quick Setup:
- Setup GitHub OAuth (5 minutes)
- Generate SSL certificates:
npm run ssl:generate - Start services (2 terminals as shown above)
- Connect Claude.ai to your tunnel URL
📖 Complete Guide: See
docs/claude-ai-integration.mdfor detailed setup instructions, troubleshooting, and advanced features.
📖 Certificate Setup: See
docs/trusted-certificates.mdfor SSL certificate configuration.
✨ Features
🎯 Core Capabilities
- 22 Production-Ready MCP Tools - Complete Dataproc management suite
- 🧠 Knowledge Base Semantic Search - Natural language queries with optional Qdrant integration
- 🚀 Response Optimization - 60-96% token reduction with Qdrant storage
- 🔄 Generic Type Conversion System - Automatic, type-safe data transformations
- 60-80% Parameter Reduction - Intelligent default injection
- Multi-Environment Support - Dev/staging/production configurations
- Service Account Impersonation - Enterprise authentication
- Real-time Job Monitoring - Comprehensive status tracking
🚀 Response Optimization
- 96.2% Token Reduction -
list_clusters: 7,651 → 292 tokens - Automatic Qdrant Storage - Full data preserved and searchable
- Resource URI Access -
dataproc://responses/clusters/list/abc123 - Graceful Fallback - Works without Qdrant, falls back to full responses
- 9.95ms Processing - Lightning-fast optimization with <1MB memory usage
🔄 Generic Type Conversion System
- 75% Code Reduction - Eliminates manual conversion logic across services
- Type-Safe Transformations - Automatic field detection and mapping
- Intelligent Compression - Field-level compression with configurable thresholds
- 0.50ms Conversion Times - Lightning-fast processing with 100% compression ratios
- Zero-Configuration - Works automatically with existing TypeScript types
- Backward Compatible - Seamless integration with existing functionality
� Enterprise Security
- Input Validation - Zod schemas for all 16 tools
- Rate Limiting - Configurable abuse prevention
- Credential Management - Secure handling and rotation
- Audit Logging - Comprehensive security event tracking
- Threat Detection - Injection attack prevention
📊 Quality Assurance
- 90%+ Test Coverage - Comprehensive test suite
- Performance Monitoring - Configurable thresholds
- Multi-Environment Testing - Cross-platform validation
- Automated Quality Gates - CI/CD integration
- Security Scanning - Vulnerability management
🚀 Developer Experience
- 5-Minute Setup - Quick start guide
- Interactive Documentation - HTML docs with examples
- Comprehensive Examples - Multi-environment configs
- Troubleshooting Guides - Common issues and solutions
- IDE Integration - TypeScript support
🛠️ Complete MCP Tools Suite (22 Tools)
🔄 Enhanced with Generic Type Conversion: All tools now benefit from automatic, type-safe data transformations with intelligent compression and field mapping.
🚀 Cluster Management (8 Tools)
| Tool | Description | Smart Defaults | Key Features |
|---|---|---|---|
start_dataproc_cluster |
Create and start new clusters | ✅ 80% fewer params | Profile-based, auto-config |
create_cluster_from_yaml |
Create from YAML configuration | ✅ Project/region injection | Template-driven setup |
create_cluster_from_profile |
Create using predefined profiles | ✅ 85% fewer params | 8 built-in profiles |
list_clusters |
List all clusters with filtering | ✅ No params needed | Semantic queries, pagination |
list_tracked_clusters |
List MCP-created clusters | ✅ Profile filtering | Creation tracking |
get_cluster |
Get detailed cluster information | ✅ 75% fewer params | Semantic data extraction |
delete_cluster |
Delete existing clusters | ✅ Project/region defaults | Safe deletion |
get_zeppelin_url |
Get Zeppelin notebook URL | ✅ Auto-discovery | Web interface access |
💼 Job Management (7 Tools)
| Tool | Description | Smart Defaults | Key Features |
|---|---|---|---|
submit_hive_query |
Submit Hive queries to clusters | ✅ 70% fewer params | Async support, timeouts |
submit_dataproc_job |
Submit Spark/PySpark/Presto jobs | ✅ 75% fewer params | Multi-engine support, Local file staging |
cancel_dataproc_job |
Cancel running or pending jobs | ✅ JobID only needed | Emergency cancellation, cost control |
get_job_status |
Get job execution status | ✅ JobID only needed | Real-time monitoring |
get_job_results |
Get job outputs and results | ✅ Auto-pagination | Result formatting |
get_query_status |
Get Hive query status | ✅ Minimal params | Query tracking |
get_query_results |
Get Hive query results | ✅ Smart pagination | Enhanced async support |
📋 Configuration & Profiles (3 Tools)
| Tool | Description | Smart Defaults | Key Features |
|---|---|---|---|
list_profiles |
List available cluster profiles | ✅ Category filtering | 8 production profiles |
get_profile |
Get detailed profile configuration | ✅ Profile ID only | Template access |
query_cluster_data |
Query stored cluster data | ✅ Natural language | Semantic search |
📊 Analytics & Insights (4 Tools)
| Tool | Description | Smart Defaults | Key Features |
|---|---|---|---|
check_active_jobs |
Quick status of all active jobs | ✅ No params needed | Multi-project view |
get_cluster_insights |
Comprehensive cluster analytics | ✅ Auto-discovery | Machine types, components |
get_job_analytics |
Job performance analytics | ✅ Success rates | Error patterns, metrics |
query_knowledge |
Query comprehensive knowledge base | ✅ Natural language | Clusters, jobs, errors |
🎯 Key Capabilities
- 🧠 Semantic Search: Natural language queries with Qdrant integration
- ⚡ Smart Defaults: 60-80% parameter reduction through intelligent injection
- 📊 Response Optimization: 96% token reduction with full data preservation
- 🔄 Async Support: Non-blocking job submission and monitoring
- 🏷️ Profile System: 8 production-ready cluster templates
- 📈 Analytics: Comprehensive insights and performance tracking
📋 Configuration
Project-Based Configuration
The server supports a project-based configuration format:
# profiles/@analytics-workloads.yaml
my-company-analytics-prod-1234:
region: us-central1
tags:
- DataProc
- analytics
- production
labels:
service: analytics-service
owner: data-team
environment: production
cluster_config:
# ... cluster configuration
Authentication Methods
- Service Account Impersonation (Recommended)
- Direct Service Account Key
- Application Default Credentials
- Hybrid Authentication with fallbacks
📚 Documentation
- Quick Start Guide - Get started in 5 minutes
- Knowledge Base Semantic Search - Natural language queries and setup
- Generic Type Conversion System - Architectural design and implementation
- Generic Converter Migration Guide - Migration from manual conversions
- API Reference - Complete tool documentation
- Configuration Examples - Real-world configurations
- Security Guide - Best practices and compliance
- Installation Guide - Detailed setup instructions
🔧 MCP Client Integration
Claude Desktop
{
"mcpServers": {
"dataproc": {
"command": "npx",
"args": ["@dataproc/mcp-server"],
"env": {
"LOG_LEVEL": "info"
}
}
}
}
Roo (VS Code)
{
"mcpServers": {
"dataproc-server": {
"command": "npx",
"args": ["@dataproc/mcp-server"],
"disabled": false,
"alwaysAllow": [
"list_clusters",
"get_cluster",
"list_profiles"
]
}
}
}
🏗️ Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ MCP Client │────│ Dataproc MCP │────│ Google Cloud │
│ (Claude/Roo) │ │ Server │ │ Dataproc │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
┌──────┴──────┐
│ Features │
├─────────────┤
│ • Security │
│ • Profiles │
│ • Validation│
│ • Monitoring│
│ • Generic │
│ Converter │
└─────────────┘
🔄 Generic Type Conversion System Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Source Types │────│ Generic Converter │────│ Qdrant Payloads │
│ • ClusterData │ │ System │ │ • Compressed │
│ • QueryResults │ │ │ │ • Type-Safe │
│ • JobData │ │ ┌──────────────┐ │ │ • Optimized │
└─────────────────┘ │ │Field Analyzer│ │ └─────────────────┘
│ │Transformation│ │
│ │Engine │ │
│ │Compression │ │
│ │Service │ │
│ └──────────────┘ │
└──────────────────┘
🚦 Performance
Response Time Achievements
- Schema Validation: ~2ms (target: <5ms) ✅
- Parameter Injection: ~1ms (target: <2ms) ✅
- Generic Type Conversion: ~0.50ms (target: <2ms) ✅
- Credential Validation: ~25ms (target: <50ms) ✅
- MCP Tool Call: ~50ms (target: <100ms) ✅
Throughput Achievements
- Schema Validation: ~2000 ops/sec ✅
- Parameter Injection: ~5000 ops/sec ✅
- Generic Type Conversion: ~2000 ops/sec ✅
- Credential Validation: ~200 ops/sec ✅
- MCP Tool Call: ~100 ops/sec ✅
Compression Achievements
- Field-Level Compression: Up to 100% compression ratios ✅
- Memory Optimization: 30-60% reduction in memory usage ✅
- Type Safety: Zero runtime type errors with automatic validation ✅
🧪 Testing
# Run all tests
npm test
# Run specific test suites
npm run test:unit
npm run test:integration
npm run test:performance
# Run with coverage
npm run test:coverage
🤝 Contributing
We welcome contributions! Please see our Contributing Guide for details.
Development Setup
# Clone the repository
git clone https://github.com/dipseth/dataproc-mcp.git
cd dataproc-mcp
# Install dependencies
npm install
# Build the project
npm run build
# Run tests
npm test
# Start development server
npm run dev
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🆘 Support
- GitHub Issues: Report bugs and request features
- Documentation: Complete documentation
- NPM Package: Package information
🏆 Acknowledgments
- Model Context Protocol - The protocol that makes this possible
- Google Cloud Dataproc - The service we're integrating with
- Qdrant - High-performance vector database powering our semantic search and knowledge indexing
- TypeScript - For type safety and developer experience
Made with ❤️ for the MCP and Google Cloud communities
推荐服务器
Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。
VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。
e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。