
MCP Dataset Onboarding Server
Enables automated dataset processing and onboarding using Google Drive integration. Provides metadata extraction, data quality assessment, and contract generation for CSV/Excel files through natural language interactions.
README
🤖 MCP Dataset Onboarding Server
A FastAPI-based MCP (Model-Compatible Protocol) server for automating dataset onboarding using Google Drive as both input source and mock catalog.
🔒 SECURITY FIRST - READ THIS BEFORE SETUP
⚠️ This repository contains template files only. You MUST configure your own credentials before use.
📖 Read SECURITY_SETUP.md for complete security instructions.
🚨 Never commit service account keys or real folder IDs to version control!
Features
- Automated Dataset Processing: Complete workflow from raw CSV/Excel files to cataloged datasets
- Google Drive Integration: Uses Google Drive folders as input source and catalog storage
- Metadata Extraction: Automatically extracts column information, data types, and basic statistics
- Data Quality Rules: Suggests DQ rules based on data characteristics
- Contract Generation: Creates Excel contracts with schema and DQ information
- Mock Catalog: Publishes processed artifacts to a catalog folder
- 🤖 Automated Processing: Watches folders and processes files automatically
- 🌐 Multiple Interfaces: FastAPI server, MCP server, CLI tools, and dashboards
Project Structure
├── main.py # FastAPI server and endpoints
├── mcp_server.py # True MCP protocol server for LLM integration
├── utils.py # Google Drive helpers and DQ functions
├── dataset_processor.py # Centralized dataset processing logic
├── auto_processor.py # 🤖 Automated file monitoring
├── start_auto_processor.py # 🚀 Easy startup for auto-processor
├── processor_dashboard.py # 📊 Monitoring dashboard
├── dataset_manager.py # CLI tool for managing datasets
├── local_test.py # Local processing script
├── auto_config.py # ⚙️ Configuration management
├── requirements.txt # Python dependencies
├── Dockerfile # Container configuration
├── .env.template # Environment variables template
├── .gitignore # Security: excludes sensitive files
├── SECURITY_SETUP.md # 🔒 Security configuration guide
├── processed_datasets/ # Organized output folder
│ └── [dataset_name]/ # Individual dataset folders
│ ├── [dataset].csv # Original dataset
│ ├── [dataset]_metadata.json
│ ├── [dataset]_contract.xlsx
│ ├── [dataset]_dq_report.json
│ └── README.md # Dataset summary
└── README.md # This file
🚀 Quick Start
1. Security Setup (REQUIRED)
# 1. Read the security guide
cat SECURITY_SETUP.md
# 2. Set up your Google service account (outside this repo)
# 3. Configure your environment variables
cp .env.template .env
# Edit .env with your actual values
# 4. Verify no sensitive files will be committed
git status
2. Installation
# Install dependencies
pip install -r requirements.txt
# Test the setup
python local_test.py
3. Choose Your Interface
🤖 Fully Automated (Recommended)
# Start auto-processor - upload files and walk away!
python start_auto_processor.py
🌐 API Server
# Start FastAPI server
python main.py
🧠 LLM Integration (MCP)
# Start MCP server for Claude Desktop, etc.
python mcp_server.py
🖥️ Command Line
# Manual dataset management
python dataset_manager.py list
python dataset_manager.py process YOUR_FILE_ID
🎯 Usage Scenarios
Scenario 1: Set-and-Forget Automation
python start_auto_processor.py
- Upload files to Google Drive
- Files processed automatically within 30 seconds
- Monitor with
python processor_dashboard.py --live
Scenario 2: LLM-Powered Data Analysis
- Configure MCP server in Claude Desktop
- Chat: "Analyze the dataset I just uploaded"
- Claude uses MCP tools to process and explain your data
Scenario 3: API Integration
python main.py
- Integrate with your data pipelines via REST API
- Programmatic dataset onboarding
📊 What You Get
For each processed dataset:
- 📄 Original File: Preserved in organized folder
- 📋 Metadata JSON: Column info, types, statistics
- 📊 Excel Contract: Professional multi-sheet contract
- 🔍 Quality Report: Data quality assessment
- 📖 README: Human-readable summary
🛠️ Available Tools
FastAPI Endpoints
/tool/extract_metadata
- Analyze dataset structure/tool/apply_dq_rules
- Generate quality rules/process_dataset
- Complete workflow/health
- System health check
MCP Tools (for LLMs)
extract_dataset_metadata
- Dataset analysisgenerate_data_quality_rules
- Quality assessmentprocess_complete_dataset
- Full pipelinelist_catalog_files
- Catalog browsing
CLI Commands
dataset_manager.py list
- Show processed datasetsauto_processor.py --once
- Single check cycleprocessor_dashboard.py --live
- Real-time monitoring
🔧 Configuration
Environment Variables (.env)
GOOGLE_SERVICE_ACCOUNT_KEY_PATH=path/to/your/key.json
MCP_SERVER_FOLDER_ID=your_input_folder_id
MCP_CLIENT_FOLDER_ID=your_output_folder_id
Auto-Processor Settings (auto_config.py)
- Check interval: 30 seconds
- Supported formats: CSV, Excel
- File age threshold: 1 minute
- Max files per cycle: 5
📈 Monitoring & Analytics
# Current status
python processor_dashboard.py
# Live monitoring (auto-refresh)
python processor_dashboard.py --live
# Detailed statistics
python processor_dashboard.py --stats
# Processing history
python auto_processor.py --list
🐳 Docker Deployment
# Build
docker build -t mcp-dataset-server .
# Run (mount your service account key securely)
docker run -p 8000:8000 \
-v /secure/path/to/key.json:/app/keys/key.json \
-e GOOGLE_SERVICE_ACCOUNT_KEY_PATH=/app/keys/key.json \
-e MCP_SERVER_FOLDER_ID=your_folder_id \
mcp-dataset-server
🔍 Troubleshooting
Common Issues
- No files detected: Check Google Drive permissions
- Processing errors: Verify service account access
- MCP not working: Check Claude Desktop configuration
Debug Commands
# Test Google Drive connection
python -c "from utils import get_drive_service; print('✅ Connected')"
# Check auto-processor status
python auto_processor.py --once
# Verify MCP server
python test_mcp_server.py
🤝 Contributing
- Fork the repository
- Create a feature branch
- Never commit sensitive data
- Test your changes
- Submit a pull request
📚 Documentation
- SECURITY_SETUP.md - Security configuration
- AUTOMATION_GUIDE.md - Automation features
- MCP_INTEGRATION_GUIDE.md - LLM integration
📄 License
MIT License
🎉 What Makes This Special
- 🔒 Security First: Proper credential management
- 🤖 True Automation: Zero manual intervention
- 🧠 LLM Integration: Natural language data processing
- 📊 Professional Output: Enterprise-ready documentation
- 🔧 Multiple Interfaces: API, CLI, MCP, Dashboard
- 📈 Real-time Monitoring: Live processing status
- 🗂️ Perfect Organization: Structured output folders
Transform your messy data files into professional, documented, quality-checked datasets automatically! 🚀
推荐服务器

Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。