Multi-Cloud Infrastructure MCP Server

Multi-Cloud Infrastructure MCP Server

Enables deployment and management of GPU workloads across multiple cloud providers (RunPod, Vast.ai) with intelligent GPU selection, resource monitoring, and telemetry tracking through Redis, ClickHouse, and SkyPilot integration.

Category
访问服务器

README

MCP (Multi-Cloud Platform) Server

This repository provides a working, extensible reference implementation of an MCP server with multiple agent types and a SkyPilot-backed autoscaling/deployment path. It now includes integration hooks to report resource lifecycle and telemetry to an "AI Envoy" endpoint (a generic HTTP ingestion endpoint).

Highlights

  • Evaluation Agent (prompt + rules) reads tasks from Redis and outputs resource plans.
  • SkyPilot Agent builds dynamic YAML and executes the sky CLI.
  • OnPrem Agent acts to run on-prem deployments (placeholder using kubectl/helm).
  • Orchestrator wires agents together using Redis queues and ClickHouse telemetry.
  • Pluggable LLM client - default configured to call a local LiteLLM gateway for minimax-m1.
  • Phoenix observability hooks and Envoy integration for telemetry events.

Additional files

  • scripts/resource_heartbeat.py — example script that runs inside a provisioned resource and posts periodic GPU utilization/heartbeat to the orchestrator.

Quick start (local dry-run)

  1. Install Python packages: pip install -r requirements.txt
  2. Start Redis (e.g. docker run -p 6379:6379 -d redis) and optionally ClickHouse.
  3. Start the MCP server: python -m src.mcp.main
  4. Push a demo task into Redis (see scripts/run_demo.sh)
  5. Verify telemetry is forwarded to Phoenix and Envoy endpoints (configurable in .env).

Notes & caveats

  • This is a reference implementation. You will need to install and configure real services (SkyPilot CLI, LiteLLM/minimax-m1, Phoenix, and the Envoy ingestion endpoint) to get a fully working pipeline.

MCP Orchestrator - Quick Reference

🚀 Installation (5 minutes)

# 1. Configure environment
cp .env.example .env
nano .env  # Add your API keys

# 2. Deploy everything
chmod +x scripts/deploy.sh
./scripts/deploy.sh

# 3. Verify
curl http://localhost:8000/health

📡 Common API Calls

Deploy with Auto GPU Selection

# Inference workload (will select cost-effective GPU)
curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \
  -H "Content-Type: application/json" \
  -d '{
    "task_type": "inference",
    "spec": {
      "name": "llm-server",
      "image": "vllm/vllm-openai:latest",
      "command": "python -m vllm.entrypoints.api_server"
    }
  }'

# Training workload (will select powerful GPU)
curl -X POST http://localhost:8000/api/v1/providers/vastai/deploy \
  -H "Content-Type: application/json" \
  -d '{
    "task_type": "training",
    "spec": {
      "name": "fine-tune-job",
      "image": "pytorch/pytorch:latest"
    }
  }'

Deploy with Specific GPU

curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \
  -H "Content-Type: application/json" \
  -d '{
    "spec": {
      "name": "custom-pod",
      "gpu_name": "RTX 4090",
      "resources": {
        "accelerators": "RTX 4090:2"
      }
    }
  }'

Deploy to Provider (Default: ON_DEMAND + RTX 3060)

curl -X POST http://localhost:8000/api/v1/providers/runpod/deploy \
  -H "Content-Type: application/json" \
  -d '{"spec": {"name": "simple-pod"}}'

Register Existing Infrastructure

# Vast.ai instance
curl -X POST http://localhost:8000/api/v1/register \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "vastai",
    "resource_id": "12345",
    "credentials": {"api_key": "YOUR_VASTAI_KEY"}
  }'

# Bulk registration
curl -X POST http://localhost:8000/api/v1/register \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "vastai",
    "resource_ids": ["12345", "67890"],
    "credentials": {"api_key": "YOUR_VASTAI_KEY"}
  }'

List Resources

# All RunPod resources
curl http://localhost:8000/api/v1/providers/runpod/list

# All Vast.ai resources
curl http://localhost:8000/api/v1/providers/vastai/list

Terminate Resource

curl -X POST http://localhost:8000/api/v1/providers/runpod/delete/pod_abc123

🎯 GPU Rules Management

View Rules

curl http://localhost:8000/api/v1/gpu-rules

Add Rule

curl -X POST http://localhost:8000/api/v1/gpu-rules \
  -H "Content-Type: application/json" \
  -d '{
    "gpu_family": "H100",
    "type": "Enterprise",
    "min_use_case": "large-scale training",
    "optimal_use_case": "foundation models",
    "power_rating": "700W",
    "typical_cloud_instance": "RunPod",
    "priority": 0
  }'

Delete Rule

curl -X DELETE http://localhost:8000/api/v1/gpu-rules/RTX%203060

🔍 Monitoring

ClickHouse Queries

-- Active resources
SELECT provider, status, count() as total
FROM resources
WHERE status IN ('running', 'active')
GROUP BY provider, status;

-- Recent deployments
SELECT *
FROM deployments
ORDER BY created_at DESC
LIMIT 10;

-- Latest heartbeats
SELECT resource_id, status, timestamp
FROM heartbeats
WHERE timestamp > now() - INTERVAL 5 MINUTE
ORDER BY timestamp DESC;

-- Cost analysis
SELECT
    provider,
    sum(price_hour) as total_hourly_cost,
    avg(price_hour) as avg_cost
FROM resources
WHERE status = 'running'
GROUP BY provider;

-- Event volume
SELECT
    event_type,
    count() as count,
    toStartOfHour(timestamp) as hour
FROM events
WHERE timestamp > now() - INTERVAL 24 HOUR
GROUP BY event_type, hour
ORDER BY hour DESC, count DESC;

View Logs

# All services
docker-compose logs -f

# API only
docker-compose logs -f mcp-api

# Heartbeat monitor
docker-compose logs -f heartbeat-worker

# ClickHouse
docker-compose logs -f clickhouse

🛠️ Maintenance

Restart Services

# Restart all
docker-compose restart

# Restart API only
docker-compose restart mcp-api

# Reload with new code
docker-compose up -d --build

Backup ClickHouse

# Backup database
docker-compose exec clickhouse clickhouse-client --query \
  "BACKUP DATABASE mcp TO Disk('default', 'backup_$(date +%Y%m%d).zip')"

# Export table
docker-compose exec clickhouse clickhouse-client --query \
  "SELECT * FROM resources FORMAT CSVWithNames" > resources_backup.csv

Clean Up

# Stop all services
docker-compose down

# Stop and remove volumes (WARNING: deletes data)
docker-compose down -v

# Prune old data from ClickHouse (events older than 90 days auto-expire)
docker-compose exec clickhouse clickhouse-client --query \
  "OPTIMIZE TABLE events FINAL"

🐛 Troubleshooting

Service won't start

# Check status
docker-compose ps

# Check logs
docker-compose logs mcp-api

# Verify config
cat .env | grep -v '^#' | grep -v '^$'

ClickHouse connection issues

# Test connection
docker-compose exec clickhouse clickhouse-client --query "SELECT 1"

# Reinitialize
docker-compose exec clickhouse clickhouse-client --multiquery < scripts/init_clickhouse.sql

# Check tables
docker-compose exec clickhouse clickhouse-client --query "SHOW TABLES FROM mcp"

API returns 404 for provider

# Check if agent initialized
docker-compose logs mcp-api | grep -i "AgentRegistry initialized"

# Restart with fresh logs
docker-compose restart mcp-api && docker-compose logs -f mcp-api

Heartbeat not working

# Check heartbeat worker
docker-compose logs heartbeat-worker

# Manual health check
curl http://localhost:8000/api/v1/providers/runpod/list

📝 Environment Variables

Key variables in .env:

# Required
RUNPOD_API_KEY=xxx          # Your RunPod API key
VASTAI_API_KEY=xxx          # Your Vast.ai API key (used per-request only)

# ClickHouse
CLICKHOUSE_PASSWORD=xxx     # Set strong password

# Optional
LOG_LEVEL=INFO              # DEBUG for verbose logs
WORKERS=4                   # API worker processes
HEARTBEAT_INTERVAL=60       # Seconds between health checks

🔐 Security Checklist

  • [ ] Change default ClickHouse password
  • [ ] Store .env securely (add to .gitignore)
  • [ ] Use separate API keys for prod/staging
  • [ ] Enable ClickHouse authentication
  • [ ] Configure AI Envoy Gateway policies
  • [ ] Rotate API keys regularly
  • [ ] Review ClickHouse access logs
  • [ ] Set up alerting for unhealthy resources

📚 Resources

  • API Docs: http://localhost:8000/docs
  • ClickHouse UI: http://localhost:8124 (with --profile debug)
  • Health Check: http://localhost:8000/health
  • Full README: See README.md

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选