AI-Driven Remediation Testing
Orchestrates end-to-end testing of AI-powered incident remediation workflows through declarative YAML scenarios, fault injection, AI response evaluation, and automated command execution with comprehensive reporting.
README
MCP Server - AI-Driven Remediation Testing
A production-ready Model Context Protocol (MCP) server for orchestrating AI-driven remediation test scenarios with gRPC, WebSocket, and HTTP integrations.
Overview
MCP Server provides end-to-end orchestration for testing AI-powered incident remediation workflows. It reads declarative YAML scenarios, injects faults, interacts with remediation APIs, evaluates AI responses, executes remediation commands, and produces comprehensive test reports.
Features
- Declarative Scenarios: Define test scenarios in YAML with variable substitution
- FSM-Based Orchestration: 13-state finite state machine for reliable execution
- Fault Injection: Integrate with chaos engineering tools (Chaos Mesh, Litmus, etc.)
- AI Evaluation: Score AI responses using regex, JSON Schema, and semantic similarity
- Secure Execution: Sandboxed command execution with deny patterns
- Remediation API Integration: Full HTTP/WebSocket client for workflow APIs
- Comprehensive Logging: DEBUG+ file logs, INFO+ console, artifact management
- Production-Ready: Type-safe Python 3.11+ with pydantic validation
Architecture
┌─────────────────────────────────────────────────────────────┐
│ MCP Server (gRPC) │
├─────────────────────────────────────────────────────────────┤
│ ScenarioService │ FaultService │ ExecutorService │ EvalService│
└─────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Orchestration │ │ Fault │ │ Command │
│ Engine (FSM) │ │ Injection │ │ Executor │
└──────────────────┘ └──────────────┘ └──────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Remediation Workflow API Client │
│ (HTTP + WebSocket, InitiateEnsemble, Resume) │
└─────────────────────────────────────────────────────────────┘
Installation
# Install dependencies
pip install -r requirements.txt
# Generate gRPC code (optional, using simplified implementation for MVP)
# python -m grpc_tools.protoc -I proto --python_out=. --grpc_python_out=. proto/*.proto
Configuration
Configuration can be provided via config.yaml or environment variables:
# config.yaml
log_dir: "./log"
session_timeout_sec: 300
ws:
ping_interval: 300
ping_timeout: 300
grpc:
host: "localhost"
port: 50051
timeout: 300
http:
base_url: "http://localhost:8901"
ws_url: "ws://localhost:8765/chatsocket"
token_url: "https://app.lab0.signalfx.com/v2/jwt/token"
Environment variables (override config.yaml):
export CONFIG_PATH=./config.yaml
export MCP_LOG_DIR=./log
export MCP_GRPC__HOST=localhost
export MCP_GRPC__PORT=50051
export MCP_HTTP__BASE_URL=http://localhost:8901
Scenario Definition
Scenarios are defined in YAML with the following structure:
meta:
id: scenario-001
title: "Test Scenario"
owner: "team-name"
defaults:
model: "gpt-4"
timeout: 300
bindings:
namespace: "production"
service: "api-gateway"
fault:
type: "pod_kill"
params:
namespace: "${namespace}"
stabilize:
wait_for:
timeout: 120
assistant_rca:
system: "You are an SRE expert."
user: "Analyze the incident."
expect:
references: ["pod", "crash"]
metrics: ["cpu", "memory"]
guards:
- type: "regex"
pattern: "(?i)root cause"
assistant_remedy:
system: "Provide remediation."
user: "What commands should we run?"
expect:
references: ["kubectl"]
execute_remedy:
sandbox:
service_account: "sre-bot"
namespace: "${namespace}"
policies:
deny_patterns:
- ".*rm -rf.*"
commands:
- name: "Restart pods"
cmd: "kubectl"
args: ["rollout", "restart", "deployment/${service}"]
verify:
signalflow:
- program: "data('cpu.utilization').mean().publish()"
assert_rules: ["value < 70"]
cleanup:
always:
- name: "Reset state"
cmd: "kubectl"
args: ["delete", "pod", "-l", "app=${service}"]
report:
formats: ["json"]
FSM States
The orchestration engine follows this state machine:
- INIT: Initialize scenario, resolve bindings
- PRECHECK: Run pre-execution checks (SignalFlow)
- FAULT_INJECT: Inject fault using FaultService
- STABILIZE: Wait for system stabilization
- ASSISTANT_RCA: Get RCA from remediation API
- EVAL_RCA: Evaluate RCA response
- ASSISTANT_REMEDY: Get remediation commands
- EVAL_REMEDY: Evaluate remedy response
- EXECUTE_REMEDY: Execute commands
- VERIFY: Verify system state
- PASS: Scenario passed
- FAIL: Scenario failed
- CLEANUP: Clean up resources
Usage
Start Server
python -m mcp_server.server
Run Scenario (Programmatic)
import asyncio
from mcp_server.server import MCPServer
from mcp_server.config import get_settings
async def main():
settings = get_settings()
server = MCPServer(settings)
# Run scenario
result = await server.scenario_service.run_scenario(
scenario_yaml=open("scenarios/example_scenario.yaml").read(),
bindings={"namespace": "staging"}
)
print(f"Run ID: {result['run_id']}")
print(f"Status: {result['status']}")
asyncio.run(main())
Check Results
Results are stored in log/runs/{run_id}/:
scenario.yaml: Original scenariotranscript.json: RCA/remedy responsesreport.json: Final test reportcmd_*.txt: Command outputs
Services
FaultService
Injects and cleans up faults. Stub implementation provided; integrate with:
- Chaos Mesh (Kubernetes)
- Litmus (Kubernetes)
- Gremlin (Cloud)
ExecutorService
Executes commands with sandboxing:
- Local execution via
asyncio.subprocess - Deny pattern enforcement
- Output capture and artifact storage
EvalService
Evaluates AI responses:
- Regex guards: Pattern matching
- JSON Schema: Structure validation
- Semantic similarity: Token-based Jaccard
RemediationClient
HTTP client for remediation workflow API:
initiate_remediation(): Start new workflowresume_remediation(): Resume with input- JSON pointer resolution for graph navigation
API Reference
ScenarioService
service ScenarioService {
rpc RunScenario(RunScenarioRequest) returns (RunScenarioResponse);
rpc ListScenarios(Empty) returns (ListScenariosResponse);
rpc GetScenario(GetScenarioRequest) returns (GetScenarioResponse);
rpc StreamEvents(StreamEventsRequest) returns (stream ScenarioEvent);
}
Remediation API
InitiateEnsemble:
{
"apiMethod": "InitiateEnsemble",
"apiVersion": "1",
"ensembleName": "REMEDIATION",
"payload": {
"incidentId": "inc-123",
"rcaAnalysis": {
"title": "Pod Crash",
"summary": "API gateway pod crashed",
"nextSteps": "Awaiting analysis"
}
}
}
ResumeEnsemble:
{
"apiMethod": "ResumeEnsemble",
"apiVersion": "1",
"payload": {
"messageType": "node_input",
"stateIdentifier": {
"threadId": "thread-123",
"interruptId": "int-456"
},
"nodeId": "node-789",
"inputProperties": {
"input": "User input text"
}
}
}
Logging
- Console: INFO+ (concise)
- File: DEBUG+ at
log/mcp_server.log(rotating, 10MB, 5 backups) - Artifacts: Per-run in
log/runs/{run_id}/
Development
Project Structure
Remidiation-MCP/
├── config.yaml # Configuration
├── requirements.txt # Dependencies
├── proto/ # gRPC definitions
│ ├── common.proto
│ ├── scenario_service.proto
│ ├── fault_service.proto
│ ├── executor_service.proto
│ └── eval_service.proto
├── mcp_server/
│ ├── __init__.py
│ ├── config.py # Settings
│ ├── logging_config.py # Logging
│ ├── server.py # gRPC server
│ ├── models/ # Pydantic models
│ │ └── scenario.py
│ ├── services/ # Service implementations
│ │ ├── fault_service.py
│ │ ├── executor_service.py
│ │ └── eval_service.py
│ ├── clients/ # API clients
│ │ └── remediation_client.py
│ ├── orchestration/ # Orchestration engine
│ │ ├── fsm.py
│ │ └── engine.py
│ └── utils/ # Utilities
│ ├── variables.py
│ └── artifacts.py
├── scenarios/ # Test scenarios
│ └── example_scenario.yaml
└── log/ # Logs and artifacts
Testing
# Run example scenario
python -m mcp_server.server
# In another terminal, verify logs
tail -f log/mcp_server.log
# Check results
ls -la log/runs/
cat log/runs/run-*/report.json
Production Deployment
Docker
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "-m", "mcp_server.server"]
Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: mcp-server
spec:
replicas: 1
template:
spec:
containers:
- name: mcp-server
image: mcp-server:latest
ports:
- containerPort: 50051
env:
- name: MCP_GRPC__HOST
value: "0.0.0.0"
- name: MCP_HTTP__BASE_URL
value: "http://remediation-api:8901"
Contributing
- Follow PEP 8 style guidelines
- Add type hints to all functions
- Write docstrings for public APIs
- Update tests for new features
License
MIT License - See LICENSE file for details
Support
For issues and questions:
- GitHub Issues: https://github.com/your-org/mcp-server/issues
- Documentation: https://docs.your-org.com/mcp-server
Roadmap
- [ ] Full gRPC code generation from .proto files
- [ ] WebSocket streaming for real-time events
- [ ] Chaos Mesh integration
- [ ] Prometheus metrics export
- [ ] OpenTelemetry tracing
- [ ] Multi-scenario parallel execution
- [ ] Scenario templates and library
推荐服务器
Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。
VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。
e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。