MCP 服务器

AI-Driven Remediation Testing

Orchestrates end-to-end testing of AI-powered incident remediation workflows through declarative YAML scenarios, fault injection, AI response evaluation, and automated command execution with comprehensive reporting.

README

MCP Server - AI-Driven Remediation Testing

A production-ready Model Context Protocol (MCP) server for orchestrating AI-driven remediation test scenarios with gRPC, WebSocket, and HTTP integrations.

Overview

MCP Server provides end-to-end orchestration for testing AI-powered incident remediation workflows. It reads declarative YAML scenarios, injects faults, interacts with remediation APIs, evaluates AI responses, executes remediation commands, and produces comprehensive test reports.

Features

Declarative Scenarios: Define test scenarios in YAML with variable substitution
FSM-Based Orchestration: 13-state finite state machine for reliable execution
Fault Injection: Integrate with chaos engineering tools (Chaos Mesh, Litmus, etc.)
AI Evaluation: Score AI responses using regex, JSON Schema, and semantic similarity
Secure Execution: Sandboxed command execution with deny patterns
Remediation API Integration: Full HTTP/WebSocket client for workflow APIs
Comprehensive Logging: DEBUG+ file logs, INFO+ console, artifact management
Production-Ready: Type-safe Python 3.11+ with pydantic validation

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     MCP Server (gRPC)                        │
├─────────────────────────────────────────────────────────────┤
│  ScenarioService │ FaultService │ ExecutorService │ EvalService│
└─────────────────────────────────────────────────────────────┘
         │                    │                    │
         ▼                    ▼                    ▼
┌──────────────────┐  ┌──────────────┐  ┌──────────────────┐
│ Orchestration    │  │ Fault        │  │ Command          │
│ Engine (FSM)     │  │ Injection    │  │ Executor         │
└──────────────────┘  └──────────────┘  └──────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────┐
│              Remediation Workflow API Client                 │
│         (HTTP + WebSocket, InitiateEnsemble, Resume)        │
└─────────────────────────────────────────────────────────────┘

Installation

# Install dependencies
pip install -r requirements.txt

# Generate gRPC code (optional, using simplified implementation for MVP)
# python -m grpc_tools.protoc -I proto --python_out=. --grpc_python_out=. proto/*.proto

Configuration

Configuration can be provided via config.yaml or environment variables:

# config.yaml
log_dir: "./log"
session_timeout_sec: 300

ws:
  ping_interval: 300
  ping_timeout: 300

grpc:
  host: "localhost"
  port: 50051
  timeout: 300

http:
  base_url: "http://localhost:8901"
  ws_url: "ws://localhost:8765/chatsocket"
  token_url: "https://app.lab0.signalfx.com/v2/jwt/token"

Environment variables (override config.yaml):

export CONFIG_PATH=./config.yaml
export MCP_LOG_DIR=./log
export MCP_GRPC__HOST=localhost
export MCP_GRPC__PORT=50051
export MCP_HTTP__BASE_URL=http://localhost:8901

Scenario Definition

Scenarios are defined in YAML with the following structure:

meta:
  id: scenario-001
  title: "Test Scenario"
  owner: "team-name"

defaults:
  model: "gpt-4"
  timeout: 300

bindings:
  namespace: "production"
  service: "api-gateway"

fault:
  type: "pod_kill"
  params:
    namespace: "${namespace}"

stabilize:
  wait_for:
    timeout: 120

assistant_rca:
  system: "You are an SRE expert."
  user: "Analyze the incident."
  expect:
    references: ["pod", "crash"]
    metrics: ["cpu", "memory"]
    guards:
      - type: "regex"
        pattern: "(?i)root cause"

assistant_remedy:
  system: "Provide remediation."
  user: "What commands should we run?"
  expect:
    references: ["kubectl"]

execute_remedy:
  sandbox:
    service_account: "sre-bot"
    namespace: "${namespace}"
    policies:
      deny_patterns:
        - ".*rm -rf.*"
  commands:
    - name: "Restart pods"
      cmd: "kubectl"
      args: ["rollout", "restart", "deployment/${service}"]

verify:
  signalflow:
    - program: "data('cpu.utilization').mean().publish()"
      assert_rules: ["value < 70"]

cleanup:
  always:
    - name: "Reset state"
      cmd: "kubectl"
      args: ["delete", "pod", "-l", "app=${service}"]

report:
  formats: ["json"]

FSM States

The orchestration engine follows this state machine:

INIT: Initialize scenario, resolve bindings
PRECHECK: Run pre-execution checks (SignalFlow)
FAULT_INJECT: Inject fault using FaultService
STABILIZE: Wait for system stabilization
ASSISTANT_RCA: Get RCA from remediation API
EVAL_RCA: Evaluate RCA response
ASSISTANT_REMEDY: Get remediation commands
EVAL_REMEDY: Evaluate remedy response
EXECUTE_REMEDY: Execute commands
VERIFY: Verify system state
PASS: Scenario passed
FAIL: Scenario failed
CLEANUP: Clean up resources

Usage

Start Server

python -m mcp_server.server

Run Scenario (Programmatic)

import asyncio
from mcp_server.server import MCPServer
from mcp_server.config import get_settings

async def main():
    settings = get_settings()
    server = MCPServer(settings)

    # Run scenario
    result = await server.scenario_service.run_scenario(
        scenario_yaml=open("scenarios/example_scenario.yaml").read(),
        bindings={"namespace": "staging"}
    )

    print(f"Run ID: {result['run_id']}")
    print(f"Status: {result['status']}")

asyncio.run(main())

Check Results

Results are stored in log/runs/{run_id}/:

scenario.yaml: Original scenario
transcript.json: RCA/remedy responses
report.json: Final test report
cmd_*.txt: Command outputs

Services

FaultService

Injects and cleans up faults. Stub implementation provided; integrate with:

Chaos Mesh (Kubernetes)
Litmus (Kubernetes)
Gremlin (Cloud)

ExecutorService

Executes commands with sandboxing:

Local execution via asyncio.subprocess
Deny pattern enforcement
Output capture and artifact storage

EvalService

Evaluates AI responses:

Regex guards: Pattern matching
JSON Schema: Structure validation
Semantic similarity: Token-based Jaccard

RemediationClient

HTTP client for remediation workflow API:

initiate_remediation(): Start new workflow
resume_remediation(): Resume with input
JSON pointer resolution for graph navigation

API Reference

ScenarioService

service ScenarioService {
  rpc RunScenario(RunScenarioRequest) returns (RunScenarioResponse);
  rpc ListScenarios(Empty) returns (ListScenariosResponse);
  rpc GetScenario(GetScenarioRequest) returns (GetScenarioResponse);
  rpc StreamEvents(StreamEventsRequest) returns (stream ScenarioEvent);
}

Remediation API

InitiateEnsemble:

{
  "apiMethod": "InitiateEnsemble",
  "apiVersion": "1",
  "ensembleName": "REMEDIATION",
  "payload": {
    "incidentId": "inc-123",
    "rcaAnalysis": {
      "title": "Pod Crash",
      "summary": "API gateway pod crashed",
      "nextSteps": "Awaiting analysis"
    }
  }
}

ResumeEnsemble:

{
  "apiMethod": "ResumeEnsemble",
  "apiVersion": "1",
  "payload": {
    "messageType": "node_input",
    "stateIdentifier": {
      "threadId": "thread-123",
      "interruptId": "int-456"
    },
    "nodeId": "node-789",
    "inputProperties": {
      "input": "User input text"
    }
  }
}

Logging

Console: INFO+ (concise)
File: DEBUG+ at log/mcp_server.log (rotating, 10MB, 5 backups)
Artifacts: Per-run in log/runs/{run_id}/

Development

Project Structure

Remidiation-MCP/
├── config.yaml              # Configuration
├── requirements.txt         # Dependencies
├── proto/                   # gRPC definitions
│   ├── common.proto
│   ├── scenario_service.proto
│   ├── fault_service.proto
│   ├── executor_service.proto
│   └── eval_service.proto
├── mcp_server/
│   ├── __init__.py
│   ├── config.py            # Settings
│   ├── logging_config.py    # Logging
│   ├── server.py            # gRPC server
│   ├── models/              # Pydantic models
│   │   └── scenario.py
│   ├── services/            # Service implementations
│   │   ├── fault_service.py
│   │   ├── executor_service.py
│   │   └── eval_service.py
│   ├── clients/             # API clients
│   │   └── remediation_client.py
│   ├── orchestration/       # Orchestration engine
│   │   ├── fsm.py
│   │   └── engine.py
│   └── utils/               # Utilities
│       ├── variables.py
│       └── artifacts.py
├── scenarios/               # Test scenarios
│   └── example_scenario.yaml
└── log/                     # Logs and artifacts

Testing

# Run example scenario
python -m mcp_server.server

# In another terminal, verify logs
tail -f log/mcp_server.log

# Check results
ls -la log/runs/
cat log/runs/run-*/report.json

Production Deployment

Docker

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

CMD ["python", "-m", "mcp_server.server"]

Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mcp-server
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: mcp-server
        image: mcp-server:latest
        ports:
        - containerPort: 50051
        env:
        - name: MCP_GRPC__HOST
          value: "0.0.0.0"
        - name: MCP_HTTP__BASE_URL
          value: "http://remediation-api:8901"

Contributing

Follow PEP 8 style guidelines
Add type hints to all functions
Write docstrings for public APIs
Update tests for new features

License

MIT License - See LICENSE file for details

Support

For issues and questions:

GitHub Issues: https://github.com/your-org/mcp-server/issues
Documentation: https://docs.your-org.com/mcp-server

Roadmap

[ ] Full gRPC code generation from .proto files
[ ] WebSocket streaming for real-time events
[ ] Chaos Mesh integration
[ ] Prometheus metrics export
[ ] OpenTelemetry tracing
[ ] Multi-scenario parallel execution
[ ] Scenario templates and library