Conclave MCP

Conclave MCP

Provides access to multiple frontier LLM models (GPT, Claude, Gemini, Grok, DeepSeek) for consulting a "conclave" of AI perspectives, enabling peer-ranked evaluations and synthesized consensus answers for important decisions.

Category
访问服务器

README

Conclave MCP

An MCP (Model Context Protocol) server that provides access to a "conclave" of LLM models, enabling any MCP-compatible client to consult multiple frontier models for diverse opinions, peer-ranked evaluations, and synthesized answers.

Why This Exists

When working with an AI assistant, you're getting one model's perspective. Sometimes that's exactly what you need. But for important decisions—technical architecture, business strategy, creative direction, complex analysis, or any situation where blind spots matter—a plurality of opinions surfaces alternatives you might miss.

Conclave brings democratic AI consensus to any workflow.

Instead of manually querying multiple AI services, you can consult the conclave through Claude Desktop, Claude Code, or any MCP client. Get ranked opinions from multiple frontier models (GPT, Claude, Gemini, Grok, DeepSeek) and receive a synthesized answer representing collective AI wisdom.

Use cases include:

  • Technical: Architecture decisions, code review, debugging, API design
  • Business: Strategy analysis, proposal review, market research synthesis
  • Creative: Writing feedback, brainstorming, editorial perspectives
  • Research: Literature review, fact-checking, multi-perspective analysis
  • Decision-making: Pros/cons analysis, risk assessment, option evaluation

Inspired by Andrej Karpathy's llm-council concept. This project reimplements the core ideas as an MCP server for seamless integration with AI-assisted workflows.

How It Works

The conclave operates in up to 3 stages:

┌─────────────────────────────────────────────────────────────────┐
│  Stage 1: OPINIONS                                              │
│  Query multiple LLMs in parallel for independent responses      │
│  (GPT, Claude, Gemini, Grok, DeepSeek, etc.)                   │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  Stage 2: PEER RANKING                                          │
│  Each model anonymously evaluates and ranks all responses       │
│  Aggregate scores reveal best performers (lower = better)       │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  Stage 3: SYNTHESIS                                             │
│  Chairman model synthesizes final answer from collective wisdom │
│  Consensus level reported (strong/moderate/weak/split)          │
│  Tiebreaker vote cast if conclave is split                      │
└─────────────────────────────────────────────────────────────────┘

Features

  • Tiered queries: Choose cost/depth tradeoff (quick | ranked | full)
  • Three council tiers: Premium (frontier), Standard (balanced), Budget (fast/cheap)
  • Consensus protocol: Detects agreement level, triggers tiebreaker on splits
  • Odd conclave size: Ensures tiebreaker votes can break deadlocks
  • Rotating chairmanship: Weekly rotation prevents single-model bias
  • Chairman presets: Context-aware chairman selection (code, creative, reasoning)
  • Cost estimation: Know what you'll spend before querying
  • Eval-light: Standalone benchmark runner for tracking performance over time

Installation

Prerequisites

  1. Get an OpenRouter API key from https://openrouter.ai/keys
  2. Add credits to your OpenRouter account (pay-as-you-go)

Setup

# Clone the repository
git clone https://github.com/stephenpeters/conclave-mcp.git
cd conclave-mcp

# Create .env file with your API key
echo "OPENROUTER_API_KEY=sk-or-v1-your-key-here" > .env

# Install dependencies
uv sync

Configure Claude Desktop

Option 1: Desktop Extensions (Recommended)

  1. Open Claude Desktop
  2. Go to Settings > Extensions > Advanced settings > Install Extension...
  3. Navigate to the conclave-mcp directory
  4. Follow prompts to configure your OPENROUTER_API_KEY
  5. Restart Claude Desktop

Option 2: Manual Config

Open Claude Desktop, go to Settings > Developer > Edit Config, and add the following to claude_desktop_config.json:

{
  "mcpServers": {
    "conclave": {
      "command": "uv",
      "args": ["run", "--directory", "/path/to/conclave-mcp", "python", "server.py"],
      "env": {
        "OPENROUTER_API_KEY": "sk-or-v1-your-key-here"
      }
    }
  }
}

Replace /path/to/conclave-mcp with your actual path, save, and restart Claude Desktop.

Configure Claude Code

Add the server using the CLI:

claude mcp add --transport stdio conclave -- uv run --directory /path/to/conclave-mcp python server.py --env OPENROUTER_API_KEY=sk-or-v1-your-key-here

Or copy .mcp.json.example to .mcp.json and update paths:

cp .mcp.json.example .mcp.json
# Edit .mcp.json with your paths and API key

Verify with /mcp in Claude Code or claude mcp list in terminal.

Available Tools

conclave_quick

Fast parallel opinions (Stage 1 only). Queries all conclave models and returns individual responses.

Cost: ~$0.01-0.03 per query

Use for: Quick brainstorming, getting diverse perspectives fast

conclave_ranked

Opinions with peer rankings (Stage 1 + 2). Shows which model performed best on this specific question.

Cost: ~$0.05-0.10 per query

Use for: Code review, comparing approaches, seeing which model "won"

conclave_full

Complete conclave with synthesis (all 3 stages). Includes consensus detection and chairman tiebreaker.

Cost: ~$0.10-0.20 per query

Options:

  • tier: Model tier - "premium", "standard" (default), "budget"
  • chairman: Override chairman model (e.g., "anthropic/claude-sonnet-4")
  • chairman_preset: Use a preset ("code", "creative", "reasoning", "concise", "balanced")

Use for: Important decisions, architecture choices, complex debugging

conclave_config

View current configuration: conclave members, chairman rotation status, consensus thresholds.

conclave_estimate

Estimate costs before running a query.

conclave_models

List all available models with selection numbers. Shows models grouped by tier with stable numbering:

  • Premium tier: 1-10
  • Standard tier: 11-20
  • Budget tier: 21-30
  • Chairman pool: 31-40

conclave_select

Create a custom conclave from model numbers. The first model becomes the chairman.

conclave_select(models="31,1,11,21")

Creates:

  • Chairman: #31 (deepseek-r1)
  • Members: #1 (claude-opus-4.5), #11 (claude-sonnet-4.5), #21 (gemini-2.5-flash)

Custom selection persists until server restart or conclave_reset.

conclave_reset

Clear custom conclave selection and return to tier-based configuration.

Custom Model Selection

For full control over which models participate in the conclave:

  1. List available models: Use conclave_models to see all models with their numbers
  2. Select your lineup: Use conclave_select(models="31,1,11,21") - first number is chairman
  3. Query: Use conclave_quick, conclave_ranked, or conclave_full as normal
  4. Reset: Use conclave_reset to return to tier-based config

Example workflow:

> conclave_models
## Available Models
### Premium Tier (1-10)
   1. anthropic/claude-opus-4.5
   2. google/gemini-3-pro-preview
   ...

> conclave_select(models="31,1,12,21")
## Custom Conclave Created
Chairman (#31): deepseek/deepseek-r1
Members:
  - #1: anthropic/claude-opus-4.5
  - #12: google/gemini-2.5-pro
  - #21: google/gemini-2.5-flash

> conclave_quick("What is the best approach for...")
[Uses your custom selection]

> conclave_reset
## Custom Conclave Cleared

Configuration

Edit config.py to customize:

Conclave Tiers

Each tier has unique models (no overlap) for proper price/performance differentiation:

# Premium: 6 frontier models for complex questions (~$0.30-0.50/query)
COUNCIL_PREMIUM = [
    "anthropic/claude-opus-4.5",        # Claude Opus 4.5
    "google/gemini-3-pro-preview",      # Gemini 3 Pro
    "x-ai/grok-4",                      # Grok 4 (full reasoning)
    "openai/gpt-5.1",                   # GPT-5.1 (flagship)
    "deepseek/deepseek-v3.2-speciale",  # DeepSeek V3.2 Speciale
    "moonshotai/kimi-k2-thinking",      # Kimi K2 Thinking (1T MoE)
]

# Standard: 4 balanced models (default) (~$0.10-0.20/query)
COUNCIL_STANDARD = [
    "anthropic/claude-sonnet-4.5",      # Claude Sonnet 4.5
    "google/gemini-2.5-pro",            # Gemini 2.5 Pro
    "openai/o4-mini",                   # OpenAI o4-mini
    "deepseek/deepseek-chat-v3.1",      # DeepSeek Chat V3.1
]

# Budget: 4 cheap/fast models (~$0.02-0.05/query)
COUNCIL_BUDGET = [
    "google/gemini-2.5-flash",          # Gemini 2.5 Flash
    "qwen/qwen3-235b-a22b:free",        # Qwen 3 235B (free tier)
    "openai/gpt-4.1-mini",              # GPT-4.1 Mini
    "moonshotai/kimi-k2:free",          # Kimi K2 (free tier)
]

Chairman Rotation

The chairman pool uses reasoning models only (not chat models) for high-quality synthesis:

CHAIRMAN_ROTATION_ENABLED = True
CHAIRMAN_ROTATION_DAYS = 7  # Rotate weekly

CHAIRMAN_POOL = [
    "deepseek/deepseek-r1",          # DeepSeek R1 reasoning
    "openai/o3-mini",                # OpenAI o3-mini reasoning
    "anthropic/claude-sonnet-4",     # Claude Sonnet 4 (strong reasoning)
    "qwen/qwq-32b",                  # Qwen QWQ reasoning model
]

Consensus Thresholds

CONSENSUS_STRONG_THRESHOLD = 0.75   # 75%+ agreement
CONSENSUS_MODERATE_THRESHOLD = 0.50  # 50-75% agreement
CHAIRMAN_TIEBREAKER_ENABLED = True   # Chairman breaks ties

Eval-Light

A standalone benchmark runner for testing and comparing conclave performance across tiers and over time.

Test Suite Overview

The eval suite includes 16 tasks across 9 categories, designed to test different model capabilities:

Category Tasks Difficulty What It Tests
math 2 Easy-Medium Arithmetic, word problems, step-by-step reasoning
code 2 Easy-Medium Bug detection, concept explanation, code examples
reasoning 2 Medium-Hard Syllogisms, multi-step logic puzzles
analysis 2 Medium Logical fallacies, tradeoff analysis
summarization 2 Medium Technical docs, business reports
writing_business 2 Easy-Medium Professional emails, proposals
writing_creative 2 Easy-Medium Story openings, original metaphors
creative 1 Easy Analogies with explanations
factual 1 Easy Science explanations for general audience

Running Evaluations

# Run all 16 tests at standard tier (default)
python eval.py

# Run at different tiers
python eval.py --tier premium    # 6 frontier models (~$0.30-0.50/query)
python eval.py --tier standard   # 4 balanced models (~$0.10-0.20/query)
python eval.py --tier budget     # 4 cheap/fast models (~$0.02-0.05/query)

# Different modes
python eval.py --mode quick      # Stage 1 only (fastest, cheapest)
python eval.py --mode ranked     # Stage 1 + 2 (adds peer rankings)
python eval.py --mode full       # All 3 stages (default, includes synthesis)

# Filter by category
python eval.py --category math
python eval.py --category code
python eval.py --category reasoning

# Don't save results to disk
python eval.py --no-save

# Combine options
python eval.py --tier premium --mode full --category reasoning

Output Format

Results are saved to evals/eval_<tier>_<mode>_<timestamp>.json with:

  • metadata: Timestamp, tier, mode, chairman model
  • summary: Success rate, total time, average time per task
  • results: Per-task details including:
    • Individual model responses
    • Peer rankings (for ranked/full modes)
    • Chairman synthesis (for full mode)
    • Consensus level

Example Output

🏛️  Conclave Eval-Light
   Tier: standard | Mode: full | Tasks: 16
--------------------------------------------------

[1/16] Running: math_arithmetic (math)
   ✓ Completed in 12.34s

[2/16] Running: math_word_problem (math)
   ✓ Completed in 15.67s
...

==================================================
📊 EVAL SUMMARY
==================================================
Tier: standard | Mode: full
Chairman: deepseek/deepseek-r1
Tasks: 16/16 successful
Total time: 287.45s
Avg per task: 17.97s

📋 Results by Task:
  ✓ math_arithmetic (easy) - 12.34s
  ✓ math_word_problem (medium) - 15.67s
  ✓ code_debug (easy) - 11.23s
  ...

💾 Results saved to: evals/eval_standard_full_20251204_143052.json

Comparing Tiers

Run the same eval across all tiers to compare model quality vs cost:

python eval.py --tier budget --category reasoning
python eval.py --tier standard --category reasoning
python eval.py --tier premium --category reasoning

Then compare the JSON outputs to see how different model tiers perform on the same tasks.

Use Cases

Scenario Recommended Tool Why
"Review this function" conclave_ranked See which model catches the most issues
"Redis vs PostgreSQL for sessions?" conclave_full Important decision, need synthesis
"Ideas for this feature" conclave_quick Fast diverse brainstorming
"Debug this error" conclave_quick Quick parallel diagnosis
"Rewrite this paragraph" conclave_full + chairman_preset="creative" Creative synthesis
"Is this architecture sound?" conclave_full + chairman_preset="code" Technical synthesis

Example Tool Output

## Conclave Full Result

**Consensus: ✅ STRONG** (75% agreement)

---

### Chairman's Synthesis

_Chairman: deepseek/deepseek-r1_

[Synthesized answer incorporating best points from all models...]

---

### Model Rankings (lower is better)

1. **claude-sonnet-4.5**: 1.50
2. **o4-mini**: 2.00
3. **gemini-2.5-pro**: 2.75
4. **deepseek-v3.1**: 3.75

_First-place votes:_ claude-sonnet-4.5=3, o4-mini=1

Project Structure

conclave-mcp/
├── server.py      # MCP server entry point (5 tools)
├── conclave.py    # Core 3-stage council logic
├── config.py      # Model tiers, chairman rotation, cost estimates
├── eval.py        # Standalone benchmark runner
└── evals/         # Saved evaluation results

Adding Models

OpenRouter supports 200+ models. Find model IDs at https://openrouter.ai/models

# Add to COUNCIL_* lists in config.py
"x-ai/grok-4"                    # xAI Grok
"meta-llama/llama-4-maverick"    # Meta Llama
"mistralai/mistral-large-2"      # Mistral
"deepseek/deepseek-r1"           # DeepSeek reasoning

Important: Keep each tier's models unique (no overlap) for proper differentiation.

How OpenRouter Works

OpenRouter is a unified API gateway—you don't need separate accounts with OpenAI, Google, Anthropic, etc. One API key, one credit balance, access to all models.

  • Sign up: https://openrouter.ai
  • Add credits (prepaid, or enable auto-top-up)
  • Use your single API key for all models

License

MIT

Attribution

Inspired by Andrej Karpathy's llm-council. The original is a web application for interactively exploring LLM comparisons. This project reimplements the council concept as an MCP server for integration with AI-assisted editors, adding consensus protocol and tiebreaker mechanics.

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选