MCP 服务器

MCP Server with Local LLM

Integrates local language models (like Qwen3-8B) with MCP clients, providing tools for chat, code analysis, text generation, translation, and content summarization using your own hardware.

README

MCP Server with Local LLM

A Model Context Protocol (MCP) server that integrates with your local Large Language Model, providing tools and resources that can be used by MCP-compatible clients like Claude Desktop, IDEs, and other AI assistants.

Features

Local LLM Integration: Uses your local language model (default: Qwen3-8B) for all operations
MCP-Compatible: Implements the Model Context Protocol specification
Multiple Tools: Provides various tools for different use cases
Resource Access: Exposes model information and conversation history
Memory Efficient: Uses 4-bit quantization for optimal performance

Available Tools

1. `chat_with_llm`

Basic chat interface with the local LLM.

Parameters:

message (required): The message to send to the LLM
system_prompt (optional): System prompt to guide behavior
temperature (optional): Temperature for generation (0-1, default: 0.7)
max_tokens (optional): Maximum tokens to generate (default: 512)

2. `analyze_code`

Analyze and review code using the local LLM.

Parameters:

code (required): The code to analyze
language (optional): Programming language
analysis_type (optional): Type of analysis (review, explain, optimize, debug)

3. `generate_text`

Generate text content in various styles and lengths.

Parameters:

prompt (required): The prompt for text generation
style (optional): Writing style (formal, casual, technical, creative)
length (optional): Desired length (short, medium, long)

4. `translate_text`

Translate text between languages.

Parameters:

text (required): Text to translate
target_language (required): Target language
source_language (optional): Source language (auto-detect if not provided)

5. `summarize_content`

Summarize content in different formats.

Parameters:

content (required): Content to summarize
summary_type (optional): Type of summary (brief, detailed, bullet_points)
max_length (optional): Maximum length in words (default: 150)

Available Resources

1. `llm://model/info`

Information about the loaded local LLM model.

2. `llm://conversations/history`

History of conversations with the LLM.

3. `llm://system/status`

Current system status and resource usage.

Installation

Clone or download this repository to your local machine.
Install dependencies:
```
pip install -r requirements.txt
```

Set up environment variables (create a .env file):

MODEL_NAME=Qwen/Qwen3-8B
MAX_NEW_TOKENS=512
TEMPERATURE=0.7

Ensure you have sufficient hardware:
- GPU with at least 8GB VRAM (recommended)
- 16GB+ system RAM
- CUDA-compatible GPU (for optimal performance)

Usage

Running the Server

Start the MCP server:
```
python mcp_server.py
```
The server will:
- Load the specified model (Qwen3-8B by default)
- Initialize the MCP server
- Wait for JSON-RPC requests via stdin/stdout

Integrating with Claude Desktop

To use this server with Claude Desktop, add it to your MCP configuration:

Create or edit your Claude Desktop MCP configuration file:
- macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
- Windows: %APPDATA%\Claude\claude_desktop_config.json

Add the server configuration:

{
  "mcpServers": {
    "local-llm": {
      "command": "python",
      "args": ["/Users/sazzad/Documents/mcp-server/mcp_server.py"],
      "env": {
        "MODEL_NAME": "Qwen/Qwen3-8B"
      }
    }
  }
}

Restart Claude Desktop to load the new server.

Example Usage in Claude Desktop

Once configured, you can use the tools in Claude Desktop:

Please use the local LLM to analyze this Python code:

Claude will then use the analyze_code tool to get analysis from your local LLM.

Configuration

Environment Variables

MODEL_NAME: HuggingFace model name (default: "Qwen/Qwen3-8B")
MAX_NEW_TOKENS: Maximum tokens to generate (default: 512)
TEMPERATURE: Default temperature for generation (default: 0.7)

Supported Models

This server works with most HuggingFace transformers models that support:

Chat templates
4-bit quantization via BitsAndBytesConfig
Causal language modeling

Popular choices:

Qwen/Qwen3-8B (default)
microsoft/DialoGPT-medium
meta-llama/Llama-2-7b-chat-hf
mistralai/Mistral-7B-Instruct-v0.1

Hardware Requirements

Minimum Requirements

8GB RAM
4GB GPU VRAM (with 4-bit quantization)
10GB disk space (for model storage)

Recommended Requirements

16GB+ RAM
8GB+ GPU VRAM
CUDA-compatible GPU
SSD storage for faster model loading

Troubleshooting

Common Issues

Out of Memory Error:
- Reduce MAX_NEW_TOKENS
- Use a smaller model
- Ensure 4-bit quantization is working
Model Loading Fails:
- Check internet connection for initial download
- Verify model name is correct
- Ensure sufficient disk space
CUDA Errors:
- Update CUDA drivers
- Check GPU compatibility
- Fall back to CPU if needed

Performance Optimization

Use GPU acceleration:

pip install torch --index-url https://download.pytorch.org/whl/cu118

Enable Flash Attention (if compatible):
```
pip install flash-attn
```
Adjust batch size and sequence length based on your hardware.

Development

Project Structure

mcp-server/
├── mcp_server.py          # Main MCP server implementation
├── mcp-server.py          # Legacy FastAPI server (keep for reference)
├── requirements.txt       # Python dependencies
├── mcp/                   # MCP protocol implementation
│   ├── __init__.py
│   ├── server.py         # MCP server base
│   └── types.py          # MCP protocol types
├── .env                  # Environment variables (create this)
└── README.md            # This file