MCP 服务器

Gemini TTS MCP Server

Provides text-to-speech capabilities using Google's Gemini TTS API with support for multiple voices, automatic chunking of long text, multi-speaker dialogue, and audio playback via Windows Media Player.

README

Gemini TTS MCP Server

A Model Context Protocol (MCP) server that provides text-to-speech capabilities using Google's Gemini TTS API.

Features

Single-speaker TTS: Generate speech with 30 available voices
Automatic chunking: Handles long text by splitting into logical episodes at sentence boundaries
Multi-speaker support: Create dialogue with different voices for each speaker
Automatic playback: Uses Windows Media Player for audio playback
File saving: Optional parameter to save audio files permanently
Temporary file management: Auto-cleanup of generated audio files
Environment-based API key: Secure API key management via environment variables

Prerequisites

Required

Python 3.10 or higher: Required for FastMCP and async features
Google API Key: Must have access to Gemini API (specifically gemini-2.5-flash-preview-tts model)
Windows 11: Required for audio playback via PowerShell's System.Media.SoundPlayer
PowerShell 5.1 or higher: Built into Windows 11, used for audio playback

Optional

MCP Client: Such as Claude Desktop, to interact with the server through the Model Context Protocol
Audio Output Device: Speakers or headphones for audio playback testing

API Access Requirements

Active Google Cloud account with billing enabled
Gemini API access (may require waitlist approval during preview)
API key with TTS permissions enabled

Installation

Step 1: Clone the Repository

git clone <repository-url>
cd mcp-gemini-tts

Step 2: Install Dependencies

# Install required packages
pip install -r requirements.txt

Verify installation:

# Check Python version (should be 3.10+)
python --version

# Verify FastMCP is installed
python -c "import mcp; print('FastMCP installed successfully')"

Step 3: Configure API Key

Option A: User Environment Variable (Recommended)

# Set for current user (persists across sessions)
[System.Environment]::SetEnvironmentVariable('GOOGLE_API_KEY', 'your-api-key-here', 'User')

# Restart PowerShell to apply changes

Option B: Session Environment Variable (Temporary)

# Set for current session only
$env:GOOGLE_API_KEY = "your-api-key-here"

Verify API key is set:

# Check environment variable
echo $env:GOOGLE_API_KEY

Step 4: Test Installation

# Test the server can start (Ctrl+C to stop)
python src/server.py

# Run example scripts to verify functionality
python src/examples/test_playback.py

Expected output: Server should start without errors, and test script should generate and play audio.

Configuration

MCP Client Setup

To use this server with Claude Desktop or other MCP clients, configure the MCP settings file.

Claude Desktop Configuration Location:

Windows: %APPDATA%\Claude\claude_desktop_config.json

Configuration:

{
  "mcpServers": {
    "gemini-tts": {
      "command": "python",
      "args": [
        "C:\\Projects\\mcp-gemini-tts\\src\\server.py"
      ],
      "env": {
        "GOOGLE_API_KEY": "your-api-key-here"
      }
    }
  }
}

Important Notes:

Replace C:\\Projects\\mcp-gemini-tts\\src\\server.py with your actual installation path
Use double backslashes (\\) in Windows paths for JSON
Replace your-api-key-here with your actual Google API key
Restart Claude Desktop after modifying the configuration file

Verify Configuration:

Restart Claude Desktop
Check that the gemini-tts tools appear in the available tools list
Test with a simple command: "Generate speech saying 'test' using the Kore voice"

Alternative: Direct Server Usage

You can also run the server directly without an MCP client:

# Start the MCP server (communicates via stdio)
python src/server.py

# Or use the example scripts for direct testing
python src/examples/test_playback.py
python src/examples/test_chunking.py

Available Tools

1. generate_speech

Generate and play speech from text using a single voice. Automatically chunks long text (>3900 bytes) into logical episodes.

Parameters:

text (required): Text to convert to speech (automatically chunked if >3900 bytes)
voice (optional): Voice name (default: "Kore")
play (optional): Whether to play audio after generation (default: true)
save_file (optional): File path to save audio (e.g., "output.wav")

Example:

{
  "text": "Hello, this is a test of the Gemini TTS system!",
  "voice": "Puck",
  "play": true,
  "save_file": "my_audio.wav"
}

Long Text Example:

{
  "text": "Very long text that exceeds 3900 bytes will be automatically split into chunks at sentence boundaries, then combined into a single audio file...",
  "voice": "Aoede",
  "play": true
}

2. generate_multi_speaker_speech

Generate and play speech with multiple speakers/voices. Note: Multi-speaker does not support automatic chunking - text must be under 4000 bytes.

Parameters:

text (required): Text to convert (use speaker tags like "Alice: Hello!"). Must be under 4000 bytes.
speakers (required): Array of speaker configurations
- speaker: Speaker name/identifier
- voice: Voice to use for this speaker
play (optional): Whether to play audio after generation (default: true)
save_file (optional): File path to save audio (e.g., "dialogue.wav")

Example:

{
  "text": "Alice: Hello! How are you? Bob: I'm doing great, thanks!",
  "speakers": [
    {"speaker": "Alice", "voice": "Kore"},
    {"speaker": "Bob", "voice": "Puck"}
  ],
  "play": true,
  "save_file": "conversation.wav"
}

3. list_available_voices

List all available voice options for speech generation.

Parameters: None

Available Voices

The server supports 30 prebuilt voices:

Kore, Puck, Charon (base voices)
Kore-F, Puck-F, Charon-F (female variants)
Kore-M, Puck-M, Charon-M (male variants)
Aoede, Arcas, Fenrir (specialty voices)
Regional variants: -G, -H, -I, -J, -K, -L suffixes

Use the list_available_voices tool to see the complete list.

Technical Details

Model: gemini-2.5-flash-preview-tts
Audio Format: PCM WAV, 24kHz, mono, 16-bit
API: Google Gemini API via google-genai Python SDK
MCP Framework: Uses FastMCP (Python MCP implementation)
Playback: Windows Media Player via PowerShell System.Media.SoundPlayer

API Limitations

Text Input Limits

Character vs. Byte Counting

API Limit: 4,000 bytes (not characters) per text field
UTF-8 Encoding: Multi-byte characters (emoji, non-ASCII) consume more bytes than their character count
Example: "Hello 👋" = 10 bytes (5 chars + 4-byte emoji + space), not 7 characters

Single-Speaker Mode

Maximum per request: 4,000 bytes (hard API limit)
Recommended chunk size: 3,900 bytes (100-byte safety buffer)
Automatic chunking: Enabled by default for text >3,900 bytes
Chunk boundaries: Splits at sentence endings (., !, ?, \n) to maintain natural pauses
Maximum combined length: ~25,000 characters with chunking (~27,000 bytes, 7 chunks)

Multi-Speaker Mode

Maximum text length: 4,000 bytes (no chunking support)
Speaker count: Exactly 2 speakers required (API constraint, not 1 or 3+)
Text format: Must use speaker tags (e.g., "Alice: Hello! Bob: Hi there!")
Voice assignment: Each speaker must have a distinct voice from MULTI_SPEAKER_VOICES list

Common Edge Cases

❌ Multi-speaker with 1 speaker → API Error
❌ Multi-speaker with 3+ speakers → API Error
❌ Multi-speaker with 5000 bytes → Validation Error
❌ Single-speaker with emoji-heavy text → May hit byte limit unexpectedly
✅ Single-speaker with 10,000 bytes → Auto-chunks into 3 segments
✅ Multi-speaker with 3,500 bytes, 2 speakers → Works perfectly

Audio Duration Limits

Per-Chunk Duration Cap

Maximum audio per API call: ~5 minutes 27 seconds (327 seconds)
Undocumented limit: This is a preview-phase restriction not officially documented
Behavior: API silently truncates audio at this duration
Text-to-duration ratio: Approximately 1,800 bytes = 1 minute of audio (varies by content)

Combined Audio with Chunking

Maximum combined duration: ~12 minutes (720 seconds)
Implementation: Multiple 3,900-byte chunks concatenated into single WAV file
Calculation: 7 chunks × ~2.2 min/chunk = ~15 min theoretical (limited by playback timeout)
Practical limit: ~25,000 characters due to 12-minute playback timeout

Duration Estimation Examples

Text Length    Chunks    Est. Duration    Chunking?
-----------    ------    -------------    ---------
1,000 bytes    1         ~33 seconds      No
3,900 bytes    1         ~2.2 minutes     No
8,000 bytes    3         ~4.5 minutes     Yes
15,000 bytes   4         ~8.3 minutes     Yes
25,000 bytes   7         ~13.8 minutes    Yes (may timeout)
30,000 bytes   8         ~16.5 minutes    ⚠️ Exceeds timeout

Playback Timeout

Timeout Configuration

Hard limit: 720 seconds (12 minutes)
Rationale: Balances usability with resource management
Applies to: PowerShell System.Media.SoundPlayer playback only
Does not affect: File generation (saved files can be any length)

Timeout Behavior

On timeout: Audio file is preserved with manual playback instructions
Error message: Includes file path for manual playback via Windows Media Player
File cleanup: Skipped on timeout to allow manual access

Avoiding Timeout

✅ Keep text under 20,000 characters for reliable playback
✅ Use save_file parameter for long content, play manually
✅ Split very long content into multiple generation calls
❌ Attempting 30,000+ characters in single call (will timeout)

Voice System Limitations

Voice List Separation

Single-speaker voices: 30 voices with capital names (Kore, Puck, Charon, etc.)
Multi-speaker voices: 30 different voices with lowercase names (kore, puck, charon, etc.)
Not interchangeable: Using single-speaker voice in multi-speaker mode causes API error

Example Validation Errors

# ❌ Wrong voice case for mode
generate_speech(text="Hello", voice="kore")  # Error: "kore" not in AVAILABLE_VOICES

# ❌ Wrong voice case for multi-speaker
generate_multi_speaker_speech(
    text="A: Hi! B: Hello!",
    speakers=[{"speaker": "A", "voice": "Kore"}]  # Error: "Kore" not in MULTI_SPEAKER_VOICES
)

# ✅ Correct usage
generate_speech(text="Hello", voice="Kore")  # Single-speaker with capital
generate_multi_speaker_speech(
    speakers=[{"speaker": "A", "voice": "kore"}]  # Multi-speaker with lowercase
)

Automatic Chunking

For text longer than 3900 bytes, the system automatically:

Splits at sentence boundaries: Intelligently chunks text at periods, exclamation points, question marks, and newlines
Generates episodes: Creates separate audio files for each chunk (3900 bytes each)
Concatenates seamlessly: Combines all chunks into a single WAV file
Cleans up: Removes temporary chunk files after concatenation

Optimized for Efficiency:

3900-byte chunks maximize each API call while maintaining 100-byte safety buffer
Reduces API calls by ~11% compared to smaller chunk sizes
Example: 25,000-byte text = ~7 API calls (vs 8 with smaller chunks)

Example Use Cases:

Text: 10,000 bytes (3 chunks)
→ Chunk 1: 3,900 bytes (~2.2 min audio)
→ Chunk 2: 3,900 bytes (~2.2 min audio)
→ Chunk 3: 2,200 bytes (~1.2 min audio)
→ Combined: 10,000 bytes (~5.6 min audio)
→ Only 3 API calls

Text: 25,000 characters (~27,000 bytes, 7 chunks)
→ Supports up to 12 minutes of combined audio
→ Only 7 API calls with 3900-byte chunks
→ Automatically managed with intelligent sentence-boundary splitting

Operational Limitations

Platform Requirements

Windows-Only Playback

Audio playback: Requires Windows 11 with PowerShell 5.1+
Limitation: Uses System.Media.SoundPlayer which is Windows-specific
Alternative: On non-Windows platforms, use save_file parameter and play manually
File generation: Works on any platform (Linux, macOS, Windows)

PowerShell Dependencies

Required for playback: PowerShell execution policy must allow script execution
Check policy: Get-ExecutionPolicy (should be RemoteSigned or Unrestricted)
Set policy: Set-ExecutionPolicy RemoteSigned -Scope CurrentUser
Bypass for testing: Not recommended for security reasons

Performance Constraints

API Rate Limits

Gemini API: Subject to standard Gemini API rate limits (varies by account tier)
Chunking impact: Each chunk = 1 API call (e.g., 7 chunks = 7 API calls)
Error handling: Rate limit errors return 429 status with retry guidance
Recommended: Add retry logic with exponential backoff for production use

Memory Usage

Audio buffering: Full audio loaded into memory before playback
Chunking: Each chunk temporarily stored in memory during concatenation
Large files: 25,000-character audio (~30MB WAV) requires ~100MB memory overhead
Recommendation: Monitor memory for very long audio generation

File System

Temporary files: Created in system temp directory (tempfile.mkstemp())
Disk space: Each chunk ~3-5MB, cleaned up after concatenation
Concurrent usage: Multiple simultaneous generations may exhaust temp space
Cleanup: Automatic on success, manual cleanup needed after crashes

Known Issues & Workarounds

Issue: Playback Timeout on Long Audio

Symptom: "Playback timed out" error after 12 minutes
Workaround: Use save_file parameter and play manually
Fix: Split content into multiple shorter generations

Issue: Silent API Truncation at 5:27

Symptom: Audio cuts off before text ends (no error)
Cause: Undocumented API duration limit during preview phase
Workaround: Use automatic chunking (enabled by default)
Detection: Compare generated audio duration to expected duration

Issue: Multi-Byte Character Byte Count

Symptom: Text seems under 4,000 characters but still fails validation
Cause: Emoji and non-ASCII characters consume multiple bytes
Workaround: Use len(text.encode('utf-8')) to check actual byte count
Example: "Hello 👋👋👋" = 16 bytes, not 9 characters

Issue: PowerShell Window Flash

Symptom: PowerShell window briefly appears during playback
Cause: Windows subprocess creation for audio playback
Impact: Minor visual distraction, does not affect functionality
No workaround: Inherent to PowerShell-based playback approach

Issue: Audio Device Conflicts

Symptom: Playback fails with "device busy" or no sound
Cause: Another application using audio device exclusively
Workaround: Close other audio applications, use save_file to play later
Detection: Check Windows audio mixer for exclusive-mode applications

MCP Client-Specific Limitations

Claude Desktop Integration

Configuration: Must restart Claude Desktop after config changes
API Key: Cannot be changed during session (requires restart)
Tool discovery: May take 5-10 seconds after startup
Concurrency: One generation at a time (stdio-based communication)

Stdio Protocol

Single-threaded: Server processes one request at a time
No streaming: Audio must complete before playback begins
Large responses: JSON responses with embedded audio data can be large (>100KB)
Timeout: MCP client timeout separate from playback timeout (check client docs)

Security Considerations

API Key Exposure

Environment variables: Visible to all processes running as same user
Configuration files: Store API key in plaintext (ensure proper file permissions)
Recommendation: Use User-level environment variable, not System-level
Best practice: Rotate API keys regularly, never commit to version control

Temporary File Security

File permissions: Temp files inherit system temp directory permissions
Content exposure: Audio files contain generated speech (consider sensitive content)
Cleanup: Failed operations may leave temp files (contain generated audio)
Recommendation: Use save_file with explicit permissions for sensitive content

Network Security

TLS: All Gemini API calls use HTTPS (enforced by SDK)
Data transmission: Text sent to Google servers for processing
Privacy: Google's data processing terms apply (review Gemini API ToS)
Consideration: Avoid sending PII or confidential information without proper authorization

Usage Example

Once configured, you can use the tools through your MCP client:

User: Generate speech saying "Welcome to Gemini TTS!" using the Aoede voice
Assistant: [Uses generate_speech tool with text and voice parameters]

File Structure

mcp-gemini-tts/
├── src/
│   ├── gemini_tts.py    # TTS wrapper class
│   └── server.py         # MCP server implementation
├── pyproject.toml        # Project metadata
├── requirements.txt      # Python dependencies
└── README.md            # This file

Troubleshooting

Installation & Setup Issues

Problem: ImportError or ModuleNotFoundError

Solution:
1. Verify Python version: python --version (must be 3.10+)
2. Reinstall dependencies: pip install -r requirements.txt
3. Check virtual environment: Ensure you're in the correct venv
4. Try: pip install --upgrade mcp google-genai

Problem: "GOOGLE_API_KEY not found" error

Solution:
1. Check environment variable: echo $env:GOOGLE_API_KEY
2. Set if missing: [System.Environment]::SetEnvironmentVariable('GOOGLE_API_KEY', 'your-key', 'User')
3. Restart PowerShell/terminal after setting
4. Verify API key is valid at https://aistudio.google.com/app/apikey

Problem: "Server failed to start" in Claude Desktop

Solution:
1. Check config file path: %APPDATA%\Claude\claude_desktop_config.json
2. Verify absolute path to server.py (use double backslashes)
3. Check Python is in PATH: python --version
4. Review Claude Desktop logs for specific errors
5. Try running server manually: python src/server.py

Audio Playback Issues

Problem: No sound during playback

Solution:
1. Check Windows audio mixer (other apps using audio exclusively?)
2. Test audio device: Right-click speaker icon → Sound settings → Test
3. Verify PowerShell execution: Get-ExecutionPolicy (should be RemoteSigned/Unrestricted)
4. Try manual playback: Use save_file parameter, open in Windows Media Player
5. Check audio service: services.msc → Windows Audio service (should be Running)

Problem: "Playback timed out after 720 seconds"

Solution:
1. Text is too long (>20,000 characters)
2. Use save_file parameter: save_file="output.wav"
3. Play manually after generation
4. Or split into multiple shorter generations
5. File is preserved at path shown in error message

Problem: PowerShell window flashes during playback

This is normal behavior:
- Caused by subprocess creation for PowerShell audio player
- Does not affect functionality
- No workaround available (inherent to implementation)

API & Generation Issues

Problem: "Text exceeds maximum length" error

Solution:
Single-speaker mode:
- Automatic chunking should handle this
- If error persists, check for multi-byte characters (emoji)
- Calculate bytes: len(text.encode('utf-8'))

Multi-speaker mode:
- No automatic chunking (hard 4,000 byte limit)
- Reduce text length or split into multiple calls
- Check byte count, not character count

Problem: Audio cuts off before text ends (silent truncation)

Cause: 5:27 minute API duration limit

Solution:
- Automatic chunking handles this (enabled by default)
- If issue persists, manually split text into smaller segments
- Each segment should be <3,500 bytes for safety

Problem: "Rate limit exceeded" (429 error)

Solution:
1. Wait 60 seconds before retrying
2. Reduce chunking (use shorter text)
3. Check account tier limits at Google AI Studio
4. Implement exponential backoff in production code

Problem: "Invalid voice name" error

Solution:
Single-speaker: Use capital case (Kore, Puck, Charon)
Multi-speaker: Use lowercase (kore, puck, charon)
List voices: Use list_available_voices tool
Common mistake: Mixing voice cases between modes

MCP Client Issues

Problem: Tools don't appear in Claude Desktop

Solution:
1. Restart Claude Desktop completely (not just reload)
2. Check config file syntax (valid JSON, double backslashes in paths)
3. Verify server starts: python src/server.py (should show no errors)
4. Check Claude Desktop logs: %APPDATA%\Claude\logs
5. Wait 5-10 seconds after startup for tool discovery

Problem: "Server not responding" in MCP client

Solution:
1. Check server process is running
2. Verify stdio communication (server uses stdin/stdout)
3. Test server standalone: python src/server.py
4. Review MCP client logs for connection errors
5. Ensure no firewall blocking (though stdio doesn't use network)

Platform-Specific Issues

Problem: "PowerShell execution policy restricted"

Solution:
1. Check policy: Get-ExecutionPolicy
2. Set for current user: Set-ExecutionPolicy RemoteSigned -Scope CurrentUser
3. Confirm: Get-ExecutionPolicy (should show RemoteSigned or Unrestricted)
4. If corporate policy prevents: Use save_file, play manually

Problem: Running on macOS or Linux

Audio playback will not work:
- Playback requires Windows-specific System.Media.SoundPlayer
- File generation works on all platforms
- Workaround: Always use save_file parameter, play with system audio player
- Alternative: Modify gemini_tts.py to use platform-specific playback (afplay, mpg123, etc.)

Debug Mode

Enable verbose logging:

# Add to top of src/server.py
import logging
logging.basicConfig(level=logging.DEBUG)

Test components individually:

# Test TTS API access
python -c "from src.gemini_tts import GeminiTTS; tts = GeminiTTS(); print('API access OK')"

# Test audio playback
python src/examples/test_playback.py

# Test chunking
python src/examples/test_chunking.py

Check generated files:

# View temp directory
echo $env:TEMP
cd $env:TEMP
Get-ChildItem *.wav | Sort-Object LastWriteTime -Descending | Select-Object -First 5

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Gemini TTS MCP Server

README

Gemini TTS MCP Server

Features

Prerequisites

Required

Optional

API Access Requirements

Installation

Step 1: Clone the Repository

Step 2: Install Dependencies

Step 3: Configure API Key

Step 4: Test Installation

Configuration

MCP Client Setup

Alternative: Direct Server Usage

Available Tools

1. generate_speech

2. generate_multi_speaker_speech

3. list_available_voices

Available Voices

Technical Details

API Limitations

Text Input Limits

Audio Duration Limits

Playback Timeout

Voice System Limitations

Automatic Chunking

Operational Limitations

Platform Requirements

Performance Constraints

Known Issues & Workarounds

MCP Client-Specific Limitations

Security Considerations

Usage Example

File Structure

Troubleshooting

Installation & Setup Issues

Audio Playback Issues

API & Generation Issues

MCP Client Issues

Platform-Specific Issues

Debug Mode

License

Contributing

推荐服务器