Gemini TTS MCP Server
Provides text-to-speech capabilities using Google's Gemini TTS API with support for multiple voices, automatic chunking of long text, multi-speaker dialogue, and audio playback via Windows Media Player.
README
Gemini TTS MCP Server
A Model Context Protocol (MCP) server that provides text-to-speech capabilities using Google's Gemini TTS API.
Features
- Single-speaker TTS: Generate speech with 30 available voices
- Automatic chunking: Handles long text by splitting into logical episodes at sentence boundaries
- Multi-speaker support: Create dialogue with different voices for each speaker
- Automatic playback: Uses Windows Media Player for audio playback
- File saving: Optional parameter to save audio files permanently
- Temporary file management: Auto-cleanup of generated audio files
- Environment-based API key: Secure API key management via environment variables
Prerequisites
Required
- Python 3.10 or higher: Required for FastMCP and async features
- Google API Key: Must have access to Gemini API (specifically
gemini-2.5-flash-preview-ttsmodel) - Windows 11: Required for audio playback via PowerShell's
System.Media.SoundPlayer - PowerShell 5.1 or higher: Built into Windows 11, used for audio playback
Optional
- MCP Client: Such as Claude Desktop, to interact with the server through the Model Context Protocol
- Audio Output Device: Speakers or headphones for audio playback testing
API Access Requirements
- Active Google Cloud account with billing enabled
- Gemini API access (may require waitlist approval during preview)
- API key with TTS permissions enabled
Installation
Step 1: Clone the Repository
git clone <repository-url>
cd mcp-gemini-tts
Step 2: Install Dependencies
# Install required packages
pip install -r requirements.txt
Verify installation:
# Check Python version (should be 3.10+)
python --version
# Verify FastMCP is installed
python -c "import mcp; print('FastMCP installed successfully')"
Step 3: Configure API Key
Option A: User Environment Variable (Recommended)
# Set for current user (persists across sessions)
[System.Environment]::SetEnvironmentVariable('GOOGLE_API_KEY', 'your-api-key-here', 'User')
# Restart PowerShell to apply changes
Option B: Session Environment Variable (Temporary)
# Set for current session only
$env:GOOGLE_API_KEY = "your-api-key-here"
Verify API key is set:
# Check environment variable
echo $env:GOOGLE_API_KEY
Step 4: Test Installation
# Test the server can start (Ctrl+C to stop)
python src/server.py
# Run example scripts to verify functionality
python src/examples/test_playback.py
Expected output: Server should start without errors, and test script should generate and play audio.
Configuration
MCP Client Setup
To use this server with Claude Desktop or other MCP clients, configure the MCP settings file.
Claude Desktop Configuration Location:
- Windows:
%APPDATA%\Claude\claude_desktop_config.json
Configuration:
{
"mcpServers": {
"gemini-tts": {
"command": "python",
"args": [
"C:\\Projects\\mcp-gemini-tts\\src\\server.py"
],
"env": {
"GOOGLE_API_KEY": "your-api-key-here"
}
}
}
}
Important Notes:
- Replace
C:\\Projects\\mcp-gemini-tts\\src\\server.pywith your actual installation path - Use double backslashes (
\\) in Windows paths for JSON - Replace
your-api-key-herewith your actual Google API key - Restart Claude Desktop after modifying the configuration file
Verify Configuration:
- Restart Claude Desktop
- Check that the
gemini-ttstools appear in the available tools list - Test with a simple command: "Generate speech saying 'test' using the Kore voice"
Alternative: Direct Server Usage
You can also run the server directly without an MCP client:
# Start the MCP server (communicates via stdio)
python src/server.py
# Or use the example scripts for direct testing
python src/examples/test_playback.py
python src/examples/test_chunking.py
Available Tools
1. generate_speech
Generate and play speech from text using a single voice. Automatically chunks long text (>3900 bytes) into logical episodes.
Parameters:
text(required): Text to convert to speech (automatically chunked if >3900 bytes)voice(optional): Voice name (default: "Kore")play(optional): Whether to play audio after generation (default: true)save_file(optional): File path to save audio (e.g., "output.wav")
Example:
{
"text": "Hello, this is a test of the Gemini TTS system!",
"voice": "Puck",
"play": true,
"save_file": "my_audio.wav"
}
Long Text Example:
{
"text": "Very long text that exceeds 3900 bytes will be automatically split into chunks at sentence boundaries, then combined into a single audio file...",
"voice": "Aoede",
"play": true
}
2. generate_multi_speaker_speech
Generate and play speech with multiple speakers/voices. Note: Multi-speaker does not support automatic chunking - text must be under 4000 bytes.
Parameters:
text(required): Text to convert (use speaker tags like "Alice: Hello!"). Must be under 4000 bytes.speakers(required): Array of speaker configurationsspeaker: Speaker name/identifiervoice: Voice to use for this speaker
play(optional): Whether to play audio after generation (default: true)save_file(optional): File path to save audio (e.g., "dialogue.wav")
Example:
{
"text": "Alice: Hello! How are you? Bob: I'm doing great, thanks!",
"speakers": [
{"speaker": "Alice", "voice": "Kore"},
{"speaker": "Bob", "voice": "Puck"}
],
"play": true,
"save_file": "conversation.wav"
}
3. list_available_voices
List all available voice options for speech generation.
Parameters: None
Available Voices
The server supports 30 prebuilt voices:
- Kore, Puck, Charon (base voices)
- Kore-F, Puck-F, Charon-F (female variants)
- Kore-M, Puck-M, Charon-M (male variants)
- Aoede, Arcas, Fenrir (specialty voices)
- Regional variants: -G, -H, -I, -J, -K, -L suffixes
Use the list_available_voices tool to see the complete list.
Technical Details
- Model:
gemini-2.5-flash-preview-tts - Audio Format: PCM WAV, 24kHz, mono, 16-bit
- API: Google Gemini API via
google-genaiPython SDK - MCP Framework: Uses FastMCP (Python MCP implementation)
- Playback: Windows Media Player via PowerShell
System.Media.SoundPlayer
API Limitations
Text Input Limits
Character vs. Byte Counting
- API Limit: 4,000 bytes (not characters) per text field
- UTF-8 Encoding: Multi-byte characters (emoji, non-ASCII) consume more bytes than their character count
- Example: "Hello 👋" = 10 bytes (5 chars + 4-byte emoji + space), not 7 characters
Single-Speaker Mode
- Maximum per request: 4,000 bytes (hard API limit)
- Recommended chunk size: 3,900 bytes (100-byte safety buffer)
- Automatic chunking: Enabled by default for text >3,900 bytes
- Chunk boundaries: Splits at sentence endings (
.,!,?,\n) to maintain natural pauses - Maximum combined length: ~25,000 characters with chunking (~27,000 bytes, 7 chunks)
Multi-Speaker Mode
- Maximum text length: 4,000 bytes (no chunking support)
- Speaker count: Exactly 2 speakers required (API constraint, not 1 or 3+)
- Text format: Must use speaker tags (e.g., "Alice: Hello! Bob: Hi there!")
- Voice assignment: Each speaker must have a distinct voice from
MULTI_SPEAKER_VOICESlist
Common Edge Cases
❌ Multi-speaker with 1 speaker → API Error
❌ Multi-speaker with 3+ speakers → API Error
❌ Multi-speaker with 5000 bytes → Validation Error
❌ Single-speaker with emoji-heavy text → May hit byte limit unexpectedly
✅ Single-speaker with 10,000 bytes → Auto-chunks into 3 segments
✅ Multi-speaker with 3,500 bytes, 2 speakers → Works perfectly
Audio Duration Limits
Per-Chunk Duration Cap
- Maximum audio per API call: ~5 minutes 27 seconds (327 seconds)
- Undocumented limit: This is a preview-phase restriction not officially documented
- Behavior: API silently truncates audio at this duration
- Text-to-duration ratio: Approximately 1,800 bytes = 1 minute of audio (varies by content)
Combined Audio with Chunking
- Maximum combined duration: ~12 minutes (720 seconds)
- Implementation: Multiple 3,900-byte chunks concatenated into single WAV file
- Calculation: 7 chunks × ~2.2 min/chunk = ~15 min theoretical (limited by playback timeout)
- Practical limit: ~25,000 characters due to 12-minute playback timeout
Duration Estimation Examples
Text Length Chunks Est. Duration Chunking?
----------- ------ ------------- ---------
1,000 bytes 1 ~33 seconds No
3,900 bytes 1 ~2.2 minutes No
8,000 bytes 3 ~4.5 minutes Yes
15,000 bytes 4 ~8.3 minutes Yes
25,000 bytes 7 ~13.8 minutes Yes (may timeout)
30,000 bytes 8 ~16.5 minutes ⚠️ Exceeds timeout
Playback Timeout
Timeout Configuration
- Hard limit: 720 seconds (12 minutes)
- Rationale: Balances usability with resource management
- Applies to: PowerShell
System.Media.SoundPlayerplayback only - Does not affect: File generation (saved files can be any length)
Timeout Behavior
- On timeout: Audio file is preserved with manual playback instructions
- Error message: Includes file path for manual playback via Windows Media Player
- File cleanup: Skipped on timeout to allow manual access
Avoiding Timeout
✅ Keep text under 20,000 characters for reliable playback
✅ Use save_file parameter for long content, play manually
✅ Split very long content into multiple generation calls
❌ Attempting 30,000+ characters in single call (will timeout)
Voice System Limitations
Voice List Separation
- Single-speaker voices: 30 voices with capital names (
Kore,Puck,Charon, etc.) - Multi-speaker voices: 30 different voices with lowercase names (
kore,puck,charon, etc.) - Not interchangeable: Using single-speaker voice in multi-speaker mode causes API error
Example Validation Errors
# ❌ Wrong voice case for mode
generate_speech(text="Hello", voice="kore") # Error: "kore" not in AVAILABLE_VOICES
# ❌ Wrong voice case for multi-speaker
generate_multi_speaker_speech(
text="A: Hi! B: Hello!",
speakers=[{"speaker": "A", "voice": "Kore"}] # Error: "Kore" not in MULTI_SPEAKER_VOICES
)
# ✅ Correct usage
generate_speech(text="Hello", voice="Kore") # Single-speaker with capital
generate_multi_speaker_speech(
speakers=[{"speaker": "A", "voice": "kore"}] # Multi-speaker with lowercase
)
Automatic Chunking
For text longer than 3900 bytes, the system automatically:
- Splits at sentence boundaries: Intelligently chunks text at periods, exclamation points, question marks, and newlines
- Generates episodes: Creates separate audio files for each chunk (3900 bytes each)
- Concatenates seamlessly: Combines all chunks into a single WAV file
- Cleans up: Removes temporary chunk files after concatenation
Optimized for Efficiency:
- 3900-byte chunks maximize each API call while maintaining 100-byte safety buffer
- Reduces API calls by ~11% compared to smaller chunk sizes
- Example: 25,000-byte text = ~7 API calls (vs 8 with smaller chunks)
Example Use Cases:
Text: 10,000 bytes (3 chunks)
→ Chunk 1: 3,900 bytes (~2.2 min audio)
→ Chunk 2: 3,900 bytes (~2.2 min audio)
→ Chunk 3: 2,200 bytes (~1.2 min audio)
→ Combined: 10,000 bytes (~5.6 min audio)
→ Only 3 API calls
Text: 25,000 characters (~27,000 bytes, 7 chunks)
→ Supports up to 12 minutes of combined audio
→ Only 7 API calls with 3900-byte chunks
→ Automatically managed with intelligent sentence-boundary splitting
Operational Limitations
Platform Requirements
Windows-Only Playback
- Audio playback: Requires Windows 11 with PowerShell 5.1+
- Limitation: Uses
System.Media.SoundPlayerwhich is Windows-specific - Alternative: On non-Windows platforms, use
save_fileparameter and play manually - File generation: Works on any platform (Linux, macOS, Windows)
PowerShell Dependencies
- Required for playback: PowerShell execution policy must allow script execution
- Check policy:
Get-ExecutionPolicy(should beRemoteSignedorUnrestricted) - Set policy:
Set-ExecutionPolicy RemoteSigned -Scope CurrentUser - Bypass for testing: Not recommended for security reasons
Performance Constraints
API Rate Limits
- Gemini API: Subject to standard Gemini API rate limits (varies by account tier)
- Chunking impact: Each chunk = 1 API call (e.g., 7 chunks = 7 API calls)
- Error handling: Rate limit errors return 429 status with retry guidance
- Recommended: Add retry logic with exponential backoff for production use
Memory Usage
- Audio buffering: Full audio loaded into memory before playback
- Chunking: Each chunk temporarily stored in memory during concatenation
- Large files: 25,000-character audio (~30MB WAV) requires ~100MB memory overhead
- Recommendation: Monitor memory for very long audio generation
File System
- Temporary files: Created in system temp directory (
tempfile.mkstemp()) - Disk space: Each chunk ~3-5MB, cleaned up after concatenation
- Concurrent usage: Multiple simultaneous generations may exhaust temp space
- Cleanup: Automatic on success, manual cleanup needed after crashes
Known Issues & Workarounds
Issue: Playback Timeout on Long Audio
- Symptom: "Playback timed out" error after 12 minutes
- Workaround: Use
save_fileparameter and play manually - Fix: Split content into multiple shorter generations
Issue: Silent API Truncation at 5:27
- Symptom: Audio cuts off before text ends (no error)
- Cause: Undocumented API duration limit during preview phase
- Workaround: Use automatic chunking (enabled by default)
- Detection: Compare generated audio duration to expected duration
Issue: Multi-Byte Character Byte Count
- Symptom: Text seems under 4,000 characters but still fails validation
- Cause: Emoji and non-ASCII characters consume multiple bytes
- Workaround: Use
len(text.encode('utf-8'))to check actual byte count - Example: "Hello 👋👋👋" = 16 bytes, not 9 characters
Issue: PowerShell Window Flash
- Symptom: PowerShell window briefly appears during playback
- Cause: Windows subprocess creation for audio playback
- Impact: Minor visual distraction, does not affect functionality
- No workaround: Inherent to PowerShell-based playback approach
Issue: Audio Device Conflicts
- Symptom: Playback fails with "device busy" or no sound
- Cause: Another application using audio device exclusively
- Workaround: Close other audio applications, use
save_fileto play later - Detection: Check Windows audio mixer for exclusive-mode applications
MCP Client-Specific Limitations
Claude Desktop Integration
- Configuration: Must restart Claude Desktop after config changes
- API Key: Cannot be changed during session (requires restart)
- Tool discovery: May take 5-10 seconds after startup
- Concurrency: One generation at a time (stdio-based communication)
Stdio Protocol
- Single-threaded: Server processes one request at a time
- No streaming: Audio must complete before playback begins
- Large responses: JSON responses with embedded audio data can be large (>100KB)
- Timeout: MCP client timeout separate from playback timeout (check client docs)
Security Considerations
API Key Exposure
- Environment variables: Visible to all processes running as same user
- Configuration files: Store API key in plaintext (ensure proper file permissions)
- Recommendation: Use User-level environment variable, not System-level
- Best practice: Rotate API keys regularly, never commit to version control
Temporary File Security
- File permissions: Temp files inherit system temp directory permissions
- Content exposure: Audio files contain generated speech (consider sensitive content)
- Cleanup: Failed operations may leave temp files (contain generated audio)
- Recommendation: Use
save_filewith explicit permissions for sensitive content
Network Security
- TLS: All Gemini API calls use HTTPS (enforced by SDK)
- Data transmission: Text sent to Google servers for processing
- Privacy: Google's data processing terms apply (review Gemini API ToS)
- Consideration: Avoid sending PII or confidential information without proper authorization
Usage Example
Once configured, you can use the tools through your MCP client:
User: Generate speech saying "Welcome to Gemini TTS!" using the Aoede voice
Assistant: [Uses generate_speech tool with text and voice parameters]
File Structure
mcp-gemini-tts/
├── src/
│ ├── gemini_tts.py # TTS wrapper class
│ └── server.py # MCP server implementation
├── pyproject.toml # Project metadata
├── requirements.txt # Python dependencies
└── README.md # This file
Troubleshooting
Installation & Setup Issues
Problem: ImportError or ModuleNotFoundError
Solution:
1. Verify Python version: python --version (must be 3.10+)
2. Reinstall dependencies: pip install -r requirements.txt
3. Check virtual environment: Ensure you're in the correct venv
4. Try: pip install --upgrade mcp google-genai
Problem: "GOOGLE_API_KEY not found" error
Solution:
1. Check environment variable: echo $env:GOOGLE_API_KEY
2. Set if missing: [System.Environment]::SetEnvironmentVariable('GOOGLE_API_KEY', 'your-key', 'User')
3. Restart PowerShell/terminal after setting
4. Verify API key is valid at https://aistudio.google.com/app/apikey
Problem: "Server failed to start" in Claude Desktop
Solution:
1. Check config file path: %APPDATA%\Claude\claude_desktop_config.json
2. Verify absolute path to server.py (use double backslashes)
3. Check Python is in PATH: python --version
4. Review Claude Desktop logs for specific errors
5. Try running server manually: python src/server.py
Audio Playback Issues
Problem: No sound during playback
Solution:
1. Check Windows audio mixer (other apps using audio exclusively?)
2. Test audio device: Right-click speaker icon → Sound settings → Test
3. Verify PowerShell execution: Get-ExecutionPolicy (should be RemoteSigned/Unrestricted)
4. Try manual playback: Use save_file parameter, open in Windows Media Player
5. Check audio service: services.msc → Windows Audio service (should be Running)
Problem: "Playback timed out after 720 seconds"
Solution:
1. Text is too long (>20,000 characters)
2. Use save_file parameter: save_file="output.wav"
3. Play manually after generation
4. Or split into multiple shorter generations
5. File is preserved at path shown in error message
Problem: PowerShell window flashes during playback
This is normal behavior:
- Caused by subprocess creation for PowerShell audio player
- Does not affect functionality
- No workaround available (inherent to implementation)
API & Generation Issues
Problem: "Text exceeds maximum length" error
Solution:
Single-speaker mode:
- Automatic chunking should handle this
- If error persists, check for multi-byte characters (emoji)
- Calculate bytes: len(text.encode('utf-8'))
Multi-speaker mode:
- No automatic chunking (hard 4,000 byte limit)
- Reduce text length or split into multiple calls
- Check byte count, not character count
Problem: Audio cuts off before text ends (silent truncation)
Cause: 5:27 minute API duration limit
Solution:
- Automatic chunking handles this (enabled by default)
- If issue persists, manually split text into smaller segments
- Each segment should be <3,500 bytes for safety
Problem: "Rate limit exceeded" (429 error)
Solution:
1. Wait 60 seconds before retrying
2. Reduce chunking (use shorter text)
3. Check account tier limits at Google AI Studio
4. Implement exponential backoff in production code
Problem: "Invalid voice name" error
Solution:
Single-speaker: Use capital case (Kore, Puck, Charon)
Multi-speaker: Use lowercase (kore, puck, charon)
List voices: Use list_available_voices tool
Common mistake: Mixing voice cases between modes
MCP Client Issues
Problem: Tools don't appear in Claude Desktop
Solution:
1. Restart Claude Desktop completely (not just reload)
2. Check config file syntax (valid JSON, double backslashes in paths)
3. Verify server starts: python src/server.py (should show no errors)
4. Check Claude Desktop logs: %APPDATA%\Claude\logs
5. Wait 5-10 seconds after startup for tool discovery
Problem: "Server not responding" in MCP client
Solution:
1. Check server process is running
2. Verify stdio communication (server uses stdin/stdout)
3. Test server standalone: python src/server.py
4. Review MCP client logs for connection errors
5. Ensure no firewall blocking (though stdio doesn't use network)
Platform-Specific Issues
Problem: "PowerShell execution policy restricted"
Solution:
1. Check policy: Get-ExecutionPolicy
2. Set for current user: Set-ExecutionPolicy RemoteSigned -Scope CurrentUser
3. Confirm: Get-ExecutionPolicy (should show RemoteSigned or Unrestricted)
4. If corporate policy prevents: Use save_file, play manually
Problem: Running on macOS or Linux
Audio playback will not work:
- Playback requires Windows-specific System.Media.SoundPlayer
- File generation works on all platforms
- Workaround: Always use save_file parameter, play with system audio player
- Alternative: Modify gemini_tts.py to use platform-specific playback (afplay, mpg123, etc.)
Debug Mode
Enable verbose logging:
# Add to top of src/server.py
import logging
logging.basicConfig(level=logging.DEBUG)
Test components individually:
# Test TTS API access
python -c "from src.gemini_tts import GeminiTTS; tts = GeminiTTS(); print('API access OK')"
# Test audio playback
python src/examples/test_playback.py
# Test chunking
python src/examples/test_chunking.py
Check generated files:
# View temp directory
echo $env:TEMP
cd $env:TEMP
Get-ChildItem *.wav | Sort-Object LastWriteTime -Descending | Select-Object -First 5
License
MIT License
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
推荐服务器
Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。
VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。
e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。