Wayland MCP Server

Wayland MCP Server

Enables AI assistants to automate Wayland desktop environments through screenshot analysis, mouse control, and keyboard input simulation. It supports visual context via VLM providers like Gemini and OpenRouter to perform complex, multi-step desktop actions.

Category
访问服务器

README

Wayland MCP Server

<div align="center">

License: GPL3 Python Version MCP Platform

Model Context Protocol server for Wayland desktop automation

FeaturesInstallationUsageAPISecurity


</div>

Overview

Wayland MCP Server enables AI assistants to interact with your Wayland desktop through the Model Context Protocol. It provides screenshot capture with VLM analysis, mouse control, keyboard input, and action chaining capabilities.

Why This Project?

Existing Wayland screenshot and automation tools often have reliability issues. This project provides a robust, MCP-native solution specifically designed for AI-driven desktop automation on modern Linux systems.

Quick Example

# AI Assistant: "Take a screenshot and tell me what's on screen"
→ Captures screen, analyzes with VLM, responds with description

# AI Assistant: "Click the OK button"  
→ Identifies button location from screenshot, moves mouse, clicks

# AI Assistant: "Fill out this form with test data"
→ Chains clicks and keyboard input to complete form automatically

Features

Visual Analysis

  • Screenshot capture with precision ruler overlays
  • VLM-powered image analysis via OpenRouter or Google Gemini
  • Multiple vision model support (Claude, GPT-4V, Gemini, Qwen)
  • Side-by-side image comparison and diff detection

Mouse Automation

  • Absolute and relative cursor positioning
  • Click operations (left, right, middle button)
  • Drag and drop with coordinate precision
  • Bidirectional scrolling (vertical/horizontal)

Keyboard Control

  • Text input simulation
  • Individual key press events
  • Complex key combinations

Action Sequences

  • Chain multiple operations together
  • Flexible syntax: chain:action1;action2;action3
  • Example: chain:click:100,200;type:hello;press:Enter

Installation

Prerequisites

  • Python 3.8 or higher
  • Wayland compositor (GNOME, KDE Plasma, Hyprland, Sway, etc.)
  • grim and slurp for screenshots (usually pre-installed)

Quick Install

uvx wayland-mcp

From Source

git clone https://github.com/kurojs/wayland-mcp.git
cd wayland-mcp
pip install -e .

Input Control Setup

For mouse and keyboard automation, run the setup script:

sudo ./setup.sh

What it does:

  • Installs evemu-tools package
  • Configures setuid for evemu-event
  • Adds user to input group
  • Creates udev rules for device access

After setup, log out and back in for group changes to take effect.

Usage

MCP Configuration

The server supports two VLM providers:

Option 1: OpenRouter (multiple models via proxy)

{
  "mcpServers": {
    "wayland": {
      "command": "uvx",
      "args": ["wayland-mcp"],
      "env": {
        "OPENROUTER_API_KEY": "sk-or-v1-...",
        "VLM_PROVIDER": "openrouter",
        "VLM_MODEL": "qwen/qwen2.5-vl-72b-instruct:free",
        "XDG_RUNTIME_DIR": "/run/user/1000",
        "WAYLAND_DISPLAY": "wayland-0"
      }
    }
  }
}

Option 2: Google Gemini Direct (native API, faster)

{
  "mcpServers": {
    "wayland": {
      "command": "uvx",
      "args": ["wayland-mcp"],
      "env": {
        "GEMINI_API_KEY": "AIza...",
        "VLM_PROVIDER": "gemini",
        "VLM_MODEL": "gemini-2.5-flash",
        "XDG_RUNTIME_DIR": "/run/user/1000",
        "WAYLAND_DISPLAY": "wayland-0"
      }
    }
  }
}

Example for Claude Desktop (~/.config/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "wayland": {
      "command": "uvx",
      "args": ["wayland-mcp"],
      "env": {
        "GEMINI_API_KEY": "AIza...",
        "VLM_PROVIDER": "gemini",
        "VLM_MODEL": "gemini-2.5-flash",
        "XDG_RUNTIME_DIR": "/run/user/1000",
        "WAYLAND_DISPLAY": "wayland-0"
      }
    }
  }
}

Note: See CONFIG_EXAMPLES.md for more configuration examples including Cursor, OpenRouter models, and VLM provider options.

Environment Variables

Variable Description Default Required
VLM Provider Options
VLM_PROVIDER Vision provider: openrouter or gemini openrouter No
OPENROUTER_API_KEY OpenRouter API key - For OpenRouter
GEMINI_API_KEY Google Gemini API key - For Gemini
VLM_MODEL Model identifier qwen/qwen2.5-vl-72b-instruct:free (OpenRouter) or gemini-2.5-flash (Gemini) No
Wayland Environment
XDG_RUNTIME_DIR Wayland runtime directory /run/user/1000 Yes
WAYLAND_DISPLAY Display identifier wayland-0 Yes
Optional
WAYLAND_MCP_PORT Server listen port 4999 No

Getting API Keys:

Desktop Environment Compatibility

Desktop Status Notes
GNOME ✅ Tested Wayland by default on modern versions
KDE Plasma ✅ Tested Enable Wayland session at login
Hyprland ✅ Tested Native Wayland compositor
Sway ✅ Should work i3-compatible Wayland compositor
Others ⚠️ Untested Any wlroots-based compositor should work

Example Commands

Through an MCP client, you can request actions like:

  • "Take a screenshot and analyze what's on the screen"
  • "Move the mouse to coordinates (100, 200) and click"
  • "Type 'hello world' and press Enter"
  • "Click at (50, 50), then drag to (200, 200)"

Available Tools

The server exposes the following MCP tools:

Screen Capture

  • capture_screenshot - Take a screenshot with optional ruler overlays
  • capture_and_analyze - Capture and analyze using VLM in one step

Vision Analysis

  • analyze_screenshot - Analyze an existing screenshot with custom prompt
  • compare_images - Compare two screenshots to detect differences

Mouse Control

  • move_mouse - Move cursor to coordinates (absolute or relative)
  • click_mouse - Perform left click at current position
  • drag_mouse - Drag between two coordinate points
  • scroll_mouse - Vertical scroll (positive=up, negative=down)

Action Execution

  • execute_action - Execute single action or chain multiple actions

Action Chain Syntax

Combine multiple actions with semicolons:

chain:action1;action2;action3

Supported Actions:

  • type:text - Type a text string
  • press:key - Press a specific key
  • click: or click:x,y - Click at position or current location
  • move_to:x,y - Move to absolute coordinates
  • move_to:rel:x,y - Move relative to current position
  • drag:x1,y1:x2,y2 - Drag from point to point
  • scroll:amount - Scroll vertically (typical values: 15-120)
  • scroll:horizontal:amount - Scroll horizontally

Example Chains:

chain:move_to:100,200;click:;type:hello;press:Enter
chain:click:50,50;drag:50,50:200,200
chain:scroll:120;move_to:rel:0,-50;click:

Security

⚠️ IMPORTANT SECURITY CONSIDERATIONS

This server grants extensive control over your desktop environment:

  • Full mouse and keyboard control
  • Screen capture capabilities
  • Ability to execute arbitrary input sequences

Best Practices

  • Only use with trusted AI models and MCP clients
  • Review action chains before execution in sensitive contexts
  • Consider running in a sandboxed or test environment
  • Be aware that the AI can perform any action you could perform manually

Permission Model

The setup script requires sudo access to:

  • Install system packages (evemu-tools)
  • Modify file permissions
  • Configure udev rules

After setup, the server runs with your user privileges but can control input devices through configured permissions.

Architecture

                    ┌─────────────────────────────────┐
                    │      MCP Client Layer           │
                    │   (Claude, Cursor, VS Code)     │
                    └───────────────┬─────────────────┘
                                    │
                            MCP Protocol (stdio/HTTP)
                                    │
                    ┌───────────────▼─────────────────┐
                    │    Wayland MCP Server           │
                    │    ┌─────────────────────┐      │
                    │    │  Core Components    │      │
                    │    ├─────────────────────┤      │
                    │    │ • FastMCP Handler   │      │
                    │    │ • Action Processor  │      │
                    │    │ • Chain Parser      │      │
                    │    └─────────────────────┘      │
                    └────┬────────────────┬────────────┘
                         │                │
         ┌───────────────┴────┐      ┌────┴──────────────┐
         │                    │      │                   │
    ┌────▼─────┐      ┌──────▼───┐  │  ┌──────────────┐ │
    │  Vision  │      │  Input   │  │  │   Screen     │ │
    │          │      │ Control  │  │  │   Capture    │ │
    ├──────────┤      ├──────────┤  │  ├──────────────┤ │
    │ • VLM    │      │ • evemu  │  │  │ • grim       │ │
    │ • Compare│      │ • Mouse  │  │  │ • slurp      │ │
    │          │      │ • Keyboard│  │  │ • PIL        │ │
    └──────────┘      └──────────┘  │  └──────────────┘ │
                                    │                    │
                                    └────────────────────┘
                                      Wayland Compositor

Troubleshooting

Input control not working

  • Ensure you ran sudo ./setup.sh
  • Log out and back in after setup
  • Verify you're in the input group: groups | grep input

Screenshots failing

  • Check if grim is installed: which grim
  • Verify WAYLAND_DISPLAY matches your session: echo $WAYLAND_DISPLAY

VLM analysis not working

  • Confirm OPENROUTER_API_KEY is set correctly
  • Check API key permissions on OpenRouter dashboard
  • Test model availability: some models have usage limits

Server won't start

  • Check Python version: python3 --version (needs 3.8+)
  • Verify all dependencies: pip install -e .
  • Look for port conflicts if using custom WAYLAND_MCP_PORT

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Project Structure

wayland-mcp/
├── wayland_mcp/          # Main package
│   ├── server_mcp.py     # MCP server implementation
│   ├── screen_utils.py   # Screenshot & VLM analysis
│   ├── mouse_utils.py    # Mouse control functions
│   ├── keyboard_utils.py # Keyboard input handling
│   ├── chain_processor.py# Action chain parser
│   └── ...
├── README.md             # This file
├── CONFIG_EXAMPLES.md    # Configuration examples
├── CONTRIBUTING.md       # Contribution guidelines
├── setup.sh              # Permission setup script
└── pyproject.toml        # Package metadata

License

GPL-3.0 License - See LICENSE for details.

Acknowledgments


<div align="center"> Made for the Wayland desktop environment </div>

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选