Ubuntu Desktop Control MCP

Ubuntu Desktop Control MCP

Enables AI assistants to control Ubuntu desktops through screenshots, mouse clicks, and keyboard interactions using AT-SPI integration and computer vision. It features optimized element detection and workflow batching for fast and accurate visual interaction with desktop applications.

Category
访问服务器

README

Ubuntu Desktop Control MCP Server

An MCP (Model Context Protocol) server that enables LLMs to control your Ubuntu desktop by taking screenshots and sending mouse clicks. This allows AI assistants to visually interact with your desktop applications.

⚡ NEW: Optimized Production Workflow

5x faster, 5x more accurate! Now using the same optimization techniques as Anthropic's Computer Use API:

  • 📸 Smart Screenshots: Auto-downsampled to 1280x720 (5x smaller)
  • 🎯 Numbered Elements: See what's clickable at a glance with overlaid IDs
  • 🤖 AT-SPI Integration: Automatic UI element detection using accessibility API
  • 📐 Percentage Coords: Resolution-agnostic positioning (no more pixel hunting!)
  • ⚡ Workflow Batching: Execute multiple actions in one MCP call
  • 🎪 Element Cache: Direct element interaction - "click element #5"

Example - Old way (8+ calls, ~15s):

take_screenshot() → analyze → grid overlay → zoom quadrant → find pixel → click → miss

Example - New way (1 call, ~3s):

take_screenshot() → "I see Pinta is element #5" → click_screen(element_id=5) → ✓

See README.md for full details.

Features

  • 📸 Screenshot Capture: Annotated screenshots with automatic element detection
  • 🔢 Element Detection: AT-SPI + CV fallback for robust UI element identification
  • 🖱️ Smart Clicking: Click by element ID or percentage coordinates
  • ⌨️ Keyboard Control: Type text and press keys/hotkeys
  • 🎯 Mouse Movement: Smooth cursor positioning with animation
  • 🚀 Workflow Batching: Execute multi-step tasks in single MCP call
  • 📊 Diagnostics: Display scaling detection, warnings, and recommendations

Quick Start

1. Prerequisites

  • Ubuntu Linux (X11 required, Wayland not fully supported)
  • Python 3.9+

2. Installation

From PyPI (Recommended)

pip install ubuntu-desktop-control

From Source

# Clone repository
git clone https://github.com/charettep/ubuntu-desktop-control-mcp.git
cd ubuntu-desktop-control-mcp

# Install system dependencies (requires sudo)
chmod +x scripts/install.sh
./scripts/install.sh

# Install Python dependencies
pip install -e .

Configuration

Claude Code

<details> <summary>Installation Methods</summary>

Method 1: CLI (Recommended)

claude mcp add --transport stdio ubuntu-desktop-control -- \
  ubuntu-desktop-control

Method 2: Manual Config

Edit ~/.claude/claude_desktop_config.json:

{
  "mcpServers": {
    "ubuntu-desktop-control": {
      "command": "ubuntu-desktop-control",
      "args": []
    }
  }
}

</details>

VS Code Insiders

<details> <summary>Installation Methods</summary>

Method 1: MCP Command

  1. Open Command Palette (Ctrl+Shift+P)
  2. Run MCP: Open Workspace Folder Configuration
  3. Add the server configuration below.

Method 2: Manual Config

Create .vscode/mcp.json in your workspace:

{
  "servers": {
    "ubuntu-desktop-control": {
      "type": "stdio",
      "command": "ubuntu-desktop-control",
      "args": []
    }
  }
}

</details>

Codex CLI

<details> <summary>Installation Methods</summary>

Method 1: CLI

codex mcp add ubuntu-desktop-control -- \
  ubuntu-desktop-control

Method 2: Manual Config

Edit ~/.config/codex/config.toml:

[mcp_servers.ubuntu-desktop-control]
type = "stdio"
command = "ubuntu-desktop-control"
args = []

</details>

Tools

Core Capabilities

Tool Description
take_screenshot Capture the desktop (optionally per-monitor) with annotated elements.
click_screen Click by element ID or percentage coordinates (supports per-monitor).
move_mouse Move the cursor by element ID or percentage coordinates (supports per-monitor).
drag_mouse Drag the cursor to coordinates while holding a mouse button.
type_text Type text using the keyboard.
press_key Press a specific key (e.g., 'enter', 'esc').
press_hotkey Press a combination of keys simultaneously (e.g., Ctrl+Shift+C).
get_screen_info Get screen dimensions and display server type (X11/Wayland).
get_display_diagnostics Troubleshoot scaling and coordinate mismatches.
map_GUI_elements_location Detect and map UI elements (hitboxes) using Computer Vision.
convert_screenshot_coordinates Convert pixels from a screenshot to logical click coordinates.
list_prompt_templates List available prompt templates (for clients without native prompt support).
execute_workflow Execute a batch of actions (screenshot/click/move/type/wait).

Prompt Rendering Tools

These tools allow clients without native prompt support (like Codex CLI) to render prompt templates as text.

Tool Description
render_prompt_baseline_display_check Render the baseline display check prompt.
render_prompt_capture_full_desktop Render the full desktop capture prompt.
render_prompt_capture_region_for_task Render the region capture prompt.
render_prompt_convert_screenshot_coordinates Render the coordinate conversion prompt.
render_prompt_safe_click Render the safe click prompt.
render_prompt_hover_and_capture Render the hover and capture prompt.
render_prompt_coordinate_mismatch_recovery Render the mismatch recovery prompt.
render_prompt_end_to_end_capture_and_act Render the end-to-end workflow prompt.

Prompts

Prompt Description
baseline_display_check Check display settings and scaling before starting tasks.
capture_full_desktop Capture and summarize the full desktop state.
capture_region_for_task Capture a specific region for detailed inspection.
safe_click Perform a click with safety checks and scaling awareness.
hover_and_capture Hover to reveal UI elements, then capture.
coordinate_mismatch_recovery Diagnose and fix missed clicks.
end_to_end_capture_and_act Plan and execute a full interaction loop.

Configuration & Customization

Environment Variables

The server relies on standard Linux/X11 environment variables to locate and interact with the desktop session.

Variable Description Default
DISPLAY X11 display identifier. Required for the server to know which screen to control. :0
XDG_SESSION_TYPE Used to detect if running on X11 or Wayland. unknown
XAUTHORITY Path to X11 authority file. Required if running from a different user context (e.g., sudo, docker) or over SSH. ~/.Xauthority
UDC_FORCE_COORDS Force coordinate clicks (disable AT-SPI action clicks). unset

Passing Environment Variables

You can customize these variables in your MCP client configuration.

Claude Desktop (claude_desktop_config.json)

{
  "mcpServers": {
    "ubuntu-desktop-control": {
      "command": "ubuntu-desktop-control",
      "args": [],
      "env": {
        "DISPLAY": ":0",
        "XAUTHORITY": "/home/user/.Xauthority"
      }
    }
  }
}

VS Code (.vscode/mcp.json)

{
  "servers": {
    "ubuntu-desktop-control": {
      "command": "ubuntu-desktop-control",
      "args": [],
      "env": {
        "DISPLAY": ":0"
      }
    }
  }
}

Display Scaling & Coordinates

If clicks land in the wrong place, you likely have a HiDPI display scaling mismatch (e.g., logical 1920x1080 vs physical 3840x2160).

Solutions:

  1. Auto-scale: Use click_screen(..., auto_scale=True) to let the server handle it.
  2. Diagnostics: Run get_display_diagnostics() to see the scaling factor.
  3. Element IDs: Use take_screenshot(detect_elements=True) and click via element_id or percentage coordinates.

Troubleshooting

<details> <summary><strong>Common Issues</strong></summary>

  • "Screenshot failed": Ensure gnome-screenshot or scrot is installed (sudo apt install gnome-screenshot).
  • "PyAutoGUI not installed": Ensure you are using the .venv python.
  • Wayland Issues: This server requires X11. Check with echo $XDG_SESSION_TYPE. If "wayland", switch to "GNOME on Xorg" at login.
  • Permission Denied: Run xhost +local: if you have X11 permission issues.

</details>

Security

⚠️ Warning: This server gives LLMs full control over your mouse and visibility of your screen.

  • Only use with trusted clients.
  • Be aware screenshots may capture sensitive data.
  • Automated clicks can be destructive.

License

MIT License

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选