desktop-touch-mcp

desktop-touch-mcp

Windows desktop automation MCP server — screenshot, mouse, keyboard & UI Automation. Lets LLM agents see and control your Windows desktop directly.

Category
访问服务器

README

desktop-touch-mcp

日本語

Glama Link

Stop pasting screenshots. Let Claude see and control your desktop directly.

An MCP server that gives Claude eyes and hands on Windows — 25 tools covering screenshots, mouse, keyboard, and Windows UI Automation, designed from the ground up for LLM efficiency.

Applies MPEG P-frame diffing to window capture: only changed windows are sent after the first frame, cutting token usage by ~60–80% in typical automation loops.


Features

  • LLM-native design — Built around how LLMs think, not how humans click. run_macro batches multiple operations into a single API call; diffMode sends only the windows that changed since the last frame. Minimal tokens, minimal round-trips.
  • Full CJK support — Uses Win32 GetWindowTextW for window titles, avoiding nut-js garbling. IME bypass input supported for Japanese/Chinese/Korean environments.
  • 3-tier token reductiondetail="image" (~443 tok) / detail="text" (~100–300 tok) / diffMode=true (~160 tok). Send pixels only when you actually need to see them.
  • 1:1 coordinate modedotByDot=true captures at native resolution (WebP). Image pixel = screen coordinate — no scale math needed. With origin+scale passed to mouse_click, the server converts coords for you — eliminating off-by-one / scale bugs.
  • Chrome / AWS data reductiongrayscale=true (~50% size), dotByDotMaxDimension=1280 (auto-scaled with coord preservation), windowTitle + region sub-crop (exclude browser chrome). Targeted at heavy dotByDot captures. Typical token reduction: 50–70%.
  • Chromium smart fallbackdetail="text" on Chrome/Edge/Brave auto-skips UIA (prohibitively slow there) and runs Windows OCR. hints.chromiumGuard + hints.ocrFallbackFired flag the path taken.
  • UIA element extractiondetail="text" returns button names and clickAt coords as JSON. Claude can click the right element without ever looking at a screenshot.
  • Auto-dock CLIdock_window snaps any window to a screen corner with always-on-top. Set DESKTOP_TOUCH_DOCK_TITLE='@parent' to auto-dock the terminal hosting Claude on MCP startup — the process-tree walker finds the right window regardless of title.
  • Emergency stop (Failsafe) — Move the mouse to the top-left corner (within 10px of 0,0) to immediately terminate the MCP server.

Requirements

OS Windows 10 / 11 (64-bit)
Node.js v20+ recommended (tested on v22+)
PowerShell 5.1+ (bundled with Windows)
Claude CLI claude command must be available

Note: nut-js native bindings require the Visual C++ Redistributable. Download from Microsoft if not already installed.


Installation

git clone https://github.com/Harusame64/desktop-touch-mcp.git
cd desktop-touch-mcp
npm install
npm run build

Register with Claude CLI

Add to ~/.claude.json under mcpServers:

{
  "mcpServers": {
    "desktop-touch": {
      "type": "stdio",
      "command": "node",
      "args": ["D:/path/to/desktop-touch-mcp/dist/index.js"]
    }
  }
}

Note: Replace D:/path/to/desktop-touch-mcp with the actual path where you cloned this repository.

No system prompt needed. The command reference is automatically injected into Claude via the MCP initialize response's instructions field.


Tools (34 total)

Screenshot (5)

Tool Description
screenshot Main capture. Supports detail, dotByDot, dotByDotMaxDimension, grayscale, region sub-crop, diffMode
screenshot_background Capture a background window without focusing it (PrintWindow API)
screenshot_ocr Windows.Media.Ocr on a window; returns word-level text + screen clickAt coords
get_screen_info Monitor layout, DPI, cursor position
scroll_capture Full-page stitch by scrolling (MAE overlap detection + 10% fallback)

Window management (4)

Tool Description
get_windows List all windows in Z-order
get_active_window Info about the focused window
focus_window Bring a window to foreground by partial title match
dock_window Snap a window to a screen corner at a small size + always-on-top (for keeping CLI visible)

Mouse (5)

Tool Description
mouse_move / mouse_click / mouse_drag Move, click, drag. Accept speed and homing parameters
scroll Scroll in any direction. Accepts speed and homing parameters
get_cursor_position Current cursor coordinates

Keyboard (2)

Tool Description
keyboard_type Type text (use_clipboard=true bypasses IME)
keyboard_press Key combos (ctrl+c, alt+f4, etc.)

UI Automation (4)

Tool Description
get_ui_elements Full UIA element tree for a window
click_element Click a button by name or automationId — no coordinates needed
set_element_value Write directly to a text field
scope_element High-res zoom crop of an element + its child tree

Browser CDP (7)

Tool Description
browser_connect Connect to Chrome/Edge via CDP; lists open tabs
browser_find_element CSS selector → exact physical screen coords
browser_click_element Find DOM element + click in one step
browser_eval Evaluate JS expression in the browser tab
browser_get_dom Get outerHTML of element or document.body
browser_navigate Navigate via CDP Page.navigate (no address bar needed)
browser_disconnect Close cached CDP WebSocket sessions

Workspace (2)

Tool Description
workspace_snapshot All windows: thumbnails + UI summaries in one call
workspace_launch Launch an app and auto-detect the new window

Pin / Macro (3)

Tool Description
pin_window / unpin_window Always-on-top toggle
run_macro Execute up to 50 steps sequentially in one MCP call

Browser CDP automation

For web automation, connect Chrome or Edge with the remote debugging port enabled — no Selenium or Playwright needed.

# Launch Chrome in CDP mode
chrome.exe --remote-debugging-port=9222 --user-data-dir=C:\tmp\cdp
browser_connect()                       → list open tabs + get tabIds
browser_find_element("#submit")         → CSS selector → physical screen coords
browser_click_element("#submit")        → find + click in one step (auto-focuses browser)
browser_eval("document.title")          → evaluate JS, returns result
browser_get_dom("#main", maxLength=5000)→ outerHTML, truncated to maxLength chars
browser_navigate("https://example.com") → navigate via CDP (no address bar interaction)
browser_disconnect()                    → clean up WebSocket sessions

Coordinates returned by browser_find_element account for the browser chrome (tab strip + address bar height) and devicePixelRatio, so they can be passed directly to mouse_click without any scaling.

Recommended web workflow:

browser_connect() → browser_get_dom() → browser_find_element(selector) → browser_click_element(selector)

Auto-dock CLI on startup

Keep Claude CLI visible while operating other apps full-screen. Set env vars in your MCP config and the docked window auto-snaps into place every MCP startup.

{
  "mcpServers": {
    "desktop-touch": {
      "type": "stdio",
      "command": "node",
      "args": ["D:/path/to/desktop-touch-mcp/dist/index.js"],
      "env": {
        "DESKTOP_TOUCH_DOCK_TITLE": "@parent",
        "DESKTOP_TOUCH_DOCK_CORNER": "bottom-right",
        "DESKTOP_TOUCH_DOCK_WIDTH": "480",
        "DESKTOP_TOUCH_DOCK_HEIGHT": "360",
        "DESKTOP_TOUCH_DOCK_PIN": "true"
      }
    }
  }
}
Env var Default Notes
DESKTOP_TOUCH_DOCK_TITLE (unset = off) @parent walks the MCP process tree to find the hosting terminal — immune to title / branch / project changes. Or use a literal substring.
DESKTOP_TOUCH_DOCK_CORNER bottom-right top-left / top-right / bottom-left / bottom-right
DESKTOP_TOUCH_DOCK_WIDTH / HEIGHT 480 / 360 px ("480") or ratio of work area ("25%") — 4K/8K auto-adapts
DESKTOP_TOUCH_DOCK_PIN true Always-on-top toggle
DESKTOP_TOUCH_DOCK_MONITOR primary Monitor id from get_screen_info
DESKTOP_TOUCH_DOCK_SCALE_DPI false If true, multiply px values by dpi / 96 (opt-in per-monitor scaling)
DESKTOP_TOUCH_DOCK_MARGIN 8 Screen-edge padding (px)
DESKTOP_TOUCH_DOCK_TIMEOUT_MS 5000 Max wait for the target window to appear

Input routing gotcha: when a pinned window is active (e.g. Claude CLI), keyboard_type / keyboard_press send keys to it, not the app you wanted to type into. Always call focus_window(title=...) before keyboard operations, then verify isActive=true via screenshot(detail='meta').


Mouse homing correction

When Claude calls screenshot(detail='text') to read coordinates and then mouse_click seconds later, the target window may have moved. The homing system corrects this automatically.

Tier How to enable Latency What it does
1 Always-on (if cache exists) <1ms Applies (dx, dy) offset when window moved
2 Pass windowTitle hint ~100ms Auto-focuses window if it went behind another
3 Pass elementName/elementId + windowTitle 1–3s UIA re-query for fresh coords on resize
# Tier 1 only (automatic)
mouse_click(x=500, y=300)

# Tier 1 + 2: also bring window to front if hidden
mouse_click(x=500, y=300, windowTitle="メモ帳")

# Tier 1 + 2 + 3: also re-query UIA if window resized
mouse_click(x=500, y=300, windowTitle="メモ帳", elementName="保存")

# Traction control OFF — no correction
mouse_click(x=500, y=300, homing=false)

The homing parameter is available on mouse_click, mouse_move, mouse_drag, and scroll. The cache is updated automatically on every screenshot(), get_windows(), focus_window(), and workspace_snapshot() call.

mouse_click image-local coords (origin + scale)

When you take a dotByDot screenshot with dotByDotMaxDimension, the response prints the origin and scale values. Instead of computing screen coords manually, copy them into mouse_click:

# Screenshot response:
#   origin: (0, 120) | scale: 0.6667
#   To click image pixel (ix, iy): mouse_click(x=ix, y=iy, origin={x:0, y:120}, scale=0.6667)

mouse_click(x=640, y=300, origin={x:0, y:120}, scale=0.6667, windowTitle="Chrome")
# Server converts: screen = (0 + 640/0.6667, 120 + 300/0.6667) = (960, 570)

This eliminates a whole class of off-by-one and scale bugs. Without origin/scale, x/y remain absolute screen pixels (unchanged behavior).


screenshot key parameters

detail="image"          — PNG/WebP pixels (default)
detail="text"           — UIA element JSON + clickAt coords (no image, ~100–300 tok)
detail="meta"           — Title + region only (cheapest, ~20 tok/window)
dotByDot=true           — 1:1 WebP; image_px + origin = screen_px
dotByDotMaxDimension=N  — cap longest edge (response includes scale for coord math)
grayscale=true          — ~50% smaller for text-heavy captures (code/AWS console)
region={x,y,w,h}        — with windowTitle: window-local coords (exclude browser chrome)
                          without: virtual screen coords
diffMode=true           — I-frame first call, P-frame (changed windows only) after (~160 tok)
ocrFallback="auto"      — detail='text' auto-fires Windows OCR on uiaSparse or empty

Recommended Chrome combo (50–70% data reduction):

screenshot(windowTitle="Chrome",
           dotByDot=true, dotByDotMaxDimension=1280, grayscale=true,
           region={x:0, y:120, width:1920, height:900})  # skip browser chrome

Recommended workflow:

workspace_snapshot()                     → full orientation (resets diff buffer)
screenshot(detail="text", windowTitle=X) → get actionable[].clickAt coords
mouse_click(x, y)                        → click directly, no math needed
screenshot(diffMode=true)                → check only what changed (~160 tok)

Security

Emergency stop (Failsafe)

Move the mouse to the top-left corner of the screen (within 10px of 0,0) to immediately terminate the MCP server.

  • Per-tool check: checkFailsafe() runs before every tool handler
  • Background monitor: 500ms polling as a backup for long-running operations
  • Trigger radius: 10px

Blocked operations

workspace_launch blocklist: cmd.exe, powershell.exe, pwsh.exe, wscript.exe, cscript.exe, mshta.exe, regsvr32.exe, rundll32.exe, msiexec.exe, bash.exe, wsl.exe are blocked. Script extensions (.bat, .ps1, .vbs, etc.) are rejected. Arguments containing ;, &, |, `, $(, ${ are also rejected.

keyboard_press blocklist: Win+R (Run dialog), Win+X (admin menu), Win+S (search), Win+L (lock screen) are blocked.

PowerShell injection protection

All -like patterns in the UIA bridge are sanitized with escapeLike(), which escapes wildcard characters (*, ?, [, ]) before they reach PowerShell.

Allowlist for workspace_launch

Shell interpreters are blocked by default. To allow specific executables, create an allowlist file:

File locations (searched in order):

  1. Path in DESKTOP_TOUCH_ALLOWLIST environment variable
  2. ~/.claude/desktop-touch-allowlist.json
  3. desktop-touch-allowlist.json in the server's working directory

Format:

{
  "allowedExecutables": [
    "pwsh.exe",
    "C:\\Tools\\myapp.exe"
  ]
}

Changes take effect immediately — no restart needed.


Mouse movement speed

All mouse tools (mouse_move, mouse_click, mouse_drag, scroll) accept an optional speed parameter:

Value Behavior
Omitted Uses the configured default (see below)
0 Instant teleport — setPosition(), no animation
1–N Animated movement at N px/sec

Default speed is 1500 px/sec. Change it permanently via the DESKTOP_TOUCH_MOUSE_SPEED environment variable:

{
  "mcpServers": {
    "desktop-touch": {
      "type": "stdio",
      "command": "node",
      "args": ["D:/path/to/desktop-touch-mcp/dist/index.js"],
      "env": {
        "DESKTOP_TOUCH_MOUSE_SPEED": "3000"
      }
    }
  }
}

Common values: 0 = teleport, 1500 = default gentle, 3000 = fast, 5000 = very fast.


Known limitations

Limitation Detail Workaround
Games / video players may return black or hang in background capture DirectX fullscreen apps may not work even with PW_RENDERFULLCONTENT Retry with screenshot_background(fullContent=false); if still black, use foreground screenshot
UIA call overhead ~300ms per call via PowerShell; workspace_snapshot uses a 2s timeout internally Batch with workspace_snapshot upfront, then use diffMode for incremental checks
Chrome / WinUI3 UIA elements are empty Chromium exposes only limited UIA screenshot(detail='text') auto-detects Chromium and falls back to Windows OCR (hints.chromiumGuard=true). For richer DOM access use browser_connect + browser_find_element
Chromium title-regex misses when sites rewrite document.title Guard relies on the - Google Chrome suffix being present; some sites push it off the end of a long title Title is treated as plain Chrome (UIA runs). OCR path is still reachable via ocrFallback='always' or when UIA returns <5 elements (uiaSparse)
browser_* CDP tools need Chrome launched with --remote-debugging-port If Chrome is already running on the default profile without the flag, browser_launch / browser_connect fail. The CDP E2E suite (tests/e2e/browser-cdp.test.ts) will also fail in that state Close Chrome first, then browser_launch will relaunch it in debug mode, or start Chrome manually with --remote-debugging-port=9222 --user-data-dir=C:\tmp\cdp
Layer buffer TTL Buffer auto-clears after 90s of inactivity → next diffMode becomes an I-frame After long waits, call workspace_snapshot to explicitly reset the buffer
keyboard_type / keyboard_press follow focus When dock_window(pin=true) keeps another window on top (e.g. Claude CLI), keystrokes may be absorbed by that window Call focus_window(title=...) first and verify isActive=true via screenshot(detail='meta') before sending keys

Token cost reference

Mode Tokens Use case
screenshot (768px PNG) ~443 tok General visual check
screenshot(dotByDot=true) window ~800 tok Precise clicking (no coordinate math)
screenshot(diffMode=true) ~160 tok Post-action diff
screenshot(detail="text") ~100–300 tok UI interaction (no image)
workspace_snapshot ~2000 tok Full session orientation

License

MIT

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选