computer-use

computer-use

Exposes Anthropic's computer-use action surface (screenshot, click, move, keyboard, clipboard, batch) against a persistent desktop display via MCP stdio protocol. Enables AI agents to control a virtual desktop environment through natural language instructions.

Category
访问服务器

README

plugin-computer-use

A persistent stdio MCP server that exposes the Anthropic computer-use action surface (screenshot, click, move, keyboard, clipboard, batch) against the SideButton agent desktop on DISPLAY=:10.

This repo is the scaffold + dispatch core for the Computer Use epic (SCRUM-1399). It is delivered by SCRUM-1397:

  • the long-lived stdio MCP server loop (initialize / tools/list / tools/call),
  • the ported computer.py dispatch base (DISPLAY targeting, screenshot → base64 PNG, coordinate scaling, single-owner lock, xdotool runner),
  • the full tool surface declared so tools/list returns it,
  • screenshot wired end-to-end as the proof action.

The individual tool bodies land in sibling tickets (SCRUM-1400…1405) and hosting this as a runtime: "service" plugin is SCRUM-1406.

Why a persistent server

The current SideButton plugin model (the-assistant packages/server/src/plugins) spawns a fresh, stateless handler process per tools/call and SIGKILLs it at a 30s timeout. That cannot host the computer-use surface, which needs cross-call state: a held mouse button (left_mouse_downleft_mouse_up), the screenshot→coordinate session, session grants, and holds up to ~100s. So this is a single, long-lived child process that speaks MCP over stdio.

Tool surface

24 tools, grouped by the sibling ticket that owns each body. The capture group (screenshot, zoom, SCRUM-1400), the click group (left_click / right_click / middle_click / double_click / triple_click, SCRUM-1401), the keyboard group (type / key / hold_key, SCRUM-1403), and the clipboard + session group (SCRUM-1404) are implemented; the rest are declared and return a clear pending-owner error until their ticket lands. Full input schemas: docs/computer-use-mcp-tools-schema.md.

Group Ticket Tools
capture SCRUM-1400 screenshot ✅, zoom
click SCRUM-1401 left_click ✅, right_click ✅, middle_click ✅, double_click ✅, triple_click
move / drag / scroll SCRUM-1402 mouse_move, left_click_drag, scroll, left_mouse_down, left_mouse_up
keyboard SCRUM-1403 type ✅, key ✅, hold_key
clipboard + session SCRUM-1404 read_clipboard ✅, write_clipboard ✅, request_access ✅, list_granted_applications ✅, open_application ✅, switch_display
utility / batch SCRUM-1405 computer_batch, wait, cursor_position

Clipboard + session behaviour (SCRUM-1404)

The macOS session/permission model has no XFCE/Xvfb equivalent, so these degrade gracefully instead of erroring — keeping cross-runner (macOS-authored) skills working — while honouring the native grant flags so call shapes match:

  • request_access auto-grants the requested apps (no compositor dialog), records the clipboardRead / clipboardWrite / systemKeyCombos flags (additive across calls), and returns screenshotFiltering: false.
  • list_granted_applications echoes the allowlist + active grant flags.
  • read_clipboard / write_clipboard shell out to xclip -selection clipboard, gated on the clipboardRead / clipboardWrite grants (a call without the grant returns an isError result, matching native).
  • open_application is best-effort window focus (wmctrl -a, then xdotool search --name … windowactivate); the primary target is the single RDP window. With neither binary installed it returns a non-error no-op note.
  • switch_display is a no-op on the single Xvfb :10 and reports the current display (accepts "auto").

Surface count. This is the 24-tool surface the epic (SCRUM-1399) specifies. The clipboard + session group follows the explicit enumeration in SCRUM-1404 (read_clipboard / write_clipboard split + list_granted_applications), which is the 2-tool delta over the work plan's interim count of 22. src/tools.py is the single source of truth; docs/computer-use-mcp-tools-schema.md (AC4) is generated from it.

Bare names + collisions. Names are the canonical Anthropic action ids. screenshot, type, scroll, wait, click collide with core SideButton MCP tools, and the current loader drops the entire plugin on any collision. That is fine standalone (this server owns its namespace); namespacing on aggregation is deferred to SCRUM-1406 (recommended: bare names in the child, prefix/slug-namespace on the host).

Layout

plugin-computer-use/
├── plugin.json        # generated service-plugin manifest (proposes runtime:"service")
├── src/
│   ├── server.py      # stdio MCP loop: initialize / tools/list / tools/call
│   ├── computer.py    # dispatch base (ported computer.py)
│   └── tools.py       # canonical tool surface (single source of truth)
├── scripts/
│   └── build_manifest.py   # regenerates plugin.json + the schema doc from tools.py
├── tests/             # unittest: dispatch-base unit + stdio round-trip + manifest
├── docs/
│   └── computer-use-mcp-tools-schema.md   # generated; the AC4 schema doc
├── run_tests.sh       # runs the suite (xvfb-wrapped when no DISPLAY)
├── pyproject.toml     # dependency-free, python>=3.10
├── README.md  LICENSE  .gitignore

Run it standalone

# speak MCP by hand (newline-delimited JSON-RPC):
printf '%s\n' \
  '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{}}' \
  '{"jsonrpc":"2.0","id":2,"method":"tools/list"}' \
  '{"jsonrpc":"2.0","id":3,"method":"tools/call","params":{"name":"screenshot","arguments":{"save_to_disk":true}}}' \
  '{"jsonrpc":"2.0","id":4,"method":"tools/call","params":{"name":"zoom","arguments":{"region":[600,300,900,500]}}}' \
  '{"jsonrpc":"2.0","id":5,"method":"tools/call","params":{"name":"type","arguments":{"text":"hello"}}}' \
  '{"jsonrpc":"2.0","id":6,"method":"tools/call","params":{"name":"key","arguments":{"text":"ctrl+a","repeat":1}}}' \
  '{"jsonrpc":"2.0","id":7,"method":"tools/call","params":{"name":"hold_key","arguments":{"text":"shift","duration":2}}}' \
  '{"jsonrpc":"2.0","id":8,"method":"tools/call","params":{"name":"left_click","arguments":{"coordinate":[600,300],"text":"ctrl"}}}' \
  | DISPLAY=:10 python3 src/server.py

initialize returns the handshake, tools/list the 24-tool surface, the screenshot call a base64 PNG image block (plus a Saved to disk: <path> text block when save_to_disk is set), and zoom a magnified PNG of the region. The keyboard and click calls each return a short text ack (isError:false); they need xdotool on PATH. The left_click maps its [600, 300] against the id:3 screenshot's coordinate session — a click before any screenshot returns a clear no screenshot session yet error instead of clicking blind.

Capture & coordinates

screenshot captures DISPLAY=:10 and, when the measured size matches a model resolution, downscales it (on the live 1920×1080 :10 it returns 1366×768). Each capture records a screenshot → coordinate session: the measured device geometry and the returned image geometry. Coordinates the model returns are in image space (relative to the last screenshot); the server maps them back to device pixels via Computer.to_device(x, y) — the foundation the click/move siblings (SCRUM-1401/1402) consume. Both the downscale and the coordinate mapping are derived from the same measured geometry, so they can never use different bases (the wrong-pixel-click failure mode).

zoom takes region: (x0, y0, x1, y1) in image space, maps it to a device rect, and crops it from a fresh full-resolution capture — genuine magnification, not an upscale of the downscaled screenshot. It is read-only: it never moves the click-coordinate origin (clicks still refer to the last screenshot). If no screenshot has been taken yet, zoom establishes the session lazily.

True 1:1 (no downscale) would require pinning :10 / the RDP window to a model-friendly size — that is provisioning (SCRUM-1396), out of scope here.

Click group (SCRUM-1401)

Pointer clicks at a screenshot-session coordinate. The [x, y] coordinate is image space (relative to the last screenshot) and is mapped to device pixels via Computer.to_device — a click before any screenshot returns a clear no screenshot session yet error (look before you click). Optional text modifier(s) ('ctrl', 'shift+alt', …) are held for the click and always released (keyup in a finally, the same guarantee as hold_key).

Tool xdotool Button
left_click mousemove --sync <dx> <dy> click 1 left (1)
right_click mousemove --sync <dx> <dy> click 3 right (3)
middle_click mousemove --sync <dx> <dy> click 2 middle (2)
double_click mousemove --sync <dx> <dy> click --repeat 2 --delay 100 1 left ×2
triple_click mousemove --sync <dx> <dy> click --repeat 3 --delay 100 1 left ×3

With a modifier the click is wrapped in keydown -- <text> → click → keyup -- <text>. On-screen pixel accuracy is validated live in SCRUM-1408 (xdotool is absent on the current runner image, so the unit tests assert the device-pixel argv).

Keyboard group (SCRUM-1403)

Tool xdotool Notes
type xdotool type --delay 12 -- <text> types text at the current focus
key xdotool key --repeat <repeat> -- <text> chords, e.g. ctrl+s; optional repeat (default 1)
hold_key keydown -- <text>sleep <duration>keyup -- <text> the hold runs in the persistent server (Python time.sleep), so durations up to ~100s do not trip the per-call subprocess timeout; keyup runs in a finally so a held key/modifier is always released

Test

./run_tests.sh          # uses $DISPLAY if set, else wraps in xvfb-run
# or directly:
DISPLAY=:10 python3 -m unittest discover -s tests -v
  • tests/test_dispatch_base.py — coordinate-scaling math, the screenshot → coordinate session + to_device mapping, the measured-basis downscale target, zoom region validation + region→device-rect math, xdotool command construction, single-owner lock, screenshot-backend detection, surface shape, plus live screenshot/zoom + save_to_disk (DISPLAY-gated).
  • tests/test_stdio_roundtrip.pyinitializetools/listtools/call screenshot (incl. save_to_disk path block) and zoom over a spawned server, plus error paths.
  • tests/test_manifest.pyplugin.json + schema doc are present and in sync with src/tools.py.

The screenshot round-trip needs an X display; run_tests.sh provides one via xvfb-run when $DISPLAY is unset, so AC3 still exercises in headless CI.

System dependencies

System packages (apt), not pip — the plugin install copies no node_modules/venv and runs no build step, so the server is stdlib-only and shells out to:

Tool Used for Notes
a screenshot backend screenshot gnome-screenshot or scrot or ImageMagick (import/convert). The runner ships ImageMagick.
xdotool pointer/keyboard actions; open_application fallback required by the click/move/keyboard groups (siblings); absent on the runner image.
xclip read_clipboard / write_clipboard already on the runner; grant-gated.
wmctrl open_application window focus best-effort; open_application degrades to a no-op when absent.

scrot and gnome-screenshot are absent on the runner image, so the screenshot backend falls through to ImageMagick import -window root (verified on DISPLAY=:10). When SCRUM-1407 adds this plugin to the agent-runners catalog, declare xdotool, a screenshot backend, and xclip in its system_deps.

DISPLAY and single-owner

  • The server targets the inherited $DISPLAY, defaulting to :10 (the runner desktop). It never hardcodes a display — the screen-record plugin's bug was capturing a non-existent :1.0.
  • It takes a process-lifetime single-owner lock (flock, /tmp/sidebutton-computer-use.lock, override with CU_LOCK_PATH) so only one session drives the shared pointer/keyboard; a second instance exits non-zero.

Service-manifest contract (SCRUM-1406)

plugin.json targets the merged runtime: "service" tier: the SideButton server keeps the child alive, discovers its tools via tools/list, and forwards tools/call to it.

{
  "name": "computer-use",
  "runtime": "service",
  "service": {
    "command": "python3 src/server.py",  // non-empty string; the engine splits on
                                          // whitespace and spawns with cwd=plugin dir
    "toolNamespace": "computer_use",      // tools surface as computer_use_<tool>
    "tools": {                            // per-tool timeout overrides (ms)
      "hold_key": { "timeoutMs": 120000 },
      "wait":     { "timeoutMs": 120000 }
    }
  },
  "tools": []                             // service plugins declare no static tools
}

The loader (the-assistant packages/server/src/plugins/loader.ts) recognizes only command / timeoutMs / toolNamespace / tools under service, and hard-rejects the manifest unless command is a non-empty string — an array fails validation and the plugin never loads. Tools are discovered live, so the top-level tools array is normalized to []. This repo owns only plugin.json; the agent-runners catalog entry + system_deps are SCRUM-1407.

Configuration (env)

Var Default Purpose
DISPLAY :10 target X display
CU_WIDTH / CU_HEIGHT 1920 / 1080 screen size for coordinate scaling
CU_SCREENSHOT_DELAY 2.0 post-action settle before a screenshot
CU_LOCK_PATH /tmp/sidebutton-computer-use.lock single-owner lock file
CU_SAVE_DIR /tmp/sidebutton-computer-use/ where save_to_disk writes shareable PNGs (host-pruned; saved files are not auto-deleted)

License

MIT © 2026 SideButton

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选