computer-use
Exposes Anthropic's computer-use action surface (screenshot, click, move, keyboard, clipboard, batch) against a persistent desktop display via MCP stdio protocol. Enables AI agents to control a virtual desktop environment through natural language instructions.
README
plugin-computer-use
A persistent stdio MCP server that exposes the Anthropic
computer-use
action surface (screenshot, click, move, keyboard, clipboard, batch) against the
SideButton agent desktop on DISPLAY=:10.
This repo is the scaffold + dispatch core for the Computer Use epic (SCRUM-1399). It is delivered by SCRUM-1397:
- the long-lived stdio MCP server loop (
initialize/tools/list/tools/call), - the ported
computer.pydispatch base (DISPLAY targeting, screenshot → base64 PNG, coordinate scaling, single-owner lock, xdotool runner), - the full tool surface declared so
tools/listreturns it, screenshotwired end-to-end as the proof action.
The individual tool bodies land in sibling tickets (SCRUM-1400…1405) and
hosting this as a runtime: "service" plugin is SCRUM-1406.
Why a persistent server
The current SideButton plugin model
(the-assistant packages/server/src/plugins)
spawns a fresh, stateless handler process per tools/call and SIGKILLs it at
a 30s timeout. That cannot host the computer-use surface, which needs cross-call
state: a held mouse button (left_mouse_down … left_mouse_up), the
screenshot→coordinate session, session grants, and holds up to ~100s. So this is
a single, long-lived child process that speaks MCP over stdio.
Tool surface
24 tools, grouped by the sibling ticket that owns each body. The capture group
(screenshot, zoom, SCRUM-1400), the click group (left_click / right_click
/ middle_click / double_click / triple_click, SCRUM-1401), the keyboard
group (type / key / hold_key, SCRUM-1403), and the clipboard + session
group (SCRUM-1404) are implemented; the rest are declared and return a clear
pending-owner error until their ticket lands. Full input schemas:
docs/computer-use-mcp-tools-schema.md.
| Group | Ticket | Tools |
|---|---|---|
| capture | SCRUM-1400 | screenshot ✅, zoom ✅ |
| click | SCRUM-1401 | left_click ✅, right_click ✅, middle_click ✅, double_click ✅, triple_click ✅ |
| move / drag / scroll | SCRUM-1402 | mouse_move, left_click_drag, scroll, left_mouse_down, left_mouse_up |
| keyboard | SCRUM-1403 | type ✅, key ✅, hold_key ✅ |
| clipboard + session | SCRUM-1404 | read_clipboard ✅, write_clipboard ✅, request_access ✅, list_granted_applications ✅, open_application ✅, switch_display ✅ |
| utility / batch | SCRUM-1405 | computer_batch, wait, cursor_position |
Clipboard + session behaviour (SCRUM-1404)
The macOS session/permission model has no XFCE/Xvfb equivalent, so these degrade gracefully instead of erroring — keeping cross-runner (macOS-authored) skills working — while honouring the native grant flags so call shapes match:
request_accessauto-grants the requestedapps(no compositor dialog), records theclipboardRead/clipboardWrite/systemKeyCombosflags (additive across calls), and returnsscreenshotFiltering: false.list_granted_applicationsechoes the allowlist + active grant flags.read_clipboard/write_clipboardshell out toxclip -selection clipboard, gated on theclipboardRead/clipboardWritegrants (a call without the grant returns anisErrorresult, matching native).open_applicationis best-effort window focus (wmctrl -a, thenxdotool search --name … windowactivate); the primary target is the single RDP window. With neither binary installed it returns a non-error no-op note.switch_displayis a no-op on the single Xvfb:10and reports the current display (accepts"auto").
Surface count. This is the 24-tool surface the epic (SCRUM-1399) specifies. The clipboard + session group follows the explicit enumeration in SCRUM-1404 (
read_clipboard/write_clipboardsplit +list_granted_applications), which is the 2-tool delta over the work plan's interim count of 22.src/tools.pyis the single source of truth;docs/computer-use-mcp-tools-schema.md(AC4) is generated from it.
Bare names + collisions. Names are the canonical Anthropic action ids.
screenshot,type,scroll,wait,clickcollide with core SideButton MCP tools, and the current loader drops the entire plugin on any collision. That is fine standalone (this server owns its namespace); namespacing on aggregation is deferred to SCRUM-1406 (recommended: bare names in the child, prefix/slug-namespace on the host).
Layout
plugin-computer-use/
├── plugin.json # generated service-plugin manifest (proposes runtime:"service")
├── src/
│ ├── server.py # stdio MCP loop: initialize / tools/list / tools/call
│ ├── computer.py # dispatch base (ported computer.py)
│ └── tools.py # canonical tool surface (single source of truth)
├── scripts/
│ └── build_manifest.py # regenerates plugin.json + the schema doc from tools.py
├── tests/ # unittest: dispatch-base unit + stdio round-trip + manifest
├── docs/
│ └── computer-use-mcp-tools-schema.md # generated; the AC4 schema doc
├── run_tests.sh # runs the suite (xvfb-wrapped when no DISPLAY)
├── pyproject.toml # dependency-free, python>=3.10
├── README.md LICENSE .gitignore
Run it standalone
# speak MCP by hand (newline-delimited JSON-RPC):
printf '%s\n' \
'{"jsonrpc":"2.0","id":1,"method":"initialize","params":{}}' \
'{"jsonrpc":"2.0","id":2,"method":"tools/list"}' \
'{"jsonrpc":"2.0","id":3,"method":"tools/call","params":{"name":"screenshot","arguments":{"save_to_disk":true}}}' \
'{"jsonrpc":"2.0","id":4,"method":"tools/call","params":{"name":"zoom","arguments":{"region":[600,300,900,500]}}}' \
'{"jsonrpc":"2.0","id":5,"method":"tools/call","params":{"name":"type","arguments":{"text":"hello"}}}' \
'{"jsonrpc":"2.0","id":6,"method":"tools/call","params":{"name":"key","arguments":{"text":"ctrl+a","repeat":1}}}' \
'{"jsonrpc":"2.0","id":7,"method":"tools/call","params":{"name":"hold_key","arguments":{"text":"shift","duration":2}}}' \
'{"jsonrpc":"2.0","id":8,"method":"tools/call","params":{"name":"left_click","arguments":{"coordinate":[600,300],"text":"ctrl"}}}' \
| DISPLAY=:10 python3 src/server.py
initialize returns the handshake, tools/list the 24-tool surface, the
screenshot call a base64 PNG image block (plus a Saved to disk: <path> text
block when save_to_disk is set), and zoom a magnified PNG of the region. The
keyboard and click calls each return a short text ack (isError:false); they need
xdotool on PATH. The left_click maps its [600, 300] against the id:3
screenshot's coordinate session — a click before any screenshot returns a clear
no screenshot session yet error instead of clicking blind.
Capture & coordinates
screenshot captures DISPLAY=:10 and, when the measured size matches a model
resolution, downscales it (on the live 1920×1080 :10 it returns 1366×768).
Each capture records a screenshot → coordinate session: the measured device
geometry and the returned image geometry. Coordinates the model returns are in
image space (relative to the last screenshot); the server maps them back to
device pixels via Computer.to_device(x, y) — the foundation the click/move
siblings (SCRUM-1401/1402) consume. Both the downscale and the coordinate mapping
are derived from the same measured geometry, so they can never use different
bases (the wrong-pixel-click failure mode).
zoom takes region: (x0, y0, x1, y1) in image space, maps it to a device rect,
and crops it from a fresh full-resolution capture — genuine magnification, not
an upscale of the downscaled screenshot. It is read-only: it never moves the
click-coordinate origin (clicks still refer to the last screenshot). If no
screenshot has been taken yet, zoom establishes the session lazily.
True 1:1 (no downscale) would require pinning :10 / the RDP window to a
model-friendly size — that is provisioning (SCRUM-1396),
out of scope here.
Click group (SCRUM-1401)
Pointer clicks at a screenshot-session coordinate. The [x, y] coordinate
is image space (relative to the last screenshot) and is mapped to device pixels
via Computer.to_device — a click before any screenshot returns a clear
no screenshot session yet error (look before you click). Optional text
modifier(s) ('ctrl', 'shift+alt', …) are held for the click and always
released (keyup in a finally, the same guarantee as hold_key).
| Tool | xdotool | Button |
|---|---|---|
left_click |
mousemove --sync <dx> <dy> click 1 |
left (1) |
right_click |
mousemove --sync <dx> <dy> click 3 |
right (3) |
middle_click |
mousemove --sync <dx> <dy> click 2 |
middle (2) |
double_click |
mousemove --sync <dx> <dy> click --repeat 2 --delay 100 1 |
left ×2 |
triple_click |
mousemove --sync <dx> <dy> click --repeat 3 --delay 100 1 |
left ×3 |
With a modifier the click is wrapped in keydown -- <text> → click → keyup -- <text>. On-screen pixel accuracy is validated live in
SCRUM-1408 (xdotool is absent
on the current runner image, so the unit tests assert the device-pixel argv).
Keyboard group (SCRUM-1403)
| Tool | xdotool | Notes |
|---|---|---|
type |
xdotool type --delay 12 -- <text> |
types text at the current focus |
key |
xdotool key --repeat <repeat> -- <text> |
chords, e.g. ctrl+s; optional repeat (default 1) |
hold_key |
keydown -- <text> → sleep <duration> → keyup -- <text> |
the hold runs in the persistent server (Python time.sleep), so durations up to ~100s do not trip the per-call subprocess timeout; keyup runs in a finally so a held key/modifier is always released |
Test
./run_tests.sh # uses $DISPLAY if set, else wraps in xvfb-run
# or directly:
DISPLAY=:10 python3 -m unittest discover -s tests -v
tests/test_dispatch_base.py— coordinate-scaling math, the screenshot → coordinate session +to_devicemapping, the measured-basis downscale target,zoomregion validation + region→device-rect math, xdotool command construction, single-owner lock, screenshot-backend detection, surface shape, plus livescreenshot/zoom+save_to_disk(DISPLAY-gated).tests/test_stdio_roundtrip.py—initialize→tools/list→tools/callscreenshot(incl.save_to_diskpath block) andzoomover a spawned server, plus error paths.tests/test_manifest.py—plugin.json+ schema doc are present and in sync withsrc/tools.py.
The screenshot round-trip needs an X display; run_tests.sh provides one via
xvfb-run when $DISPLAY is unset, so AC3 still exercises in headless CI.
System dependencies
System packages (apt), not pip — the plugin install copies no node_modules/venv
and runs no build step, so the server is stdlib-only and shells out to:
| Tool | Used for | Notes |
|---|---|---|
| a screenshot backend | screenshot |
gnome-screenshot or scrot or ImageMagick (import/convert). The runner ships ImageMagick. |
xdotool |
pointer/keyboard actions; open_application fallback |
required by the click/move/keyboard groups (siblings); absent on the runner image. |
xclip |
read_clipboard / write_clipboard |
already on the runner; grant-gated. |
wmctrl |
open_application window focus |
best-effort; open_application degrades to a no-op when absent. |
scrot and gnome-screenshot are absent on the runner image, so the
screenshot backend falls through to ImageMagick import -window root (verified
on DISPLAY=:10). When SCRUM-1407 adds this plugin to the agent-runners catalog,
declare xdotool, a screenshot backend, and xclip in its system_deps.
DISPLAY and single-owner
- The server targets the inherited
$DISPLAY, defaulting to:10(the runner desktop). It never hardcodes a display — the screen-record plugin's bug was capturing a non-existent:1.0. - It takes a process-lifetime single-owner lock (
flock,/tmp/sidebutton-computer-use.lock, override withCU_LOCK_PATH) so only one session drives the shared pointer/keyboard; a second instance exits non-zero.
Service-manifest contract (SCRUM-1406)
plugin.json targets the merged runtime: "service" tier: the SideButton server
keeps the child alive, discovers its tools via tools/list, and forwards
tools/call to it.
{
"name": "computer-use",
"runtime": "service",
"service": {
"command": "python3 src/server.py", // non-empty string; the engine splits on
// whitespace and spawns with cwd=plugin dir
"toolNamespace": "computer_use", // tools surface as computer_use_<tool>
"tools": { // per-tool timeout overrides (ms)
"hold_key": { "timeoutMs": 120000 },
"wait": { "timeoutMs": 120000 }
}
},
"tools": [] // service plugins declare no static tools
}
The loader (
the-assistantpackages/server/src/plugins/loader.ts) recognizes onlycommand/timeoutMs/toolNamespace/toolsunderservice, and hard-rejects the manifest unlesscommandis a non-empty string — an array fails validation and the plugin never loads. Tools are discovered live, so the top-leveltoolsarray is normalized to[]. This repo owns onlyplugin.json; the agent-runners catalog entry +system_depsare SCRUM-1407.
Configuration (env)
| Var | Default | Purpose |
|---|---|---|
DISPLAY |
:10 |
target X display |
CU_WIDTH / CU_HEIGHT |
1920 / 1080 |
screen size for coordinate scaling |
CU_SCREENSHOT_DELAY |
2.0 |
post-action settle before a screenshot |
CU_LOCK_PATH |
/tmp/sidebutton-computer-use.lock |
single-owner lock file |
CU_SAVE_DIR |
/tmp/sidebutton-computer-use/ |
where save_to_disk writes shareable PNGs (host-pruned; saved files are not auto-deleted) |
License
MIT © 2026 SideButton
推荐服务器
Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。
VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。
e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。