Touchpoint

Touchpoint

Playwright for the entire OS. Give AI agents eyes and hands on any desktop app — find, click, type, and read UI elements across Linux, macOS, and Windows.

Category
访问服务器

README

<p align="center"> <h1 align="center">Touchpoint</h1> <p align="center"> <strong>Give your AI agent eyes and hands on any desktop.</strong> </p> <p align="center"> <a href="https://pypi.org/project/touchpoint-py/"><img src="https://img.shields.io/pypi/v/touchpoint-py?color=blue" alt="PyPI"></a> <a href="https://pypi.org/project/touchpoint-py/"><img src="https://img.shields.io/pypi/pyversions/touchpoint-py" alt="Python"></a> <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-green" alt="MIT License"></a> <a href="#status"><img src="https://img.shields.io/badge/status-alpha-orange" alt="Alpha"></a> <br> <img src="https://img.shields.io/badge/Linux-FCC624?logo=linux&logoColor=black" alt="Linux"> <img src="https://img.shields.io/badge/macOS-000000?logo=apple&logoColor=white" alt="macOS"> <img src="https://img.shields.io/badge/Windows-0078D6?logo=windows&logoColor=white" alt="Windows"> </p> <p align="center"> <code>pip install touchpoint-py</code> </p> </p>

<p align="center"><img src="docs/demo.gif" width="720" alt="Touchpoint demo — AI agent creates a formatted Excel table using Touchpoint"></p> <p align="center"><em>AI agent researches data in Chrome, then creates a formatted Excel table — full task completed in ~12 minutes</em></p>


Touchpoint is a cross-platform Python library for reading and interacting with desktop UI through native accessibility APIs. One import, one API — works on Linux, macOS, and Windows, with built-in support for Chromium and Electron apps via CDP (Chrome DevTools Protocol).

Instead of scraping pixels or running vision models, Touchpoint reads the real accessibility tree — structured names, roles, states, and positions for every element on screen. Fast and reliable, with no model inference needed. Ships with an MCP server so LLM agents like Claude or Cursor can control any desktop app out of the box.

import touchpoint as tp

elements = tp.find("Send", role=tp.Role.BUTTON, app="Slack")
tp.click(elements[0])

Why Touchpoint?

Screenshot / vision Browser automation Touchpoint
Native desktop apps ⚠️ inaccurate or slow ✅ structured access
Browsers ⚠️ inaccurate or slow ✅ via CDP
Electron apps (Slack, VS Code, ...) ⚠️ inaccurate or slow ⚠️ web content only ✅ native + web
Structured element data ❌ needs OCR/vision models ✅ web only ✅ names, roles, states, positions
Works across Linux, macOS, Windows

Table of Contents


Install

Requires Python 3.10+.

pip install touchpoint-py

Everything is included: your platform's native backend, CDP support for browsers and Electron apps, the MCP server, and screenshot capabilities. Platform-specific dependencies are installed automatically via pip environment markers.

Platform requirements

Platform Backend Requirement
Linux AT-SPI2 Install xdotool for input. Most desktops include python3-gi and gir1.2-atspi-2.0 — install them if missing.
Windows UI Automation None — uses built-in COM APIs
macOS Accessibility (AX) Grant permission: System Settings → Privacy & Security → Accessibility

Quick Start

import touchpoint as tp

# Discover
apps = tp.apps()                            # ["Firefox", "Slack", "Terminal", ...]
windows = tp.windows()                      # Window objects with title, position, size
all_els = tp.elements(app="Firefox", named_only=True)  # only elements with text labels

# Find
results = tp.find("Search", role=tp.Role.TEXT_FIELD, app="Firefox")

# Act
tp.set_value(results[0], "touchpoint python", replace=True)
tp.press_key("enter")
tp.hotkey("ctrl", "s")                      # keyboard shortcuts

# Wait for UI changes
tp.wait_for("results", app="Firefox", timeout=10)

# Screenshot
img = tp.screenshot()                       # full desktop → PIL.Image
img = tp.screenshot(app="Firefox")           # cropped to app window

Element IDs

Every element has a unique ID like atspi:1234:1:2.0 or cdp:9222:TID:4. Action functions accept either an Element object or a bare ID string — useful for storing references across steps:

results = tp.find("Send", max_results=1)
element_id = results[0].id                  # "atspi:1234:1:5.2"

# later...
tp.click(element_id)                        # works with just the string

Output formats

Control how results are returned:

tp.elements(app="Slack", format="flat")     # one compact line per element (best for LLMs)
tp.elements(app="Slack", format="tree")     # indented parent/child hierarchy
tp.elements(app="Slack", format="json")     # full JSON with all fields

MCP Server

Touchpoint ships an MCP (Model Context Protocol) server with 19 tools, ready for any MCP-compatible client. Use it to let LLM agents like Claude, Cursor, or Copilot control your desktop.

Tools

Category Tools
Discovery apps, windows, find, elements, get_element
Screenshot screenshot (returns image data the LLM can see)
Actions click (left/right/double), set_value, set_numeric_value, focus, action
Keyboard type_text, press_key (single key or combo)
Mouse mouse_move, scroll
Window activate_window
Waiting wait_for, wait_for_app, wait_for_window

The MCP server includes built-in instructions that teach LLM agents how to work effectively — the orient → locate → act → verify loop, how to use find(), and how to recover from errors.

         ┌──────────┐
    ┌───▶│  ORIENT  │  screenshot · apps · windows
    │    └────┬─────┘
    │         ▼
    │    ┌──────────┐
    │    │  LOCATE  │  find · elements · get_element
    │    └────┬─────┘
    │         ▼
    │    ┌──────────┐
    │    │   ACT    │  click · set_value · type_text · press_key
    │    └────┬─────┘
    │         ▼
    │    ┌──────────┐
    │    │  VERIFY  │───▶ Done ✅
    │    └────┬─────┘
    │         │ not yet
    └─────────┘

Client setup

<details> <summary><strong>Claude Desktop</strong></summary>

Config file location:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json
{
  "mcpServers": {
    "touchpoint": {
      "command": "touchpoint-mcp"
    }
  }
}

If using a virtualenv, use the full path: "/path/to/venv/bin/touchpoint-mcp"

</details>

<details> <summary><strong>VS Code / GitHub Copilot</strong></summary>

Add to .vscode/mcp.json in your workspace:

{
  "servers": {
    "touchpoint": {
      "command": "touchpoint-mcp"
    }
  }
}

</details>

<details> <summary><strong>Cursor</strong></summary>

Create or edit ~/.cursor/mcp.json:

{
  "mcpServers": {
    "touchpoint": {
      "command": "touchpoint-mcp"
    }
  }
}

</details>

<details> <summary><strong>Windsurf</strong></summary>

Edit ~/.codeium/windsurf/mcp_config.json:

{
  "mcpServers": {
    "touchpoint": {
      "command": "touchpoint-mcp"
    }
  }
}

</details>

<details> <summary><strong>Claude Code (CLI)</strong></summary>

claude mcp add touchpoint -- touchpoint-mcp

</details>

<details> <summary><strong>OpenClaw</strong></summary>

Add to mcpServers in ~/.openclaw/openclaw.json:

{
  "mcpServers": {
    "touchpoint": {
      "command": "touchpoint-mcp"
    }
  }
}

</details>

Environment variables

<details> <summary>All optional — click to see available settings</summary>

<br>

Variable Example Description
TOUCHPOINT_CDP_DISCOVER true Auto-discover CDP ports from running processes
TOUCHPOINT_CDP_PORTS {"Chrome": 9222} Explicit app-to-port mapping (JSON)
TOUCHPOINT_CDP_APP Google Chrome Single app name (pair with _PORT)
TOUCHPOINT_CDP_PORT 9222 Single port (pair with _APP)
TOUCHPOINT_CDP_REFRESH_INTERVAL 5.0 Seconds between CDP port scans
TOUCHPOINT_SCALE_FACTOR 1.25 Display scale override
TOUCHPOINT_FUZZY_THRESHOLD 0.6 Minimum match score for find() (0.0–1.0)
TOUCHPOINT_FALLBACK_INPUT true Use coordinate fallback when native actions fail
TOUCHPOINT_MAX_ELEMENTS 5000 Maximum elements per query
TOUCHPOINT_MAX_DEPTH 10 Default tree depth limit

</details>


Browser & Electron Apps (CDP)

Native accessibility APIs return limited data for Electron and Chromium apps (Slack, Discord, VS Code, etc.). Touchpoint's CDP backend connects via Chrome DevTools Protocol to get the full web content.

Auto-discovery is enabled by default — Touchpoint automatically finds running browsers and Electron apps that were launched with a debug port. No manual configuration needed beyond launching the app with the flag.

Setup

  1. Launch the app with a debug port:
# Linux
google-chrome --remote-debugging-port=9222 --user-data-dir=/tmp/tp-chrome

# macOS
open -na "Google Chrome" --args --remote-debugging-port=9222 --user-data-dir=/tmp/tp-chrome

# Windows
start chrome --remote-debugging-port=9222 --user-data-dir=%TEMP%\tp-chrome
  1. Configure Touchpoint:
import touchpoint as tp

tp.configure(cdp_discover=True)             # auto-discover from running processes
# or
tp.configure(cdp_ports={"Google Chrome": 9222})  # explicit mapping
  1. Control what you get with the source parameter:
tp.elements(app="Google Chrome", source="full")     # native chrome + web content (default)
tp.elements(app="Google Chrome", source="ax")       # web content only (CDP accessibility tree)
tp.elements(app="Google Chrome", source="native")   # native UI only (toolbar, tabs, menus)
tp.elements(app="Google Chrome", source="dom")      # DOM walker (catches what AX misses)

CDP results are merged with native backend results — you get the toolbar and window controls from AT-SPI2/UIA/AX, combined with the full web page content from CDP, in a single elements() call.


API Reference

Discovery

Function Description
tp.apps() List application names in the accessibility tree
tp.windows() All windows with id, title, app, position, size, active state
tp.elements(app, role, states, ...) UI elements, with filtering, tree mode, and formatting
tp.element_at(x, y) Deepest element at screen coordinates
tp.get_element(id) Fresh snapshot of a single element by ID

Search & Wait

Function Description
tp.find(query, app, role, ...) Search by name — 4-stage matching: exact → contains → word → fuzzy
tp.wait_for(query, ...) Poll until elements appear (or disappear with gone=True)
tp.wait_for_app(app, ...) Poll until an app appears or disappears
tp.wait_for_window(title, ...) Poll until a window appears or disappears

Actions

Function Description
tp.click(element) Click via accessibility action, with coordinate fallback
tp.double_click(element) Double-click
tp.right_click(element) Right-click / context menu
tp.set_value(element, text) Set text content (replace=True to clear first)
tp.set_numeric_value(element, n) Set slider or spinbox value
tp.focus(element) Move keyboard focus
tp.action(element, name) Execute a raw accessibility action by name
tp.activate_window(window) Bring a window to the foreground

Input

Function Description
tp.type_text(text) Type into the currently focused element
tp.press_key(key) Press and release a key ("enter", "tab", "escape")
tp.hotkey(*keys) Key combination (tp.hotkey("ctrl", "s"))
tp.click_at(x, y) Click at screen coordinates
tp.double_click_at(x, y) Double-click at coordinates
tp.right_click_at(x, y) Right-click at coordinates
tp.mouse_move(x, y) Move the cursor
tp.scroll(direction, amount) Scroll at current cursor position

Screenshot & Config

Function Description
tp.screenshot(app, element, ...) Full desktop or cropped to app/window/element/monitor
tp.monitor_count() Number of connected monitors
tp.configure(...) Set runtime options (see Configuration)

All action functions accept an Element object or a string ID. elements(), find(), and get_element() support format="flat", format="json", or format="tree" (elements only) to return pre-formatted strings instead of objects.


Architecture

┌───────────────────────────────────────────────────────┐
│               import touchpoint as tp                 │
│  tp.find() · tp.click() · tp.screenshot() · ...       │
│                    (Public API)                       │
├─────────────────────────┬─────────────────────────────┤
│     Backend (ABC)       │    InputProvider (ABC)      │
├─────────────────────────┼─────────────────────────────┤
│  AT-SPI2     (Linux)    │  Xdotool       (X11)        │
│  UIA         (Windows)  │  SendInput     (Win32)      │
│  AX          (macOS)    │  CGEvent       (macOS)      │
│  CDP         (browsers) │                             │
├─────────────────────────┴─────────────────────────────┤
│  Utilities: formatter · matcher · screenshot · scale  │
└───────────────────────────────────────────────────────┘

Two-layer design:

  • Backend reads the accessibility tree and runs structured actions (click, set_value, focus). Element-aware and reliable.
  • InputProvider simulates raw keyboard and mouse input. Coordinate-based and element-blind. Used as an automatic fallback when a native accessibility action isn't available.

CDP runs alongside the platform backend. Their results are merged: native window chrome (toolbar, tabs, menus) from AT-SPI2/UIA/AX, plus full web content from CDP, unified under one API.

For detailed internals, see ARCHITECTURE.md.


Configuration

tp.configure(
    fuzzy_threshold=0.6,          # minimum match score for find() (0.0–1.0)
    fallback_input=True,          # use InputProvider when native actions fail
    type_chunk_size=40,           # split long text into chunks for typing (0 = disable)
    max_elements=5000,            # max elements per query
    max_depth=10,                 # default tree depth limit
    scale_factor=None,            # display scale override (None = auto-detect)
    cdp_ports={"Chrome": 9222},   # explicit CDP port mapping
    cdp_discover=True,            # auto-discover CDP ports from running processes
    cdp_refresh_interval=5.0,     # seconds between CDP target scans
)

Development

git clone https://github.com/Touchpoint-Labs/touchpoint.git
cd touchpoint
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest

Status

Alpha — fully functional and tested on all three platforms. The API may change before 1.0 based on user feedback.

Platform Backend Input CDP Tests
Linux (X11) ✅ AT-SPI2 ✅ xdotool
Windows ✅ UIA ✅ SendInput
macOS ✅ AX ✅ CGEvent

Known limitations

  • Wayland input — The Linux InputProvider uses xdotool, which requires X11. On pure Wayland (no XWayland), keyboard/mouse simulation is unavailable. The accessibility tree and native actions still work.

  • Synchronous CDP — CDP calls block on WebSocket responses. JavaScript dialogs (alert, confirm, prompt) are auto-dismissed to prevent deadlocks. An async rewrite is planned.

  • No browser navigation API — Touchpoint doesn't have built-in URL navigation. Agents can navigate by interacting with UI elements directly: find the address bar, type a URL, press Enter.


Roadmap

High Priority

  • Async CDP architecture — non-blocking WebSocket, proper dialog queuing, concurrent multi-tab queries

Medium Priority

  • Text selection tool — select text within an element by content (e.g. "select this word"), across all backends
  • Window management tools — minimize, maximize, close, move, resize windows
  • Backend role/state parity — close remaining role mapping gaps (especially UIA on Windows)
  • Batch action tool — execute a sequence of actions in one call to reduce LLM round trips
  • Wayland input backendlibei / xdg-desktop-portal RemoteDesktop when X11 isn't available

Lower Priority

  • Tooltip and notification visibility
  • Code organisation (split large files)
  • Element caching

License

MIT

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选