UI Perception Engine
Enables AI agents to perceive and interact with web interfaces by extracting a unified UI Scene Graph from live URLs, providing tools for navigation, element detection, visual analysis, and state tracking.
README
UI Perception Engine
A perception layer for AI agents — human-like understanding of web interfaces.
Fuses structural (DOM + a11y tree + CSS), visual (three-tier vision pipeline), and temporal data into a unified UI Scene Graph that an LLM can reason over fluently. Exposed as an MCP server so Claude can navigate and inspect any live URL.
What It Does
- Navigates to any URL in a real Playwright browser
- Extracts a compact, LLM-readable scene graph from the live DOM
- Detects UI elements visually via OmniParser V2, understands layout via Qwen3-VL, and performs deep UX analysis via Claude Vision
- Tracks UI transitions and diffs between states
- Predicts affordances (what you can interact with, and what happens when you do)
- Exposes everything as MCP tools that Claude can call directly
MCP Tools
| Tool | Description |
|---|---|
navigate |
Navigate to a URL and return the initial scene graph |
get_scene |
Re-capture the current scene (compact text or full JSON) |
get_affordances |
List interactive elements ranked by priority |
act |
Execute browser actions: click, type, scroll, hover, keypress, navigate, wait |
get_console_logs |
Return captured browser console messages |
get_network_errors |
Return failed network requests |
get_screenshot |
Capture a screenshot and return it as an image |
detect_elements |
Run element detection (OmniParser or Claude Vision fallback) |
analyze_visual |
Visual understanding via Qwen3-VL: hierarchy, contrast, spacing, UX |
compare_states |
Diff current vs previous scene graph — shows what changed |
watch |
Start real-time keyframe capture (CDP screencast + perceptual hashing) |
stop_watch |
Stop keyframe capture and return summary of changes |
Three-tier vision pipeline (visual=true)
Both navigate and get_scene accept an optional visual: true parameter. When enabled, the engine runs a three-tier vision pipeline:
- Tier A: OmniParser V2 — Fast element detection (~0.8s). A YOLOv8 + Florence-2 model running as a Python sidecar on port 8100. Detects buttons, inputs, images, icons, and other UI elements with bounding boxes and labels.
- Tier B: Qwen3-VL via Ollama — Visual understanding (~2-4s). Analyzes the screenshot with detected element context to assess visual hierarchy, contrast issues, spacing problems, affordance clarity, and state indicators.
- Tier C: Claude Vision API — Deep UX analysis (~3-5s, on-demand only). Provides detailed qualitative analysis when requested via
analyze_visualordepth: 'deep'.
Graceful degradation: Each tier skips silently if its backing service is unavailable. The system works with any combination of services running — from all three tiers down to structural-only analysis with no vision services at all.
Use detect_elements for fast Tier A detection only, or analyze_visual for the full Tier A + B pipeline.
act action types
| Type | Parameters |
|---|---|
click |
x, y |
clickSelector |
selector |
type |
text, selector (optional) |
scroll |
direction (up/down), amount (optional) |
hover |
x, y |
wait |
ms |
navigate |
url |
back |
— |
pressKey |
key |
Installation
From npm
npm install -g ui-perception-engine
Or use without installing:
npx ui-perception-engine
Install Playwright browsers (first time only):
npx playwright install chromium
From source
git clone https://github.com/dirkknibbe/uipe.git
cd uipe
pnpm install
pnpm build
Claude Code MCP Configuration
Add to your Claude Code MCP config (~/.claude/mcp.json or project .mcp.json):
{
"mcpServers": {
"ui-perception-engine": {
"command": "npx",
"args": ["ui-perception-engine"]
}
}
}
Or if installed globally:
{
"mcpServers": {
"ui-perception-engine": {
"command": "uipe"
}
}
}
Or from a local clone:
{
"mcpServers": {
"ui-perception-engine": {
"command": "node",
"args": ["/path/to/uipe/ui-perception-engine/packages/core/dist/src/mcp/index.js"],
"env": {
"OLLAMA_URL": "http://localhost:11434",
"OLLAMA_MODEL": "qwen3-vl:8b",
"OMNIPARSER_URL": "http://localhost:8100",
"ANTHROPIC_API_KEY": "sk-ant-..."
}
}
}
}
Environment variables
| Variable | Purpose |
|---|---|
ANTHROPIC_API_KEY |
Claude Vision API — detection fallback + deep analysis (Tier C) |
OLLAMA_URL |
Ollama server URL (default: http://localhost:11434) |
OLLAMA_MODEL |
Vision model name (default: qwen3-vl:8b) |
OMNIPARSER_URL |
OmniParser V2 sidecar URL (default: http://localhost:8100) |
See .env.example for the full list of configurable variables including frame capture, browser, and temporal settings.
Local Vision Services
The three-tier vision pipeline uses two local services. Both are optional — the system degrades gracefully without them.
Ollama (Tier B — visual understanding):
# Install Ollama: https://ollama.com
ollama pull qwen3-vl:8b
ollama list # verify model is available
# Ollama serves on http://localhost:11434 by default
OmniParser V2 (Tier A — element detection):
OmniParser runs as a Python FastAPI sidecar on port 8100. See the Local Vision Handoff doc section 5 for full setup instructions.
# Quick check if OmniParser is running:
curl -s http://localhost:8100/health
Without local services: If neither Ollama nor OmniParser is running, visual=true falls back to Claude Vision API (requires ANTHROPIC_API_KEY). If no vision service is available at all, the engine uses structural-only analysis (DOM + a11y tree).
Using with the live-deployment-check Skill
The live-deployment-check skill pairs directly with this MCP server to visually verify a deployed site or app — catching broken images, empty routes, stuck spinners, and placeholder text that only surface in a real browser.
Workflow
1. navigate(url) → load the page, get initial scene
2. get_scene() → re-capture after JS hydrates (critical for SPAs)
3. get_console_logs() → check for JS errors (type="error")
4. get_network_errors() → check for failed API/resource requests
5. Scan scene output → look for broken signals (see below)
6. act() on nav links → walk routes, verify each one loads
7. Report findings → list what's working and what's broken
Common Signals in Scene Output
# Broken image:
img[img]:"broken"
# Empty SPA route (component failed to load):
router-outlet[element] ← no children = problem
# Stuck loading spinner:
progressbar[progressbar] ← present after JS settles = API error
# Route loaded correctly:
router-outlet[element]
app-order-list[element]:"Order Management..." ← has content = good
Example
// After deploying an Angular app
navigate("http://your-app.vercel.app")
get_scene() // wait for hydration
→ check router-outlet has content, no broken img nodes
// Walk routes
act({ type: "clickSelector", selector: "a[href='/orders']" })
get_scene()
→ verify orders page loaded
act({ type: "clickSelector", selector: "a[href='/customers']" })
get_scene()
→ verify customers page loaded
What to Check
- Broken images —
imgnodes with"broken"content - Empty routes —
router-outletwith no child elements - Stuck spinners —
progressbarstill present afterget_scene() - Placeholder text —
undefined,null,TODO,<repo-url>in visible text - Error pages — 404 or error component rendered instead of expected content
Development
# Root workspace (delegates to all packages via -r)
pnpm test # run all tests
pnpm build # compile TypeScript
pnpm lint # lint all packages
# @uipe/core package only
pnpm -F @uipe/core test:watch # watch mode
pnpm -F @uipe/core mcp # start MCP server (after build)
pnpm -F @uipe/core start:dev # check services + start MCP server
pnpm -F @uipe/core exec vitest run --reporter=verbose # verbose test output
Architecture
packages/
├── contracts/ ← shared types (@uipe/contracts)
└── core/ ← perception engine + MCP server (@uipe/core)
└── src/
├── config.ts ← centralized config (dotenv)
├── types/ ← internal types
├── browser/ ← BrowserRuntime (Playwright)
├── pipelines/
│ ├── structural/ ← DOM + a11y tree extraction
│ ├── visual/
│ │ ├── index.ts ← Three-tier orchestrator (detect/understand/deep)
│ │ ├── omniparser.ts ← OmniParser V2 client (Tier A)
│ │ ├── claude-vision.ts ← Claude Vision API (Tier C)
│ │ ├── ollama-vision.ts ← Qwen3-VL via Ollama (Tier B)
│ │ └── frame-capture.ts ← CDP screencast + perceptual hashing
│ ├── fusion/ ← merge visual + structural → SceneGraph
│ ├── temporal/ ← change detection + state tracking
│ └── affordance/ ← predict interaction outcomes
├── mcp/ ← MCP server (12 tools)
└── utils/
Viewport default: 1280x720 (configurable via env) Screenshot format: PNG (lossless, required for vision models)
推荐服务器
Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。
VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。
e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。