UI Perception Engine

UI Perception Engine

Enables AI agents to perceive and interact with web interfaces by extracting a unified UI Scene Graph from live URLs, providing tools for navigation, element detection, visual analysis, and state tracking.

Category
访问服务器

README

UI Perception Engine

A perception layer for AI agents — human-like understanding of web interfaces.

Fuses structural (DOM + a11y tree + CSS), visual (three-tier vision pipeline), and temporal data into a unified UI Scene Graph that an LLM can reason over fluently. Exposed as an MCP server so Claude can navigate and inspect any live URL.

What It Does

  • Navigates to any URL in a real Playwright browser
  • Extracts a compact, LLM-readable scene graph from the live DOM
  • Detects UI elements visually via OmniParser V2, understands layout via Qwen3-VL, and performs deep UX analysis via Claude Vision
  • Tracks UI transitions and diffs between states
  • Predicts affordances (what you can interact with, and what happens when you do)
  • Exposes everything as MCP tools that Claude can call directly

MCP Tools

Tool Description
navigate Navigate to a URL and return the initial scene graph
get_scene Re-capture the current scene (compact text or full JSON)
get_affordances List interactive elements ranked by priority
act Execute browser actions: click, type, scroll, hover, keypress, navigate, wait
get_console_logs Return captured browser console messages
get_network_errors Return failed network requests
get_screenshot Capture a screenshot and return it as an image
detect_elements Run element detection (OmniParser or Claude Vision fallback)
analyze_visual Visual understanding via Qwen3-VL: hierarchy, contrast, spacing, UX
compare_states Diff current vs previous scene graph — shows what changed
watch Start real-time keyframe capture (CDP screencast + perceptual hashing)
stop_watch Stop keyframe capture and return summary of changes

Three-tier vision pipeline (visual=true)

Both navigate and get_scene accept an optional visual: true parameter. When enabled, the engine runs a three-tier vision pipeline:

  • Tier A: OmniParser V2 — Fast element detection (~0.8s). A YOLOv8 + Florence-2 model running as a Python sidecar on port 8100. Detects buttons, inputs, images, icons, and other UI elements with bounding boxes and labels.
  • Tier B: Qwen3-VL via Ollama — Visual understanding (~2-4s). Analyzes the screenshot with detected element context to assess visual hierarchy, contrast issues, spacing problems, affordance clarity, and state indicators.
  • Tier C: Claude Vision API — Deep UX analysis (~3-5s, on-demand only). Provides detailed qualitative analysis when requested via analyze_visual or depth: 'deep'.

Graceful degradation: Each tier skips silently if its backing service is unavailable. The system works with any combination of services running — from all three tiers down to structural-only analysis with no vision services at all.

Use detect_elements for fast Tier A detection only, or analyze_visual for the full Tier A + B pipeline.

act action types

Type Parameters
click x, y
clickSelector selector
type text, selector (optional)
scroll direction (up/down), amount (optional)
hover x, y
wait ms
navigate url
back
pressKey key

Installation

From npm

npm install -g ui-perception-engine

Or use without installing:

npx ui-perception-engine

Install Playwright browsers (first time only):

npx playwright install chromium

From source

git clone https://github.com/dirkknibbe/uipe.git
cd uipe
pnpm install
pnpm build

Claude Code MCP Configuration

Add to your Claude Code MCP config (~/.claude/mcp.json or project .mcp.json):

{
  "mcpServers": {
    "ui-perception-engine": {
      "command": "npx",
      "args": ["ui-perception-engine"]
    }
  }
}

Or if installed globally:

{
  "mcpServers": {
    "ui-perception-engine": {
      "command": "uipe"
    }
  }
}

Or from a local clone:

{
  "mcpServers": {
    "ui-perception-engine": {
      "command": "node",
      "args": ["/path/to/uipe/ui-perception-engine/packages/core/dist/src/mcp/index.js"],
      "env": {
        "OLLAMA_URL": "http://localhost:11434",
        "OLLAMA_MODEL": "qwen3-vl:8b",
        "OMNIPARSER_URL": "http://localhost:8100",
        "ANTHROPIC_API_KEY": "sk-ant-..."
      }
    }
  }
}

Environment variables

Variable Purpose
ANTHROPIC_API_KEY Claude Vision API — detection fallback + deep analysis (Tier C)
OLLAMA_URL Ollama server URL (default: http://localhost:11434)
OLLAMA_MODEL Vision model name (default: qwen3-vl:8b)
OMNIPARSER_URL OmniParser V2 sidecar URL (default: http://localhost:8100)

See .env.example for the full list of configurable variables including frame capture, browser, and temporal settings.

Local Vision Services

The three-tier vision pipeline uses two local services. Both are optional — the system degrades gracefully without them.

Ollama (Tier B — visual understanding):

# Install Ollama: https://ollama.com
ollama pull qwen3-vl:8b
ollama list                    # verify model is available
# Ollama serves on http://localhost:11434 by default

OmniParser V2 (Tier A — element detection):

OmniParser runs as a Python FastAPI sidecar on port 8100. See the Local Vision Handoff doc section 5 for full setup instructions.

# Quick check if OmniParser is running:
curl -s http://localhost:8100/health

Without local services: If neither Ollama nor OmniParser is running, visual=true falls back to Claude Vision API (requires ANTHROPIC_API_KEY). If no vision service is available at all, the engine uses structural-only analysis (DOM + a11y tree).

Using with the live-deployment-check Skill

The live-deployment-check skill pairs directly with this MCP server to visually verify a deployed site or app — catching broken images, empty routes, stuck spinners, and placeholder text that only surface in a real browser.

Workflow

1. navigate(url)            → load the page, get initial scene
2. get_scene()              → re-capture after JS hydrates (critical for SPAs)
3. get_console_logs()       → check for JS errors (type="error")
4. get_network_errors()     → check for failed API/resource requests
5. Scan scene output        → look for broken signals (see below)
6. act() on nav links       → walk routes, verify each one loads
7. Report findings          → list what's working and what's broken

Common Signals in Scene Output

# Broken image:
img[img]:"broken"

# Empty SPA route (component failed to load):
router-outlet[element]          ← no children = problem

# Stuck loading spinner:
progressbar[progressbar]        ← present after JS settles = API error

# Route loaded correctly:
router-outlet[element]
  app-order-list[element]:"Order Management..."   ← has content = good

Example

// After deploying an Angular app
navigate("http://your-app.vercel.app")
get_scene()                     // wait for hydration
→ check router-outlet has content, no broken img nodes

// Walk routes
act({ type: "clickSelector", selector: "a[href='/orders']" })
get_scene()
→ verify orders page loaded

act({ type: "clickSelector", selector: "a[href='/customers']" })
get_scene()
→ verify customers page loaded

What to Check

  • Broken imagesimg nodes with "broken" content
  • Empty routesrouter-outlet with no child elements
  • Stuck spinnersprogressbar still present after get_scene()
  • Placeholder textundefined, null, TODO, <repo-url> in visible text
  • Error pages — 404 or error component rendered instead of expected content

Development

# Root workspace (delegates to all packages via -r)
pnpm test          # run all tests
pnpm build         # compile TypeScript
pnpm lint          # lint all packages

# @uipe/core package only
pnpm -F @uipe/core test:watch          # watch mode
pnpm -F @uipe/core mcp                 # start MCP server (after build)
pnpm -F @uipe/core start:dev           # check services + start MCP server
pnpm -F @uipe/core exec vitest run --reporter=verbose  # verbose test output

Architecture

packages/
├── contracts/          ← shared types (@uipe/contracts)
└── core/               ← perception engine + MCP server (@uipe/core)
    └── src/
        ├── config.ts           ← centralized config (dotenv)
        ├── types/              ← internal types
        ├── browser/            ← BrowserRuntime (Playwright)
        ├── pipelines/
        │   ├── structural/     ← DOM + a11y tree extraction
        │   ├── visual/
        │   │   ├── index.ts        ← Three-tier orchestrator (detect/understand/deep)
        │   │   ├── omniparser.ts   ← OmniParser V2 client (Tier A)
        │   │   ├── claude-vision.ts ← Claude Vision API (Tier C)
        │   │   ├── ollama-vision.ts ← Qwen3-VL via Ollama (Tier B)
        │   │   └── frame-capture.ts ← CDP screencast + perceptual hashing
        │   ├── fusion/         ← merge visual + structural → SceneGraph
        │   ├── temporal/       ← change detection + state tracking
        │   └── affordance/     ← predict interaction outcomes
        ├── mcp/                ← MCP server (12 tools)
        └── utils/

Viewport default: 1280x720 (configurable via env) Screenshot format: PNG (lossless, required for vision models)

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选