AI Test Pilot
An MCP server that generates, runs, and triages tests by introspecting Python modules or web pages, using structured LLM outputs for scenario generation and failure analysis.
README
AI Test Pilot
An LLM-driven test generator with a shared core and pluggable adapters: point it at a Python module or a web page, and it introspects the target, proposes test scenarios as schema-validated JSON, renders them into runnable tests, runs them, and triages the failures.
Overview
Most "AI writes your tests" tools let the model emit test code directly — which hallucinates imports, fabricates inputs, and asserts wrong things. AI Test Pilot takes the opposite stance: the LLM only ever returns structured, schema-validated JSON describing a scenario; every line of runnable code is rendered deterministically from that JSON. The LLM is used for the two genuinely fuzzy steps — proposing scenarios and judging ambiguous failures — and nothing else.
The same engine drives two target types through one adapter seam, so adding a new kind of target is a single new file with zero changes to the core:
python_pytest— points at a Python module, generatespytesttests.web_playwright— points at a web page, generates Playwright end-to-end tests (and exports idiomatic TypeScript alongside the runnable Python). In served mode it goes deeper: abase_urlauth_statefixture pair (realstorage_statereuse),page.routenetwork interception,page.route_web_socketWebSocket mocking, and anasync_playwrightvariant.
Features
- Structured-output pipeline — the LLM returns JSON validated against a Pydantic schema (with a one-shot repair retry); code is generated from the validated objects, never written by the model.
- Typed-input construction — recursively resolves a function's parameter types from source
(dataclass + Pydantic, nested,
Decimal/datetime/Enum, defaults) viaast— without importing the target — and builds real constructor calls. Lets it test domain/OO code, not just functions taking primitives. - Characterization (golden) mode — runs each call once and locks the assertion to the real result, turning a generated test into a regression guard. Guarded against time-bombs: it double-runs and skips any clock/RNG-reading unit whose time isn't pinned.
- File & fixture inputs — creates real temp files for file-processing functions, and can optionally seed inputs from a companion synthetic data factory.
- Failure triage — a deterministic signal table classifies most failures for free
(
bad_scenario/env_issue/ a broken golden lock →real_bug); the LLM is called only for the genuinely ambiguous ones. - Advanced Playwright (served web mode) — fixtures (
base_url,auth_state), authenticated sessions via savedstorage_state, network interception (page.route) to stub APIs deterministically, in-process WebSocket mocking (page.route_web_socket, server-push + echo), and anasync_playwrightvariant. Each is just structured JSON the LLM emits — no Playwright code from the model. - Self-tracking ledger + self-improving tuning — every run is recorded to DuckDB;
acceptbackfills how many tests you kept. The tool then proposes the best prompt version and (inautomode) injects your previously-accepted scenarios for the same target as few-shot exemplars — closing the loop with zero extra LLM calls. - Draft → suite workflow —
discoverscans a project and prints ready-to-run commands per module;promotestrips a draft's boilerplate, rewrites golden locks into value assertions, and appends only the non-duplicate tests into an existing suite. Both deterministic, zero-token. - MCP server — exposes the engine as tools (
introspect,generate_tests,triage_failures,run_metrics,accept_run) so it's callable from any MCP client.
How it works
flowchart LR
T([target: module or URL]) --> I[1 · introspect<br/>ast / DOM — deterministic]
I --> G[2 · generate<br/>LLM → schema-validated JSON]
G --> M[3 · materialize<br/>render code — deterministic]
M --> R[4 · run<br/>pytest / Playwright]
R --> TR[5 · triage<br/>signals + LLM for ambiguous]
TR --> L[6 · record<br/>DuckDB ledger]
L -. 7 · propose tuning .-> G
Stages 1, 3, 4, 6 cost zero tokens. Stage 2 is one batched LLM call; stage 5 calls the LLM only for failures the deterministic signal table can't classify. The core never imports an adapter directly — only through a name registry — which is what keeps the two target types fully decoupled:
flowchart TD
C[shared core<br/>introspect · generate · materialize · run · triage · record] --> RG[registry]
RG --> A1[python_pytest adapter]
RG --> A2[web_playwright adapter]
A1 --> P[(pytest)]
A2 --> PW[(Playwright + TS export)]
Demo — sample run
$ python scripts/main.py --target path/to/rules/commission.py --selector compute_commission --golden
introspected 1 unit(s); resolved types: OrderView, LineItemView, RulesConfig, CommissionRules
generated 5 scenario(s)
golden mode: locked 5 characterization assertion(s)
run complete: 5/5 passed
✓ 5 passed · 0 failed · 0 error / 5 generated
tests: scripts/outputs/tests/test_commission_<ts>.py
report: scripts/outputs/reports/report_<ts>.md
A generated test constructs the real typed inputs and locks the computed result:
def test_standard_commission():
"""Commission for a multi-item order."""
result = compute_commission(
order=OrderView(currency="PLN", status="DELIVERED", line_items=[
LineItemView(category="electronics", unit_amount=Decimal("100.00"), quantity=2)]),
config=RulesConfig())
assert repr(result) == ("CommissionBreakdown(currency='PLN', "
"items_commission=Decimal('4.00'), transaction_fee=Decimal('1.00'), "
"total_commission=Decimal('5.00'), rule_version='v1')")
For the web_playwright adapter, the same pipeline produces self-contained Playwright tests and an
idiomatic .spec.ts export. Three sample targets are included: demo/signup.html (simple form),
demo/login_app/ (served — auth/storage_state + API interception), and demo/ws_app/ (served —
WebSocket push/echo). The served demos are run with --serve:
# deep web: emits base_url + auth_state fixtures, page.route interception, async variant
python scripts/main.py --adapter web_playwright --target demo/login_app/index.html --serve
# websocket: emits page.route_web_socket mock (server push + echo) + expect_ws_message
python scripts/main.py --adapter web_playwright --target demo/ws_app/index.html --serve
Tech Stack
- Language: Python 3.10+
- Core:
pydantic(the schema spine),jinja2(pytest emission),duckdb(the run ledger) - LLM:
langchain-openaiagainst any OpenAI-compatible gateway (LLM_BASE_URL/LLM_MODEL/LLM_API_KEY) - Adapters:
pytest(python target runner),playwright(web — bundles its own driver, no Node needed) - Integration:
mcp(FastMCP) — exposes the engine over the Model Context Protocol
Getting Started
Prerequisites
- Python 3.10+ on PATH
Installation
git clone https://github.com/Drzymek92/ai-test-pilot.git
cd ai-test-pilot
python -m venv .venv
.venv\Scripts\activate # Windows (source .venv/bin/activate on macOS/Linux)
pip install -r requirements.txt
cp config/.env.example config/.env # then fill in your LLM gateway values
# for the web adapter only:
python -m playwright install chromium
Usage
# generate pytest tests for selected functions
python scripts/main.py --target path/to/module.py --selector func_a,func_b
# lock assertions to real results (characterization / regression mode)
python scripts/main.py --target path/to/module.py --golden
# generate Playwright tests for a web page
python scripts/main.py --adapter web_playwright --target path/to/page.html
# record how many proposed tests you kept (feeds tuning)
python scripts/main.py accept <run_id> --kept 4
# scan a project for testable targets (deterministic, no LLM)
python scripts/main.py discover path/to/project
# clean a draft for the suite: strip boilerplate, rewrite golden locks, append non-duplicates
python scripts/main.py promote <run_id> --into tests/test_module.py
Run as an MCP server (callable from any MCP client) — register it with:
{ "command": "python", "args": ["/path/to/ai-test-pilot/scripts/mcp_server.py"] }
Project Structure
scripts/
main.py # CLI + run_pipeline() (the one pipeline every interface reuses)
mcp_server.py # MCP server (FastMCP) exposing the engine as tools
core/ # adapter-agnostic engine: models, generate, materialize, runner, triage,
# ledger, tuning, context, fixtures, registry, discover, promote
adapters/ # python_pytest · web_playwright (one file per target type)
prompts/ # scenario-generation prompts + the pytest Jinja template
config/ # ai_test_pilot.toml (defaults) + .env.example
demo/ # signup.html · login_app/ (auth) · ws_app/ (websocket) — web adapter targets
tests/ # 89 unit tests
Design notes
- Determinism first. Introspection, code emission, running, the triage signal table, and the ledger are all plain code. The LLM is a tool for the two irreducibly fuzzy steps only.
- Never imports the target. Introspection is
ast-only, so a target's heavy/optional dependencies are never triggered to generate tests for it. - Human-in-the-loop. Generated tests are proposed into
scripts/outputs/— never written into a target repository. Promoting them is a separate, explicit step.
Validated end-to-end across a typed business-rules engine, pure data-transformation helpers, a web form, an authenticated app with API interception, and a WebSocket feed — producing runnable, correctly-typed, regression-grade tests in each case.
CI note: the published CI runs the full unit suite, which is browser-free by design — the
web_playwrighttests assert on the generated test source, not a live browser. The--servedemos are run locally (afterplaywright install chromium); CI doesn't download a browser.
License
Licensed under the MIT License — see LICENSE.
推荐服务器
Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。
VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。
e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。