MCP 服务器

Cutible MCP Server

Enables AI agents to perform headless video editing through 35 tools for project creation, clip manipulation, rendering, quality control, and semantic search, all via JSON-RPC 2.0 over stdio.

README

Cutible — Agent-Native Montage Engine

A headless video-editing engine whose primary operator is an AI agent, not a human with a mouse. The agent reads the project as data, calls editing verbs, renders deterministically, and inspects the result through a QC loop — then iterates.

Architecture

              AGENT-REVISOR (LLM: planning, reasoning, decisions)
                    │
        ┌───────────┼───────────┐
        │ HANDS     │ EYES      │ MEMORY
        ▼           ▼           ▼
   Verb API    Perception    Semantic Media
   (14 low +   Loop          Index
   8 high)     (VLM+QC)     (scenes+transcript+VLM+embeddings)
        │           │           │
        └─────┬─────┘           │
              ▼                 │
     Timeline-as-Data ◄────────┘
     (JSON, diffable, auditable)
              │
              ▼
     ┌─────────────────────┐
     │ Deterministic Render │  ← FFmpeg (Contour A)
     │ Remotion (Contour B) │  ← Motion graphics
     │ Render Farm          │  ← Distributed GPU
     └─────────────────────┘
              │
              ▼
         QC Gate (deterministic + VLM)
              │
              ▼
     Final Video / OTIO → DaVinci/Premiere

What's Implemented

Plan concept	Module	Status
§4 Timeline-as-Data	`cutible/schema.py`	✅ pydantic, 3 zooms, content hash
§3.1 Low-level verbs (14)	`cutible/verbs.py`	✅ diffs, checkpoint/undo/branch
§3.1 High-level verbs (8)	`cutible/verbs_high.py`	✅ remove_silences, reframe, beat-sync, captions, ducking, assemble, make_short
§5 Ingest Pipeline	`cutible/ingest/`	✅ scenes, Whisper, VLM, audio analysis, embeddings
§5 Semantic Media Index	`cutible/index/`	✅ models, store, text/time/speaker/B-roll search
§3.2 Perception Loop	`cutible/perception/`	✅ VLM review + proxy render
§7 Multi-agent Swarm	`cutible/agents/`	✅ Planner, Editor, Sound, QC, Orchestrator
§6.1 Contour A (FFmpeg)	`cutible/compiler.py`	✅ deterministic render
§6.1 Contour B (Remotion)	`cutible/remotion/`	✅ TSX generation, config
§9 OTIO Bridge	`cutible/otio_bridge/`	✅ export/import to DaVinci/Premiere
§6.2 Distributed Render Farm	`cutible/render_farm/`	✅ scheduler, workers, assembly
§8.1 MCP Server	`cutible/mcp_server.py`	✅ 35 tools, JSON-RPC 2.0/stdio
§8.2 REST API	`cutible/api/`	✅ FastAPI, full CRUD
§8.3 Python SDK	`cutible/sdk/`	✅ in-process + HTTP client
§8.4 CLI	`cutible/cli.py`	✅ render/probe/view/qc/ingest/search/agent/export/import/farm
§12.3 Tests	`tests/`	✅ 30+ tests

Quick Start

pip install -e .                    # core
pip install -e ".[api]"             # + REST API (FastAPI/uvicorn)
pip install -e ".[whisper]"         # + Whisper transcription
pip install -e ".[all]"             # everything

# Generate synthetic assets
bash examples/make_assets.sh

# Watch the agent assemble a recap
python examples/agent_recap_demo.py

CLI

# Render
python -m cutible render project.json -o out.mp4 --qc

# Ingest a video into the semantic index
python -m cutible ingest speaker /path/to/speaker.mp4

# Search the index
python -m cutible search "moment where speaker discusses AI"

# Run the multi-agent swarm
python -m cutible agent "make a 60s recap about AI" --duration 60

# Export/Import OTIO
python -m cutible export project.json --otio output.otio
python -m cutible import output.otio --save imported.json

# Distributed render farm
python -m cutible farm project.json -o out.mp4 --workers 4

# Start REST API
python -m cutible serve-api --port 8000

Python SDK

from cutible.sdk import CutibleClient

# In-process mode
client = CutibleClient()
client.create_project("demo", fps=30, width=1920, height=1080)
client.add_asset("speaker", "video", uri="speaker.mp4", duration=60)
client.add_track("v_main", "video")
client.add_clip("v_main", "speaker", src_in=0, src_out=10)
result = client.render("output.mp4")

# Run the agent swarm
result = client.run_agent("make a 30s recap", target_duration=30)

REST API

# Start server
python -m cutible serve-api

# Create project
curl -X POST http://localhost:8000/projects \
  -H "Content-Type: application/json" \
  -d '{"id": "demo", "fps": 30}'

# Add clip
curl -X POST http://localhost:8000/projects/demo/verbs \
  -H "Content-Type: application/json" \
  -d '{"verb": "add_clip", "args": {"track_id": "v1", "asset": "a", "src_out": 5}}'

# Render
curl -X POST http://localhost:8000/projects/demo/render \
  -H "Content-Type: application/json" \
  -d '{"output": "out.mp4", "run_qc": true}'

MCP Server (primary agent interface)

python -m cutible.mcp_server   # speaks JSON-RPC 2.0 over stdio

35 tools exposed including: create_project, add_clip, trim, split, ripple_delete, add_transition, add_text_layer, render, qc, ingest_asset, search_index, remove_silences, reframe_to, sync_cuts_to_beat, generate_captions, auto_ducking, make_short, vlm_review, render_proxy, run_agent_swarm, export_otio, import_otio, render_farm.

Project Layout

cutible/
  schema.py           Timeline-as-Data models + zoom views + content hash
  verbs.py            Editor: low-level verbs (14 primitives)
  verbs_high.py       High-level composite verbs (8 intentions)
  compiler.py         Timeline → FFmpeg filtergraph → mp4
  qc.py               Deterministic QC (duration / black frames / LUFS)
  cli.py              Headless CLI (12 commands)
  mcp_server.py       MCP stdio server (35 tools)
  ingest/
    pipeline.py       Ingest orchestrator
    scenes.py         Scene/shot detection (ffmpeg)
    audio_transcribe.py  Whisper transcription + diarization
    vlm.py            VLM visual analysis (Gemini/OpenAI)
    audio_analysis.py  Beat/silence/tempo detection (librosa/ffmpeg)
    embeddings.py     Embedding generation (CLIP/OpenAI)
  index/
    models.py         Semantic index data models
    store.py          Index persistence
    search.py         Text/time/speaker/B-roll search
  perception/
    vlm_review.py     VLM semantic review of renders
    proxy_render.py   Fast low-res proxy renderer
  agents/
    base.py           Base agent + message types
    planner.py        Director/Planner agent
    editor.py         Editor/Montageur agent
    sound.py          Sound Engineer agent
    qc_agent.py       QC/Reviewer agent
    orchestrator.py   Multi-agent swarm coordinator
  remotion/
    compiler.py       Timeline → Remotion (React) project
  otio_bridge/
    exporter.py       Cutible → OpenTimelineIO
    importer.py       OpenTimelineIO → Cutible
  render_farm/
    worker.py         Segment render worker
    scheduler.py      Task scheduler
    manager.py        Distributed render farm manager
  api/
    app.py            FastAPI REST application
  sdk/
    client.py         Python SDK client (in-process + HTTP)
tests/
  test_core.py        Original 15 tests
  test_new_modules.py 20+ tests for new modules
examples/
  agent_recap_demo.py End-to-end agent demo
  make_assets.sh      Synthetic asset generator

Design Principles (Agent-Native)

State is data, not pixels. The agent reads/diffs/mutates a JSON timeline.
Verbs return diffs. Each call reports what changed.
Errors teach. Structured errors with hint and context.
Try / inspect / revert. Checkpoint/undo/branch for exploration.
Deterministic render. Same project → identical frames.
Closed perception loop. QC gate + VLM review → self-correction.
Semantic memory. Ingest → indexed content the agent can search.
Multi-agent swarm. Specialized roles: plan → edit → sound → QC → iterate.
Industry bridge. OTIO export → DaVinci/Premiere for human finishing.