videoscan-mcp
An MCP server for comprehensive video analysis — AI-powered transcription, visual frame analysis, and metadata extraction from 1000+ platforms.
README
VideoScan MCP
An MCP (Model Context Protocol) server for comprehensive video analysis — AI-powered transcription, visual frame analysis, and metadata extraction from 1000+ platforms.
Features
- Full video analysis — combines transcription, frame extraction, and metadata in a single call
- AI vision analysis — describes frames and extracts on-screen text (OCR) using GPT-4o, Claude, or Gemini
- Audio transcription — Whisper-based transcription with timestamps and language detection
- Auto-tuning — automatically adjusts frame extraction density, interval, and detail level based on video duration
- Smart frame extraction — scene-change detection, interval sampling, or combined strategy
- Deduplication — perceptual hashing removes near-duplicate frames before analysis
- Metadata extraction — title, duration, chapters, tags, view count, and more without full download
- Multi-provider — OpenAI, Anthropic, and Google vision providers with per-request override
- Caching — persistent cache for downloads, frames, and results to minimize repeat costs
- 1000+ platforms — powered by yt-dlp (YouTube, Vimeo, Twitter/X, TikTok, and more)
Installation
pip install videoscan-mcp
System dependencies
VideoScan requires ffmpeg for video processing and yt-dlp for downloading from URLs.
# macOS
brew install ffmpeg yt-dlp
# Ubuntu/Debian
apt install ffmpeg
pip install yt-dlp
# Windows — install ffmpeg from https://ffmpeg.org/download.html, then:
pip install yt-dlp
Configuration
Copy .env.example to .env and fill in at minimum one API key:
# Vision provider (frame analysis)
VISION_PROVIDER=openai # openai | anthropic | google
VISION_MODEL= # optional — defaults: gpt-4o / claude-sonnet-4-20250514 / gemini-2.0-flash
# Transcription provider
TRANSCRIPTION_PROVIDER=openai # openai only for now
TRANSCRIPTION_MODEL=whisper-1
# Concurrency
VISION_CONCURRENCY=5 # max parallel vision API calls
# API keys — only need the key for your chosen provider
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AIza...
# Cache
CACHE_ENABLED=true
CACHE_DIR=~/.videoscan/cache
CACHE_MAX_SIZE_GB=5
CACHE_DOWNLOAD_TTL=3600 # 1 hour
CACHE_FRAMES_TTL=86400 # 24 hours
CACHE_RESULTS_TTL=604800 # 7 days
# Safety limits (set to 0 for unlimited)
MAX_VIDEO_DURATION=3600 # 60 minutes in seconds
MAX_DOWNLOAD_SIZE=2147483648 # 2 GB in bytes
MAX_ANALYZED_FRAMES=100
DOWNLOAD_TIMEOUT=300
FRAME_ANALYSIS_TIMEOUT=30
Quick Start — Claude Code
Add VideoScan to your Claude Code settings.json (usually at ~/.claude/settings.json):
{
"mcpServers": {
"videoscan": {
"command": "videoscan",
"env": {
"OPENAI_API_KEY": "sk-..."
}
}
}
}
Or using uvx without a global install:
{
"mcpServers": {
"videoscan": {
"command": "uvx",
"args": ["videoscan-mcp"],
"env": {
"OPENAI_API_KEY": "sk-..."
}
}
}
}
Once connected, you can ask Claude things like:
- "Analyze this YouTube video: https://youtube.com/watch?v=..."
- "Transcribe the audio from this video file"
- "What's on screen at the 2:30 mark of this video?"
- "Extract frames from this video and describe what you see"
Auto-Tuning
When max_frames and interval are not explicitly set, VideoScan automatically adjusts frame extraction parameters based on video duration to optimize cost and coverage:
| Duration | Frames | Interval | Strategy | Detail |
|---|---|---|---|---|
| < 2 min | ~1/sec (dense) | 1s | combined | detailed |
| 2–10 min | ~40 | 3s | combined | standard |
| 10–30 min | ~30 | 10s | combined | standard |
| 30–60 min | ~30 | 20s | combined | brief |
| > 60 min | ~20 | 30s | scene only | brief |
Short videos get dense frame extraction for maximum detail, while longer videos use lighter sampling to keep costs down. You can always override by setting max_frames or interval explicitly.
Tool Reference
analyze_video
Full pipeline — transcription + AI frame analysis + metadata in one call. Uses auto-tuning by default.
| Parameter | Type | Default | Description |
|---|---|---|---|
source |
string | required | URL or local file path |
detail |
string | "standard" |
Vision level: "brief", "standard", "detailed" |
max_frames |
int | auto | Maximum frames to analyze — set to -1 (default) for auto-tuning based on duration |
threshold |
float | 0.3 |
Scene change sensitivity (0.0–1.0) |
strategy |
string | "combined" |
Frame extraction: "scene", "interval", "combined" |
interval |
int | auto | Seconds between frames — set to -1 (default) for auto-tuning based on duration |
skip_frames |
bool | false |
Skip visual analysis (transcription only) |
skip_audio |
bool | false |
Skip transcription (frames only) |
language |
string | "auto" |
Transcription language or "auto" |
provider |
string | null |
Override vision provider |
force_refresh |
bool | false |
Bypass cache |
transcribe
Transcribe video or audio to text with timestamps.
| Parameter | Type | Default | Description |
|---|---|---|---|
source |
string | required | URL or local file path |
language |
string | "auto" |
Preferred language or "auto" for detection |
extract_frames
Extract and AI-analyze frames from a video.
| Parameter | Type | Default | Description |
|---|---|---|---|
source |
string | required | URL or local file path |
max_frames |
int | 30 |
Maximum frames to extract (1–100) |
threshold |
float | 0.3 |
Scene change sensitivity (0.0–1.0) |
strategy |
string | "combined" |
"scene", "interval", or "combined" |
interval |
int | 5 |
Seconds between frames in interval mode |
detail |
string | "standard" |
Vision analysis level |
deduplicate |
bool | true |
Remove near-duplicate frames via dHash |
provider |
string | null |
Override vision provider |
force_refresh |
bool | false |
Bypass cache |
analyze_moment
Deep-dive analysis on a specific time range.
| Parameter | Type | Default | Description |
|---|---|---|---|
source |
string | required | URL or local file path |
start |
float | required | Start time in seconds |
end |
float | required | End time in seconds |
dense |
bool | true |
Extract 1 frame per second in the range |
detail |
string | "detailed" |
Vision analysis level |
provider |
string | null |
Override vision provider |
force_refresh |
bool | false |
Bypass cache |
get_frame_at
Get a single frame at a specific timestamp, optionally analyzed by AI.
| Parameter | Type | Default | Description |
|---|---|---|---|
source |
string | required | URL or local file path |
timestamp |
float | required | Time in seconds |
analyze |
bool | true |
Run AI vision analysis |
provider |
string | null |
Override vision provider |
force_refresh |
bool | false |
Bypass cache |
get_metadata
Fetch video metadata without downloading the full video.
| Parameter | Type | Default | Description |
|---|---|---|---|
source |
string | required | URL or local file path |
include |
list | null |
Specific fields to return — "title", "duration", "channel", "description", "thumbnail", "chapters", "tags", "view_count". Returns all if omitted. |
Supported Platforms
VideoScan uses yt-dlp under the hood, which supports 1000+ video platforms including:
- YouTube, YouTube Shorts, YouTube Live
- Vimeo, Dailymotion, Twitch
- Twitter/X, Instagram, TikTok, Facebook
- Reddit, LinkedIn, Pinterest
- BBC iPlayer, CNN, NBC, CBS
- SoundCloud, Bandcamp (audio)
- And hundreds more — see the yt-dlp supported sites list
Local files in any format supported by ffmpeg (mp4, mov, avi, mkv, webm, mp3, wav, etc.) are also supported.
Cost Estimates
Costs depend on your chosen provider and usage:
| Operation | Provider | Approx. Cost |
|---|---|---|
| Vision analysis | OpenAI GPT-4o | ~$0.015 per frame |
| Vision analysis | Anthropic Claude | ~$0.024 per frame |
| Vision analysis | Google Gemini | ~$0.002 per frame |
| Transcription | OpenAI Whisper | ~$0.006 per minute |
A typical 10-minute video analyzed with analyze_video (30 frames + transcription) costs approximately $0.45–$0.51 with OpenAI.
Development
git clone https://github.com/guguborbh/videoscan-mcp
cd videoscan-mcp
pip install -e ".[dev]"
pytest
License
MIT License — see LICENSE for details.
推荐服务器
Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。
VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。
e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。