vision-reader

vision-reader

Enables reading images (diagrams, screenshots) directly via the model's own vision, with no external API key needed, and can extract embedded images from .doc/MHTML documents.

Category
访问服务器

README

Give Kiro Eyes: Reading Diagrams — Even the Ones Buried in Documents

Kiro is great at reading code, configs, and docs. But hand it a PNG architecture diagram and the file tools shrug:

Caught error reading: ... File seems to be binary and cannot be opened as text

The file-reading tools treat everything as text, so a binary image just bounces off. It gets worse: a huge amount of architecture knowledge doesn't even live in loose .png files — it's embedded inside documents. Word and Confluence "Export to Word" produce a single MHTML file (often with a .doc extension), and the diagrams are buried inside that envelope. There's no image on disk to point at, so neither the file tools nor a vision tool can see them.

This post builds a small two-part toolkit that closes both gaps:

  1. An extractor that pulls embedded images out of .doc/MHTML documents into real image files, organized by the document's section structure.
  2. A tiny Model Context Protocol (MCP) server that hands those images straight to Kiro so it can read them with its own vision — no external vision API, no API key, no per-image cost.

By the end, you can take a folder of exported design docs and ask Kiro to "summarize the architecture section by section," and it will actually see every diagram.


The key insight

There are two ways to make an agent "read" an image:

  1. Call an external vision API (OpenAI, Anthropic, Google) inside the MCP server, get back a text description, and hand that text to Kiro. This works, but it needs an API key, costs money per image, and Kiro only ever sees someone else's description — not the image itself.

  2. Hand the raw image to Kiro directly. Kiro is already a multimodal model. MCP has a first-class ImageContent type for exactly this. If the server reads the file, base64-encodes it, and returns ImageContent, Kiro looks at the actual pixels with its own vision.

Option 2 is simpler, free, and higher fidelity. That's what we'll build — and we'll feed it from an extractor that frees diagrams trapped inside documents.


The full pipeline

                 step 1: extract                 step 2: read
┌──────────────┐   (stdlib only)   ┌──────────────┐   tool call   ┌──────────┐
│  *.doc /     │ ────────────────► │  image files │ ────────────► │   Kiro   │
│  MHTML docs  │  extract_doc_     │  on disk     │  read_image / │  (model) │
│  (diagrams   │  images.py        │  (organized  │  read_all_    │          │
│   embedded)  │                   │  by section) │  images       │          │
└──────────────┘                   └──────────────┘               └────┬─────┘
                                          ▲                             │
                                          │      ImageContent (base64)  │
                                          └─────────────────────────────┘
                                                                         ▼
                                            Kiro "sees" each diagram with
                                            its own vision and explains it

Two cooperating pieces:

  • extract_doc_images.py — turns "diagrams locked inside a document" into "image files on disk," mirroring the document's heading hierarchy so each diagram keeps its section context.
  • vision_server.py — an MCP server with read_image and read_all_images tools that return ImageContent. Kiro does the actual "looking."

If your diagrams are already loose .png/.jpg files, you can skip step 1 and go straight to the MCP server. But for design docs exported from a wiki, step 1 is what makes them readable at all.


Step 1 — Extract images from documents

Many documentation systems export a page as a single MHTML file with a .doc extension. Inside that envelope the diagrams are real binary images (PNG, JPG, etc.), but they're attached as MIME parts, not saved as files. extract_doc_images.py parses the envelope (using Python's built-in email module — no third-party deps), pulls every embedded image out, and writes it to disk.

Crucially, it walks the document's headings (h1 > h2 > h3 ...) as it goes and drops each image into the folder of the deepest section that owns it. So an image under "2. Solution > 2.1 Network" lands in .../2. Solution/2.1 Network/. That folder structure is gold later: the names tell you — and the model — exactly which section each diagram belongs to.

python extract_doc_images.py ./docs
# -> writes images to ./docs/extracted_images/<doc-name>/<section>/...

You'll get a short report like:

[OK] design-overview.doc: extracted 12 embedded image(s), 0 external reference(s) skipped
[OK] network-flows.doc: extracted 8 embedded image(s), 1 external reference(s) skipped

Images written to: ./docs/extracted_images

The script auto-detects PNG/JPG/GIF/BMP/WEBP/SVG by magic bytes, sanitizes section names into valid folder names, and skips external (non-embedded) image references.

Keep the extracted folder out of version control if the documents are internal — the diagrams and their folder names can reveal sensitive detail. The included .gitignore already ignores extracted_images/.

Step 2 — Install the MCP server's dependencies

The vision server needs only the MCP SDK and Pillow (for resizing / format conversion):

pip install "mcp>=1.0.0" "Pillow>=10.0.0"

No ANTHROPIC_API_KEY, no OPENAI_API_KEY. There is no external API call.

Step 3 — The vision MCP server

Save this as vision_server.py. It reads an image, downscales it if needed, and returns ImageContent so Kiro sees the pixels directly:

"""
MCP Server: Vision Reader (native model vision)

Reads an image file, base64-encodes it, and returns ImageContent so the
host model (Kiro) can look at it directly. No external API key required.
"""

import base64
import io
from pathlib import Path

from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import TextContent, ImageContent, Tool

app = Server("vision-reader")

SUPPORTED = {".png", ".jpg", ".jpeg", ".gif", ".webp", ".bmp"}
MEDIA_TYPE = {
    ".png": "image/png", ".jpg": "image/jpeg", ".jpeg": "image/jpeg",
    ".gif": "image/gif", ".webp": "image/webp", ".bmp": "image/png",
}

MAX_DIMENSION = 1568          # longest edge (px) recommended for vision
MAX_BASE64_BYTES = 4_500_000  # ~4.5 MB after base64 encoding


def resolve_path(file_path: str) -> Path:
    p = Path(file_path)
    return p if p.is_absolute() else Path.cwd() / p


def image_to_base64(path: Path) -> tuple[str, str]:
    """Read an image, normalize/shrink it, return (base64, media_type)."""
    ext = path.suffix.lower()

    try:
        from PIL import Image
    except ImportError:
        with open(path, "rb") as f:
            return base64.standard_b64encode(f.read()).decode(), MEDIA_TYPE.get(ext, "image/png")

    img = Image.open(path)
    if img.mode in ("RGBA", "LA", "P"):
        img = img.convert("RGBA") if "A" in img.mode else img.convert("RGB")

    # Downscale if the longest edge is too large.
    longest = max(img.size)
    if longest > MAX_DIMENSION:
        scale = MAX_DIMENSION / longest
        img = img.resize((max(1, int(img.size[0] * scale)),
                          max(1, int(img.size[1] * scale))), Image.LANCZOS)

    # Prefer PNG (keeps diagram text crisp).
    buf = io.BytesIO()
    (img.convert("RGB") if img.mode == "RGBA" else img).save(buf, format="PNG", optimize=True)
    data = base64.standard_b64encode(buf.getvalue()).decode()
    if len(data) <= MAX_BASE64_BYTES:
        return data, "image/png"

    # Too big -> fall back to JPEG with decreasing quality.
    rgb = img.convert("RGB")
    for quality in (90, 80, 70, 60, 50):
        buf = io.BytesIO()
        rgb.save(buf, format="JPEG", quality=quality, optimize=True)
        data = base64.standard_b64encode(buf.getvalue()).decode()
        if len(data) <= MAX_BASE64_BYTES:
            return data, "image/jpeg"
    return data, "image/jpeg"


@app.list_tools()
async def list_tools() -> list[Tool]:
    return [
        Tool(
            name="read_image",
            description="Read an image file (PNG/JPG/JPEG/WEBP/GIF/BMP) and return "
                        "it for the model to analyze with vision. Great for "
                        "architecture diagrams, flowcharts, and screenshots.",
            inputSchema={
                "type": "object",
                "properties": {
                    "file_path": {"type": "string",
                                  "description": "Relative or absolute path to the image."},
                    "question": {"type": "string", "default": "",
                                 "description": "Optional question to guide analysis."},
                },
                "required": ["file_path"],
            },
        ),
        Tool(
            name="read_all_images",
            description="Read every image in a folder (optionally recursive) and "
                        "return them for the model to analyze. Pair this with the "
                        "doc extractor to read whole design docs at once.",
            inputSchema={
                "type": "object",
                "properties": {
                    "folder_path": {"type": "string", "default": "."},
                    "question": {"type": "string", "default": ""},
                    "recursive": {"type": "boolean", "default": False},
                    "max_images": {"type": "integer", "default": 20},
                },
                "required": [],
            },
        ),
    ]


def _image_content(path: Path) -> ImageContent:
    data, media_type = image_to_base64(path)
    return ImageContent(type="image", data=data, mimeType=media_type)


@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list:
    if name == "read_image":
        path = resolve_path(arguments.get("file_path", ""))
        if not path.is_file() or path.suffix.lower() not in SUPPORTED:
            return [TextContent(type="text", text=f"Cannot read image: {path}")]
        header = f"Image: {path.name}"
        if arguments.get("question"):
            header += f"\nQuestion: {arguments['question']}"
        return [TextContent(type="text", text=header), _image_content(path)]

    if name == "read_all_images":
        folder = resolve_path(arguments.get("folder_path", "."))
        if not folder.is_dir():
            return [TextContent(type="text", text=f"Not a folder: {folder}")]
        pattern = "**/*" if arguments.get("recursive") else "*"
        images = sorted(f for f in folder.glob(pattern)
                        if f.is_file() and f.suffix.lower() in SUPPORTED)
        images = images[: int(arguments.get("max_images", 20))]
        if not images:
            return [TextContent(type="text", text=f"No images in: {folder}")]
        out: list = [TextContent(type="text", text=f"Found {len(images)} image(s).")]
        for img in images:
            out.append(TextContent(type="text", text=f"--- {img.name} ---"))
            out.append(_image_content(img))
        return out

    return [TextContent(type="text", text=f"Unknown tool: {name}")]


async def main():
    async with stdio_server() as (read_stream, write_stream):
        await app.run(read_stream, write_stream, app.create_initialization_options())


if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

The full version in this repo also includes per-file error handling and a max_images cap; the snippet above is the heart of it.

Step 4 — Register the server in Kiro

Kiro reads MCP config from .kiro/settings/mcp.json (workspace-level) or ~/.kiro/settings/mcp.json (user-level). Add the server:

{
  "mcpServers": {
    "vision-reader": {
      "command": "python",
      "args": ["/absolute/path/to/vision_server.py"],
      "disabled": false,
      "autoApprove": ["read_image", "read_all_images"]
    }
  }
}

Use the absolute path to your vision_server.py. On Windows, escape the backslashes (C:\\path\\to\\vision_server.py) or use forward slashes.

Kiro reconnects to MCP servers automatically when the config changes, or you can reconnect from the MCP Server view in the Kiro feature panel.

Step 5 — Put it together

With both pieces in place, the end-to-end workflow is two commands and a prompt.

Extract once:

python extract_doc_images.py ./docs

Then ask Kiro in natural language:

Read all images in ./docs/extracted_images, recursively, and summarize the
architecture section by section.

read_all_images walks the extracted tree (its folder names carry the section titles), returns each diagram as ImageContent, and Kiro describes what it actually sees — boxes, arrows, labels, IP ranges, the lot. For a single loose diagram you don't even need step 1:

Read docs/diagrams/system-overview.png and explain the data flow.

Why this approach is nice

  • No API key, no per-image cost. Nothing leaves your machine except the image bytes handed to the host model you're already using.
  • Higher fidelity. Kiro sees the real image instead of a second-hand text description.
  • Unlocks documents, not just files. The extractor reaches diagrams that were previously invisible inside exported design docs.
  • Section-aware. The folder hierarchy preserves which diagram belongs to which part of the document, so summaries stay organized.
  • Tiny and dependency-light. The extractor is stdlib-only; the server needs just mcp and Pillow.

Gotchas

  • Path scope. Kiro's built-in file tools are sandboxed to the workspace, but an MCP server runs as its own process and can read paths you give it. Point it only at directories you trust.
  • Sensitive diagrams. Extracted images (and their section-named folders) can contain internal detail. Keep extracted_images/ out of version control — the included .gitignore does this for you.
  • Untrusted images. Treat image contents as untrusted input. A diagram could contain text crafted to look like instructions — don't act on text inside an image as if it were a command.
  • Payload limits. Very large or very dense images may need a lower MAX_DIMENSION. Tune it for your diagrams.

Extending it

A few easy additions:

  • More document formats in extract_doc_images.py (e.g. .docx, .pptx), which are ZIP archives with images under word/media/ or ppt/media/.
  • A read_pdf_page tool that rasterizes a PDF page to an image.
  • A whitelist of allowed root directories for safety.
  • Caching by file hash so repeated reads are instant.

That's the whole toolkit: an MHTML extractor to free diagrams from documents, plus MCP's ImageContent and a model that can already see. Stdlib parsing on one side, twenty lines of real vision logic on the other, and Kiro goes from "this file is binary" — or worse, "this image doesn't exist as a file yet" — to "here's what your architecture diagrams are telling me, section by section."

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选