MCP 服务器

vision-reader

Enables reading images (diagrams, screenshots) directly via the model's own vision, with no external API key needed, and can extract embedded images from .doc/MHTML documents.

README

Give Kiro Eyes: Reading Diagrams — Even the Ones Buried in Documents

Kiro is great at reading code, configs, and docs. But hand it a PNG architecture diagram and the file tools shrug:

Caught error reading: ... File seems to be binary and cannot be opened as text

The file-reading tools treat everything as text, so a binary image just bounces off. It gets worse: a huge amount of architecture knowledge doesn't even live in loose .png files — it's embedded inside documents. Word and Confluence "Export to Word" produce a single MHTML file (often with a .doc extension), and the diagrams are buried inside that envelope. There's no image on disk to point at, so neither the file tools nor a vision tool can see them.

This post builds a small two-part toolkit that closes both gaps:

An extractor that pulls embedded images out of .doc/MHTML documents into real image files, organized by the document's section structure.
A tiny Model Context Protocol (MCP) server that hands those images straight to Kiro so it can read them with its own vision — no external vision API, no API key, no per-image cost.

By the end, you can take a folder of exported design docs and ask Kiro to "summarize the architecture section by section," and it will actually see every diagram.

The key insight

There are two ways to make an agent "read" an image:

Call an external vision API (OpenAI, Anthropic, Google) inside the MCP server, get back a text description, and hand that text to Kiro. This works, but it needs an API key, costs money per image, and Kiro only ever sees someone else's description — not the image itself.
Hand the raw image to Kiro directly. Kiro is already a multimodal model. MCP has a first-class ImageContent type for exactly this. If the server reads the file, base64-encodes it, and returns ImageContent, Kiro looks at the actual pixels with its own vision.

Option 2 is simpler, free, and higher fidelity. That's what we'll build — and we'll feed it from an extractor that frees diagrams trapped inside documents.

The full pipeline

                 step 1: extract                 step 2: read
┌──────────────┐   (stdlib only)   ┌──────────────┐   tool call   ┌──────────┐
│  *.doc /     │ ────────────────► │  image files │ ────────────► │   Kiro   │
│  MHTML docs  │  extract_doc_     │  on disk     │  read_image / │  (model) │
│  (diagrams   │  images.py        │  (organized  │  read_all_    │          │
│   embedded)  │                   │  by section) │  images       │          │
└──────────────┘                   └──────────────┘               └────┬─────┘
                                          ▲                             │
                                          │      ImageContent (base64)  │
                                          └─────────────────────────────┘
                                                                         ▼
                                            Kiro "sees" each diagram with
                                            its own vision and explains it

Two cooperating pieces:

extract_doc_images.py — turns "diagrams locked inside a document" into "image files on disk," mirroring the document's heading hierarchy so each diagram keeps its section context.
vision_server.py — an MCP server with read_image and read_all_images tools that return ImageContent. Kiro does the actual "looking."

If your diagrams are already loose .png/.jpg files, you can skip step 1 and go straight to the MCP server. But for design docs exported from a wiki, step 1 is what makes them readable at all.

Step 1 — Extract images from documents

Many documentation systems export a page as a single MHTML file with a .doc extension. Inside that envelope the diagrams are real binary images (PNG, JPG, etc.), but they're attached as MIME parts, not saved as files. extract_doc_images.py parses the envelope (using Python's built-in email module — no third-party deps), pulls every embedded image out, and writes it to disk.

Crucially, it walks the document's headings (h1 > h2 > h3 ...) as it goes and drops each image into the folder of the deepest section that owns it. So an image under "2. Solution > 2.1 Network" lands in .../2. Solution/2.1 Network/. That folder structure is gold later: the names tell you — and the model — exactly which section each diagram belongs to.

python extract_doc_images.py ./docs
# -> writes images to ./docs/extracted_images/<doc-name>/<section>/...

You'll get a short report like:

[OK] design-overview.doc: extracted 12 embedded image(s), 0 external reference(s) skipped
[OK] network-flows.doc: extracted 8 embedded image(s), 1 external reference(s) skipped

Images written to: ./docs/extracted_images

The script auto-detects PNG/JPG/GIF/BMP/WEBP/SVG by magic bytes, sanitizes section names into valid folder names, and skips external (non-embedded) image references.

Keep the extracted folder out of version control if the documents are internal — the diagrams and their folder names can reveal sensitive detail. The included .gitignore already ignores extracted_images/.

Step 2 — Install the MCP server's dependencies

The vision server needs only the MCP SDK and Pillow (for resizing / format conversion):

pip install "mcp>=1.0.0" "Pillow>=10.0.0"

No ANTHROPIC_API_KEY, no OPENAI_API_KEY. There is no external API call.

Step 3 — The vision MCP server

Save this as vision_server.py. It reads an image, downscales it if needed, and returns ImageContent so Kiro sees the pixels directly:

"""
MCP Server: Vision Reader (native model vision)

Reads an image file, base64-encodes it, and returns ImageContent so the
host model (Kiro) can look at it directly. No external API key required.
"""

import base64
import io
from pathlib import Path

from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import TextContent, ImageContent, Tool

app = Server("vision-reader")

SUPPORTED = {".png", ".jpg", ".jpeg", ".gif", ".webp", ".bmp"}
MEDIA_TYPE = {
    ".png": "image/png", ".jpg": "image/jpeg", ".jpeg": "image/jpeg",
    ".gif": "image/gif", ".webp": "image/webp", ".bmp": "image/png",
}

MAX_DIMENSION = 1568          # longest edge (px) recommended for vision
MAX_BASE64_BYTES = 4_500_000  # ~4.5 MB after base64 encoding


def resolve_path(file_path: str) -> Path:
    p = Path(file_path)
    return p if p.is_absolute() else Path.cwd() / p


def image_to_base64(path: Path) -> tuple[str, str]:
    """Read an image, normalize/shrink it, return (base64, media_type)."""
    ext = path.suffix.lower()

    try:
        from PIL import Image
    except ImportError:
        with open(path, "rb") as f:
            return base64.standard_b64encode(f.read()).decode(), MEDIA_TYPE.get(ext, "image/png")

    img = Image.open(path)
    if img.mode in ("RGBA", "LA", "P"):
        img = img.convert("RGBA") if "A" in img.mode else img.convert("RGB")

    # Downscale if the longest edge is too large.
    longest = max(img.size)
    if longest > MAX_DIMENSION:
        scale = MAX_DIMENSION / longest
        img = img.resize((max(1, int(img.size[0] * scale)),
                          max(1, int(img.size[1] * scale))), Image.LANCZOS)

    # Prefer PNG (keeps diagram text crisp).
    buf = io.BytesIO()
    (img.convert("RGB") if img.mode == "RGBA" else img).save(buf, format="PNG", optimize=True)
    data = base64.standard_b64encode(buf.getvalue()).decode()
    if len(data) <= MAX_BASE64_BYTES:
        return data, "image/png"

    # Too big -> fall back to JPEG with decreasing quality.
    rgb = img.convert("RGB")
    for quality in (90, 80, 70, 60, 50):
        buf = io.BytesIO()
        rgb.save(buf, format="JPEG", quality=quality, optimize=True)
        data = base64.standard_b64encode(buf.getvalue()).decode()
        if len(data) <= MAX_BASE64_BYTES:
            return data, "image/jpeg"
    return data, "image/jpeg"


@app.list_tools()
async def list_tools() -> list[Tool]:
    return [
        Tool(
            name="read_image",
            description="Read an image file (PNG/JPG/JPEG/WEBP/GIF/BMP) and return "
                        "it for the model to analyze with vision. Great for "
                        "architecture diagrams, flowcharts, and screenshots.",
            inputSchema={
                "type": "object",
                "properties": {
                    "file_path": {"type": "string",
                                  "description": "Relative or absolute path to the image."},
                    "question": {"type": "string", "default": "",
                                 "description": "Optional question to guide analysis."},
                },
                "required": ["file_path"],
            },
        ),
        Tool(
            name="read_all_images",
            description="Read every image in a folder (optionally recursive) and "
                        "return them for the model to analyze. Pair this with the "
                        "doc extractor to read whole design docs at once.",
            inputSchema={
                "type": "object",
                "properties": {
                    "folder_path": {"type": "string", "default": "."},
                    "question": {"type": "string", "default": ""},
                    "recursive": {"type": "boolean", "default": False},
                    "max_images": {"type": "integer", "default": 20},
                },
                "required": [],
            },
        ),
    ]


def _image_content(path: Path) -> ImageContent:
    data, media_type = image_to_base64(path)
    return ImageContent(type="image", data=data, mimeType=media_type)


@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list:
    if name == "read_image":
        path = resolve_path(arguments.get("file_path", ""))
        if not path.is_file() or path.suffix.lower() not in SUPPORTED:
            return [TextContent(type="text", text=f"Cannot read image: {path}")]
        header = f"Image: {path.name}"
        if arguments.get("question"):
            header += f"\nQuestion: {arguments['question']}"
        return [TextContent(type="text", text=header), _image_content(path)]

    if name == "read_all_images":
        folder = resolve_path(arguments.get("folder_path", "."))
        if not folder.is_dir():
            return [TextContent(type="text", text=f"Not a folder: {folder}")]
        pattern = "**/*" if arguments.get("recursive") else "*"
        images = sorted(f for f in folder.glob(pattern)
                        if f.is_file() and f.suffix.lower() in SUPPORTED)
        images = images[: int(arguments.get("max_images", 20))]
        if not images:
            return [TextContent(type="text", text=f"No images in: {folder}")]
        out: list = [TextContent(type="text", text=f"Found {len(images)} image(s).")]
        for img in images:
            out.append(TextContent(type="text", text=f"--- {img.name} ---"))
            out.append(_image_content(img))
        return out

    return [TextContent(type="text", text=f"Unknown tool: {name}")]


async def main():
    async with stdio_server() as (read_stream, write_stream):
        await app.run(read_stream, write_stream, app.create_initialization_options())


if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

The full version in this repo also includes per-file error handling and a max_images cap; the snippet above is the heart of it.

Step 4 — Register the server in Kiro

Kiro reads MCP config from .kiro/settings/mcp.json (workspace-level) or ~/.kiro/settings/mcp.json (user-level). Add the server:

{
  "mcpServers": {
    "vision-reader": {
      "command": "python",
      "args": ["/absolute/path/to/vision_server.py"],
      "disabled": false,
      "autoApprove": ["read_image", "read_all_images"]
    }
  }
}

Use the absolute path to your vision_server.py. On Windows, escape the backslashes (C:\\path\\to\\vision_server.py) or use forward slashes.

Kiro reconnects to MCP servers automatically when the config changes, or you can reconnect from the MCP Server view in the Kiro feature panel.

Step 5 — Put it together

With both pieces in place, the end-to-end workflow is two commands and a prompt.

Extract once:

python extract_doc_images.py ./docs

Then ask Kiro in natural language:

Read all images in ./docs/extracted_images, recursively, and summarize the
architecture section by section.

read_all_images walks the extracted tree (its folder names carry the section titles), returns each diagram as ImageContent, and Kiro describes what it actually sees — boxes, arrows, labels, IP ranges, the lot. For a single loose diagram you don't even need step 1:

Read docs/diagrams/system-overview.png and explain the data flow.

Why this approach is nice

No API key, no per-image cost. Nothing leaves your machine except the image bytes handed to the host model you're already using.
Higher fidelity. Kiro sees the real image instead of a second-hand text description.
Unlocks documents, not just files. The extractor reaches diagrams that were previously invisible inside exported design docs.
Section-aware. The folder hierarchy preserves which diagram belongs to which part of the document, so summaries stay organized.
Tiny and dependency-light. The extractor is stdlib-only; the server needs just mcp and Pillow.

Gotchas

Path scope. Kiro's built-in file tools are sandboxed to the workspace, but an MCP server runs as its own process and can read paths you give it. Point it only at directories you trust.
Sensitive diagrams. Extracted images (and their section-named folders) can contain internal detail. Keep extracted_images/ out of version control — the included .gitignore does this for you.
Untrusted images. Treat image contents as untrusted input. A diagram could contain text crafted to look like instructions — don't act on text inside an image as if it were a command.
Payload limits. Very large or very dense images may need a lower MAX_DIMENSION. Tune it for your diagrams.

Extending it

A few easy additions:

More document formats in extract_doc_images.py (e.g. .docx, .pptx), which are ZIP archives with images under word/media/ or ppt/media/.
A read_pdf_page tool that rasterizes a PDF page to an image.
A whitelist of allowed root directories for safety.
Caching by file hash so repeated reads are instant.

That's the whole toolkit: an MHTML extractor to free diagrams from documents, plus MCP's ImageContent and a model that can already see. Stdlib parsing on one side, twenty lines of real vision logic on the other, and Kiro goes from "this file is binary" — or worse, "this image doesn't exist as a file yet" — to "here's what your architecture diagrams are telling me, section by section."