MCP Data Server

MCP Data Server

An MCP server that connects AI agents to cloud-native geospatial data via STAC metadata and DuckDB with H3 spatial indexing, enabling zero-configuration SQL queries on terabyte-scale datasets over S3.

Category
访问服务器

README

MCP Data Server

Documentation · The bigger picture

An open Model Context Protocol (MCP) server that connects AI agents to cloud-native data: it grounds the agent in STAC metadata so it finds the right dataset and reads its schema, and confines it to validated cloud-native engines so it queries terabyte-scale data over S3 without downloading it, misreading it, or silently failing at scale. Today it serves SQL over Parquet via DuckDB with H3 spatial indexing; see the roadmap for array (Zarr) and hardware-accelerated engines.

It is one of three open-source components — with data-workflows (which produces the AI-ready data and metadata) and jupyter-geoagent — that together make the cloud-native stack reachable by the AI tools researchers already use. Runs locally for sensitive data or on autoscaling Kubernetes for scale.

Quick Start

Add the hosted MCP endpoint to your LLM client, like so:

Using VSCode

create a .vscode/mcp.json like this: (as in this repo)

{
	"servers": {
		"duckdb-geo": {
			"url": "https://duckdb-mcp.nrp-nautilus.io/mcp"
		}
	}
}

Now simply ask your chat client a question about the datasets and it should answer by querying the database in SQL:

Examples:

  • What fraction of Australia is protected area?

alt text

Using Claude Code (CLI)

Run this command once in your terminal:

claude mcp add --transport http duckdb-geo https://duckdb-mcp.nrp-nautilus.io/mcp

To make it available across all your projects, add --scope user:

claude mcp add --transport http --scope user duckdb-geo https://duckdb-mcp.nrp-nautilus.io/mcp

Using Claude Desktop

Add to your Claude Desktop configuration file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Linux: ~/.config/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "duckdb-geo": {
      "url": "https://duckdb-mcp.nrp-nautilus.io/mcp"
    }
  }
}

After adding the configuration, restart Claude Desktop.

Features

  • Zero-Configuration SQL Access: Query petabytes of geospatial data without database setup
  • H3 Geospatial Indexing: Efficient spatial operations using Uber's H3 hexagonal grid system
  • Isolated Execution: Each query runs in a fresh DuckDB instance for security
  • Stateless HTTP Mode: Fully horizontally scalable for cloud deployment
  • Rich Dataset Catalog: Access to 10+ curated environmental and biodiversity datasets
  • MCP Resources & Prompts: Browse datasets and get query guidance through MCP protocol

Available Datasets

The example configuration provides access to the following datasets via S3:

  1. GLWD - Global Lakes and Wetlands Database
  2. Vulnerable Carbon - Conservation International carbon vulnerability data
  3. NCP - Nature Contributions to People biodiversity scores
  4. Countries & Regions - Global administrative boundaries (Overture Maps)
  5. WDPA - World Database on Protected Areas
  6. Ramsar Sites - Wetlands of International Importance
  7. HydroBASINS - Global watershed boundaries (levels 3-6)
  8. iNaturalist - Species occurrence range maps
  9. Corruption Index 2024 - Transparency International data

Datasets are discovered dynamically from the STAC catalog via the list_datasets and get_dataset tools.

Local Development

You can also run the server locally

Or install dependencies and run directly:

pip install -r requirements.txt
python server.py

You can now connect to the server over localhost (note http not https here), e.g. in VSCode:

{
	"servers": {
			"duckdb-geo": {
			"url": "http://localhost:8000/mcp"
		},
	}
}

You can adjust the instructions to the LLM in the corresponding .md files (e.g. query-optimization.md, h3-guide.md). You will need to adjust query-setup.md to run the server locally, as it uses endpoint and thread count that only work from inside our k8s cluster. Running locally means your local CPU+network resources will be used for the computation, which will likely be much slower than the hosted k8s endpoint.

Architecture

We have a fully-hosted version

Core Components

  • server.py - Main MCP server with FastMCP framework
  • stac.py - STAC catalog integration for dynamic dataset discovery

Runtime Prompt Files

The .md files in this repo are not documentation — they are curated prompt content loaded by server.py at startup and injected directly into MCP tool descriptions and prompts at runtime. The agent (LLM) reads them as instructions, not humans.

File How it is used
query-setup.md SQL parsed and executed in every fresh DuckDB connection before a query runs
query-optimization.md Injected verbatim into the query tool description
h3-guide.md Injected verbatim into the query tool description
assistant-role.md Served as the geospatial-analyst MCP prompt (role + response style)

Editing these files changes what the agent is told to do. They must be written for a stateless LLM — short, concrete, and unambiguous. See AGENTS.md for editing rules.

Key Design Patterns

  1. Stateless transport: FastMCP runs in stateless streamable-HTTP mode (stateless_http=True in server.py). Every POST /mcp is a complete, independent request/response — no Mcp-Session-Id, no per-pod session cache, no in-memory state that survives across requests. Replicas behind the load balancer are interchangeable on a per-request basis. (The protocol's stateful SSE mode is not used.)
  2. Isolation Engine: Each query runs in a fresh duckdb.connect(":memory:") — no DuckDB connection, credential, or query state survives between requests
  3. Context Injection: Prompt files are embedded into tool descriptions so even MCP clients that don't support prompts/list receive the guidance
  4. Partition Pruning: H3 resolution columns (h0) enable DuckDB to skip S3 partitions, giving 5–20× speedups on large datasets

Kubernetes Deployment

Deploy to Kubernetes using the provided manifests:

kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml
kubectl apply -f k8s/ingress.yaml

The deployment:

  • Runs multiple replicas for high availability (prod: 6, dev: 2)
  • Allocates up to 160 Gi memory / 16 CPU per pod for large queries
  • Bakes the application code and dependencies into the image (no runtime clone)
  • Includes /healthz readiness + liveness probes for safe rollouts

Releases and production rollouts

Application code is baked into the image (COPY . /app in the Dockerfile); pods no longer git clone at startup. CI (.github/workflows/docker.yml) builds on every push to main and on vX.Y.Z tags:

  • dev pins the moving :main tag (imagePullPolicy: Always) and tracks the latest main.
  • prod pins an immutable vX.Y.Z@sha256:<digest> (imagePullPolicy: IfNotPresent) — every replica is identical by construction.

The convention is every release tag is a GitHub Release — when you cut a version, push the tag (CI builds :vX.Y.Z), publish a release with notes (gh release create vX.Y.Z --generate-notes), then pin prod to that build's digest. The full step-by-step (including reading the digest) lives in AGENTS.md → Rollout workflow. The latest GitHub Release is the source of truth for "what should be running in prod."

To confirm prod is on the intended release:

# Digest the prod manifest pins:
kubectl -n biodiversity get deploy duckdb-mcp \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

# Every running pod should report that same digest:
kubectl -n biodiversity get pods -l app=duckdb-mcp \
  -o custom-columns=NAME:.metadata.name,IMAGE:.status.containerStatuses[0].imageID

If every pod's imageID matches the pinned digest, prod is current and consistent.

MCP Protocol Features

Tools

  • browse_stac_catalog(catalog_url?, catalog_token?) - List available datasets from the STAC catalog
  • get_stac_details(dataset_id, catalog_url?, catalog_token?) - Get S3 paths and schema for a dataset
  • query(sql_query, s3_key?, s3_secret?, s3_endpoint?, s3_scope?) - Execute DuckDB SQL against S3 parquet files

Resources

NOTE: Some MCP clients, like in VSCode, do not recognize "resources" and "prompts". Newer clients (Claude code, Continue.dev, Antigravity do)

  • catalog://list - List all available datasets
  • catalog://{name} - Get detailed schema for a specific dataset

Prompts

  • geospatial-analyst - Load complete context for geospatial analysis persona

Query Optimization Tips

  1. Always include h0 in joins - Enables partition pruning for 5-20x speedup
  2. Use APPROX_COUNT_DISTINCT(h8) - Fast area calculations with H3 hexagons
  3. Filter small tables first - Create CTEs to reduce join cardinality
  4. Set THREADS=100 - Parallel S3 reads are I/O bound, not CPU bound
  5. Enable object cache - Reduces redundant S3 requests

See query-optimization.md for detailed guidance.

H3 Spatial Operations

All datasets use Uber's H3 hexagonal grid system for spatial indexing:

  • Resolution 8 (h8): ~0.737 km² per hex
  • Resolution 0-4 (h0-h4): Coarser resolutions for global analysis
  • Use h3_cell_to_parent() to join datasets at different resolutions
  • Use APPROX_COUNT_DISTINCT(h8) * 0.737327598 to calculate areas in km²

Testing

# Run all tests
pytest tests/

# Run specific test file
pytest tests/test_server.py

# Run with coverage
pytest --cov=. tests/

Configuration

Environment Variables

  • THREADS - DuckDB thread count (default: 100 for S3 workloads)
  • PORT - HTTP server port (default: 8000)

DuckDB Settings

Required settings are documented in query-setup.md and automatically injected into query tool descriptions.

Private Data Access

The server supports private STAC catalogs and private S3 buckets. Credentials are supplied per-call by the client and are scoped to that request only — they are never logged, cached, or shared between clients.

Private STAC catalog

If your STAC catalog requires authentication, pass a bearer token alongside the catalog URL:

{ "tool": "list_datasets", "arguments": {
    "catalog_url": "https://your-app.example.org/stac/catalog.json",
    "catalog_token": "YOUR_BEARER_TOKEN"
}}

The token is forwarded as Authorization: Bearer <token> when fetching catalog JSON. Pass the same catalog_url and catalog_token to get_dataset as well.

Serving a private catalog: The catalog endpoint needs to accept bearer token authentication for machine-to-machine access. If you are using oauth2-proxy for human (browser) access, add a parallel nginx auth_request bypass for the /stac/ path that accepts a static shared token via the Authorization header. This allows the MCP server to fetch catalog metadata without requiring a browser OAuth session.

Private S3 data

Pass S3 credentials directly to the query tool. The server injects them as a scoped DuckDB secret for the duration of that query, then destroys the connection:

{ "tool": "query", "arguments": {
    "sql_query": "SELECT * FROM read_parquet('s3://my-private-bucket/data/**') LIMIT 10",
    "s3_key": "YOUR_ACCESS_KEY_ID",
    "s3_secret": "YOUR_SECRET_ACCESS_KEY",
    "s3_endpoint": "minio.example.org"
}}

s3_endpoint defaults to s3-west.nrp-nautilus.io if omitted. SSL is enabled automatically for non-Ceph endpoints.

Security properties

Concern How it is handled
Credential bleed between clients Each request uses a separate duckdb.connect(":memory:") — DuckDB secrets are connection-scoped and destroyed on close
Credentials in server logs CREATE SECRET statements are constructed internally and never written to stderr
Credentials in transit All traffic is TLS-terminated at the ingress
Credential persistence stateless_http=True — no session state survives between requests

Deploying private apps without a separate server

Rather than maintaining a forked server deployment per app, private geo-agent apps can share the public MCP server endpoint and pass their credentials per-call. This reduces idle deployments and ensures all apps benefit from server improvements automatically.

Security

  • Stateless Design: No persistent database or user data
  • Query Isolation: Each request gets a fresh DuckDB instance; client credentials cannot bleed across requests
  • DNS Rebinding Protection: Disabled for MCP HTTP mode

License

BSD-3-Clause License — see LICENSE.

Contributing

Contributions welcome! Key areas:

  • Additional dataset integrations
  • Query optimization patterns
  • STAC catalog enhancements
  • Documentation improvements

References

Support

For issues and questions:

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选