MCP 服务器

Unstructured API MCP Server

一个 MCP 服务器实现，它能够与 Unstructured API 交互，并提供列出、创建、更新和管理源、目标和工作流的工具。

内容获取

数据库交互

数据与应用分析

访问服务器

Tools

create_s3_source

Create an S3 source connector. Args: name: A unique name for this connector remote_url: The S3 URI to the bucket or folder (e.g., s3://my-bucket/) recursive: Whether to access subfolders within the bucket Returns: String containing the created source connector information

update_s3_source

Update an S3 source connector. Args: source_id: ID of the source connector to update remote_url: The S3 URI to the bucket or folder recursive: Whether to access subfolders within the bucket Returns: String containing the updated source connector information

delete_s3_source

Delete an S3 source connector. Args: source_id: ID of the source connector to delete Returns: String containing the result of the deletion

create_azure_source

Create an Azure source connector. Args: name: A unique name for this connector remote_url: The Azure Storage remote URL, with the format az://<container-name>/<path/to/file/or/folder/in/container/as/needed> recursive: Whether to access subfolders within the bucket Returns: String containing the created source connector information

update_azure_source

Update an azure source connector. Args: source_id: ID of the source connector to update remote_url: The Azure Storage remote URL, with the format az://<container-name>/<path/to/file/or/folder/in/container/as/needed> recursive: Whether to access subfolders within the bucket Returns: String containing the updated source connector information

delete_azure_source

Delete an azure source connector. Args: source_id: ID of the source connector to delete Returns: String containing the result of the deletion

create_gdrive_source

Create an gdrive source connector. Args: name: A unique name for this connector remote_url: The gdrive URI to the bucket or folder (e.g., gdrive://my-bucket/) recursive: Whether to access subfolders within the bucket Returns: String containing the created source connector information

update_gdrive_source

Update an gdrive source connector. Args: source_id: ID of the source connector to update remote_url: The gdrive URI to the bucket or folder recursive: Whether to access subfolders within the bucket Returns: String containing the updated source connector information

delete_gdrive_source

Delete an gdrive source connector. Args: source_id: ID of the source connector to delete Returns: String containing the result of the deletion

create_s3_destination

Create an S3 destination connector. Args: name: A unique name for this connector remote_url: The S3 URI to the bucket or folder key: The AWS access key ID secret: The AWS secret access key token: The AWS STS session token for temporary access (optional) endpoint_url: Custom URL if connecting to a non-AWS S3 bucket Returns: String containing the created destination connector information

update_s3_destination

Update an S3 destination connector. Args: destination_id: ID of the destination connector to update remote_url: The S3 URI to the bucket or folder Returns: String containing the updated destination connector information

delete_s3_destination

Delete an S3 destination connector. Args: destination_id: ID of the destination connector to delete Returns: String containing the result of the deletion

create_weaviate_destination

Create an weaviate vector database destination connector. Args: cluster_url: URL of the weaviate cluster collection : Name of the collection to use in the weaviate cluster Note: The collection is a table in the weaviate cluster. In platform, there are dedicated code to generate collection for users here, due to the simplicity of the server, we are not generating it for users. Returns: String containing the created destination connector information

update_weaviate_destination

Update an weaviate destination connector. Args: destination_id: ID of the destination connector to update cluster_url (optional): URL of the weaviate cluster collection (optional): Name of the collection(like a file) to use in the weaviate cluster Returns: String containing the updated destination connector information

delete_weaviate_destination

Delete an weaviate destination connector. Args: destination_id: ID of the destination connector to delete Returns: String containing the result of the deletion

create_astradb_destination

Create an AstraDB destination connector. Args: name: A unique name for this connector collection_name: The name of the collection to use keyspace: The AstraDB keyspace batch_size: The batch size for inserting documents, must be positive (default: 20) Note: A collection in AstraDB is a schemaless document store optimized for NoSQL workloads, equivalent to a table in traditional databases. A keyspace is the top-level namespace in AstraDB that groups multiple collections. We require the users to create their own collection and keyspace before creating the connector. Returns: String containing the created destination connector information

update_astradb_destination

Update an AstraDB destination connector. Args: destination_id: ID of the destination connector to update collection_name: The name of the collection to use (optional) keyspace: The AstraDB keyspace (optional) batch_size: The batch size for inserting documents (optional) Note: We require the users to create their own collection and keyspace before creating the connector. Returns: String containing the updated destination connector information

delete_astradb_destination

Delete an AstraDB destination connector. Args: destination_id: ID of the destination connector to delete Returns: String containing the result of the deletion

create_neo4j_destination

Create an neo4j destination connector. Args: name: A unique name for this connector database: The neo4j database, e.g. "neo4j" uri: The neo4j URI, e.g. neo4j+s://<neo4j_instance_id>.databases.neo4j.io username: The neo4j username Returns: String containing the created destination connector information

update_neo4j_destination

Update an neo4j destination connector. Args: destination_id: ID of the destination connector to update database: The neo4j database, e.g. "neo4j" uri: The neo4j URI, e.g. neo4j+s://<neo4j_instance_id>.databases.neo4j.io username: The neo4j username Returns: String containing the updated destination connector information

delete_neo4j_destination

Delete an neo4j destination connector. Args: destination_id: ID of the destination connector to delete Returns: String containing the result of the deletion

invoke_firecrawl_crawlhtml

Start an asynchronous web crawl job using Firecrawl to retrieve HTML content. Args: url: URL to crawl s3_uri: S3 URI where results will be uploaded limit: Maximum number of pages to crawl (default: 100) Returns: Dictionary with crawl job information including the job ID

check_crawlhtml_status

Check the status of an existing Firecrawl HTML crawl job. Args: crawl_id: ID of the crawl job to check Returns: Dictionary containing the current status of the crawl job

invoke_firecrawl_llmtxt

Start an asynchronous llmfull.txt generation job using Firecrawl. This file is a standardized markdown file containing information to help LLMs use a website at inference time. The llmstxt endpoint leverages Firecrawl to crawl your website and extracts data using gpt-4o-mini Args: url: URL to crawl s3_uri: S3 URI where results will be uploaded max_urls: Maximum number of pages to crawl (1-100, default: 10) Returns: Dictionary with job information including the job ID

check_llmtxt_status

Check the status of an existing llmfull.txt generation job. Args: job_id: ID of the llmfull.txt generation job to check Returns: Dictionary containing the current status of the job and text content if completed

cancel_crawlhtml_job

Cancel an in-progress Firecrawl HTML crawl job. Args: crawl_id: ID of the crawl job to cancel Returns: Dictionary containing the result of the cancellation

list_sources

List available sources from the Unstructured API. Args: source_type: Optional source connector type to filter by Returns: String containing the list of sources

get_source_info

Get detailed information about a specific source connector. Args: source_id: ID of the source connector to get information for, should be valid UUID Returns: String containing the source connector information

list_destinations

List available destinations from the Unstructured API. Args: destination_type: Optional destination connector type to filter by Returns: String containing the list of destinations

get_destination_info

Get detailed information about a specific destination connector. Args: destination_id: ID of the destination connector to get information for Returns: String containing the destination connector information

list_workflows

List workflows from the Unstructured API. Args: destination_id: Optional destination connector ID to filter by source_id: Optional source connector ID to filter by status: Optional workflow status to filter by Returns: String containing the list of workflows

get_workflow_info

Get detailed information about a specific workflow. Args: workflow_id: ID of the workflow to get information for Returns: String containing the workflow information

create_workflow

Create a new workflow. Args: workflow_config: A Typed Dictionary containing required fields (destination_id - should be a valid UUID, name, source_id - should be a valid UUID, workflow_type) and non-required fields (schedule, and workflow_nodes). Note workflow_nodes is only enabled when workflow_type is `custom` and is a list of WorkflowNodeTypedDict: partition, prompter,chunk, embed Below is an example of a partition workflow node: { "name": "vlm-partition", "type": "partition", "sub_type": "vlm", "settings": { "provider": "your favorite provider", "model": "your favorite model" } } Returns: String containing the created workflow information Custom workflow DAG nodes - If WorkflowType is set to custom, you must also specify the settings for the workflow’s directed acyclic graph (DAG) nodes. These nodes’ settings are specified in the workflow_nodes array. - A Source node is automatically created when you specify the source_id value outside of the workflow_nodes array. - A Destination node is automatically created when you specify the destination_id value outside of the workflow_nodes array. - You can specify Partitioner, Chunker, Prompter, and Embedder nodes. - The order of the nodes in the workflow_nodes array will be the same order that these nodes appear in the DAG, with the first node in the array added directly after the Source node. The Destination node follows the last node in the array. - Be sure to specify nodes in the allowed order. The following DAG placements are all allowed: - Source -> Partitioner -> Destination, - Source -> Partitioner -> Chunker -> Destination, - Source -> Partitioner -> Chunker -> Embedder -> Destination, - Source -> Partitioner -> Prompter -> Chunker -> Destination, - Source -> Partitioner -> Prompter -> Chunker -> Embedder -> Destination Partitioner node A Partitioner node has a type of partition and a subtype of auto, vlm, hi_res, or fast. Examples: - auto strategy: { "name": "Partitioner", "type": "partition", "subtype": "vlm", "settings": { "provider": "anthropic", (required) "model": "claude-3-5-sonnet-20241022", (required) "output_format": "text/html", "user_prompt": null, "format_html": true, "unique_element_ids": true, "is_dynamic": true, "allow_fast": true } } - vlm strategy: Allowed values are provider and model. Below are examples: - "provider": "anthropic" "model": "claude-3-5-sonnet-20241022", - "provider": "openai" "model": "gpt-4o" - hi_res strategy: { "name": "Partitioner", "type": "partition", "subtype": "unstructured_api", "settings": { "strategy": "hi_res", "include_page_breaks": <true|false>, "pdf_infer_table_structure": <true|false>, "exclude_elements": [ "<element-name>", "<element-name>" ], "xml_keep_tags": <true|false>, "encoding": "<encoding>", "ocr_languages": [ "<language>", "<language>" ], "extract_image_block_types": [ "image", "table" ], "infer_table_structure": <true|false> } } - fast strategy { "name": "Partitioner", "type": "partition", "subtype": "unstructured_api", "settings": { "strategy": "fast", "include_page_breaks": <true|false>, "pdf_infer_table_structure": <true|false>, "exclude_elements": [ "<element-name>", "<element-name>" ], "xml_keep_tags": <true|false>, "encoding": "<encoding>", "ocr_languages": [ "<language-code>", "<language-code>" ], "extract_image_block_types": [ "image", "table" ], "infer_table_structure": <true|false> } } Chunker node A Chunker node has a type of chunk and subtype of chunk_by_character or chunk_by_title. - chunk_by_character { "name": "Chunker", "type": "chunk", "subtype": "chunk_by_character", "settings": { "include_orig_elements": <true|false>, "new_after_n_chars": <new-after-n-chars>, (required, if not provided set same as max_characters) "max_characters": <max-characters>, (required) "overlap": <overlap>, (required, if not provided set default to 0) "overlap_all": <true|false>, "contextual_chunking_strategy": "v1" } } - chunk_by_title { "name": "Chunker", "type": "chunk", "subtype": "chunk_by_title", "settings": { "multipage_sections": <true|false>, "combine_text_under_n_chars": <combine-text-under-n-chars>, "include_orig_elements": <true|false>, "new_after_n_chars": <new-after-n-chars>, (required, if not provided set same as max_characters) "max_characters": <max-characters>, (required) "overlap": <overlap>, (required, if not provided set default to 0) "overlap_all": <true|false>, "contextual_chunking_strategy": "v1" } } Prompter node An Prompter node has a type of prompter and subtype of: - openai_image_description, - anthropic_image_description, - bedrock_image_description, - vertexai_image_description, - openai_table_description, - anthropic_table_description, - bedrock_table_description, - vertexai_table_description, - openai_table2html, - openai_ner Example: { "name": "Prompter", "type": "prompter", "subtype": "<subtype>", "settings": {} } Embedder node An Embedder node has a type of embed Allowed values for subtype and model_name include: - "subtype": "azure_openai" - "model_name": "text-embedding-3-small" - "model_name": "text-embedding-3-large" - "model_name": "text-embedding-ada-002" - "subtype": "bedrock" - "model_name": "amazon.titan-embed-text-v2:0" - "model_name": "amazon.titan-embed-text-v1" - "model_name": "amazon.titan-embed-image-v1" - "model_name": "cohere.embed-english-v3" - "model_name": "cohere.embed-multilingual-v3" - "subtype": "togetherai": - "model_name": "togethercomputer/m2-bert-80M-2k-retrieval" - "model_name": "togethercomputer/m2-bert-80M-8k-retrieval" - "model_name": "togethercomputer/m2-bert-80M-32k-retrieval" Example: { "name": "Embedder", "type": "embed", "subtype": "<subtype>", "settings": { "model_name": "<model-name>" } }

run_workflow

Run a specific workflow. Args: workflow_id: ID of the workflow to run Returns: String containing the response from the workflow execution

update_workflow

Update an existing workflow. Args: workflow_id: ID of the workflow to update workflow_config: A Typed Dictionary containing required fields (destination_id, name, source_id, workflow_type) and non-required fields (schedule, and workflow_nodes) Returns: String containing the updated workflow information Custom workflow DAG nodes - If WorkflowType is set to custom, you must also specify the settings for the workflow’s directed acyclic graph (DAG) nodes. These nodes’ settings are specified in the workflow_nodes array. - A Source node is automatically created when you specify the source_id value outside of the workflow_nodes array. - A Destination node is automatically created when you specify the destination_id value outside of the workflow_nodes array. - You can specify Partitioner, Chunker, Prompter, and Embedder nodes. - The order of the nodes in the workflow_nodes array will be the same order that these nodes appear in the DAG, with the first node in the array added directly after the Source node. The Destination node follows the last node in the array. - Be sure to specify nodes in the allowed order. The following DAG placements are all allowed: - Source -> Partitioner -> Destination, - Source -> Partitioner -> Chunker -> Destination, - Source -> Partitioner -> Chunker -> Embedder -> Destination, - Source -> Partitioner -> Prompter -> Chunker -> Destination, - Source -> Partitioner -> Prompter -> Chunker -> Embedder -> Destination Partitioner node A Partitioner node has a type of partition and a subtype of auto, vlm, hi_res, or fast. Examples: - auto strategy: { "name": "Partitioner", "type": "partition", "subtype": "vlm", "settings": { "provider": "anthropic", (required) "model": "claude-3-5-sonnet-20241022", (required) "output_format": "text/html", "user_prompt": null, "format_html": true, "unique_element_ids": true, "is_dynamic": true, "allow_fast": true } } - vlm strategy: Allowed values are provider and model. Below are examples: - "provider": "anthropic" "model": "claude-3-5-sonnet-20241022", - "provider": "openai" "model": "gpt-4o" - hi_res strategy: { "name": "Partitioner", "type": "partition", "subtype": "unstructured_api", "settings": { "strategy": "hi_res", "include_page_breaks": <true|false>, "pdf_infer_table_structure": <true|false>, "exclude_elements": [ "<element-name>", "<element-name>" ], "xml_keep_tags": <true|false>, "encoding": "<encoding>", "ocr_languages": [ "<language>", "<language>" ], "extract_image_block_types": [ "image", "table" ], "infer_table_structure": <true|false> } } - fast strategy { "name": "Partitioner", "type": "partition", "subtype": "unstructured_api", "settings": { "strategy": "fast", "include_page_breaks": <true|false>, "pdf_infer_table_structure": <true|false>, "exclude_elements": [ "<element-name>", "<element-name>" ], "xml_keep_tags": <true|false>, "encoding": "<encoding>", "ocr_languages": [ "<language-code>", "<language-code>" ], "extract_image_block_types": [ "image", "table" ], "infer_table_structure": <true|false> } } Chunker node A Chunker node has a type of chunk and subtype of chunk_by_character or chunk_by_title. - chunk_by_character { "name": "Chunker", "type": "chunk", "subtype": "chunk_by_character", "settings": { "include_orig_elements": <true|false>, "new_after_n_chars": <new-after-n-chars>, (required, if not provided set same as max_characters) "max_characters": <max-characters>, (required) "overlap": <overlap>, (required, if not provided set default to 0) "overlap_all": <true|false>, "contextual_chunking_strategy": "v1" } } - chunk_by_title { "name": "Chunker", "type": "chunk", "subtype": "chunk_by_title", "settings": { "multipage_sections": <true|false>, "combine_text_under_n_chars": <combine-text-under-n-chars>, "include_orig_elements": <true|false>, "new_after_n_chars": <new-after-n-chars>, (required, if not provided set same as max_characters) "max_characters": <max-characters>, (required) "overlap": <overlap>, (required, if not provided set default to 0) "overlap_all": <true|false>, "contextual_chunking_strategy": "v1" } } Prompter node An Prompter node has a type of prompter and subtype of: - openai_image_description, - anthropic_image_description, - bedrock_image_description, - vertexai_image_description, - openai_table_description, - anthropic_table_description, - bedrock_table_description, - vertexai_table_description, - openai_table2html, - openai_ner Example: { "name": "Prompter", "type": "prompter", "subtype": "<subtype>", "settings": {} } Embedder node An Embedder node has a type of embed Allowed values for subtype and model_name include: - "subtype": "azure_openai" - "model_name": "text-embedding-3-small" - "model_name": "text-embedding-3-large" - "model_name": "text-embedding-ada-002" - "subtype": "bedrock" - "model_name": "amazon.titan-embed-text-v2:0" - "model_name": "amazon.titan-embed-text-v1" - "model_name": "amazon.titan-embed-image-v1" - "model_name": "cohere.embed-english-v3" - "model_name": "cohere.embed-multilingual-v3" - "subtype": "togetherai": - "model_name": "togethercomputer/m2-bert-80M-2k-retrieval" - "model_name": "togethercomputer/m2-bert-80M-8k-retrieval" - "model_name": "togethercomputer/m2-bert-80M-32k-retrieval" Example: { "name": "Embedder", "type": "embed", "subtype": "<subtype>", "settings": { "model_name": "<model-name>" } }

delete_workflow

Delete a specific workflow. Args: workflow_id: ID of the workflow to delete Returns: String containing the response from the workflow deletion

list_jobs

List jobs via the Unstructured API. Args: workflow_id: Optional workflow ID to filter by status: Optional job status to filter by Returns: String containing the list of jobs

get_job_info

Get detailed information about a specific job. Args: job_id: ID of the job to get information for Returns: String containing the job information

cancel_job

Delete a specific job. Args: job_id: ID of the job to cancel Returns: String containing the response from the job cancellation

README

Unstructured API MCP 服务器

一个用于与 Unstructured API 交互的 MCP 服务器实现。该服务器提供列出源和工作流的工具。

可用工具

工具	描述
`list_sources`	列出 Unstructured API 中可用的源。
`get_source_info`	获取有关特定源连接器的详细信息。
`create_[connector]_source`	创建一个源连接器。目前，我们有 s3/google drive/azure 连接器（更多即将推出！）
`update_[connector]_source`	通过参数更新现有的源连接器。
`delete_[connector]_source`	通过源 ID 删除源连接器。
`list_destinations`	列出 Unstructured API 中可用的目标。
`get_destination_info`	获取有关特定目标连接器的详细信息。目前，我们有 s3/weaviate/astra/neo4j/mongo DB（更多即将推出！）
`create_[connector]_destination`	通过参数创建目标连接器。
`update_[connector]_destination`	通过目标 ID 更新现有的目标连接器。
`delete_[connector]_destination`	通过目标 ID 删除目标连接器。
`list_workflows`	列出 Unstructured API 中的工作流。
`get_workflow_info`	获取有关特定工作流的详细信息。
`create_workflow`	使用源、目标 ID 等创建一个新的工作流。
`run_workflow`	使用工作流 ID 运行特定的工作流。
`update_workflow`	通过参数更新现有的工作流。
`delete_workflow`	通过 ID 删除特定的工作流。
`list_jobs`	列出 Unstructured API 中特定工作流的作业。
`get_job_info`	通过作业 ID 获取有关特定作业的详细信息。
`cancel_job`	通过 ID 删除特定的作业。

以下是 UNS-MCP 服务器当前支持的连接器列表，请在此处查看 Unstructured 平台支持的完整源连接器列表 here 和目标列表 here。我们计划添加更多！

源	目标
S3	S3
Azure	Weaviate
Google Drive	Pinecone
OneDrive	AstraDB
Salesforce	MongoDB
Sharepoint	Neo4j
	Databricks Volumes
	Databricks Volumes Delta Table

要使用创建/更新/删除连接器的工具，必须在您的 .env 文件中定义该特定连接器的凭据。以下是我们支持的连接器的 credentials 列表：

凭据名称	描述
`ANTHROPIC_API_KEY`	运行 `minimal_client` 以与我们的服务器交互所必需的。
`AWS_KEY`, `AWS_SECRET`	通过 `uns-mcp` 服务器创建 S3 连接器所必需的，请参阅 documentation 和 here 中的说明
`WEAVIATE_CLOUD_API_KEY`	创建 Weaviate 向量数据库连接器所必需的，请参阅 documentation 中的说明
`FIRECRAWL_API_KEY`	在 `external/firecrawl.py` 中使用 Firecrawl 工具所必需的，请在 Firecrawl 上注册并获取 API 密钥。
`ASTRA_DB_APPLICATION_TOKEN`, `ASTRA_DB_API_ENDPOINT`	通过 `uns-mcp` 服务器创建 Astradb 连接器所必需的，请参阅 documentation 中的说明
`AZURE_CONNECTION_STRING`	通过 `uns-mcp` 服务器创建 Azure 连接器的必需选项 1，请参阅 documentation 中的说明
`AZURE_ACCOUNT_NAME`+`AZURE_ACCOUNT_KEY`	通过 `uns-mcp` 服务器创建 Azure 连接器的必需选项 2，请参阅 documentation 中的说明
`AZURE_ACCOUNT_NAME`+`AZURE_SAS_TOKEN`	通过 `uns-mcp` 服务器创建 Azure 连接器的必需选项 3，请参阅 documentation 中的说明
`NEO4J_PASSWORD`	通过 `uns-mcp` 服务器创建 Neo4j 连接器所必需的，请参阅 documentation 中的说明
`MONGO_DB_CONNECTION_STRING`	通过 `uns-mcp` 服务器创建 Mongodb 连接器所必需的，请参阅 documentation 中的说明
`GOOGLEDRIVE_SERVICE_ACCOUNT_KEY`	一个字符串值。原始服务器帐户密钥（请遵循 documentation）位于 json 文件中，在终端中运行 `cat /path/to/google_service_account_key.json
`DATABRICKS_CLIENT_ID`,`DATABRICKS_CLIENT_SECRET`	通过 `uns-mcp` 服务器创建 Databricks volume/delta table 连接器所必需的，请参阅 documentation 和 here 中的说明
`ONEDRIVE_CLIENT_ID`, `ONEDRIVE_CLIENT_CRED`,`ONEDRIVE_TENANT_ID`	通过 `uns-mcp` 服务器创建 One Drive 连接器所必需的，请参阅 documentation 中的说明
`PINECONE_API_KEY`	通过 `uns-mcp` 服务器创建 Pinecone 向量数据库连接器所必需的，请参阅 documentation 中的说明
`SALESFORCE_CONSUMER_KEY`,`SALESFORCE_PRIVATE_KEY`	通过 `uns-mcp` 服务器创建 salesforce 源连接器所必需的，请参阅 documentation
`SHAREPOINT_CLIENT_ID`, `SHAREPOINT_CLIENT_CRED`,`SHAREPOINT_TENANT_ID`	通过 `uns-mcp` 服务器创建 One Drive 连接器所必需的，请参阅 documentation 中的说明
`LOG_LEVEL`	用于为我们的 `minimal_client` 设置日志记录级别，例如，设置为 ERROR 以获取所有内容
`CONFIRM_TOOL_USE`	设置为 true，以便 `minimal_client` 可以在每次工具调用之前确认执行
`DEBUG_API_REQUESTS`	设置为 true，以便 `uns_mcp/server.py` 可以输出请求参数以进行更好的调试

Firecrawl 源

Firecrawl 是一个 Web 爬取 API，在我们的 MCP 中提供两个主要功能：

HTML 内容检索：使用 invoke_firecrawl_crawlhtml 启动爬取作业，并使用 check_crawlhtml_status 监视它们
LLM 优化文本生成：使用 invoke_firecrawl_llmtxt 生成文本，并使用 check_llmtxt_status 检索结果

Firecrawl 的工作原理：

Web 爬取过程：

从指定的 URL 开始并分析它以识别链接
如果可用，则使用站点地图；否则，遵循网站上找到的链接
递归地遍历每个链接以发现所有子页面
收集每个访问页面的内容，处理 JavaScript 渲染和速率限制
如果需要，可以使用 cancel_crawlhtml_job 取消作业
如果您需要将所有信息提取到原始 HTML 中，请使用此功能，Unstructured 的工作流程可以很好地清理它 :smile:

LLM 文本生成：

爬取后，从爬取的页面中提取干净、有意义的文本内容
生成专门为大型语言模型格式化的优化文本格式
结果会自动上传到指定的 S3 位置
注意：LLM 文本生成作业一旦启动就无法取消。提供 cancel_llmtxt_job 函数是为了保持一致性，但 Firecrawl API 目前不支持它。

注意：必须设置 FIRECRAWL_API_KEY 环境变量才能使用这些函数。

安装和配置

本指南提供使用 Python 3.12 和 uv 工具设置和配置 UNS_MCP 服务器的分步说明。

先决条件

Python 3.12+
用于环境管理的 uv
来自 Unstructured 的 API 密钥。您可以在 here 注册并获取您的 API 密钥。

使用 `uv`（推荐）

使用 uvx 时不需要额外的安装，因为它处理执行。但是，如果您希望直接安装该软件包：

uv pip install uns_mcp

配置 Claude Desktop

要与 Claude Desktop 集成，请将以下内容添加到您的 claude_desktop_config.json：

注意： 该文件位于 ~/Library/Application Support/Claude/ 目录中。

使用 uvx 命令：

{
   "mcpServers": {
      "UNS_MCP": {
         "command": "uvx",
         "args": ["uns_mcp"],
         "env": {
           "UNSTRUCTURED_API_KEY": "<your-key>"
         }
      }
   }
}

或者，使用 Python 包：

{
   "mcpServers": {
      "UNS_MCP": {
         "command": "python",
         "args": ["-m", "uns_mcp"],
         "env": {
           "UNSTRUCTURED_API_KEY": "<your-key>"
         }
      }
   }
}

使用源代码

克隆存储库。
安装依赖项：
```
uv sync
```
将您的 Unstructured API 密钥设置为环境变量。在根目录中创建一个 .env 文件，其中包含以下内容：
```
UNSTRUCTURED_API_KEY="YOUR_KEY"
```
有关可配置的环境变量，请参阅 .env.template。

您现在可以使用以下方法之一运行服务器：

<details> <summary> 使用可编辑的软件包安装 </summary> 以可编辑的软件包安装：

uvx pip install -e .

更新您的 Claude Desktop 配置：

{
  "mcpServers": {
    "UNS_MCP": {
      "command": "uvx",
      "args": ["uns_mcp"]
    }
  }
}

注意： 请记住指向安装软件包的环境中的 uvx 可执行文件

</details>

<details> <summary> 使用 SSE 服务器协议 </summary>

注意：Claude Desktop 不支持。

对于 SSE 协议，您可以通过分离客户端和服务器来更轻松地进行调试：

在一个终端中启动服务器：

uv run python uns_mcp/server.py --host 127.0.0.1 --port 8080
# 或
make sse-server

在另一个终端中使用本地客户端测试服务器：

uv run python minimal_client/client.py "http://127.0.0.1:8080/sse"
# 或
make sse-client

注意： 要停止服务，请先在客户端上使用 Ctrl+C，然后再在服务器上使用。 </details>

<details> <summary> 使用 Stdio 服务器协议 </summary>

配置 Claude Desktop 以使用 stdio：

{
  "mcpServers": {
    "UNS_MCP": {
      "command": "ABSOLUTE/PATH/TO/.local/bin/uv",
      "args": [
        "--directory",
        "ABSOLUTE/PATH/TO/YOUR-UNS-MCP-REPO/uns_mcp",
        "run",
        "server.py"
      ]
    }
  }
}

或者，运行本地客户端：

uv run python minimal_client/client.py uns_mcp/server.py

</details>

其他本地客户端配置

使用环境变量配置最小客户端：

LOG_LEVEL="ERROR"：设置为禁止 LLM 的调试输出，为用户显示清晰的消息。
CONFIRM_TOOL_USE='false'：在执行之前禁用工具使用确认。 谨慎使用，尤其是在开发期间，因为 LLM 可能会执行昂贵的工作流程或删除数据。

调试工具

Anthropic 提供 MCP Inspector 工具来调试/测试您的 MCP 服务器。运行以下命令以启动调试 UI。从那里，您将能够在左侧窗格中添加环境变量（指向您的本地环境）。在那里包含您的个人 API 密钥作为 env var。转到 tools，您可以测试添加到 MCP 服务器的功能。

mcp dev uns_mcp/server.py

如果您需要将请求调用参数记录到 UnstructuredClient，请设置环境变量 DEBUG_API_REQUESTS=false。日志存储在格式为 unstructured-client-{date}.log 的文件中，可以检查该文件以调试对 UnstructuredClient 函数的请求调用参数。

添加对最小客户端的终端访问

我们将使用 @wonderwhy-er/desktop-commander 将终端访问添加到最小客户端。它构建在 MCP 文件系统服务器之上。请小心，因为客户端（也是 LLM）现在可以访问私有文件。

执行以下命令以安装软件包：

npx @wonderwhy-er/desktop-commander setup

然后使用额外的参数启动客户端：

uv run python minimal_client/client.py "http://127.0.0.1:8080/sse" "@wonderwhy-er/desktop-commander"
# 或
make sse-client-terminal

使用工具的子集

如果您的客户端仅支持使用工具的子集，则应注意以下事项：

update_workflow 工具必须与 create_workflow 工具一起加载到上下文中，因为它包含有关如何创建和配置自定义节点的详细说明。

已知问题

update_workflow - 需要在上下文中具有它正在更新的工作流程的配置，或者由用户提供，或者通过调用 get_workflow_info 工具来提供，因为此工具不作为 patch 应用器工作，它会完全替换工作流程配置。

CHANGELOG.md

任何新开发的功能/修复/增强都将添加到 CHANGELOG.md。在我们升级到稳定版本之前，首选 0.x.x-dev 预发布格式。

故障排除

如果您遇到 Error: spawn <command> ENOENT 问题，则表示未安装 <command> 或在您的 PATH 中不可见：
- 确保安装它并将其添加到您的 PATH。
- 或者在您的配置的 command 字段中提供命令的绝对路径。因此，例如，将 python 替换为 /opt/miniconda3/bin/python