Unstructured API MCP Server
一个 MCP 服务器实现,它能够与 Unstructured API 交互,并提供列出、创建、更新和管理源、目标和工作流的工具。
Tools
create_s3_source
Create an S3 source connector. Args: name: A unique name for this connector remote_url: The S3 URI to the bucket or folder (e.g., s3://my-bucket/) recursive: Whether to access subfolders within the bucket Returns: String containing the created source connector information
update_s3_source
Update an S3 source connector. Args: source_id: ID of the source connector to update remote_url: The S3 URI to the bucket or folder recursive: Whether to access subfolders within the bucket Returns: String containing the updated source connector information
delete_s3_source
Delete an S3 source connector. Args: source_id: ID of the source connector to delete Returns: String containing the result of the deletion
create_azure_source
Create an Azure source connector. Args: name: A unique name for this connector remote_url: The Azure Storage remote URL, with the format az://<container-name>/<path/to/file/or/folder/in/container/as/needed> recursive: Whether to access subfolders within the bucket Returns: String containing the created source connector information
update_azure_source
Update an azure source connector. Args: source_id: ID of the source connector to update remote_url: The Azure Storage remote URL, with the format az://<container-name>/<path/to/file/or/folder/in/container/as/needed> recursive: Whether to access subfolders within the bucket Returns: String containing the updated source connector information
delete_azure_source
Delete an azure source connector. Args: source_id: ID of the source connector to delete Returns: String containing the result of the deletion
create_gdrive_source
Create an gdrive source connector. Args: name: A unique name for this connector remote_url: The gdrive URI to the bucket or folder (e.g., gdrive://my-bucket/) recursive: Whether to access subfolders within the bucket Returns: String containing the created source connector information
update_gdrive_source
Update an gdrive source connector. Args: source_id: ID of the source connector to update remote_url: The gdrive URI to the bucket or folder recursive: Whether to access subfolders within the bucket Returns: String containing the updated source connector information
delete_gdrive_source
Delete an gdrive source connector. Args: source_id: ID of the source connector to delete Returns: String containing the result of the deletion
create_s3_destination
Create an S3 destination connector. Args: name: A unique name for this connector remote_url: The S3 URI to the bucket or folder key: The AWS access key ID secret: The AWS secret access key token: The AWS STS session token for temporary access (optional) endpoint_url: Custom URL if connecting to a non-AWS S3 bucket Returns: String containing the created destination connector information
update_s3_destination
Update an S3 destination connector. Args: destination_id: ID of the destination connector to update remote_url: The S3 URI to the bucket or folder Returns: String containing the updated destination connector information
delete_s3_destination
Delete an S3 destination connector. Args: destination_id: ID of the destination connector to delete Returns: String containing the result of the deletion
create_weaviate_destination
Create an weaviate vector database destination connector. Args: cluster_url: URL of the weaviate cluster collection : Name of the collection to use in the weaviate cluster Note: The collection is a table in the weaviate cluster. In platform, there are dedicated code to generate collection for users here, due to the simplicity of the server, we are not generating it for users. Returns: String containing the created destination connector information
update_weaviate_destination
Update an weaviate destination connector. Args: destination_id: ID of the destination connector to update cluster_url (optional): URL of the weaviate cluster collection (optional): Name of the collection(like a file) to use in the weaviate cluster Returns: String containing the updated destination connector information
delete_weaviate_destination
Delete an weaviate destination connector. Args: destination_id: ID of the destination connector to delete Returns: String containing the result of the deletion
create_astradb_destination
Create an AstraDB destination connector. Args: name: A unique name for this connector collection_name: The name of the collection to use keyspace: The AstraDB keyspace batch_size: The batch size for inserting documents, must be positive (default: 20) Note: A collection in AstraDB is a schemaless document store optimized for NoSQL workloads, equivalent to a table in traditional databases. A keyspace is the top-level namespace in AstraDB that groups multiple collections. We require the users to create their own collection and keyspace before creating the connector. Returns: String containing the created destination connector information
update_astradb_destination
Update an AstraDB destination connector. Args: destination_id: ID of the destination connector to update collection_name: The name of the collection to use (optional) keyspace: The AstraDB keyspace (optional) batch_size: The batch size for inserting documents (optional) Note: We require the users to create their own collection and keyspace before creating the connector. Returns: String containing the updated destination connector information
delete_astradb_destination
Delete an AstraDB destination connector. Args: destination_id: ID of the destination connector to delete Returns: String containing the result of the deletion
create_neo4j_destination
Create an neo4j destination connector. Args: name: A unique name for this connector database: The neo4j database, e.g. "neo4j" uri: The neo4j URI, e.g. neo4j+s://<neo4j_instance_id>.databases.neo4j.io username: The neo4j username Returns: String containing the created destination connector information
update_neo4j_destination
Update an neo4j destination connector. Args: destination_id: ID of the destination connector to update database: The neo4j database, e.g. "neo4j" uri: The neo4j URI, e.g. neo4j+s://<neo4j_instance_id>.databases.neo4j.io username: The neo4j username Returns: String containing the updated destination connector information
delete_neo4j_destination
Delete an neo4j destination connector. Args: destination_id: ID of the destination connector to delete Returns: String containing the result of the deletion
invoke_firecrawl_crawlhtml
Start an asynchronous web crawl job using Firecrawl to retrieve HTML content. Args: url: URL to crawl s3_uri: S3 URI where results will be uploaded limit: Maximum number of pages to crawl (default: 100) Returns: Dictionary with crawl job information including the job ID
check_crawlhtml_status
Check the status of an existing Firecrawl HTML crawl job. Args: crawl_id: ID of the crawl job to check Returns: Dictionary containing the current status of the crawl job
invoke_firecrawl_llmtxt
Start an asynchronous llmfull.txt generation job using Firecrawl. This file is a standardized markdown file containing information to help LLMs use a website at inference time. The llmstxt endpoint leverages Firecrawl to crawl your website and extracts data using gpt-4o-mini Args: url: URL to crawl s3_uri: S3 URI where results will be uploaded max_urls: Maximum number of pages to crawl (1-100, default: 10) Returns: Dictionary with job information including the job ID
check_llmtxt_status
Check the status of an existing llmfull.txt generation job. Args: job_id: ID of the llmfull.txt generation job to check Returns: Dictionary containing the current status of the job and text content if completed
cancel_crawlhtml_job
Cancel an in-progress Firecrawl HTML crawl job. Args: crawl_id: ID of the crawl job to cancel Returns: Dictionary containing the result of the cancellation
list_sources
List available sources from the Unstructured API. Args: source_type: Optional source connector type to filter by Returns: String containing the list of sources
get_source_info
Get detailed information about a specific source connector. Args: source_id: ID of the source connector to get information for, should be valid UUID Returns: String containing the source connector information
list_destinations
List available destinations from the Unstructured API. Args: destination_type: Optional destination connector type to filter by Returns: String containing the list of destinations
get_destination_info
Get detailed information about a specific destination connector. Args: destination_id: ID of the destination connector to get information for Returns: String containing the destination connector information
list_workflows
List workflows from the Unstructured API. Args: destination_id: Optional destination connector ID to filter by source_id: Optional source connector ID to filter by status: Optional workflow status to filter by Returns: String containing the list of workflows
get_workflow_info
Get detailed information about a specific workflow. Args: workflow_id: ID of the workflow to get information for Returns: String containing the workflow information
create_workflow
Create a new workflow. Args: workflow_config: A Typed Dictionary containing required fields (destination_id - should be a valid UUID, name, source_id - should be a valid UUID, workflow_type) and non-required fields (schedule, and workflow_nodes). Note workflow_nodes is only enabled when workflow_type is `custom` and is a list of WorkflowNodeTypedDict: partition, prompter,chunk, embed Below is an example of a partition workflow node: { "name": "vlm-partition", "type": "partition", "sub_type": "vlm", "settings": { "provider": "your favorite provider", "model": "your favorite model" } } Returns: String containing the created workflow information Custom workflow DAG nodes - If WorkflowType is set to custom, you must also specify the settings for the workflow’s directed acyclic graph (DAG) nodes. These nodes’ settings are specified in the workflow_nodes array. - A Source node is automatically created when you specify the source_id value outside of the workflow_nodes array. - A Destination node is automatically created when you specify the destination_id value outside of the workflow_nodes array. - You can specify Partitioner, Chunker, Prompter, and Embedder nodes. - The order of the nodes in the workflow_nodes array will be the same order that these nodes appear in the DAG, with the first node in the array added directly after the Source node. The Destination node follows the last node in the array. - Be sure to specify nodes in the allowed order. The following DAG placements are all allowed: - Source -> Partitioner -> Destination, - Source -> Partitioner -> Chunker -> Destination, - Source -> Partitioner -> Chunker -> Embedder -> Destination, - Source -> Partitioner -> Prompter -> Chunker -> Destination, - Source -> Partitioner -> Prompter -> Chunker -> Embedder -> Destination Partitioner node A Partitioner node has a type of partition and a subtype of auto, vlm, hi_res, or fast. Examples: - auto strategy: { "name": "Partitioner", "type": "partition", "subtype": "vlm", "settings": { "provider": "anthropic", (required) "model": "claude-3-5-sonnet-20241022", (required) "output_format": "text/html", "user_prompt": null, "format_html": true, "unique_element_ids": true, "is_dynamic": true, "allow_fast": true } } - vlm strategy: Allowed values are provider and model. Below are examples: - "provider": "anthropic" "model": "claude-3-5-sonnet-20241022", - "provider": "openai" "model": "gpt-4o" - hi_res strategy: { "name": "Partitioner", "type": "partition", "subtype": "unstructured_api", "settings": { "strategy": "hi_res", "include_page_breaks": <true|false>, "pdf_infer_table_structure": <true|false>, "exclude_elements": [ "<element-name>", "<element-name>" ], "xml_keep_tags": <true|false>, "encoding": "<encoding>", "ocr_languages": [ "<language>", "<language>" ], "extract_image_block_types": [ "image", "table" ], "infer_table_structure": <true|false> } } - fast strategy { "name": "Partitioner", "type": "partition", "subtype": "unstructured_api", "settings": { "strategy": "fast", "include_page_breaks": <true|false>, "pdf_infer_table_structure": <true|false>, "exclude_elements": [ "<element-name>", "<element-name>" ], "xml_keep_tags": <true|false>, "encoding": "<encoding>", "ocr_languages": [ "<language-code>", "<language-code>" ], "extract_image_block_types": [ "image", "table" ], "infer_table_structure": <true|false> } } Chunker node A Chunker node has a type of chunk and subtype of chunk_by_character or chunk_by_title. - chunk_by_character { "name": "Chunker", "type": "chunk", "subtype": "chunk_by_character", "settings": { "include_orig_elements": <true|false>, "new_after_n_chars": <new-after-n-chars>, (required, if not provided set same as max_characters) "max_characters": <max-characters>, (required) "overlap": <overlap>, (required, if not provided set default to 0) "overlap_all": <true|false>, "contextual_chunking_strategy": "v1" } } - chunk_by_title { "name": "Chunker", "type": "chunk", "subtype": "chunk_by_title", "settings": { "multipage_sections": <true|false>, "combine_text_under_n_chars": <combine-text-under-n-chars>, "include_orig_elements": <true|false>, "new_after_n_chars": <new-after-n-chars>, (required, if not provided set same as max_characters) "max_characters": <max-characters>, (required) "overlap": <overlap>, (required, if not provided set default to 0) "overlap_all": <true|false>, "contextual_chunking_strategy": "v1" } } Prompter node An Prompter node has a type of prompter and subtype of: - openai_image_description, - anthropic_image_description, - bedrock_image_description, - vertexai_image_description, - openai_table_description, - anthropic_table_description, - bedrock_table_description, - vertexai_table_description, - openai_table2html, - openai_ner Example: { "name": "Prompter", "type": "prompter", "subtype": "<subtype>", "settings": {} } Embedder node An Embedder node has a type of embed Allowed values for subtype and model_name include: - "subtype": "azure_openai" - "model_name": "text-embedding-3-small" - "model_name": "text-embedding-3-large" - "model_name": "text-embedding-ada-002" - "subtype": "bedrock" - "model_name": "amazon.titan-embed-text-v2:0" - "model_name": "amazon.titan-embed-text-v1" - "model_name": "amazon.titan-embed-image-v1" - "model_name": "cohere.embed-english-v3" - "model_name": "cohere.embed-multilingual-v3" - "subtype": "togetherai": - "model_name": "togethercomputer/m2-bert-80M-2k-retrieval" - "model_name": "togethercomputer/m2-bert-80M-8k-retrieval" - "model_name": "togethercomputer/m2-bert-80M-32k-retrieval" Example: { "name": "Embedder", "type": "embed", "subtype": "<subtype>", "settings": { "model_name": "<model-name>" } }
run_workflow
Run a specific workflow. Args: workflow_id: ID of the workflow to run Returns: String containing the response from the workflow execution
update_workflow
Update an existing workflow. Args: workflow_id: ID of the workflow to update workflow_config: A Typed Dictionary containing required fields (destination_id, name, source_id, workflow_type) and non-required fields (schedule, and workflow_nodes) Returns: String containing the updated workflow information Custom workflow DAG nodes - If WorkflowType is set to custom, you must also specify the settings for the workflow’s directed acyclic graph (DAG) nodes. These nodes’ settings are specified in the workflow_nodes array. - A Source node is automatically created when you specify the source_id value outside of the workflow_nodes array. - A Destination node is automatically created when you specify the destination_id value outside of the workflow_nodes array. - You can specify Partitioner, Chunker, Prompter, and Embedder nodes. - The order of the nodes in the workflow_nodes array will be the same order that these nodes appear in the DAG, with the first node in the array added directly after the Source node. The Destination node follows the last node in the array. - Be sure to specify nodes in the allowed order. The following DAG placements are all allowed: - Source -> Partitioner -> Destination, - Source -> Partitioner -> Chunker -> Destination, - Source -> Partitioner -> Chunker -> Embedder -> Destination, - Source -> Partitioner -> Prompter -> Chunker -> Destination, - Source -> Partitioner -> Prompter -> Chunker -> Embedder -> Destination Partitioner node A Partitioner node has a type of partition and a subtype of auto, vlm, hi_res, or fast. Examples: - auto strategy: { "name": "Partitioner", "type": "partition", "subtype": "vlm", "settings": { "provider": "anthropic", (required) "model": "claude-3-5-sonnet-20241022", (required) "output_format": "text/html", "user_prompt": null, "format_html": true, "unique_element_ids": true, "is_dynamic": true, "allow_fast": true } } - vlm strategy: Allowed values are provider and model. Below are examples: - "provider": "anthropic" "model": "claude-3-5-sonnet-20241022", - "provider": "openai" "model": "gpt-4o" - hi_res strategy: { "name": "Partitioner", "type": "partition", "subtype": "unstructured_api", "settings": { "strategy": "hi_res", "include_page_breaks": <true|false>, "pdf_infer_table_structure": <true|false>, "exclude_elements": [ "<element-name>", "<element-name>" ], "xml_keep_tags": <true|false>, "encoding": "<encoding>", "ocr_languages": [ "<language>", "<language>" ], "extract_image_block_types": [ "image", "table" ], "infer_table_structure": <true|false> } } - fast strategy { "name": "Partitioner", "type": "partition", "subtype": "unstructured_api", "settings": { "strategy": "fast", "include_page_breaks": <true|false>, "pdf_infer_table_structure": <true|false>, "exclude_elements": [ "<element-name>", "<element-name>" ], "xml_keep_tags": <true|false>, "encoding": "<encoding>", "ocr_languages": [ "<language-code>", "<language-code>" ], "extract_image_block_types": [ "image", "table" ], "infer_table_structure": <true|false> } } Chunker node A Chunker node has a type of chunk and subtype of chunk_by_character or chunk_by_title. - chunk_by_character { "name": "Chunker", "type": "chunk", "subtype": "chunk_by_character", "settings": { "include_orig_elements": <true|false>, "new_after_n_chars": <new-after-n-chars>, (required, if not provided set same as max_characters) "max_characters": <max-characters>, (required) "overlap": <overlap>, (required, if not provided set default to 0) "overlap_all": <true|false>, "contextual_chunking_strategy": "v1" } } - chunk_by_title { "name": "Chunker", "type": "chunk", "subtype": "chunk_by_title", "settings": { "multipage_sections": <true|false>, "combine_text_under_n_chars": <combine-text-under-n-chars>, "include_orig_elements": <true|false>, "new_after_n_chars": <new-after-n-chars>, (required, if not provided set same as max_characters) "max_characters": <max-characters>, (required) "overlap": <overlap>, (required, if not provided set default to 0) "overlap_all": <true|false>, "contextual_chunking_strategy": "v1" } } Prompter node An Prompter node has a type of prompter and subtype of: - openai_image_description, - anthropic_image_description, - bedrock_image_description, - vertexai_image_description, - openai_table_description, - anthropic_table_description, - bedrock_table_description, - vertexai_table_description, - openai_table2html, - openai_ner Example: { "name": "Prompter", "type": "prompter", "subtype": "<subtype>", "settings": {} } Embedder node An Embedder node has a type of embed Allowed values for subtype and model_name include: - "subtype": "azure_openai" - "model_name": "text-embedding-3-small" - "model_name": "text-embedding-3-large" - "model_name": "text-embedding-ada-002" - "subtype": "bedrock" - "model_name": "amazon.titan-embed-text-v2:0" - "model_name": "amazon.titan-embed-text-v1" - "model_name": "amazon.titan-embed-image-v1" - "model_name": "cohere.embed-english-v3" - "model_name": "cohere.embed-multilingual-v3" - "subtype": "togetherai": - "model_name": "togethercomputer/m2-bert-80M-2k-retrieval" - "model_name": "togethercomputer/m2-bert-80M-8k-retrieval" - "model_name": "togethercomputer/m2-bert-80M-32k-retrieval" Example: { "name": "Embedder", "type": "embed", "subtype": "<subtype>", "settings": { "model_name": "<model-name>" } }
delete_workflow
Delete a specific workflow. Args: workflow_id: ID of the workflow to delete Returns: String containing the response from the workflow deletion
list_jobs
List jobs via the Unstructured API. Args: workflow_id: Optional workflow ID to filter by status: Optional job status to filter by Returns: String containing the list of jobs
get_job_info
Get detailed information about a specific job. Args: job_id: ID of the job to get information for Returns: String containing the job information
cancel_job
Delete a specific job. Args: job_id: ID of the job to cancel Returns: String containing the response from the job cancellation
README
Unstructured API MCP 服务器
一个用于与 Unstructured API 交互的 MCP 服务器实现。该服务器提供列出源和工作流的工具。
可用工具
工具 | 描述 |
---|---|
list_sources |
列出 Unstructured API 中可用的源。 |
get_source_info |
获取有关特定源连接器的详细信息。 |
create_[connector]_source |
创建一个源连接器。目前,我们有 s3/google drive/azure 连接器(更多即将推出!) |
update_[connector]_source |
通过参数更新现有的源连接器。 |
delete_[connector]_source |
通过源 ID 删除源连接器。 |
list_destinations |
列出 Unstructured API 中可用的目标。 |
get_destination_info |
获取有关特定目标连接器的详细信息。目前,我们有 s3/weaviate/astra/neo4j/mongo DB(更多即将推出!) |
create_[connector]_destination |
通过参数创建目标连接器。 |
update_[connector]_destination |
通过目标 ID 更新现有的目标连接器。 |
delete_[connector]_destination |
通过目标 ID 删除目标连接器。 |
list_workflows |
列出 Unstructured API 中的工作流。 |
get_workflow_info |
获取有关特定工作流的详细信息。 |
create_workflow |
使用源、目标 ID 等创建一个新的工作流。 |
run_workflow |
使用工作流 ID 运行特定的工作流。 |
update_workflow |
通过参数更新现有的工作流。 |
delete_workflow |
通过 ID 删除特定的工作流。 |
list_jobs |
列出 Unstructured API 中特定工作流的作业。 |
get_job_info |
通过作业 ID 获取有关特定作业的详细信息。 |
cancel_job |
通过 ID 删除特定的作业。 |
以下是 UNS-MCP
服务器当前支持的连接器列表,请在此处查看 Unstructured 平台支持的完整源连接器列表 here 和目标列表 here。 我们计划添加更多!
源 | 目标 |
---|---|
S3 | S3 |
Azure | Weaviate |
Google Drive | Pinecone |
OneDrive | AstraDB |
Salesforce | MongoDB |
Sharepoint | Neo4j |
Databricks Volumes | |
Databricks Volumes Delta Table |
要使用创建/更新/删除连接器的工具,必须在您的 .env 文件中定义该特定连接器的凭据。 以下是我们支持的连接器的 credentials
列表:
凭据名称 | 描述 |
---|---|
ANTHROPIC_API_KEY |
运行 minimal_client 以与我们的服务器交互所必需的。 |
AWS_KEY , AWS_SECRET |
通过 uns-mcp 服务器创建 S3 连接器所必需的,请参阅 documentation 和 here 中的说明 |
WEAVIATE_CLOUD_API_KEY |
创建 Weaviate 向量数据库连接器所必需的,请参阅 documentation 中的说明 |
FIRECRAWL_API_KEY |
在 external/firecrawl.py 中使用 Firecrawl 工具所必需的,请在 Firecrawl 上注册并获取 API 密钥。 |
ASTRA_DB_APPLICATION_TOKEN , ASTRA_DB_API_ENDPOINT |
通过 uns-mcp 服务器创建 Astradb 连接器所必需的,请参阅 documentation 中的说明 |
AZURE_CONNECTION_STRING |
通过 uns-mcp 服务器创建 Azure 连接器的必需选项 1,请参阅 documentation 中的说明 |
AZURE_ACCOUNT_NAME +AZURE_ACCOUNT_KEY |
通过 uns-mcp 服务器创建 Azure 连接器的必需选项 2,请参阅 documentation 中的说明 |
AZURE_ACCOUNT_NAME +AZURE_SAS_TOKEN |
通过 uns-mcp 服务器创建 Azure 连接器的必需选项 3,请参阅 documentation 中的说明 |
NEO4J_PASSWORD |
通过 uns-mcp 服务器创建 Neo4j 连接器所必需的,请参阅 documentation 中的说明 |
MONGO_DB_CONNECTION_STRING |
通过 uns-mcp 服务器创建 Mongodb 连接器所必需的,请参阅 documentation 中的说明 |
GOOGLEDRIVE_SERVICE_ACCOUNT_KEY |
一个字符串值。 原始服务器帐户密钥(请遵循 documentation)位于 json 文件中,在终端中运行 `cat /path/to/google_service_account_key.json |
DATABRICKS_CLIENT_ID ,DATABRICKS_CLIENT_SECRET |
通过 uns-mcp 服务器创建 Databricks volume/delta table 连接器所必需的,请参阅 documentation 和 here 中的说明 |
ONEDRIVE_CLIENT_ID , ONEDRIVE_CLIENT_CRED ,ONEDRIVE_TENANT_ID |
通过 uns-mcp 服务器创建 One Drive 连接器所必需的,请参阅 documentation 中的说明 |
PINECONE_API_KEY |
通过 uns-mcp 服务器创建 Pinecone 向量数据库连接器所必需的,请参阅 documentation 中的说明 |
SALESFORCE_CONSUMER_KEY ,SALESFORCE_PRIVATE_KEY |
通过 uns-mcp 服务器创建 salesforce 源连接器所必需的,请参阅 documentation |
SHAREPOINT_CLIENT_ID , SHAREPOINT_CLIENT_CRED ,SHAREPOINT_TENANT_ID |
通过 uns-mcp 服务器创建 One Drive 连接器所必需的,请参阅 documentation 中的说明 |
LOG_LEVEL |
用于为我们的 minimal_client 设置日志记录级别,例如,设置为 ERROR 以获取所有内容 |
CONFIRM_TOOL_USE |
设置为 true,以便 minimal_client 可以在每次工具调用之前确认执行 |
DEBUG_API_REQUESTS |
设置为 true,以便 uns_mcp/server.py 可以输出请求参数以进行更好的调试 |
Firecrawl 源
Firecrawl 是一个 Web 爬取 API,在我们的 MCP 中提供两个主要功能:
- HTML 内容检索:使用
invoke_firecrawl_crawlhtml
启动爬取作业,并使用check_crawlhtml_status
监视它们 - LLM 优化文本生成:使用
invoke_firecrawl_llmtxt
生成文本,并使用check_llmtxt_status
检索结果
Firecrawl 的工作原理:
Web 爬取过程:
- 从指定的 URL 开始并分析它以识别链接
- 如果可用,则使用站点地图;否则,遵循网站上找到的链接
- 递归地遍历每个链接以发现所有子页面
- 收集每个访问页面的内容,处理 JavaScript 渲染和速率限制
- 如果需要,可以使用
cancel_crawlhtml_job
取消作业 - 如果您需要将所有信息提取到原始 HTML 中,请使用此功能,Unstructured 的工作流程可以很好地清理它 :smile:
LLM 文本生成:
- 爬取后,从爬取的页面中提取干净、有意义的文本内容
- 生成专门为大型语言模型格式化的优化文本格式
- 结果会自动上传到指定的 S3 位置
- 注意:LLM 文本生成作业一旦启动就无法取消。 提供
cancel_llmtxt_job
函数是为了保持一致性,但 Firecrawl API 目前不支持它。
注意:必须设置 FIRECRAWL_API_KEY
环境变量才能使用这些函数。
安装和配置
本指南提供使用 Python 3.12 和 uv
工具设置和配置 UNS_MCP 服务器的分步说明。
先决条件
- Python 3.12+
- 用于环境管理的
uv
- 来自 Unstructured 的 API 密钥。 您可以在 here 注册并获取您的 API 密钥。
使用 uv
(推荐)
使用 uvx
时不需要额外的安装,因为它处理执行。 但是,如果您希望直接安装该软件包:
uv pip install uns_mcp
配置 Claude Desktop
要与 Claude Desktop 集成,请将以下内容添加到您的 claude_desktop_config.json
:
注意: 该文件位于 ~/Library/Application Support/Claude/
目录中。
使用 uvx
命令:
{
"mcpServers": {
"UNS_MCP": {
"command": "uvx",
"args": ["uns_mcp"],
"env": {
"UNSTRUCTURED_API_KEY": "<your-key>"
}
}
}
}
或者,使用 Python 包:
{
"mcpServers": {
"UNS_MCP": {
"command": "python",
"args": ["-m", "uns_mcp"],
"env": {
"UNSTRUCTURED_API_KEY": "<your-key>"
}
}
}
}
使用源代码
-
克隆存储库。
-
安装依赖项:
uv sync
-
将您的 Unstructured API 密钥设置为环境变量。 在根目录中创建一个 .env 文件,其中包含以下内容:
UNSTRUCTURED_API_KEY="YOUR_KEY"
有关可配置的环境变量,请参阅
.env.template
。
您现在可以使用以下方法之一运行服务器:
<details> <summary> 使用可编辑的软件包安装 </summary> 以可编辑的软件包安装:
uvx pip install -e .
更新您的 Claude Desktop 配置:
{
"mcpServers": {
"UNS_MCP": {
"command": "uvx",
"args": ["uns_mcp"]
}
}
}
注意: 请记住指向安装软件包的环境中的 uvx 可执行文件
</details>
<details> <summary> 使用 SSE 服务器协议 </summary>
注意:Claude Desktop 不支持。
对于 SSE 协议,您可以通过分离客户端和服务器来更轻松地进行调试:
-
在一个终端中启动服务器:
uv run python uns_mcp/server.py --host 127.0.0.1 --port 8080 # 或 make sse-server
-
在另一个终端中使用本地客户端测试服务器:
uv run python minimal_client/client.py "http://127.0.0.1:8080/sse" # 或 make sse-client
注意: 要停止服务,请先在客户端上使用 Ctrl+C
,然后再在服务器上使用。
</details>
<details> <summary> 使用 Stdio 服务器协议 </summary>
配置 Claude Desktop 以使用 stdio:
{
"mcpServers": {
"UNS_MCP": {
"command": "ABSOLUTE/PATH/TO/.local/bin/uv",
"args": [
"--directory",
"ABSOLUTE/PATH/TO/YOUR-UNS-MCP-REPO/uns_mcp",
"run",
"server.py"
]
}
}
}
或者,运行本地客户端:
uv run python minimal_client/client.py uns_mcp/server.py
</details>
其他本地客户端配置
使用环境变量配置最小客户端:
LOG_LEVEL="ERROR"
:设置为禁止 LLM 的调试输出,为用户显示清晰的消息。CONFIRM_TOOL_USE='false'
:在执行之前禁用工具使用确认。 谨慎使用,尤其是在开发期间,因为 LLM 可能会执行昂贵的工作流程或删除数据。
调试工具
Anthropic 提供 MCP Inspector
工具来调试/测试您的 MCP 服务器。 运行以下命令以启动调试 UI。 从那里,您将能够在左侧窗格中添加环境变量(指向您的本地环境)。 在那里包含您的个人 API 密钥作为 env var。 转到 tools
,您可以测试添加到 MCP 服务器的功能。
mcp dev uns_mcp/server.py
如果您需要将请求调用参数记录到 UnstructuredClient
,请设置环境变量 DEBUG_API_REQUESTS=false
。
日志存储在格式为 unstructured-client-{date}.log
的文件中,可以检查该文件以调试对 UnstructuredClient
函数的请求调用参数。
添加对最小客户端的终端访问
我们将使用 @wonderwhy-er/desktop-commander 将终端访问添加到最小客户端。 它构建在 MCP 文件系统服务器之上。 请小心,因为客户端(也是 LLM)现在可以访问私有文件。
执行以下命令以安装软件包:
npx @wonderwhy-er/desktop-commander setup
然后使用额外的参数启动客户端:
uv run python minimal_client/client.py "http://127.0.0.1:8080/sse" "@wonderwhy-er/desktop-commander"
# 或
make sse-client-terminal
使用工具的子集
如果您的客户端仅支持使用工具的子集,则应注意以下事项:
update_workflow
工具必须与create_workflow
工具一起加载到上下文中,因为它包含有关如何创建和配置自定义节点的详细说明。
已知问题
update_workflow
- 需要在上下文中具有它正在更新的工作流程的配置,或者由用户提供,或者通过调用get_workflow_info
工具来提供,因为此工具不作为patch
应用器工作,它会完全替换工作流程配置。
CHANGELOG.md
任何新开发的功能/修复/增强都将添加到 CHANGELOG.md。 在我们升级到稳定版本之前,首选 0.x.x-dev 预发布格式。
故障排除
- 如果您遇到
Error: spawn <command> ENOENT
问题,则表示未安装<command>
或在您的 PATH 中不可见:- 确保安装它并将其添加到您的 PATH。
- 或者在您的配置的
command
字段中提供命令的绝对路径。 因此,例如,将python
替换为/opt/miniconda3/bin/python
推荐服务器

VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。
mult-fetch-mcp-server
一个多功能的、符合 MCP 规范的网页内容抓取工具,支持多种模式(浏览器/Node)、格式(HTML/JSON/Markdown/文本)和智能代理检测,并提供双语界面(英语/中文)。
AIO-MCP Server
🚀 集成了 AI 搜索、RAG 和多服务(GitLab/Jira/Confluence/YouTube)的一体化 MCP 服务器,旨在增强 AI 驱动的开发工作流程。来自 Folk。
Knowledge Graph Memory Server
为 Claude 实现持久性记忆,使用本地知识图谱,允许 AI 记住用户的信息,并可在自定义位置存储,跨对话保持记忆。
Hyperbrowser
欢迎来到 Hyperbrowser,人工智能的互联网。Hyperbrowser 是下一代平台,旨在增强人工智能代理的能力,并实现轻松、可扩展的浏览器自动化。它专为人工智能开发者打造,消除了本地基础设施和性能瓶颈带来的麻烦,让您能够:

any-chat-completions-mcp
将 Claude 与任何 OpenAI SDK 兼容的聊天完成 API 集成 - OpenAI、Perplexity、Groq、xAI、PyroPrompts 等。
Exa MCP Server
一个模型上下文协议服务器,它使像 Claude 这样的人工智能助手能够以安全和受控的方式,使用 Exa AI 搜索 API 执行实时网络搜索。
BigQuery MCP Server
这是一个服务器,可以让你的大型语言模型(LLM,比如Claude)直接与你的BigQuery数据对话!可以把它想象成一个友好的翻译器,它位于你的AI助手和数据库之间,确保它们可以安全高效地进行交流。