diff --git a/docs/en/part14/p15_dataagent_semantic_nl2sql_agent.md b/docs/en/part14/p15_dataagent_semantic_nl2sql_agent.md index d5d26c20..214246f3 100644 --- a/docs/en/part14/p15_dataagent_semantic_nl2sql_agent.md +++ b/docs/en/part14/p15_dataagent_semantic_nl2sql_agent.md @@ -6,20 +6,13 @@ This project uses DataAgent to build a semantic BI assistant for enterprise structured data. The goal is not to let a model directly "guess SQL." Instead, the project organizes natural-language questions, the business semantic layer, database metadata, an NL2SQL sub-agent, result files, report generation, and runtime audit trails into a reusable data engineering chain. -Read in engineering order, the chain is: +Read in engineering order, the project follows a complete chain from business question to trustworthy delivery, as shown in Figure P15-1. -```text -business question - -> scenario prompt and task planning - -> semantic-layer schema retrieval - -> NL2SQL sub-agent - -> SQL validation and execution - -> CSV/SQL asset persistence - -> main-agent summary and report generation - -> execution trace and acceptance evaluation -``` +![DataAgent semantic BI assistant engineering chain](../../images/part14/p15_dataagent_engineering_chain_en.png) + +*Figure P15-1: Engineering chain of the DataAgent semantic BI assistant.* -The core objective is to turn enterprise BI Q&A from a one-off conversational capability into a configurable, auditable, and extensible data application. +The chain starts with a business question. A scenario prompt and task-planning step then clarifies the analysis goal, metric definitions, filters, and expected output. Semantic-layer schema retrieval supplies the model with table, field, business-description, and relationship context before the main agent delegates structured querying to the NL2SQL sub-agent. The generated SQL is validated and executed, and the resulting SQL and CSV files are persisted as reviewable workspace assets rather than treated as transient intermediate outputs. The main agent uses those assets to summarize results and generate a report. Finally, execution traces, tool states, and acceptance signals are retained for audit, regression testing, and future iteration. The core objective is to turn enterprise BI Q&A from a one-off conversational capability into a configurable, auditable, and extensible data application. The chapter follows four main threads: @@ -72,7 +65,7 @@ Common failures include input-distribution drift, missing schema fields, overly ## Reproducible Resource Notes -Reproduction materials should include data-source notes, minimal samples, configuration files, run commands, metric scripts, inspection reports, and artifact directories. The main text keeps the necessary snippets; complete notebooks, long scripts, and large files should be maintained separately as companion resources. +Reproduction materials should include data-source notes, minimal samples, configuration files, run commands, metric scripts, inspection reports, and artifact directories. The main text keeps the necessary snippets; complete notebooks, long scripts, and large files should be maintained separately as companion resources. If the project is extended with enterprise metric definitions, a modeled semantic layer, or data-build pipelines, dbt's documentation, testing, and modeling mechanisms can be used as an optional reference (dbt Labs, 2026), but dbt is not a required dependency for this chapter. ## 1. Project Background: Why Enterprise BI Needs Agent Data Engineering @@ -145,17 +138,17 @@ This case highlights DataAgent's distinctive capabilities: NL2SQL, Semantic Serv ## 4. Overall Architecture: From Business Question to Auditable Data Asset -The project architecture has six layers. The following figure organizes entry, orchestration, query, asset, and governance relationships into a layered architecture. +The semantic BI assistant in this chapter is not an isolated NL2SQL component. It is an application form inside the broader DataGallery ecosystem. It uses DataAgent for task understanding, tool orchestration, and execution control; Semantic Service for metadata, metric definitions, and semantic retrieval; and Data Studio and Data Ops to connect BI capability with applications, evaluation, observation, and continuous improvement. Figure P15-2 shows the architectural relationships that support this project. -![Layered architecture for a DataAgent enterprise semantic BI assistant](../../images/part14/p15_dataagent_semantic_bi_layered_architecture_en.png) +![DataGallery ecosystem architecture around DataAgent, Semantic Service, Data Studio, Data Ops, and foundation infrastructure](../../images/part14/p15_datagallery_architecture_vector.svg) -*Figure P15-1: Layered architecture for the DataAgent enterprise semantic BI assistant.* +*Figure P15-2: DataGallery ecosystem architecture around DataAgent.* -The overall architecture diagram from the DataAgent repository is: +At the core of the architecture, DataGallery Core provides the capabilities used most directly by the semantic BI assistant. The Data Intelligence Framework exposes the DataAgent SDK, while the Data Agent Engine is responsible for intent understanding, schema linking, SQL generation, execution validation, reflection and repair, and confidence selection. The Data Agent Shell defines runtime boundaries through privilege control, tool guardrails, and privacy-aware routing. In enterprise BI, these capabilities determine whether the model can understand a business question, select the right tool, and translate natural language into a controlled data operation. -![DataAgent overall architecture](../../images/part14/p15_dataagent_agent_excalidraw_en.png) +The semantic and data-support layers provide the context needed for reliable query generation. Semantic Distributed Runtime turns metadata ingestion, ontology modeling, and metric registry management into callable semantic services, so NL2SQL can use business descriptions, field meanings, metric definitions, and join relations instead of relying only on table and column names. Data System Service supplies data access, sandboxing, memory storage, and Agent UI/Web support, allowing SQL files, CSV outputs, reports, and runtime traces to become reviewable engineering artifacts. -*Figure P15-2: Overall architecture of the DataAgent repository.* +At the application and operations level, Data Studio corresponds to text-to-SQL, data analysis, data engineering, and business-process scenarios. Data Ops provides lifecycle capabilities such as full-chain visual tuning, benchmark evaluation, and observation. The foundation layer consists of models, data infrastructure, and hardware or chipsets. The key path in this project is therefore: a business question enters DataAgent; DataAgent uses semantic services to understand the data context and invoke NL2SQL capabilities; the resulting assets then flow into workspace, evaluation, and observation mechanisms so the answer can be audited, regressed, and improved over time. ### 4.1 Interface Layer @@ -245,12 +238,12 @@ This chapter depends on DataAgent, Semantic Service, a value-match service, and ### 5.2 Install the Project -DataAgent is an agent data engineering framework in the DataGallery open-source ecosystem. The DataGallery open-source entry is [https://gitcode.com/datagallery](https://gitcode.com/datagallery), and the DataAgent source repository is [https://gitcode.com/datagallery/DataAgent](https://gitcode.com/datagallery/DataAgent). For DataGallery's role in this book, reproduction boundaries, and project-governance usage, see [Appendix G: DataGallery Open-source Ecosystem Overview](../appendix_g_datagallery_note.md). +DataAgent is an agent data engineering framework in the DataGallery open-source ecosystem (DataGallery Contributors, 2026a). The DataGallery open-source entry is [https://gitcode.com/datagallery](https://gitcode.com/datagallery), and the DataAgent source repository is [https://gitcode.com/datagallery/dataagent](https://gitcode.com/datagallery/dataagent) (DataGallery Contributors, 2026b). For DataGallery's role in this book, reproduction boundaries, and project-governance usage, see [Appendix G: DataGallery Open-source Ecosystem Overview](../appendix_g_datagallery_note.md). First pin the version, then install dependencies from the repository root: ```bash -git clone https://gitcode.com/datagallery/DataAgent.git +git clone https://gitcode.com/datagallery/dataagent.git cd DataAgent git checkout python -m venv .venv @@ -938,5 +931,5 @@ As part of Part 14, this chapter validates earlier methods at the project level. 3. Schick, T., Dwivedi-Yu, J., Dessi, R., Raileanu, R., Lomeli, M., Hambro, E., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761. 4. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629. 5. dbt Labs. (2026). dbt Documentation. https://docs.getdbt.com/. -6. DataGallery Contributors. (2026). DataGallery organization page. https://gitcode.com/datagallery. -7. DataGallery Contributors. (2026). DataAgent source repository. https://gitcode.com/datagallery/DataAgent. +6. DataGallery Contributors. (2026a). DataGallery organization page. https://gitcode.com/datagallery. +7. DataGallery Contributors. (2026b). DataAgent source repository. https://gitcode.com/datagallery/dataagent. diff --git a/docs/images/part14/p15_dataagent_engineering_chain_en.png b/docs/images/part14/p15_dataagent_engineering_chain_en.png new file mode 100644 index 00000000..4299fc92 Binary files /dev/null and b/docs/images/part14/p15_dataagent_engineering_chain_en.png differ diff --git a/docs/images/part14/p15_datagallery_architecture_vector.svg b/docs/images/part14/p15_datagallery_architecture_vector.svg new file mode 100644 index 00000000..48119f3d --- /dev/null +++ b/docs/images/part14/p15_datagallery_architecture_vector.svg @@ -0,0 +1,305 @@ + + DataGallery Ecosystem Architecture for DataAgent + Architecture diagram showing DataGallery Core on the left, Data-Studio and Data-Ops on the right, and foundation layers for models, data infrastructure, and hardware at the bottom. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + DataGallery Core + Data-Studio & Data-Ops + + + + + + + + + + + + + + + + + Data Intelligence Framework + + + DataAgent SDK + + + + Data Agent Engine + + + + + + + Intent Understanding + Schema Linking + SQL Generation + Execution Validation + Reflection & Repair + Confidence Selector + + + + Data Agent Shell + + + + Privilege Control + Tool Guardrail + Privacy-aware Routing + + + + + + + + + + Semantic Distributed Runtime + + Semantic Service SDK + + + + Metadata Ingestion + Ontology Modeling + Metric Registry + + + + + + + + + + Data System Service + + Data POSIX + + + + System Sandbox + Memory Storage + Agent UI / Web + + + + + + + + + + + Data Studio + + + + + Text-to-SQL + Data Analysis + Data Engineering + Business Process + + + + + + + + + Data Ops + + + + Full-chain Visual Tuning + Benchmark & Eval + Observation + + + + + + + + + + + + + + + + + + Models + + + + + + + + Data Infrastructure + + + + + + + + + + Hardware / Chipsets + diff --git a/docs/zh/part14/p15_dataagent_semantic_nl2sql_agent.md b/docs/zh/part14/p15_dataagent_semantic_nl2sql_agent.md index e5e77e10..4041b284 100644 --- a/docs/zh/part14/p15_dataagent_semantic_nl2sql_agent.md +++ b/docs/zh/part14/p15_dataagent_semantic_nl2sql_agent.md @@ -5,20 +5,13 @@ ## 摘要 本章以 DataAgent 为项目实战对象,构建一个面向企业结构化数据的语义问数助手。项目目标不是让模型直接“猜 SQL”,而是把自然语言问题、业务语义层、数据库元数据、NL2SQL 子 Agent、结果文件、报告生成和运行审计组织成一条可复用的数据工程链路。 -如果按工程顺序阅读,本章对应的是一条完整链路: +如果按工程顺序阅读,本章对应的是一条从业务问题到可信交付的完整链路,如图 P15-1 所示。 -```text -业务问题 - -> 场景提示词与任务规划 - -> 语义层 schema 召回 - -> NL2SQL 子 Agent - -> SQL 校验与执行 - -> CSV/SQL 资产落盘 - -> 主 Agent 汇总与报告生成 - -> 运行轨迹与评估验收 -``` +![DataAgent 语义问数助手工程链路](../../images/part14/p15_dataagent_engineering_chain_en.png) + +*图 P15-1:DataAgent 语义问数助手工程链路。* -这一结构对应的核心目标,是把企业问数从一次性对话能力,改造成可配置、可审计、可扩展的数据应用能力。 +这条链路从业务问题开始,先由场景提示词和任务规划明确分析目标、指标口径和输出要求,再通过语义层完成 schema 召回,使模型能够获得表、字段、业务描述和关联关系等上下文。随后,主 Agent 将结构化查询任务委托给 NL2SQL 子 Agent,由子 Agent 生成 SQL,并经过校验与执行得到可信结果。执行后的 SQL 和 CSV 不只是临时中间产物,而是进入 workspace 的可复核资产;主 Agent 基于这些资产汇总结果并生成报告。最后,运行轨迹、工具调用状态和验收指标被保留下来,用于审计、回归测试和后续迭代。这一结构对应的核心目标,是把企业问数从一次性对话能力,改造成可配置、可审计、可扩展的数据应用能力。 本章重点围绕四条主线展开: @@ -71,7 +64,7 @@ DataAgent;语义层;NL2SQL;企业问数;Agent 编排 ## 可复现资源说明 -复现材料应包括数据来源说明、最小样本、配置文件、运行命令、指标脚本、检查报告和产物目录。正文保留必要片段;完整 notebook、长脚本和大文件作为配套资源独立维护。若项目进一步接入企业指标口径、模型化语义层或数据构建流水线,可以参考 dbt 文档中的模型、测试和文档化机制 (dbt Labs 2026),但本章不把 dbt 作为必需依赖。 +复现材料应包括数据来源说明、最小样本、配置文件、运行命令、指标脚本、检查报告和产物目录。正文保留必要片段;完整 notebook、长脚本和大文件作为配套资源独立维护。若项目进一步接入企业指标口径、模型化语义层或数据构建流水线,可以参考 dbt 文档中的模型、测试和文档化机制(dbt Labs, 2026),但本章不把 dbt 作为必需依赖。 ## 1. 项目背景:企业问数为什么需要 Agent 数据工程 @@ -144,17 +137,17 @@ DataAgent 的价值在于把这些环节组织成一个可配置的 Agent 数据 ## 4. 整体架构:从业务问题到可审计数据资产 -项目整体架构可以拆成六层。为了便于出版排版,下图将入口、编排、查询、资产和治理关系整理成分层架构。 +本章的语义问数助手不是孤立的 NL2SQL 组件,而是 DataGallery 生态中的一个应用形态。它依赖 DataAgent 完成任务理解、工具编排和执行控制,依赖 Semantic Service 承载元数据、指标口径和语义检索能力,并通过 Data Studio 与 Data Ops 将问数能力进一步连接到业务应用、评测、观测和持续改进环节。图 P15-2 展示了这一项目所处的整体技术关系。 -![DataAgent 企业语义问数助手分层架构](../../images/part14/p15_dataagent_semantic_bi_layered_architecture.svg) +![围绕 DataAgent、Semantic Service、Data Studio、Data Ops 与基础设施展开的 DataGallery 生态架构](../../images/part14/p15_datagallery_architecture_vector.svg) -*图 P15-1:DataAgent 企业语义问数助手分层架构。* +*图 P15-2:围绕 DataAgent 展开的 DataGallery 生态架构。* -DataAgent 仓库中的整体架构图如下: +从核心能力看,图中左侧的 DataGallery Core 是语义问数助手的主要承载区。Data Intelligence Framework 提供 DataAgent SDK,并通过 Data Agent Engine 完成意图理解、schema linking、SQL 生成、执行校验、反思修复和置信度选择;Data Agent Shell 则承担权限控制、工具护栏和隐私感知路由等运行边界控制。对于企业问数场景,这些能力共同决定了模型是否能够在受控范围内理解业务问题、选择合适工具,并把自然语言请求转化为可执行的数据操作。 -![DataAgent 整体架构图](../../images/part14/p15_dataagent_agent_excalidraw.png) +从语义和数据支撑看,Semantic Distributed Runtime 将元数据接入、本体建模和指标注册沉淀为可调用的语义服务,使 NL2SQL 不只依赖表名和字段名,而能够利用业务描述、字段含义、指标口径和关联关系完成更稳健的 schema 召回。Data System Service 则提供数据访问、系统沙箱、记忆存储和 Agent UI/Web 等基础能力,使一次问数任务产生的 SQL、CSV、报告和运行轨迹能够进入可复核的工程闭环。 -*图 P15-2:DataAgent 仓库整体架构图。* +从应用和运营看,右侧的 Data Studio 对应 Text-to-SQL、数据分析、数据工程和业务流程等上层场景,Data Ops 对应全链路可视化调优、基准评估和观测能力。底部的 Models、Data Infrastructure 和 Hardware/Chipsets 构成基础层,为模型调用、数据库访问和运行资源提供支撑。因此,本项目的关键链路可以概括为:业务问题进入 DataAgent,DataAgent 通过语义服务理解数据上下文并调用 NL2SQL 能力,结果再通过 workspace、评测和观测机制沉淀为可审计、可回归、可持续改进的数据资产。 ### 4.1 接口层 @@ -244,12 +237,12 @@ DataAgent 的运行状态、消息轨迹、工具返回和 workspace 文件为 ### 5.2 安装项目 -DataAgent 是 DataGallery 开源生态中的 Agent 数据工程框架,DataGallery 开源入口见 [https://gitcode.com/datagallery](https://gitcode.com/datagallery),DataAgent 项目仓库见 [https://gitcode.com/datagallery/DataAgent](https://gitcode.com/datagallery/DataAgent)。关于 DataGallery 在本书中的定位、复现边界和项目治理方式,可参见[附录G:DataGallery 开源生态简介](../appendix_g_datagallery_note.md)。 +DataAgent 是 DataGallery 开源生态中的 Agent 数据工程框架(DataGallery Contributors, 2026a)。DataGallery 开源入口见 [https://gitcode.com/datagallery](https://gitcode.com/datagallery),DataAgent 项目仓库见 [https://gitcode.com/datagallery/dataagent](https://gitcode.com/datagallery/dataagent)(DataGallery Contributors, 2026b)。关于 DataGallery 在本书中的定位、复现边界和项目治理方式,可参见[附录G:DataGallery 开源生态简介](../appendix_g_datagallery_note.md)。 建议先固定版本,再在仓库根目录安装依赖: ```bash -git clone https://gitcode.com/datagallery/DataAgent.git +git clone https://gitcode.com/datagallery/dataagent.git cd DataAgent git checkout python -m venv .venv @@ -906,5 +899,5 @@ NL2SQL -> CSV -> 图表 -> Markdown 报告 -> 业务交付 3. Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Hambro, E., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761. 4. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629. 5. dbt Labs. (2026). dbt Documentation. https://docs.getdbt.com/. -6. DataGallery Contributors. (2026). DataGallery organization page. https://gitcode.com/datagallery. -7. DataGallery Contributors. (2026). DataAgent source repository. https://gitcode.com/datagallery/DataAgent. +6. DataGallery Contributors. (2026a). DataGallery organization page. https://gitcode.com/datagallery. +7. DataGallery Contributors. (2026b). DataAgent source repository. https://gitcode.com/datagallery/dataagent.