Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions docs/en/appendix_a_tools_and_frameworks_quick_reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ Object storage stores files but does not automatically provide version semantics
| lakeFS | Branches and commits over object storage | Lakehouse-style data governance and collaboration |
| Delta Lake / Apache Iceberg | Large tabular data governance | Large-scale structured samples and metadata |

For cross-institution dataset construction, public evaluation, and teaching reproduction, a minimal combination is often enough: **Git for scripts and specifications, DVC or an equivalent for data versions, object storage for large files, and release pages for external documentation**.
For cross-institution dataset construction, public evaluation, and teaching reproduction, a minimal combination is often enough: **Git for scripts and specifications, DVC or an equivalent for data versions, object storage for large files, and release pages for external documentation**. This combination is easy to hand off, easy to reproduce in courses, and consistent with the governance language used in Part VIII and Part XII. Concrete data-versioning commands, remote configuration, and pipeline syntax should follow the official DVC documentation (DVC Contributors 2026).

## A.4 Cleaning, Validation, and Training Preparation Tools

Expand Down Expand Up @@ -124,7 +124,7 @@ If a project will become an open benchmark or course experiment, preserve annota

### A.5.2 Experiment Tracking Must Bind Data Versions

Tools such as `MLflow` and `Weights & Biases` are often misused by recording only model parameters and metrics while omitting data versions, slice results, and evaluation-script versions. Logs then look rich but cannot explain where improvement came from.
Tools such as `MLflow` and `Weights & Biases` are often misused by recording only model parameters and metrics while omitting data versions, slice results, and evaluation-script versions. Logs then look rich but cannot explain where improvement came from. If MLflow is used as the experiment-tracking entry point, run records, artifact management, and model registry details should follow the official MLflow documentation (MLflow Authors 2026).

Track at least:

Expand Down Expand Up @@ -194,7 +194,7 @@ Without these capabilities, a team may get good final accuracy but still be unab

This is suitable for cross-institution specialized datasets, course reproduction, and medium-scale research projects. It is lightweight and relatively easy to hand off.

If a dataset is organized and distributed through the Hugging Face Datasets ecosystem, the loading script, dataset card, and split configuration should follow the Hugging Face Datasets Documentation.
If a dataset is organized and distributed through the Hugging Face Datasets ecosystem, the loading script, dataset card, and split configuration should follow the official Hugging Face Datasets documentation (Hugging Face 2026).

### A.7.2 Enterprise Data Platform Combination

Expand Down Expand Up @@ -286,11 +286,11 @@ Third, for university collaboration, open benchmarks, and teaching reproduction,

## References

Gebru T, Morgenstern J, Vecchione B, Vaughan J W, Wallach H, Daumé III H, Crawford K (2021) Datasheets for Datasets. Communications of the ACM 64(12): 86-92.
Gebru T, Morgenstern J, Vecchione B, Vaughan J W, Wallach H, Daumé III H, Crawford K (2021) Datasheets for Datasets. Communications of the ACM 64(12): 86-92. https://doi.org/10.1145/3458723.

Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, Spitzer E, Raji I D, Gebru T (2019) Model Cards for Model Reporting. In: Proceedings of the Conference on Fairness, Accountability, and Transparency, pp 220-229.
Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, Spitzer E, Raji I D, Gebru T (2019) Model Cards for Model Reporting. In: Proceedings of the Conference on Fairness, Accountability, and Transparency, pp 220-229. https://doi.org/10.1145/3287560.3287596.

Pushkarna M, Zaldivar A, Kjartansson O, Cicconi P, Chen V, Efrat A, Zou Y, Mueller J, Taly A, Ehyaei A, Karkkainen K, Marathe A, Han X, Mittal A, Schuster T, Yarmand M, Sohn H, Dwarakanath N C, McCann B (2022) Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI. In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp 1776-1826.
Pushkarna M, Zaldivar A, Kjartansson O, Cicconi P, Chen V, Efrat A, Zou Y, Mueller J, Taly A, Ehyaei A, Karkkainen K, Marathe A, Han X, Mittal A, Schuster T, Yarmand M, Sohn H, Dwarakanath N C, McCann B (2022) Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI. In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp 1776-1826. https://doi.org/10.1145/3531146.3533231.

DVC Contributors (2026) Data Version Control Documentation. Available at: https://dvc.org/doc.

Expand Down
6 changes: 3 additions & 3 deletions docs/en/appendix_b_compliance_and_release_checklist.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@ In large-model data engineering, the most dangerous situation is often not that

This appendix therefore does not provide legal advice, medical advice, financial or investment advice, nor does it constitute regulatory approval, ethics review, or release permission. It is a checklist framework better suited to engineering-team execution and traceability. Its goal is to let technical leads, project managers, course owners, and compliance contacts use the same vocabulary and reduce cross-role communication cost.

In scenarios involving law, medicine, finance, minors, cross-border data, sensitive personal information, or industry regulation, readers should rely on their institution's formal policies, the current laws of the relevant jurisdiction, data-provider contracts, ethics-review requirements, and professional compliance opinions. In the mainland China context, cybersecurity, data security, and personal-information protection should be understood in relation to the Cybersecurity Law of the People's Republic of China, the Data Security Law of the People's Republic of China, and the Personal Information Protection Law of the People's Republic of China. The checklists in this appendix can only help teams identify issues that need escalated review in advance; they cannot replace the professional judgment of lawyers, physicians, financial compliance personnel, security leads, or ethics committees.
In scenarios involving law, medicine, finance, minors, cross-border data, sensitive personal information, or industry regulation, readers should rely on their institution's formal policies, the current laws of the relevant jurisdiction, data-provider contracts, ethics-review requirements, and professional compliance opinions. In the mainland China context, cybersecurity, data security, and personal-information protection should be understood in relation to the Cybersecurity Law of the People's Republic of China, the Data Security Law of the People's Republic of China, and the Personal Information Protection Law of the People's Republic of China (National People's Congress of the People's Republic of China 2016, 2021a, 2021b). The checklists in this appendix can only help teams identify issues that need escalated review in advance; they cannot replace the professional judgment of lawyers, physicians, financial compliance personnel, security leads, or ethics committees.

## B.2 Why Compliance Checks Must Shift Left

If compliance is checked only before release, teams usually encounter three expensive forms of rework. First, **source rework**: the data has already been collected and cleaned before the team discovers that the original authorization does not allow model training or redistribution. Second, **annotation rework**: annotation is complete before the team realizes that sensitive fields were not properly anonymized. Third, **release rework**: a benchmark is ready to publish before the team discovers unstable train/test boundaries or conflicts between external licenses and leaderboard rules.

A more stable approach is to split compliance into four gates:
A more stable approach is to split compliance into four gates. This split can also align with risk-management frameworks: the NIST AI RMF emphasizes organizing AI risk through governance, mapping, measurement, and management, while the EU Artificial Intelligence Act further reflects a regulatory approach that assigns obligations and boundaries by risk level (National Institute of Standards and Technology 2023; European Parliament and Council of the European Union 2024).

1. Source and authorization checks before data ingestion.
2. Sensitivity and delegation-boundary checks before annotation and processing.
Expand Down Expand Up @@ -321,4 +321,4 @@ National Institute of Standards and Technology (2023) AI Risk Management Framewo

European Parliament and Council of the European Union (2024) Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Available at: https://eur-lex.europa.eu/eli/reg/2024/1689/oj.

Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, Spitzer E, Raji I D, Gebru T (2019) Model Cards for Model Reporting. In: Proceedings of the Conference on Fairness, Accountability, and Transparency, pp 220-229.
Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, Spitzer E, Raji I D, Gebru T (2019) Model Cards for Model Reporting. In: Proceedings of the Conference on Fairness, Accountability, and Transparency, pp 220-229. https://doi.org/10.1145/3287560.3287596.
10 changes: 5 additions & 5 deletions docs/en/appendix_c_cost_estimation_and_resource_templates.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ The safest budget is not "the API bill for generating 100,000 samples." It is th

### C.6.1 Training Estimates Should Not Look Only at GPU Count

Training budgets are often simplified to "how many cards for how many days." The real cost also depends on effective throughput, probability of failed reruns, and number of tuning rounds.
Training budgets are often simplified to "how many cards for how many days." The real cost also depends on effective throughput, probability of failed reruns, and number of tuning rounds. Large-scale training systems such as Megatron-LM show that model parallelism, data parallelism, and pipeline parallelism can significantly affect throughput, memory footprint, and failure-recovery cost (Narayanan et al. 2021).

| Item | Unit | Quantity | Unit Cost / Hours | Subtotal | Notes |
| :-- | :-- | :-- | :-- | :-- | :-- |
Expand All @@ -128,7 +128,7 @@ If a team does not reserve resources for failed reruns, the budget usually becom

### C.6.2 Split Inference Cost by Scenario

Inference cost should be split into at least three scenarios.
Inference cost should be split into at least three scenarios. For long-context and high-concurrency serving, memory-management mechanisms such as PagedAttention have become important references for serving-cost estimation, and vLLM's engineering documentation provides a practical entry point for deployment and tuning (Kwon et al. 2023; vLLM Project 2026).

| Scenario | Characteristics | Estimation Focus |
| :-- | :-- | :-- |
Expand Down Expand Up @@ -186,7 +186,7 @@ Text projects can sometimes survive rough disk estimates. Document, image, audio
| Release images | External release and course reproduction versions |
| Archive layer | Cold storage and long-term preservation |

Without this layering, teams often discover late that training was not the expensive part; permanently retaining every intermediate artifact was.
Without this layering, teams often discover late that training was not the expensive part; permanently retaining every intermediate artifact was. If Kubernetes is used to host training, evaluation, or teaching environments, resource quotas, storage volumes, namespaces, and job lifecycle should also be included in the budget sheet, with resource objects and scheduling semantics following the official Kubernetes documentation (Kubernetes Authors 2026).

### C.8.2 Archival Strategy Determines Maintainability Over the Next Three Years

Expand Down Expand Up @@ -315,9 +315,9 @@ Third, mature cost management is not only about saving money. It makes the relat

Patterson D, Gonzalez J, Le Q, Liang C, Munguia L, Rothchild D, So D, Texier M, Dean J (2021) Carbon Emissions and Large Neural Network Training. arXiv preprint arXiv:2104.10350.

Narayanan D, Shoeybi M, Casper J, LeGresley P, Patwary M, Catanzaro B (2021) Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.
Narayanan D, Shoeybi M, Casper J, LeGresley P, Patwary M, Catanzaro B (2021) Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. arXiv:2104.04473.

Kwon W, Li Z, Zhuang S, Sheng Y, Zheng L, Yu C H, Gonzalez J E, Zhang H, Stoica I (2023) Efficient Memory Management for Large Language Model Serving with PagedAttention. In: Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pp 611-626.
Kwon W, Li Z, Zhuang S, Sheng Y, Zheng L, Yu C H, Gonzalez J E, Zhang H, Stoica I (2023) Efficient Memory Management for Large Language Model Serving with PagedAttention. In: Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pp 611-626. https://doi.org/10.1145/3600006.3613165.

Kubernetes Authors (2026) Kubernetes Documentation. Available at: https://kubernetes.io/docs/.

Expand Down
14 changes: 11 additions & 3 deletions docs/en/appendix_d_paper_to_implementation_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -393,12 +393,20 @@ These materials let later readers know not only what was built, but why this des

## References

Kapoor S, Narayanan A (2023) Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4(9):100804.
Kapoor S, Narayanan A (2023) Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4(9):100804. https://doi.org/10.1016/j.patter.2023.100804.

Kreuzberger D, Kuhl N, Hirschl S (2023) Machine Learning Operations (MLOps): Overview, Definition, and Architecture. IEEE Access 11:31866-31879.
Gebru T, Morgenstern J, Vecchione B, Vaughan J W, Wallach H, Daumé H, Crawford K (2021) Datasheets for Datasets. Communications of the ACM 64(12):86-92. https://doi.org/10.1145/3458723.

Kreuzberger D, Kuhl N, Hirschl S (2023) Machine Learning Operations (MLOps): Overview, Definition, and Architecture. IEEE Access 11:31866-31879. arXiv:2205.02302.

Longpre S, Mahari R, Lee A, et al. (2023) The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing and Attribution in AI. arXiv preprint arXiv:2310.16787.

Mazumder M, Banbury C, Yao X, et al. (2023) DataPerf: Benchmarks for Data-Centric AI Development. In: Advances in Neural Information Processing Systems 36, Datasets and Benchmarks Track.
Mazumder M, Banbury C, Yao X, et al. (2023) DataPerf: Benchmarks for Data-Centric AI Development. In: Advances in Neural Information Processing Systems 36, Datasets and Benchmarks Track. https://doi.org/10.52202/075280-0235.

Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, Spitzer E, Raji I D, Gebru T (2019) Model Cards for Model Reporting. In: Proceedings of the Conference on Fairness, Accountability, and Transparency, pp 220-229. https://doi.org/10.1145/3287560.3287596.

Pushkarna M, Zaldivar A, Kjartansson O (2022) Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI. In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp 1776-1826. https://doi.org/10.1145/3531146.3533231.

Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, Chaudhary V, Young M, Crespo J-F, Dennison D (2015) Hidden Technical Debt in Machine Learning Systems. In: Advances in Neural Information Processing Systems 28.

Zha D, Bhat Z P, Lai K-H, Yang F, Jiang Z, Zhong S, Hu X (2023) Data-centric Artificial Intelligence: A Survey. arXiv preprint arXiv:2303.10158.
6 changes: 4 additions & 2 deletions docs/en/appendix_e_common_bug_debugging_manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -446,10 +446,12 @@ If these three actions continue, this manual becomes part of daily engineering r

Blecher L, Cucurull G, Scialom T, Stojnic R (2023) Nougat: Neural Optical Understanding for Academic Documents. arXiv preprint arXiv:2308.13418.

Breck E, Cai S, Nielsen E, Salib M, Sculley D (2017) The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. In: IEEE International Conference on Big Data, pp 1123-1132.

Pfitzmann B, Auer C, Dolfi M, Nassar A S, Staar P (2022) DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp 3743-3751.

Chen D, Huang Y, Ma Z, Chen H, Pan X, Ge C, Gao D, Xie Y, Liu Z, Gao J, Li Y, Ding B, Zhou J (2024) Data-Juicer: A One-Stop Data Processing System for Large Language Models. In: Companion of the 2024 International Conference on Management of Data, pp 120-134.
Chen D, Huang Y, Ma Z, Chen H, Pan X, Ge C, Gao D, Xie Y, Liu Z, Gao J, Li Y, Ding B, Zhou J (2024) Data-Juicer: A One-Stop Data Processing System for Large Language Models. In: Companion of the 2024 International Conference on Management of Data, pp 120-134. https://doi.org/10.1145/3626246.3653385.

Chen Y, Shetty M, Somashekar G, Ma M, Simmhan Y, Mace J, Bansal C, Wang R, Rajmohan S (2025) AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds. arXiv preprint arXiv:2501.06706.

Kapoor S, Narayanan A (2023) Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4(9):100804.
Kapoor S, Narayanan A (2023) Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4(9):100804. https://doi.org/10.1016/j.patter.2023.100804.
12 changes: 9 additions & 3 deletions docs/en/appendix_f_terminology_and_chinese_english_mapping.md
Original file line number Diff line number Diff line change
Expand Up @@ -404,8 +404,14 @@ This is especially risky around privacy, compliance, and release boundaries. Ter

Bommasani R, Klyman K, Zhang D, Liang P (2023) The Foundation Model Transparency Index. arXiv preprint arXiv:2310.12941.

Liang P, Bommasani R, Lee T, et al. (2023) Holistic Evaluation of Language Models. Transactions on Machine Learning Research.
Gebru T, Morgenstern J, Vecchione B, Vaughan J W, Wallach H, Daumé H, Crawford K (2021) Datasheets for Datasets. Communications of the ACM 64(12):86-92. https://doi.org/10.1145/3458723.

Wang B, Chen W, Pei H, et al. (2023) DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In: Advances in Neural Information Processing Systems 36.
Liang P, Bommasani R, Lee T, et al. (2023) Holistic Evaluation of Language Models. Transactions on Machine Learning Research. arXiv:2211.09110.

Weidinger L, Uesato J, Rauh M, Griffin C, Huang P-S, Mellor J, Glaese A, Cheng M, Balle B, Kasirzadeh A, Kenton Z, Brown S, Hawkins W, Stepleton T, Birhane A, Haas J, Rimell L, Hendricks L A, Isaac W, Legassick S, Irving G, Gabriel I (2022) Taxonomy of Risks posed by Language Models. In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp 214-229.
Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, Spitzer E, Raji I D, Gebru T (2019) Model Cards for Model Reporting. In: Proceedings of the Conference on Fairness, Accountability, and Transparency, pp 220-229. https://doi.org/10.1145/3287560.3287596.

Pushkarna M, Zaldivar A, Kjartansson O (2022) Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI. In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp 1776-1826. https://doi.org/10.1145/3531146.3533231.

Wang B, Chen W, Pei H, et al. (2023) DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In: Advances in Neural Information Processing Systems 36. https://doi.org/10.52202/075280-1361.

Weidinger L, Uesato J, Rauh M, Griffin C, Huang P-S, Mellor J, Glaese A, Cheng M, Balle B, Kasirzadeh A, Kenton Z, Brown S, Hawkins W, Stepleton T, Birhane A, Haas J, Rimell L, Hendricks L A, Isaac W, Legassick S, Irving G, Gabriel I (2022) Taxonomy of Risks posed by Language Models. In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp 214-229. https://doi.org/10.1145/3531146.3533088.
Loading
Loading