GenAI China Replication

Firm-year generative AI exposure measures from Chinese listed-firm recruitment data, used in asset-pricing tests.

Status (2026-05-28)

DeepSeek API labeling ~3.65M unique job titles (2014–2026) into PKU 100-class occupations. Running on AutoDL, ~3M remaining.

Key Results (DeepSeek Labels — 542K titles validated)

Metric	Value
Human-AI agreement (1,000 titles)	94.7%
DeepSeek E0 vs old E0 (53 occupations)	r = 0.9186
DeepSeek Ef vs canonical Ef (13,599 firm-years)	r = 0.8560
Fama-MacBeth t (DeepSeek)	1.995
Canonical t	2.035
R1 [0,10] event t (DeepSeek)	3.93

Important: Why DeepSeek, Not BERT

BERT classifiers (title-only, title+category, chinese-roberta-wwm-ext) maxed at 77.4% accuracy — too far below DeepSeek's 94.7%. Full experiment details in E0/scripts/train_v2.py and logs in E0/logs/.

Pipeline

上市公司招聘大数据2014-2026.3_cleaned.csv.gz → unique titles → DeepSeek API → merge → update jd_class2 in task candidates → rebuild E0 (occupation exposure) → rebuild Ef (firm-year exposure) → asset-pricing tests.

Code Areas

Area	Purpose
`src/canonical/`	Formal E0/Ef pipeline, asset pricing, event studies
`jdclass_mapper/`	Title → occupation mapping (original 6-layer)
`E0/scripts/`	DeepSeek labeling, BERT training, E0 rebuild
`Ef_factor_asset_pricing/`	Asset pricing diagnostics
`scripts/`	Publishing, comparison utilities

Canonical Data

data/processed/exposure/: occupation_E0.csv (E0), firm_year_Ef.csv (Ef).

See GOAL.md for full project tracking.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
E0		E0
Ef_factor_asset_pricing		Ef_factor_asset_pricing
configs		configs
docs		docs
exports		exports
jdclass_mapper		jdclass_mapper
listed_firm_recruiting_db_v2		listed_firm_recruiting_db_v2
references		references
scripts		scripts
src		src
tools		tools
.gitignore		.gitignore
1-s2.0-S1062940824001049-main.pdf		1-s2.0-S1062940824001049-main.pdf
GOAL.md		GOAL.md
NAMING_CONVENTION_20260521.md		NAMING_CONVENTION_20260521.md
README.md		README.md
audit_classify.py		audit_classify.py
audit_classify_1000.py		audit_classify_1000.py
audit_deepseek_1000.py		audit_deepseek_1000.py
main_regressions.do		main_regressions.do
run_3m_downstream.py		run_3m_downstream.py
run_fm_compare.py		run_fm_compare.py
run_full_compare.py		run_full_compare.py
run_version_compare.py		run_version_compare.py
全部回归结果_供核对_20260525.md		全部回归结果_供核对_20260525.md
全部回归表_Stata格式.docx		全部回归表_Stata格式.docx
全部回归表_Stata格式_updated.docx		全部回归表_Stata格式_updated.docx
指标构造说明.docx		指标构造说明.docx
数据构造流程参考_DataConstruction.md		数据构造流程参考_DataConstruction.md
研究叙事_Research_Narrative.md		研究叙事_Research_Narrative.md
组会.pptx		组会.pptx
组会.pptx.bak2		组会.pptx.bak2
组会PPT_完整内容_实证部分.md		组会PPT_完整内容_实证部分.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenAI China Replication

Status (2026-05-28)

Key Results (DeepSeek Labels — 542K titles validated)

Important: Why DeepSeek, Not BERT

Pipeline

Code Areas

Canonical Data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GenAI China Replication

Status (2026-05-28)

Key Results (DeepSeek Labels — 542K titles validated)

Important: Why DeepSeek, Not BERT

Pipeline

Code Areas

Canonical Data

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages