Skip to content

Leo984357/genai_china_replication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GenAI China Replication

Firm-year generative AI exposure measures from Chinese listed-firm recruitment data, used in asset-pricing tests.

Status (2026-05-28)

DeepSeek API labeling ~3.65M unique job titles (2014–2026) into PKU 100-class occupations. Running on AutoDL, ~3M remaining.

Key Results (DeepSeek Labels — 542K titles validated)

Metric Value
Human-AI agreement (1,000 titles) 94.7%
DeepSeek E0 vs old E0 (53 occupations) r = 0.9186
DeepSeek Ef vs canonical Ef (13,599 firm-years) r = 0.8560
Fama-MacBeth t (DeepSeek) 1.995
Canonical t 2.035
R1 [0,10] event t (DeepSeek) 3.93

Important: Why DeepSeek, Not BERT

BERT classifiers (title-only, title+category, chinese-roberta-wwm-ext) maxed at 77.4% accuracy — too far below DeepSeek's 94.7%. Full experiment details in E0/scripts/train_v2.py and logs in E0/logs/.

Pipeline

  • 上市公司招聘大数据2014-2026.3_cleaned.csv.gz → unique titles → DeepSeek API → merge → update jd_class2 in task candidates → rebuild E0 (occupation exposure) → rebuild Ef (firm-year exposure) → asset-pricing tests.

Code Areas

Area Purpose
src/canonical/ Formal E0/Ef pipeline, asset pricing, event studies
jdclass_mapper/ Title → occupation mapping (original 6-layer)
E0/scripts/ DeepSeek labeling, BERT training, E0 rebuild
Ef_factor_asset_pricing/ Asset pricing diagnostics
scripts/ Publishing, comparison utilities

Canonical Data

data/processed/exposure/: occupation_E0.csv (E0), firm_year_Ef.csv (Ef).

See GOAL.md for full project tracking.

About

Firm-year GenAI exposure measures from 7.5M Chinese listed-firm job postings (2014-2024)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors