SEC-EDGAR GPT-2 124M

A GPT-2 (124M) language model trained from scratch on SEC EDGAR filings (10-K, 10-Q, 8-K, etc.).

Model Details

Property	Value
Architecture	GPT-2 124M (12 layers, 12 heads, 768 hidden)
Parameters	124,475,904
Context Length	1,024 tokens
Tokenizer	GPT-2 BPE (tiktoken)
Training Tokens	~1.55B (1 epoch)
Training Steps	47,000
Validation Loss	2.28
Training Framework	nanoGPT
Training Hardware	NVIDIA RTX 4070 12GB
Training Time	~8 hours
Bias	No (`bias=False`)

Training Data

SEC EDGAR filings sourced from the SEC-EDGAR corpus on HuggingFace, covering annual reports (10-K), quarterly reports (10-Q), current reports (8-K), and other filing types. Tokenized with GPT-2 BPE into ~1.55B tokens across 16 shards.

Training Config

Batch size: 4 × 1024 tokens, gradient accumulation 8 → effective batch 32,768 tokens/step
Optimizer: GPT-3 style (AdamW, lr=6e-4, warmup=2000, cosine decay to 6e-5)
No dropout, no weight bias

Usage

from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("lzwjava/sec-edgar-gpt")
tokenizer = GPT2Tokenizer.from_pretrained("lzwjava/sec-edgar-gpt")

prompt = "UNITED STATES SECURITIES AND EXCHANGE COMMISSION"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_new_tokens=200, temperature=0.8, do_sample=True)
print(tokenizer.decode(output[0]))

Limitations

Trained for only 1 epoch — coherent for ~200-500 tokens before repetitive loops
No instruction tuning or RLHF — raw language model
124M parameters is small; don't expect state-of-the-art quality
GPT-2 tokenizer may not handle all financial notation optimally

Training Code

Trained with nanoGPT. Training config available in the source repo.

Development Notes

Date	Topic	Notes
2026-06-25	10-K Download Summary	SEC EDGAR 10-K filing download process
2026-06-25	Financial Pretraining Corpus	Corpus preparation and tokenization
2026-06-26	GPT-2 on SEC-EDGAR Data	Paper structure and training overview
2026-06-26	Training Loss Recovery	Loss spike at 20k steps, recovery analysis
2026-06-26	Prompt File Setup	Inference prompt configuration
2026-06-26	Model Quality Check	Output quality evaluation
2026-06-26	124M Generation Test	Generation samples across prompts
2026-06-26	124M Generation Review	Detailed review of generated outputs
2026-06-26	124M Upload	Model upload to HuggingFace

Citation

@misc{sec-edgar-gpt-124m,
  author = {Zhiwei Li},
  title = {SEC-EDGAR GPT-2 124M},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/lzwjava/sec-edgar-gpt}
}

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.wrangler/cache		.wrangler/cache
logs		logs
notes		notes
scripts		scripts
server		server
src		src
website		website
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
config.json		config.json
deploy.sh		deploy.sh
generation_config.json		generation_config.json
paper2.png		paper2.png
paper3.png		paper3.png
paper4.jpg		paper4.jpg
sec-edgar-gpt.pdf		sec-edgar-gpt.pdf
sec-edgar-gpt.tex		sec-edgar-gpt.tex
tokenizer.json		tokenizer.json
tokenizer_config.json		tokenizer_config.json
web1.jpg		web1.jpg
web2.jpg		web2.jpg
wrangler.toml		wrangler.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SEC-EDGAR GPT-2 124M

Model Details

Training Data

Training Config

Usage

Limitations

Training Code

Development Notes

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SEC-EDGAR GPT-2 124M

Model Details

Training Data

Training Config

Usage

Limitations

Training Code

Development Notes

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages