This repository provides the code and data necessary to quickly get started with the Splits! paper. It allows users to quickly explore the sociocultural linguistic differences across demographics and topics using our Lift and Triviality metrics.
| Version | Size | # Posts / Instances | Demographic Labels | Topic Labels |
|---|---|---|---|---|
| 1. All Seed Users | ~115 GB | 350M posts | ✅ | ❌ |
| 2. High-Groupness Users | ~34 GB | 90M posts | ✅ (less noisy) | ❌ |
| 3. Splits! | ~2.6 GB | 3.6M posts | ✅ (less noisy) | ✅ |
See DETAILS.md for group-ness thresholds, sampling procedure, and detailed schema.
Includes all posts made by any user who has posted in a labeled seed subreddit.
High-recall, low-precision demographic labels.
Refines (1) using a group-ness metric to select users more likely to belong to a demographic group.
Higher precision, decent recall. Group-ness thresholds in DETAILS.md.
Builds on (2) by organizing posts into topics using ColBERT retrieval. Useful for studying topic-based differences across groups.
The full demographic subreddit seed sets, self-ID phrases, and anti-self-ID phrases can be found in metadata/demographics.json. All topics and topic keywords can be found in metadata/topics.json.
We use conda to provision the base Python and Java (OpenJDK) requirements, and uv for fast Python dependency installation.
-
Create the Conda environment:
conda create -y -n splits_demo python=3.10 openjdk uv
-
Activate the environment:
conda activate splits_demo
-
Install dependencies using
uv pip:uv pip install -r requirements.txt
Run the downloaded data script to get the demo data into lexica/:
./download_data.shTo access the original full variants of the data remotely directly via HF, you can run the following:
from datasets import load_dataset
v1 = load_dataset("ecaplan/splits", "all_seed_user_posts")['train']
v2 = load_dataset("ecaplan/splits", "high_groupness_user_posts")['train']
v3 = load_dataset("ecaplan/splits", "high_groupness_by_topic")['train']To run the Lift and Triviality metrics, you just need to run the splits_metrics_demo.ipynb notebook!
If you use this dataset, please cite the paper:
@misc{caplan2026splitsflexiblesocioculturallinguistic,
title={Splits! Flexible Sociocultural Linguistic Investigation at Scale},
author={Eylon Caplan and Tania Chakraborty and Dan Goldwasser},
year={2026},
eprint={2504.04640},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.04640},
}
