This document contains additional information about the Splits! dataset, including:
- Group-ness thresholds
- Topic structure
- Field definitions
We use a group-ness metric to identify users more likely to authentically belong to a demographic group.
Thresholds are determined by analyzing self-identification vs. anti-self-identification rates and selecting the point where they diverge significantly.
Percentile Cutoffs:
| Demographic Group | Threshold (Percentile) |
|---|---|
| Teacher | 75 |
| Catholic | 75 |
| Black | 75 |
| Construction Worker | 90 |
| Jewish | 90 |
| Hindu/Jain/Sikh | 80 |
- 10 neutral topic categories
- 20 specific topics per category (200 total)
- Topics selected using BM25 retrieval based on manually defined keyword prompts
Each post in version 3 is retrieved for a specific neutral topic. Topic categories, specific topics, and retrieval keywords are listed in the main paper and included in this repo under metadata/demographics.json and metadata/neutral_topics.json.
| Field | Description |
|---|---|
id |
Unique identifier for the post |
user |
Reddit username of the author |
timestamp |
Unix timestamp (in milliseconds) of when the post was created |
text |
The body of the post |
demographic |
The demographic label assigned to the post |
subreddit |
The subreddit where the post appeared |
metric_percentile (v2 only) |
Percentile for group-ness of the user within their demographic |
| Field | Description |
|---|---|
id |
Unique identifier for the post |
description |
The specific neutral topic the post was retrieved for |
demographic |
The demographic group assigned to the post |
content |
The text of the post |
metadata |
A dictionary containing post metadata: - timestamp: When the post was created - score: Reddit score (upvotes - downvotes) - subreddit: The subreddit where the post appeared - user: The username of the one who posted |
score |
BM25 similarity score between the post and the topic keywords |
- Version 1: Broad demographic analysis, exploratory studies
- Version 2: Precision-sensitive analysis of group language
- Version 3: Topic-based and group-based discourse comparison
Feel free to open an issue or reach out via GitHub.