Skip to content

Latest commit

 

History

History
78 lines (54 loc) · 3.7 KB

File metadata and controls

78 lines (54 loc) · 3.7 KB

DETAILS.md

This document contains additional information about the Splits! dataset, including:

  • Group-ness thresholds
  • Topic structure
  • Field definitions

🧠 Group-ness Thresholds

We use a group-ness metric to identify users more likely to authentically belong to a demographic group.
Thresholds are determined by analyzing self-identification vs. anti-self-identification rates and selecting the point where they diverge significantly.

Percentile Cutoffs:

Demographic Group Threshold (Percentile)
Teacher 75
Catholic 75
Black 75
Construction Worker 90
Jewish 90
Hindu/Jain/Sikh 80

📚 Topic Structure (Version 3)

  • 10 neutral topic categories
  • 20 specific topics per category (200 total)
  • Topics selected using BM25 retrieval based on manually defined keyword prompts

Each post in version 3 is retrieved for a specific neutral topic. Topic categories, specific topics, and retrieval keywords are listed in the main paper and included in this repo under metadata/demographics.json and metadata/neutral_topics.json.


🧾 Field Descriptions

Version 1 & 2

Field Description
id Unique identifier for the post
user Reddit username of the author
timestamp Unix timestamp (in milliseconds) of when the post was created
text The body of the post
demographic The demographic label assigned to the post
subreddit The subreddit where the post appeared
metric_percentile (v2 only) Percentile for group-ness of the user within their demographic

Version 3 (Splits!)

Field Description
id Unique identifier for the post
description The specific neutral topic the post was retrieved for
demographic The demographic group assigned to the post
content The text of the post
metadata A dictionary containing post metadata:
- timestamp: When the post was created
- score: Reddit score (upvotes - downvotes)
- subreddit: The subreddit where the post appeared
- user: The username of the one who posted
score BM25 similarity score between the post and the topic keywords

🧭 Suggested Use Cases

  • Version 1: Broad demographic analysis, exploratory studies
  • Version 2: Precision-sensitive analysis of group language
  • Version 3: Topic-based and group-based discourse comparison

📬 Questions?

Feel free to open an issue or reach out via GitHub.