Skip to content

RandyHaddad/EDA_Cosmopedia

Repository files navigation

Cosmopedia-100k Dataset: Comprehensive EDA

🎯 Executive Summary

This comprehensive Exploratory Data Analysis (EDA) of the Cosmopedia-100k dataset focuses on optimizing dataset selection for Mixture of Experts (MoE) training efficiency. Through advanced statistical analysis, topic clustering, and quality assessment, we provide actionable recommendations for selecting the most effective 1,000 samples from the 100k dataset.

📊 Key Findings

Dataset Overview

  • Total Samples: 100,000 synthetic text samples
  • Average Token Length: 790 tokens (std: 277)
  • Content Formats: 14 distinct types
  • Target Audiences: 8 categories
  • Seed Sources: 9 different data sources
  • Topic Clusters: 19 distinct semantic clusters identified

Quality Metrics

  • Reading Level: Graduate level (Flesch Reading Ease: 25.1)
  • Complexity: High academic complexity (Flesch-Kincaid Grade: 14.6)
  • Information Density: Strong technical vocabulary across domains
  • Token Efficiency: Optimal range 600-1200 tokens for MoE specialists

🗂️ File Structure

eda_output/
├── README.md                          # This overview document
├── detailed_findings.md               # Comprehensive analysis results
├── moe_recommendations.md             # Specific MoE training recommendations
├── technical_appendix.md              # Technical details and methodology
├── basic_statistics.txt               # Raw statistical summaries
├── clustering_analysis.txt            # Clustering results
├── moe_optimization.txt               # MoE optimization strategies
├── comprehensive_analysis.png         # Main statistical visualizations
├── advanced_clustering_analysis.png   # Clustering and dimensionality reduction
├── comprehensive_summary_dashboard.png # Executive summary dashboard
├── wordclouds_by_format.png          # Word clouds by content format
├── wordclouds_by_audience.png        # Word clouds by target audience
├── interactive_token_distribution.html # Interactive token analysis
├── interactive_scatter_plot.html      # Interactive format/audience analysis
└── interactive_sunburst.html          # Hierarchical data visualization

🎯 MoE Training Recommendations

Optimal 1000-Sample Selection Strategy: BALANCED APPROACH

Metric Random Quality-based Diversity-based Balanced
Avg Tokens 810 826 1,004 983
Format Diversity 12 types 8 types 12 types 11 types
Cluster Coverage 19/19 19/19 19/19 19/19
Total Tokens 809k 826k 1,004k 983k

Implementation Strategy

  1. Cluster-Balanced Sampling: Ensure representation from all 19 topic clusters
  2. Format Prioritization: 70% educational content, 30% diverse formats
  3. Token Range Optimization: Target 700-1000 tokens per sample
  4. Quality Filtering: Prioritize academic and educational piece formats

📈 Topic Diversity Analysis

The dataset exhibits rich topic diversity across 19 distinct clusters:

Major Topic Areas:

  • Education & Learning (29.1% of samples)
  • Business & Marketing (13.7% of samples)
  • Health & Healthcare (15.7% of samples)
  • Technology & Data (15.8% of samples)
  • Arts & Culture (8.9% of samples)
  • Science & Mathematics (7.6% of samples)
  • Storytelling & Narratives (9.2% of samples)

🔍 Content Format Analysis

Format Count Percentage Avg Tokens Key Characteristics
blogpost 37,927 37.9% 726 General audience, practical focus
textbook_academic_tone 28,261 28.3% 935 Academic rigor, college-level
educational_piece 6,203 6.2% 712 Structured learning content
story_reddit 4,235 4.2% 625 Narrative, conversational
story_children 4,160 4.2% 452 Simple language, engaging

🎓 Target Audience Distribution

Audience Count Percentage Content Characteristics
general 57,297 57.3% Broad accessibility
college_students 32,161 32.2% Academic depth
young_children 5,081 5.1% Simplified language
grade_school_students 3,153 3.2% Age-appropriate complexity

🚀 Next Steps

  1. Implement Balanced Selection: Use the recommended balanced approach for 1000-sample selection
  2. Monitor Training Efficiency: Track MoE specialist utilization and convergence rates
  3. Quality Validation: Validate selection quality through downstream task performance
  4. Iterative Refinement: Adjust selection criteria based on training outcomes

For detailed technical analysis, see detailed_findings.md
For specific MoE recommendations, see moe_recommendations.md
For technical methodology, see technical_appendix.md

About

Comprehensive EDA of Cosmopedia-100k dataset optimized for Mixture-of-Experts (MoE) language models' training efficiency. Features topic clustering, quality assessments. Includes interactive visualizations, and implementation guides for training efficiency improvemen

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors