This comprehensive Exploratory Data Analysis (EDA) of the Cosmopedia-100k dataset focuses on optimizing dataset selection for Mixture of Experts (MoE) training efficiency. Through advanced statistical analysis, topic clustering, and quality assessment, we provide actionable recommendations for selecting the most effective 1,000 samples from the 100k dataset.
- Total Samples: 100,000 synthetic text samples
- Average Token Length: 790 tokens (std: 277)
- Content Formats: 14 distinct types
- Target Audiences: 8 categories
- Seed Sources: 9 different data sources
- Topic Clusters: 19 distinct semantic clusters identified
- Reading Level: Graduate level (Flesch Reading Ease: 25.1)
- Complexity: High academic complexity (Flesch-Kincaid Grade: 14.6)
- Information Density: Strong technical vocabulary across domains
- Token Efficiency: Optimal range 600-1200 tokens for MoE specialists
eda_output/
├── README.md # This overview document
├── detailed_findings.md # Comprehensive analysis results
├── moe_recommendations.md # Specific MoE training recommendations
├── technical_appendix.md # Technical details and methodology
├── basic_statistics.txt # Raw statistical summaries
├── clustering_analysis.txt # Clustering results
├── moe_optimization.txt # MoE optimization strategies
├── comprehensive_analysis.png # Main statistical visualizations
├── advanced_clustering_analysis.png # Clustering and dimensionality reduction
├── comprehensive_summary_dashboard.png # Executive summary dashboard
├── wordclouds_by_format.png # Word clouds by content format
├── wordclouds_by_audience.png # Word clouds by target audience
├── interactive_token_distribution.html # Interactive token analysis
├── interactive_scatter_plot.html # Interactive format/audience analysis
└── interactive_sunburst.html # Hierarchical data visualization
| Metric | Random | Quality-based | Diversity-based | Balanced |
|---|---|---|---|---|
| Avg Tokens | 810 | 826 | 1,004 | 983 |
| Format Diversity | 12 types | 8 types | 12 types | 11 types |
| Cluster Coverage | 19/19 | 19/19 | 19/19 | 19/19 |
| Total Tokens | 809k | 826k | 1,004k | 983k |
- Cluster-Balanced Sampling: Ensure representation from all 19 topic clusters
- Format Prioritization: 70% educational content, 30% diverse formats
- Token Range Optimization: Target 700-1000 tokens per sample
- Quality Filtering: Prioritize academic and educational piece formats
The dataset exhibits rich topic diversity across 19 distinct clusters:
Major Topic Areas:
- Education & Learning (29.1% of samples)
- Business & Marketing (13.7% of samples)
- Health & Healthcare (15.7% of samples)
- Technology & Data (15.8% of samples)
- Arts & Culture (8.9% of samples)
- Science & Mathematics (7.6% of samples)
- Storytelling & Narratives (9.2% of samples)
| Format | Count | Percentage | Avg Tokens | Key Characteristics |
|---|---|---|---|---|
blogpost |
37,927 | 37.9% | 726 | General audience, practical focus |
textbook_academic_tone |
28,261 | 28.3% | 935 | Academic rigor, college-level |
educational_piece |
6,203 | 6.2% | 712 | Structured learning content |
story_reddit |
4,235 | 4.2% | 625 | Narrative, conversational |
story_children |
4,160 | 4.2% | 452 | Simple language, engaging |
| Audience | Count | Percentage | Content Characteristics |
|---|---|---|---|
general |
57,297 | 57.3% | Broad accessibility |
college_students |
32,161 | 32.2% | Academic depth |
young_children |
5,081 | 5.1% | Simplified language |
grade_school_students |
3,153 | 3.2% | Age-appropriate complexity |
- Implement Balanced Selection: Use the recommended balanced approach for 1000-sample selection
- Monitor Training Efficiency: Track MoE specialist utilization and convergence rates
- Quality Validation: Validate selection quality through downstream task performance
- Iterative Refinement: Adjust selection criteria based on training outcomes
For detailed technical analysis, see detailed_findings.md
For specific MoE recommendations, see moe_recommendations.md
For technical methodology, see technical_appendix.md