Curio is a research-based, unsupervised topic modelling pipeline for social media, written in Swift. It draws from available libraries to support data collection, document encoding (e.g., CoreML, Model2vec, Apple's Natural Language), dimensionality reduction (e.g., PCA, tSNE, UMAP), clustering (e.g., HDBSCAN, KMeans), and topic modeling. Our goals are to provide a modular and efficient set of tools that work across a variety of data sources. We leverage modern Swift concurrency and libraries like MLX to provide performant and safe implementations that work well on commodity Mac hardware. Curio will enable the development of new qualitative data analysis tools for edge devices like laptops, tablets, and smartphones.
-
Data Collection
- Reddit API Endpoints
- PushShift Reddit Archives
- Additional data sources (e.g., X, Steam, Github)
-
Encoding
- Static Embeddings
- Contextual Embeddings (e.g., Sentence-Transformers)
- Open AI API
- CoreML Models (e.g., All-MiniLM-L6)
-
Dimensionality Reduction
-
Clustering
-
Topic Models
- c-TF-IDF Keyword Generation
- Evaluation Metrics (Cosine Similarity, Topic Diversity)
You can use Swift Package Manager and specify dependency in Package.swift by adding:
.package(url: "https://git.uwaterloo.ca/jrWallac/curio.git", from: "0.0.8")
This project is developed by a team of researchers from the Human-Computer Interaction and Health Lab at the University of Waterloo. The project is led by Prof. Jim Wallace, with contributions from:
- Jason Zhao
- Nicole Mathis
- Peter Li
- Adrian Davila
- Henry Tian
- Jean Nordmann
- Mingchung Xia
- Abhinav Jain
- George Wang
- Ali Raza Zaidi
If you would like to contribute to the project, contact Prof. Wallace with "Curio" in the subject line, and mention one or more of the roadmap items above that you would like to work on.
All original code released under the MIT license for commercial and non-commercial use.