Recent advancements in single-cell RNA sequencing have greatly enhanced our ability to dissect cellular heterogeneity. However, unsupervised clustering often struggles to identify transitional or developmental boundary cells, as existing methods rely on highly variable genes without considering expression levels, thereby overlooking subtle but crucial signals.
To address this challenge, we developed scGEN (single-cell Gene-aware Embedded Network), which captures complex cellular relationships among cells. scGEN employs adaptive feature weighting and iterative fine-tuning to prioritize ambiguous or transitional cells with overlapping transcriptional profiles.
- Adaptive feature weighting for better cell type identification
- Iterative fine-tuning to capture transitional cell states
- Superior performance on ambiguous cell classification
- Enhanced detection of subtle biological differences
Evaluation across eight distinct scRNA-seq datasets demonstrated that scGEN consistently outperformed nine leading clustering approaches. Additionally, scGEN refined the classification of ~10% ambiguous cells and uncovered biologically significant differences, providing a more comprehensive view of cellular heterogeneity in the human fetal pituitary than existing methods.
git clone https://github.com/hurlab/scGEN.git
cd scGENscGEN accepts input data in .mat (MATLAB) format. You can convert your data to the required format using the provided sv2mat.m script in MATLAB.
-
Select HVGs: Use the
hvgs2csv.pyfile in the scGEN directory to filter the normalized data with top 2000 highly variable genes. -
Create .mat file: Use the
csv2mat.mfile in the scGEN directory to create a.matfile in MATLAB. -
Place your data: Put your
.matfile in thedatasetfolder under the scGEN directory. -
Run scGEN: Execute the main training script:
python3 train.py
You can download example datasets and scripts from: https://zenodo.org/uploads/16945598
scGEN utilizes two key hyperparameters:
- α: Balances the contributions of the Regularized ZINB loss and the structure-guided hard-sample contrastive loss functions
- γ: Adjusts the attention weight assigned to hard samples in the learning process
Based on extensive parameter sensitivity analyses (α: 0.01-100, γ: 1-5), the optimal parameters for benchmark datasets are:
| Dataset | γ (gamma) | α (alpha) |
|---|---|---|
| Bell | 1 | 1 |
| hrvatin_B1 | 1 | 1 |
| hrvatin_B2 | 1 | 1 |
| pbmc3k | 4 | 0.1 |
| Savas | 4 | 0.1 |
| Scala | 2 | 1 |
| Schwalbe | 4 | 100 |
| zhang | 4 | 10 |
- Start with default parameters: α=1, γ=1
- If results are unsatisfactory:
- Adjust γ for better hard-sample mining
- Modify α based on dataset complexity
The output file result.csv contains performance metrics (ACC, NMI, ARI, and F1 values) for each dataset across 20 runs, including the top two best-performing seeds with their average and standard deviation values.
For questions or issues, please contact guokai8@gmail.com or open an issue on GitHub.
The study was partially supported by the United States National Institute of Diabetes and Digestive and Kidney Diseases (R01DK130913 to Junguk Hur), the Computational Data Analysis Core of the University of North Dakota (supported by the National Institute of General Medical Sciences award P20GM113123), and the Science and Technology Research Program of Chongqing Municipal Education Commission (KJQN202200479), the Natural Science Foundation of Chongqing (CSTB2022NSCQ-LZX0033), Chongqing Medical University Program for Youth Innovation in Future Medicine (W0158), the National Natural Science Foundation of China (82200592).