This document provides guidance for AI assistants working on the OTK (ecDNA prediction) project.
OTK is a machine learning toolkit for extrachromosomal DNA (ecDNA) prediction. The project uses deep learning and gradient boosting models to predict ecDNA presence in cancer samples.
- All models must use the same data split: 80/10/10 (train/val/test)
- Random seed: 2026 (fixed across all models)
- Implementation:
src/otk/data/data_split.py - Usage:
from otk.data import load_split
train_df, val_df, test_df = load_split()
# Returns: train=5808151, val=715972, test=755098 rowsAll models inherit from BaseEcDNAModel and implement:
fit(X_train, y_train, X_val, y_val)- Trainingpredict_proba(X)- Probability predictionpredict(X)- Binary predictionsave(path)/load(path)- Persistenceevaluate_gene_level()- Gene-level metricsevaluate_sample_level()- Sample-level metrics
Each model has a config.yml in its directory:
otk_api/models/{model_name}/config.yml
Models should read configuration from this file for:
- Hyperparameters
- Architecture settings
- Training configuration
After training, models save training_summary.yml with unified format:
model_name: xgb_new
gene_level:
train:
auPRC: 0.9519
AUC: 0.9993
Precision: 0.7987
Recall: 0.9299
F1: 0.8593
val:
auPRC: 0.6838
...
test:
auPRC: 0.8339
...
sample_level:
train:
auPRC: 0.9906
AUC: 0.9670
...
val:
...
test:
...IMPORTANT: Never use numpy objects in YAML files. Always convert to Python floats:
# WRONG
metrics = {'auPRC': np.float64(0.85)} # Will cause YAML serialization issues
# CORRECT
metrics = {'auPRC': float(0.85)}otk/
├── src/otk/
│ ├── data/
│ │ ├── data_split.py # Unified data split (seed=2026)
│ │ ├── data_processor.py # Data preprocessing
│ │ └── sorted_modeling_data.csv.gz # Main dataset
│ ├── models/
│ │ ├── base_model.py # Base class for all models
│ │ ├── xgb11_model.py # XGBoost models (XGB11, XGBNew)
│ │ ├── neural_models.py # Neural network models
│ │ ├── tabpfn_model.py # TabPFN model
│ │ ├── custom_losses.py # Loss functions
│ │ └── config_generator.py # Config file generator
│ ├── train/
│ │ └── trainer.py # Training utilities
│ ├── predict/
│ │ └── predictor.py # Prediction utilities
│ └── utils/
│ └── __init__.py
├── otk_api/
│ ├── models/ # Trained models directory
│ │ ├── xgb_new/
│ │ ├── xgb_paper/
│ │ ├── baseline_mlp/
│ │ ├── transformer/
│ │ ├── deep_residual/
│ │ ├── optimized_residual/
│ │ ├── dgit_super/
│ │ └── tabpfn/
│ └── model_analyzer.py # Model analysis and reporting
├── train_unified.py # Unified training script
└── README.md
| Model | Type | File | Description |
|---|---|---|---|
| xgb_new | XGBoost | xgb11_model.py | Optimized with feature engineering |
| xgb_paper | XGBoost | xgb11_model.py | Paper reproduction (11 features) |
| baseline_mlp | Neural | neural_models.py | Simple MLP baseline |
| transformer | Neural | neural_models.py | Transformer architecture |
| deep_residual | Neural | neural_models.py | Deep residual network |
| optimized_residual | Neural | neural_models.py | Optimized residual network |
| dgit_super | Neural | neural_models.py | Deep gated interaction transformer |
| tabpfn | TabPFN | tabpfn_model.py | TabPFN ensemble |
- Gene-level auPRC: ≥ 0.85
- Gene-level Precision: ≥ 0.8
- Sample-level auROC: ≥ 0.9
- Sample-level auPRC: ≥ 0.99
# Train single model
python train_unified.py --model xgb_new
# Train all models
python train_unified.py --allcd otk_api
python model_analyzer.pycd otk_api
OTK_BASE_PATH=/otk bash start_api.sh- Never change the random seed (2026) without updating all models
- Always use unified data split from
data_split.py - Convert numpy types to Python types before saving to YAML
- Keep config.yml synchronized with model implementation
- Test model loading after training to ensure persistence works
If you see could not determine a constructor for the tag 'tag:yaml.org,2002:python/object/apply:numpy...':
- The training_summary.yml contains numpy objects
- Solution: Re-train the model with proper float conversion
If models have different sample counts:
- Check that all models use
load_split()fromdata_split.py - Verify the split file exists:
src/otk/data/split_2026.json
If model cannot be imported:
- Check
src/otk/models/__init__.pyincludes the model - Verify the model class inherits from
BaseEcDNAModel