A comprehensive toolkit for cancer genomics analysis and biomarker discovery using RNA-seq data from The Cancer Genome Atlas (TCGA). OncoLearn leverages machine learning and statistical methods for cancer subtyping and identifying potential diagnostic and prognostic markers.
Aryan Sharan Guda (aryanshg@andrew.cmu.edu), Seungjin Han (seungjih@andrew.cmu.edu), Seohyun Lee (seohyun4@andrew.cmu.edu), Yosen Lin (yosenl@andrew.cmu.edu), Isha Parikh (parikh.i@northeastern.edu), Diya Patidar (dpatidar@andrew.cmu.edu), Arunannamalai Sujatha Bharath Raj (asujatha@andrew.cmu.edu), Andrew Scouten (yzb2@txstate.edu), Jeffrey Wang (jdw2@andrew.cmu.edu), Qiyu (Charlie) Yang (qiyuy@andrew.cmu.edu), Xinru Zhang (mayzxr2203@gmail.com), River Zhu (riverz@andrew.cmu.edu), Zhaoyi (Zoey) You (zhaoyiyou.zoey@gmail.com), Heena Dalal (dalalhina@gmail.com/heena.dalal@kcl.ac.uk)
-
Install Docker Desktop from docker.com
-
Clone and setup:
git clone https://github.com/collaborativebioinformatics/OncoLearn.git cd OncoLearn git submodule update --init --recursive docker compose up -d -
Download sample data:
docker compose exec dev bash ./scripts/data/download_tcga_brca.sh -
Start exploring with the Jupyter notebooks in
notebooks/data/
For detailed setup options and local installation, see Getting Started.
This project supports two installation methods:
Option A: Docker (Recommended)
- Docker Desktop or Docker Engine
- Docker Compose
- VSCode with Dev Containers extension (optional but recommended)
Option B: Local Installation
- Python 3.10+
- R 4.0+
- uv - Fast Python package installer and resolver
Docker provides a consistent development environment and eliminates dependency and compatibility issues.
-
Install Docker Desktop:
- Download from docker.com
- Or install Docker Engine on Linux
-
Clone the repository:
git clone https://github.com/collaborativebioinformatics/OncoLearn.git cd OncoLearn git submodule update --init --recursive -
Start the environment:
# Build and start the container docker compose up -d -
Open in VSCode Dev Container (optional):
- Install the Dev Containers extension
- Press
F1→ "Dev Containers: Reopen in Container" - VSCode will connect to the container with all extensions and tools configured
- Jupyter notebooks (
.ipynbfiles) will work natively in VSCode without a browser
Useful Docker Commands:
# Stop containers
docker compose down
# Rebuild after dependency changes
docker compose build
# Execute commands in container
docker compose exec dev bash
# Add new Python packages
docker compose exec dev uv add <package-name>
# View running containers
docker compose ps-
Install uv (if not already installed) from here.
-
Clone the repository:
git clone https://github.com/collaborativebioinformatics/OncoLearn.git cd OncoLearn git submodule update --init --recursive -
Install Python dependencies:
# Install base dependencies uv sync # Or install with PyTorch extras (choose one based on your hardware): uv sync --extra cpu # CPU-only version uv sync --extra cu128 # CUDA 12.8 uv sync --extra cu130 # CUDA 13.0 uv sync --extra rocm # AMD ROCm
-
Install R dependencies with renv:
# Install renv if not already installed install.packages("renv") # Restore R package dependencies renv::restore()
For the best development experience, we recommend installing the following VSCode extensions:
- Python (
ms-python.python) - IntelliSense, debugging, and linting for Python - Ruff (
charliermarsh.ruff) - Fast Python linter and formatter - autopep8 (
ms-python.autopep8) - Python code formatter following PEP 8 style guide - R (
REditorSupport.r) - R language support with syntax highlighting and code execution - Jupyter (
ms-toolsai.jupyter) - Interactive Jupyter notebook support - Dev Containers (
ms-vscode-remote.remote-containers) - For Docker development (if using Docker)
Comprehensive guides and documentation are available in the docs/ folder:
- TCGA Data Download Guide - Detailed instructions for downloading and managing TCGA datasets
- TCIA Data Download Guide - Guide for downloading imaging data from TCIA
- GitHub Authentication Setup - Configure SSH authentication for GitHub access
- Models Documentation - Overview of machine learning models and architectures
data/- Data storage directory (downloaded TCGA datasets)docs/- Project documentation and guidesnotebooks/- Jupyter notebooks for data exploration and analysisscripts/- Data download and preprocessing scriptssrc/oncolearn/- Core Python package for cancer genomics analysissrc/multimodal/- Multimodal learning framework for integrating multi-omic dataconfigs/- Configuration files for training and testing
For more information on downloading and working with TCGA data, see the TCGA Data Download Guide.
This project is licensed under the MIT License - see the LICENSE file for details.
Artificial intelligence tools, including large language models (LLMs), were used during the development of this project to support writing, clarify technical concepts, and assist in generating code snippets. These tools served as an aid for idea refinement, debugging, and improving the readability of explanations and documentation. All AI-generated text and code were thoroughly reviewed, verified for correctness, and understood in full before being incorporated into this work. The responsibility for all final decisions, interpretations, and implementations remains solely with the contributors.
