Skip to content

imperial-genomics-facility/data-management-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4,032 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status Documentation Status Codacy Badge

Data Management Using Python Library

https://data-management-python.readthedocs.io

This repository contains the core Python library developed and maintained by the NIHR Imperial BRC Genomics Facility for managing raw and processed genomic datasets efficiently.

Key Features

1. Metadata Management

  • Utilizes an extended ENA metadata model for managing information about:
    • Projects
    • Samples
    • Sequencing runs
    • Analysis
    • File paths and
    • Pipeline instances

2. Genomic Sequencing Runs Processing

  • Tracks ongoing sequencing runs and initiates processing upon completion.
  • Generates summary reports and sends email notifications to users.

3. Analysis Pipelines

  • Includes wrappers for both community-developed and vendor-provided data pipelines.
  • Automates:
    • Configuration generation
    • Input formatting
  • Executes external pipelines on HPC using bash script wrappers.
  • Manages post-processing, including:
    • Custom report generation
    • Analysis data validation

Requirements

• Python v3.10

Installation

1. Clone the Repository

git clone https://github.com/imperial-genomics-facility/data-management-python.git

2. Install Dependencies Install required Python libraries:

pip install -r requirements_2.10.4.txt  # For compatibility with Apache Airflow v2.10.4

3. Update PYTHONPATH Add the core library path to PYTHONPATH:

export PYTHONPATH=/PATH/data-management-python

Update Airflow version

1. Set env variables

export AIRFLOW_VERSION=VERSION
export PYTHON_VERSION=VERSION
export CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"

2. Install core Airflow libraries

pip install "apache-airflow[celery,postgres,redis,graphviz,pandas,apache-spark,airbyte,amazon,slack,singularity,ssh,sftp,smtp]==VERSION" --constraint ${CONSTRAINT_URL}

3. Install additional libraries

pip install asana gviz-api html5lib matplotlib PyMySQL  pytest pytest-cov tox slackclient --constraint ${CONSTRAINT_URL}

4. List Python library versons in the requirements file

pip freeze > requirements_vVERSION.txt

License

This project is licensed under the Apache-2.0 License. See the LICENSE file for details.