-
Notifications
You must be signed in to change notification settings - Fork 1
permutation tests #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
enryH
wants to merge
30
commits into
main
Choose a base branch
from
anglup-learning
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
13b1b21
init internal functions for permutation testing
angelphanth cdb1705
user facing in init file
angelphanth a13be15
:construction: some tests
angelphanth 13149df
Merge branch 'main' into anglup-learning
angelphanth 5a5a42a
:bug: degeneracy check for paired permutations
angelphanth 179ce05
use internal functions
angelphanth a144f7e
:memo: copilot docstrings
angelphanth e66b6b3
tests for accept reject null
angelphanth 53d5f46
init tutorial
angelphanth 0a7e33a
diff conditions were diff mgys studies
angelphanth dc648f6
other go term
angelphanth 719b9e3
Merge branch 'main' into anglup-learning
angelphanth 5492203
:construction: instead saving processed dataset to example_data/mgnify
angelphanth 8ebab69
:memo:docs: Update API examples for permutation testing
angelphanth b6d049c
jupytext of the permutation testing demo notebook
angelphanth 74255e3
:lipstick:style: lint and reformatd code
angelphanth afc2283
Merge branch 'main' into anglup-learning
angelphanth 23a79ff
:bug:fix: update PR module link in contributing guide
angelphanth 2194b08
:bug: fix: add quotation marks to command
angelphanth 5b0124e
:sparkles: feat: add pydantic model for perm test output
angelphanth 83bbfda
:sparkles: feat: add pydantic model for perm test output
angelphanth e614049
:memo: docs: jupytexted the notebook
angelphanth 6bea62d
:lipstick: style: lint and format
angelphanth f27cf2e
Update docs/api_examples/permutation_testing.py
angelphanth 8415f0f
Update docs/api_examples/permutation_testing.py
angelphanth fcc6f4e
Merge branch 'main' into anglup-learning
enryH ccfe00e
Apply suggestions from code review
enryH 0ef5a84
Merge branch 'main' into anglup-learning
enryH e49a103
Merge branch 'main' into anglup-learning
enryH 7f74321
:truck: renamed file and moved to 'transform' dir
angelphanth File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,262 @@ | ||
| { | ||
| "cells": [ | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "e9bdb7c9", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "# Tutorial: Permutation Testing using `acore`\n", | ||
| "\n", | ||
| "In this notebook we will demonstrate how to use acore's permutation testing functions on metagenomics data collected by [Ju and colleagues (2018)](https://doi.org/10.1038/s41396-018-0277-8).\n", | ||
| "\n", | ||
| "The samples in this demo were collected from wastewater treatment plant influent (MGYS00005056) and effluent (MGYS00005058).\n", | ||
| "\n", | ||
| "For this demo we look at the GO term abundance tables generated by the Mgnify pipeline. The values in the table are the absolute abundance of selected GO terms for each sample, which we then transform to relative abundances and centred-log ratios. \n" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "dc9f17b2", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Data preparation details\n", | ||
| "\n", | ||
| "### Downloading\n", | ||
| "The analysed samples were downloaded via the [MGnify API](https://www.ebi.ac.uk/metagenomics/api/docs/). The inffluent (INF) and effluent (EFFF) datasets have paired samples and we also needed to download the sample metadata (also available via Mgnify API) to assign the correct pairing.\n", | ||
| "\n", | ||
| "### Preprocessing of abundances\n", | ||
| "- To account for technical variation due to sequencing technology limitations, we first transform the abundance values so they are relative to the total reads for the sample aka getting relative abundances. \n", | ||
| "- The relative abundances are compositional data (CoDa) so we map them to unconstrained vectors using centred log-ratio transformation `acore.microbiome.internal_functions.calc_clr()` to not violate assumptions of any frequentist stats we do\n", | ||
| "\n", | ||
| "### Preprocessing of the metadata \n", | ||
| "- the sample metadata needed for this demo (sampling location) were available in their \"sample-desc\" \n", | ||
| "- the sample-desc for each sample in both INF and EFF were parsed and used for pairing off\n", | ||
| "\n", | ||
| "### Subset of data for demo\n", | ||
| "- For this demo we only look at [go term GO:0017001](https://www.ebi.ac.uk/QuickGO/term/GO:0017001)\n", | ||
| "- It's expected that antibiotic catabolic processes to be higher in INF vs EFF\n", | ||
| "\n", | ||
| "### Saving the demo dataset\n", | ||
| "This example subset of data was saved to a CSV, ./example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv. The data dictionary is below:\n", | ||
| "\n", | ||
| "| column | description | dtype |\n", | ||
| "|-------------------|-------------------------------------------------------------------------------------------------------------------|-------|\n", | ||
| "| eff_id | The run id for the mgnify analysis of the effluent sample. | str |\n", | ||
| "| inf_id | The run id for the mgnify analysis of the influent sample. | str |\n", | ||
| "| sampling_location | [The ISO 3166-1 alpha-2 code](http://iso.org/obp/ui/#iso:pub:PUB500001:en) for the country where the sample was from. | str |\n", | ||
| "| sampling_read | Replicates? | str |\n", | ||
| "| eff_abundance | The relative abundance of the GO term for a given effluent sample following preprocessing (i.e., CoDA and CLR) | float |\n", | ||
| "| inf_abundance | The relative abundance of the GO term for a given influent sample following preprocessing (i.e., CoDA and CLR) | float |\n", | ||
| "\n", | ||
| "-----\n", | ||
| "\n", | ||
| "We will now proceed with reading in the prepared dataset. " | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 1, | ||
| "id": "00d8f038", | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "data": { | ||
| "text/html": [ | ||
| "<div>\n", | ||
| "<style scoped>\n", | ||
| " .dataframe tbody tr th:only-of-type {\n", | ||
| " vertical-align: middle;\n", | ||
| " }\n", | ||
| "\n", | ||
| " .dataframe tbody tr th {\n", | ||
| " vertical-align: top;\n", | ||
| " }\n", | ||
| "\n", | ||
| " .dataframe thead th {\n", | ||
| " text-align: right;\n", | ||
| " }\n", | ||
| "</style>\n", | ||
| "<table border=\"1\" class=\"dataframe\">\n", | ||
| " <thead>\n", | ||
| " <tr style=\"text-align: right;\">\n", | ||
| " <th></th>\n", | ||
| " <th>eff_id</th>\n", | ||
| " <th>inf_id</th>\n", | ||
| " <th>sampling_location</th>\n", | ||
| " <th>sampling_read</th>\n", | ||
| " <th>eff_abundance</th>\n", | ||
| " <th>inf_abundance</th>\n", | ||
| " </tr>\n", | ||
| " </thead>\n", | ||
| " <tbody>\n", | ||
| " <tr>\n", | ||
| " <th>0</th>\n", | ||
| " <td>ERR2985255</td>\n", | ||
| " <td>ERR2814663</td>\n", | ||
| " <td>TG</td>\n", | ||
| " <td>READ2 Taxonomy ID:256318</td>\n", | ||
| " <td>3.257283</td>\n", | ||
| " <td>4.226819</td>\n", | ||
| " </tr>\n", | ||
| " <tr>\n", | ||
| " <th>1</th>\n", | ||
| " <td>ERR2985256</td>\n", | ||
| " <td>ERR2814664</td>\n", | ||
| " <td>MN</td>\n", | ||
| " <td>READ2 Taxonomy ID:256318</td>\n", | ||
| " <td>2.572841</td>\n", | ||
| " <td>3.847191</td>\n", | ||
| " </tr>\n", | ||
| " <tr>\n", | ||
| " <th>2</th>\n", | ||
| " <td>ERR2985257</td>\n", | ||
| " <td>ERR2814651</td>\n", | ||
| " <td>AH</td>\n", | ||
| " <td>READ1 Taxonomy ID:256318</td>\n", | ||
| " <td>4.298777</td>\n", | ||
| " <td>4.086841</td>\n", | ||
| " </tr>\n", | ||
| " <tr>\n", | ||
| " <th>3</th>\n", | ||
| " <td>ERR2985258</td>\n", | ||
| " <td>ERR2814667</td>\n", | ||
| " <td>TE</td>\n", | ||
| " <td>READ1 Taxonomy ID:256318</td>\n", | ||
| " <td>2.758982</td>\n", | ||
| " <td>3.436752</td>\n", | ||
| " </tr>\n", | ||
| " <tr>\n", | ||
| " <th>4</th>\n", | ||
| " <td>ERR2985259</td>\n", | ||
| " <td>ERR2814660</td>\n", | ||
| " <td>FD</td>\n", | ||
| " <td>READ1 Taxonomy ID:256318</td>\n", | ||
| " <td>3.364675</td>\n", | ||
| " <td>3.486673</td>\n", | ||
| " </tr>\n", | ||
| " </tbody>\n", | ||
| "</table>\n", | ||
| "</div>" | ||
| ], | ||
| "text/plain": [ | ||
| " eff_id inf_id sampling_location sampling_read \\\n", | ||
| "0 ERR2985255 ERR2814663 TG READ2 Taxonomy ID:256318 \n", | ||
| "1 ERR2985256 ERR2814664 MN READ2 Taxonomy ID:256318 \n", | ||
| "2 ERR2985257 ERR2814651 AH READ1 Taxonomy ID:256318 \n", | ||
| "3 ERR2985258 ERR2814667 TE READ1 Taxonomy ID:256318 \n", | ||
| "4 ERR2985259 ERR2814660 FD READ1 Taxonomy ID:256318 \n", | ||
| "\n", | ||
| " eff_abundance inf_abundance \n", | ||
| "0 3.257283 4.226819 \n", | ||
| "1 2.572841 3.847191 \n", | ||
| "2 4.298777 4.086841 \n", | ||
| "3 2.758982 3.436752 \n", | ||
| "4 3.364675 3.486673 " | ||
| ] | ||
| }, | ||
| "execution_count": 1, | ||
| "metadata": {}, | ||
| "output_type": "execute_result" | ||
| } | ||
| ], | ||
| "source": [ | ||
| "import pandas as pd \n", | ||
| "\n", | ||
| "df_data = pd.read_csv(\n", | ||
| " 'https://raw.githubusercontent.com/Multiomics-Analytics-Group/acore/refs/heads/anglup-learning/example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv'\n", | ||
| ")\n", | ||
| "# sanity check \n", | ||
| "df_data.head()" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "7cf6c3d0", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## The permutation test\n", | ||
| "\n", | ||
| "Since these are paird samples we will proceed with paired sample permutation test using `acore.perumutation_test.paired_permutation()`. \n", | ||
| "\n", | ||
| "The permutation test compares the actual observed chosen metric (e.g., t-statistic, mean difference) with metrics calculated when the dataset values are randomly shuffled permutations of the dataset. \n", | ||
| "\n", | ||
| "If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect size having occurred by chance. " | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 2, | ||
| "id": "5eefb720", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "from acore.permutation_test import paired_permutation\n", | ||
| "\n", | ||
| "# optional choice of random number generator for repro\n", | ||
| "import numpy as np\n", | ||
| "rng = np.random.default_rng(12345)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 3, | ||
| "id": "9eabb911", | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "name": "stdout", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "{'metric': <function ttest_rel at 0x10fee80e0>, 'observed': (np.float64(6.7389860601792275), np.float64(7.122287781830209e-07)), 'p_value': 0.0}\n", | ||
| "{'metric': <function mean at 0x10920a370>, 'observed': np.float64(0.5350826397500547), 'p_value': 0.0}\n", | ||
| "{'metric': <function mean at 0x10920a370>, 'observed': np.float64(0.5350826397500547), 'p_value': 0.0}\n" | ||
| ] | ||
| } | ||
| ], | ||
| "source": [ | ||
| "# trying diff metrics to demo functionality also\n", | ||
| "for metric in ['t-statistic', 'mean', np.mean]:\n", | ||
| " result = paired_permutation(\n", | ||
| " df_data['inf_abundance'].to_numpy(),\n", | ||
| " df_data['eff_abundance'].to_numpy(), \n", | ||
| " metric=metric, \n", | ||
| " n_permutations=10000, \n", | ||
| " rng=rng\n", | ||
| " )\n", | ||
| " # verbosity\n", | ||
| " print(result)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "1ad22530", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Result\n", | ||
| "\n", | ||
| "Based on the permutation tests by test statistic and mean difference, the probability of the observed metrics (t=6.739 and mean diff=0.535) occurring at random would be <0.00001." | ||
| ] | ||
| } | ||
| ], | ||
| "metadata": { | ||
| "kernelspec": { | ||
| "display_name": ".venv", | ||
| "language": "python", | ||
| "name": "python3" | ||
| }, | ||
| "language_info": { | ||
| "codemirror_mode": { | ||
| "name": "ipython", | ||
| "version": 3 | ||
| }, | ||
| "file_extension": ".py", | ||
| "mimetype": "text/x-python", | ||
| "name": "python", | ||
| "nbconvert_exporter": "python", | ||
| "pygments_lexer": "ipython3", | ||
| "version": "3.11.0" | ||
| } | ||
| }, | ||
| "nbformat": 4, | ||
| "nbformat_minor": 5 | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,101 @@ | ||
| # --- | ||
| # jupyter: | ||
| # jupytext: | ||
| # text_representation: | ||
| # extension: .py | ||
| # format_name: percent | ||
| # format_version: '1.3' | ||
| # jupytext_version: 1.17.3 | ||
| # kernelspec: | ||
| # display_name: .venv | ||
| # language: python | ||
| # name: python3 | ||
| # --- | ||
|
|
||
| # %% [markdown] | ||
| # # Permutation Tests | ||
| # | ||
| # In this notebook we will demonstrate how to use acore's permutation testing functions on metagenomics data collected by [Ju and colleagues (2018)](https://doi.org/10.1038/s41396-018-0277-8). | ||
| # | ||
| # The samples in this demo were collected from wastewater treatment plant influent (MGYS00005056) and effluent (MGYS00005058). | ||
| # | ||
| # For this demo we look at the GO term abundance tables generated by the Mgnify pipeline. The values in the table are the absolute abundance of selected GO terms for each sample, which we then transform to relative abundances and centred-log ratios. | ||
| # | ||
|
|
||
| # %% [markdown] | ||
| # ## Data preparation details | ||
| # | ||
| # ### Downloading | ||
| # The analysed samples were downloaded via the [MGnify API](https://www.ebi.ac.uk/metagenomics/api/docs/). The inffluent (INF) and effluent (EFFF) datasets have paired samples and we also needed to download the sample metadata (also available via Mgnify API) to assign the correct pairing. | ||
|
||
| # | ||
| # ### Preprocessing of abundances | ||
| # - To account for technical variation due to sequencing technology limitations, we first transform the abundance values so they are relative to the total reads for the sample aka getting relative abundances. | ||
| # - The relative abundances are compositional data (CoDa) so we map them to unconstrained vectors using centred log-ratio transformation [`acore.microbiome.internal_functions.calc_clr`](`acore.microbiome.internal_functions.calc_clr`) to not violate assumptions of any frequentist stats we do | ||
| # | ||
| # ### Preprocessing of the metadata | ||
| # - the sample metadata needed for this demo (sampling location) were available in their "sample-desc" | ||
| # - the sample-desc for each sample in both INF and EFF were parsed and used for pairing off | ||
| # | ||
| # ### Subset of data for demo | ||
| # - For this demo we only look at [go term GO:0017001](https://www.ebi.ac.uk/QuickGO/term/GO:0017001) | ||
| # - It's expected that antibiotic catabolic processes to be higher in INF vs EFF | ||
| # | ||
| # ### Saving the demo dataset | ||
| # This example subset of data was saved to a CSV, ./example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv. The data dictionary is below: | ||
| # | ||
| # | column | description | dtype | | ||
| # |-------------------|-------------------------------------------------------------------------------------------------------------------|-------| | ||
| # | eff_id | The run id for the mgnify analysis of the effluent sample. | str | | ||
| # | inf_id | The run id for the mgnify analysis of the influent sample. | str | | ||
| # | sampling_location | [The ISO 3166-1 alpha-2 code](http://iso.org/obp/ui/#iso:pub:PUB500001:en) for the country where the sample was from. | str | | ||
| # | sampling_read | Replicates? | str | | ||
| # | eff_abundance | The relative abundance of the GO term for a given effluent sample following preprocessing (i.e., CoDA and CLR) | float | | ||
| # | inf_abundance | The relative abundance of the GO term for a given influent sample following preprocessing (i.e., CoDA and CLR) | float | | ||
| # | ||
| # ----- | ||
| # | ||
| # We will now proceed with reading in the prepared dataset. | ||
|
|
||
| # %% | ||
| import pandas as pd | ||
|
|
||
| df_data = pd.read_csv( | ||
| "https://raw.githubusercontent.com/Multiomics-Analytics-Group/acore/refs/heads/anglup-learning/example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv" | ||
| ) | ||
| # sanity check | ||
| df_data.head() | ||
|
|
||
| # %% [markdown] | ||
| # ## The permutation test | ||
| # | ||
| # Since these are paird samples we will proceed with paired sample permutation test using `acore.perumutation_test.paired_permutation()`. | ||
| # | ||
| # The permutation test compares the actual observed chosen metric (e.g., t-statistic, mean difference) with metrics calculated when the dataset values are randomly shuffled permutations of the dataset. | ||
| # | ||
| # If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect size having occurred by chance. | ||
|
|
||
| # %% | ||
| from acore.permutation_test import paired_permutation | ||
|
|
||
| # optional choice of random number generator for repro | ||
| import numpy as np | ||
|
|
||
| rng = np.random.default_rng(12345) | ||
|
|
||
| # %% | ||
| # trying diff metrics to demo functionality also | ||
| for metric in ["t-statistic", "mean", np.mean]: | ||
| result = paired_permutation( | ||
| df_data["inf_abundance"].to_numpy(), | ||
| df_data["eff_abundance"].to_numpy(), | ||
| metric=metric, | ||
| n_permutations=10000, | ||
| rng=rng, | ||
| ) | ||
| # verbosity | ||
| print(result) | ||
|
|
||
| # %% [markdown] | ||
| # ## Result | ||
| # | ||
| # Based on the permutation tests by test statistic and mean difference, the probability of the observed metrics (t=6.739 and mean diff=0.535) occurring at random would be <0.00001. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple spelling errors: "inffluent" should be "influent" and "EFFF" should be "EFF".