Multiomics-Analytics-Group · enryH · Sep 9, 2025 · Sep 9, 2025 · Sep 9, 2025 · Sep 9, 2025
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -64,8 +64,8 @@ git clone https://github.com/Multiomics-Analytics-Group/acore.git
 cd acore/
 python -m venv .env
 source .env/bin/activate
-pip install -e .[dev]
-```
+
+pip install -e ".[dev]"
 
 If you work on Windows, see the docs: https://docs.python.org/3/library/venv.html#how-venvs-work
 
@@ -119,7 +119,7 @@ Before you submit a pull request, check that it meets these guidelines:
 3. The pull request should pass the GitHub workflows.
 
 See the PR template example:
-[Add module PR template](https://github.com/Multiomics-Analytics-Group/acore/blob/main/.github/PULL_REQUEST_TEMPLATE/module.md)
+[Add module PR template](https://github.com/Multiomics-Analytics-Group/acore/blob/main/.github/workflows/PULL_REQUEST_TEMPLATE/module.md)
 
 ## Deploying
 

diff --git a/docs/api_examples/permutation_testing.ipynb b/docs/api_examples/permutation_testing.ipynb
@@ -0,0 +1,262 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "e9bdb7c9",
+   "metadata": {},
+   "source": [
+    "# Tutorial: Permutation Testing using `acore`\n",
+    "\n",
+    "In this notebook we will demonstrate how to use acore's permutation testing functions on metagenomics data collected by [Ju and colleagues (2018)](https://doi.org/10.1038/s41396-018-0277-8).\n",
+    "\n",
+    "The samples in this demo were collected from wastewater treatment plant influent (MGYS00005056) and effluent (MGYS00005058).\n",
+    "\n",
+    "For this demo we look at the GO term abundance tables generated by the Mgnify pipeline. The values in the table are the absolute abundance of selected GO terms for each sample, which we then transform to relative abundances and centred-log ratios. \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dc9f17b2",
+   "metadata": {},
+   "source": [
+    "## Data preparation details\n",
+    "\n",
+    "### Downloading\n",
+    "The analysed samples were downloaded via the [MGnify API](https://www.ebi.ac.uk/metagenomics/api/docs/). The inffluent (INF) and effluent (EFFF) datasets have paired samples and we also needed to download the sample metadata (also available via Mgnify API) to assign the correct pairing.\n",
+    "\n",
+    "### Preprocessing of abundances\n",
+    "- To account for technical variation due to sequencing technology limitations, we first transform the abundance values so they are relative to the total reads for the sample aka getting relative abundances. \n",
+    "- The relative abundances are compositional data (CoDa) so we map them to unconstrained vectors using centred log-ratio transformation `acore.microbiome.internal_functions.calc_clr()` to not violate assumptions of any frequentist stats we do\n",
+    "\n",
+    "### Preprocessing of the metadata \n",
+    "- the sample metadata needed for this demo (sampling location) were available in their \"sample-desc\" \n",
+    "- the sample-desc for each sample in both INF and EFF were parsed and used for pairing off\n",
+    "\n",
+    "### Subset of data for demo\n",
+    "- For this demo we only look at [go term GO:0017001](https://www.ebi.ac.uk/QuickGO/term/GO:0017001)\n",
+    "- It's expected that antibiotic catabolic processes to be higher in INF vs EFF\n",
+    "\n",
+    "### Saving the demo dataset\n",
+    "This example subset of data was saved to a CSV, ./example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv. The data dictionary is below:\n",
+    "\n",
+    "| column            | description                                                                                                       | dtype |\n",
+    "|-------------------|-------------------------------------------------------------------------------------------------------------------|-------|\n",
+    "| eff_id            | The run id for the mgnify analysis of the effluent sample.                                                        | str   |\n",
+    "| inf_id            | The run id for the mgnify analysis of the influent sample.                                                        | str   |\n",
+    "| sampling_location | [The ISO 3166-1 alpha-2 code](http://iso.org/obp/ui/#iso:pub:PUB500001:en) for the country where the sample was from. | str   |\n",
+    "| sampling_read     | Replicates?                                                                                                       | str   |\n",
+    "| eff_abundance     | The relative abundance of the GO term for a given effluent sample following preprocessing (i.e., CoDA and CLR)    | float |\n",
+    "| inf_abundance     | The relative abundance of the GO term for a given influent sample following preprocessing (i.e., CoDA and CLR)    | float |\n",
+    "\n",
+    "-----\n",
+    "\n",
+    "We will now proceed with reading in the prepared dataset. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "00d8f038",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>eff_id</th>\n",
+       "      <th>inf_id</th>\n",
+       "      <th>sampling_location</th>\n",
+       "      <th>sampling_read</th>\n",
+       "      <th>eff_abundance</th>\n",
+       "      <th>inf_abundance</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>ERR2985255</td>\n",
+       "      <td>ERR2814663</td>\n",
+       "      <td>TG</td>\n",
+       "      <td>READ2 Taxonomy ID:256318</td>\n",
+       "      <td>3.257283</td>\n",
+       "      <td>4.226819</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>ERR2985256</td>\n",
+       "      <td>ERR2814664</td>\n",
+       "      <td>MN</td>\n",
+       "      <td>READ2 Taxonomy ID:256318</td>\n",
+       "      <td>2.572841</td>\n",
+       "      <td>3.847191</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>ERR2985257</td>\n",
+       "      <td>ERR2814651</td>\n",
+       "      <td>AH</td>\n",
+       "      <td>READ1 Taxonomy ID:256318</td>\n",
+       "      <td>4.298777</td>\n",
+       "      <td>4.086841</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>ERR2985258</td>\n",
+       "      <td>ERR2814667</td>\n",
+       "      <td>TE</td>\n",
+       "      <td>READ1 Taxonomy ID:256318</td>\n",
+       "      <td>2.758982</td>\n",
+       "      <td>3.436752</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>ERR2985259</td>\n",
+       "      <td>ERR2814660</td>\n",
+       "      <td>FD</td>\n",
+       "      <td>READ1 Taxonomy ID:256318</td>\n",
+       "      <td>3.364675</td>\n",
+       "      <td>3.486673</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "       eff_id      inf_id sampling_location             sampling_read  \\\n",
+       "0  ERR2985255  ERR2814663                TG  READ2 Taxonomy ID:256318   \n",
+       "1  ERR2985256  ERR2814664                MN  READ2 Taxonomy ID:256318   \n",
+       "2  ERR2985257  ERR2814651                AH  READ1 Taxonomy ID:256318   \n",
+       "3  ERR2985258  ERR2814667                TE  READ1 Taxonomy ID:256318   \n",
+       "4  ERR2985259  ERR2814660                FD  READ1 Taxonomy ID:256318   \n",
+       "\n",
+       "   eff_abundance  inf_abundance  \n",
+       "0       3.257283       4.226819  \n",
+       "1       2.572841       3.847191  \n",
+       "2       4.298777       4.086841  \n",
+       "3       2.758982       3.436752  \n",
+       "4       3.364675       3.486673  "
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import pandas as pd \n",
+    "\n",
+    "df_data = pd.read_csv(\n",
+    "    'https://raw.githubusercontent.com/Multiomics-Analytics-Group/acore/refs/heads/anglup-learning/example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv'\n",
+    ")\n",
+    "# sanity check \n",
+    "df_data.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7cf6c3d0",
+   "metadata": {},
+   "source": [
+    "## The permutation test\n",
+    "\n",
+    "Since these are paird samples we will proceed with paired sample permutation test using `acore.perumutation_test.paired_permutation()`. \n",
+    "\n",
+    "The permutation test compares the actual observed chosen metric (e.g., t-statistic, mean difference) with metrics calculated when the dataset values are randomly shuffled permutations of the dataset. \n",
+    "\n",
+    "If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect size having occurred by chance. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "5eefb720",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from acore.permutation_test import paired_permutation\n",
+    "\n",
+    "# optional choice of random number generator for repro\n",
+    "import numpy as np\n",
+    "rng = np.random.default_rng(12345)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "9eabb911",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'metric': <function ttest_rel at 0x10fee80e0>, 'observed': (np.float64(6.7389860601792275), np.float64(7.122287781830209e-07)), 'p_value': 0.0}\n",
+      "{'metric': <function mean at 0x10920a370>, 'observed': np.float64(0.5350826397500547), 'p_value': 0.0}\n",
+      "{'metric': <function mean at 0x10920a370>, 'observed': np.float64(0.5350826397500547), 'p_value': 0.0}\n"
+     ]
+    }
+   ],
+   "source": [
+    "# trying diff metrics to demo functionality also\n",
+    "for metric in ['t-statistic', 'mean', np.mean]:\n",
+    "    result = paired_permutation(\n",
+    "        df_data['inf_abundance'].to_numpy(),\n",
+    "        df_data['eff_abundance'].to_numpy(), \n",
+    "        metric=metric, \n",
+    "        n_permutations=10000, \n",
+    "        rng=rng\n",
+    "    )\n",
+    "    # verbosity\n",
+    "    print(result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1ad22530",
+   "metadata": {},
+   "source": [
+    "## Result\n",
+    "\n",
+    "Based on the permutation tests by test statistic and mean difference, the probability of the observed metrics (t=6.739 and mean diff=0.535) occurring at random would be <0.00001."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/docs/api_examples/permutation_testing.py b/docs/api_examples/permutation_testing.py
@@ -0,0 +1,101 @@
+# ---
+# jupyter:
+#   jupytext:
+#     text_representation:
+#       extension: .py
+#       format_name: percent
+#       format_version: '1.3'
+#       jupytext_version: 1.17.3
+#   kernelspec:
+#     display_name: .venv
+#     language: python
+#     name: python3
+# ---
+
+# %% [markdown]
+# # Permutation Tests
+#
+# In this notebook we will demonstrate how to use acore's permutation testing functions on metagenomics data collected by [Ju and colleagues (2018)](https://doi.org/10.1038/s41396-018-0277-8).
+#
+# The samples in this demo were collected from wastewater treatment plant influent (MGYS00005056) and effluent (MGYS00005058).
+#
+# For this demo we look at the GO term abundance tables generated by the Mgnify pipeline. The values in the table are the absolute abundance of selected GO terms for each sample, which we then transform to relative abundances and centred-log ratios.
+#
+
+# %% [markdown]
+# ## Data preparation details
+#
+# ### Downloading
+# The analysed samples were downloaded via the [MGnify API](https://www.ebi.ac.uk/metagenomics/api/docs/). The inffluent (INF) and effluent (EFFF) datasets have paired samples and we also needed to download the sample metadata (also available via Mgnify API) to assign the correct pairing.
+#
+# ### Preprocessing of abundances
+# - To account for technical variation due to sequencing technology limitations, we first transform the abundance values so they are relative to the total reads for the sample aka getting relative abundances.
+# - The relative abundances are compositional data (CoDa) so we map them to unconstrained vectors using centred log-ratio transformation [`acore.microbiome.internal_functions.calc_clr`](`acore.microbiome.internal_functions.calc_clr`) to not violate assumptions of any frequentist stats we do
+#
+# ### Preprocessing of the metadata
+# - the sample metadata needed for this demo (sampling location) were available in their "sample-desc"
+# - the sample-desc for each sample in both INF and EFF were parsed and used for pairing off
+#
+# ### Subset of data for demo
+# - For this demo we only look at [go term GO:0017001](https://www.ebi.ac.uk/QuickGO/term/GO:0017001)
+# - It's expected that antibiotic catabolic processes to be higher in INF vs EFF
+#
+# ### Saving the demo dataset
+# This example subset of data was saved to a CSV, ./example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv. The data dictionary is below:
+#
+# | column            | description                                                                                                       | dtype |
+# |-------------------|-------------------------------------------------------------------------------------------------------------------|-------|
+# | eff_id            | The run id for the mgnify analysis of the effluent sample.                                                        | str   |
+# | inf_id            | The run id for the mgnify analysis of the influent sample.                                                        | str   |
+# | sampling_location | [The ISO 3166-1 alpha-2 code](http://iso.org/obp/ui/#iso:pub:PUB500001:en) for the country where the sample was from. | str   |
+# | sampling_read     | Replicates?                                                                                                       | str   |
+# | eff_abundance     | The relative abundance of the GO term for a given effluent sample following preprocessing (i.e., CoDA and CLR)    | float |
+# | inf_abundance     | The relative abundance of the GO term for a given influent sample following preprocessing (i.e., CoDA and CLR)    | float |
+#
+# -----
+#
+# We will now proceed with reading in the prepared dataset.
+
+# %%
+import pandas as pd
+
+df_data = pd.read_csv(
+    "https://raw.githubusercontent.com/Multiomics-Analytics-Group/acore/refs/heads/anglup-learning/example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv"
+)
+# sanity check
+df_data.head()
+
+# %% [markdown]
+# ## The permutation test
+#
+# Since these are paird samples we will proceed with paired sample permutation test using `acore.perumutation_test.paired_permutation()`.
+#
+# The permutation test compares the actual observed chosen metric (e.g., t-statistic, mean difference) with metrics calculated when the dataset values are randomly shuffled permutations of the dataset.
+#
+# If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect size having occurred by chance.
+
+# %%
+from acore.permutation_test import paired_permutation
+
+# optional choice of random number generator for repro
+import numpy as np
+
+rng = np.random.default_rng(12345)
+
+# %%
+# trying diff metrics to demo functionality also
+for metric in ["t-statistic", "mean", np.mean]:
+    result = paired_permutation(
+        df_data["inf_abundance"].to_numpy(),
+        df_data["eff_abundance"].to_numpy(),
+        metric=metric,
+        n_permutations=10000,
+        rng=rng,
+    )
+    # verbosity
+    print(result)
+
+# %% [markdown]
+# ## Result
+#
+# Based on the permutation tests by test statistic and mean difference, the probability of the observed metrics (t=6.739 and mean diff=0.535) occurring at random would be <0.00001.
diff --git a/docs/index.md b/docs/index.md
@@ -23,6 +23,7 @@ api_examples/batch_correction
 api_examples/exploratory_analysis
 api_examples/ANCOVA_analysis
 api_examples/enrichment_analysis
+api_examples/permutation_testing
 ```
 
 ```{toctree}