Skip to content

mhtjsh/covid19-scRNAseq-Tcell-analysis

Repository files navigation

Decoding Immune Signatures in Post-Acute COVID-19 Lung Sequelae

Project Overview

This project investigates the persistent immune alterations in the lungs of COVID-19 survivors by analyzing single-cell RNA sequencing (scRNA-seq) data. Building on the findings from the original study, "Immune signatures underlying post-acute COVID-19 lung sequelae", this analysis identifies and interprets the functional significance of differentially expressed genes (DEGs) to characterize the immune cell populations associated with post-acute COVID-19 complications.

The primary goal is to move beyond simple gene lists to understand the biological pathways that remain dysregulated in convalescent patients, contributing to a deeper understanding of Post-Acute Sequelae of COVID-19 (PASC). To explore the long-term consequences of these findings, this project culminates in the development of a hypothesized Ordinary Differential Equation (ODE) model that simulates the dynamics of lung damage and immune resolution.

Analysis Pipeline

The computational analysis was performed using the R-based Seurat package. The complete, commented code and step-by-step implementation details can be found in the

Covid19 Patients Data analysis whole.ipynb notebook. Due to the high computational demand of the integration and normalization steps, the analysis was executed on a Google Cloud Console VM.

The high-level workflow included:

  1. Data Loading & Quality Control: Filtering of low-quality cells from the raw count matrices.

  2. Normalization & Integration: Normalization via SCTransform and integration using Seurat's anchor-based workflow to correct for technical batch effects across the 7 unique patient samples.

  3. Dimensionality Reduction & Clustering: PCA and UMAP for visualization, followed by graph-based clustering to identify distinct cell populations.

  4. Differential Gene Expression & Functional Analysis: Identification of DEGs between conditions and clusters, followed by Gene Ontology (GO) and pathway analysis to determine their biological significance.

Results & Discussion

The analysis successfully identified significant differences between the immune cell profiles of healthy donors and convalescent COVID-19 patients. The following sections detail the progressive steps of dimensionality reduction, visualization, and functional interpretation.

Principal Component Analysis (PCA) and Dimensionality

After data integration, Principal Component Analysis (PCA) was performed to reduce the high-dimensional gene expression data into its most significant components of variation.

The elbow plot below visualizes the standard deviation of each principal component (PC). We use this plot to select the number of significant PCs to include in downstream analysis, typically choosing the point where the variance explained begins to plateau (the "elbow"). This ensures we capture the majority of the biological signal while excluding technical noise present in higher-dimension PCs.

A PCA plot provides a linear, two-dimensional representation of the data based on the first two principal components. This plot offers an initial glimpse into the data's structure, showing how cells cluster based on the greatest sources of variance.

pca and elbow plots
Figure: The elbow plot above led us to use 10 PCs in our downstream analysis, as these seemed to explain most of the variance in the data. Plotting the data along the axes of the first two principal components showed significant separation between healthy and COVID-19 convalescent (diseased) groups. Seurat-determined clusters revealed less distinct separation.

Cellular Heterogeneity and Immune State

Visualization of the integrated dataset via UMAP reveals a clear separation of cells based on disease status, indicating a profound and persistent transcriptomic shift in the immune cells of post-COVID patients.

umap and heat map diseased and healthy
Figure: To the left is a UMAP of our integrated dataset, grouped by status as healthy or COVID-19 convalescent(diseased), which leaves clear distinction between groups. To the right is a heat map of the top DEGs in each group.

Further clustering identified multiple distinct cell populations, suggesting significant cellular heterogeneity within the T-cell compartment of the lung.

umap and heatmap clusters
Figure: To the left is a UMAP of our integrated dataset, grouped by Seurat-determined clusters. To the right is a heat map of the top DEGs in each cluster.

Further analysis of the Seurat-determined clusters reveals significant immunological shifts. We identified several clusters composed almost exclusively of cells from the COVID-19 recovery group. Critically, within these patient-specific clusters, we observed high expression of the cell markers CD8A and NKG7. This finding indicates a persistent and localized population of CD8+ T-cells and Natural Killer (NK) cells in the lungs of convalescent patients. This aligns with the "abnormal CD8+ T cell population" discussed in the original paper and suggests these cells are key drivers of the tissue damage associated with PASC.

Differential Gene Expression Highlights a Pro-Inflammatory Environment

To understand the molecular basis for the separation observed in the UMAP plots, we performed differential gene expression analysis. The heatmap below shows the top 10 genes that distinguish the "COVID_Recovery" and "Healthy" groups. The expression pattern points towards a sustained inflammatory state in the convalescent group.

Similarly, identifying marker genes for each cell cluster revealed unique gene signatures, which are visualized in the following heatmap.

Functional Enrichment Analysis Reveals Sustained Immune Activation

The core of this project's outcome lies in the functional analysis of the differentially expressed genes. By performing Gene Ontology (GO) and pathway analysis (detailed results available in report.pdf and result.csv), we can interpret the biological meaning behind the long lists of DEGs.

My analysis revealed that the genes upregulated in convalescent COVID-19 patients are significantly enriched in pathways associated with:

  • T-Cell Activation and Cytotoxicity: Terms such as "T cell activation" and "leukocyte mediated cytotoxicity" were highly significant. This aligns with the original paper's hypothesis of persistent T-cell activity due to lingering viral antigens. Our pathway analysis corroborated this in detail, with databases like Reactome specifically identifying the "phosphorylation of CD3 and TCR zeta chains" and DAVID highlighting "T cytotoxic cell surface molecules"—both critical upstream events in T-cell activation and proliferation. This suggests that cytotoxic T-cells remain in a heightened state of alert long after the initial infection, potentially contributing to chronic lung damage.

  • Interferon Signaling: A strong enrichment for "response to interferon-gamma" and "type I interferon signaling pathway" was observed. Interferons are critical for antiviral defense, but their prolonged expression is a hallmark of many chronic inflammatory and autoimmune conditions. This sustained interferon signature is a key indicator of an immune system that has failed to return to homeostasis.

  • Inflammatory Response: General inflammatory pathways, including "inflammatory response" and "cytokine-mediated signaling pathway," were also prominent. This points to a broad, non-specific inflammatory environment persisting in the lungs of post-COVID patients.

These findings strongly suggest that the long-term sequelae of COVID-19 are not due to a failure to clear the virus, but rather to an immune system that has become "stuck" in a pro-inflammatory, antiviral state. This sustained activation, particularly of cytotoxic T-cells and interferon pathways, likely drives the chronic inflammation and impaired lung function seen in PASC.

Translating Static Signatures into a Dynamic Model with ODEs

The scRNA-seq and functional analyses revealed a persistent, pro-inflammatory state driven by T-cell activation and interferon-gamma signaling. To translate this static snapshot into a dynamic hypothesis, we developed an Ordinary Differential Equation (ODE) model. The goal was to simulate the interaction between the immune response and lung tissue over time to explore a biologically plausible mechanism for the development of PASC.

The complete, commented Python implementation can be found in the ode_model_for_covid19_data.ipynb notebook in the project's root directory.

The Conceptual Model

The model is based on the interaction between three key populations derived from our functional analysis:

  • Healthy Lung Cells (L)

  • Activated Cytotoxic T-Cells (T)

  • Interferon-gamma (IFN-γ), a key inflammatory cytokine (I)

This system tells a story: a strong, lingering T-cell population damages lung cells while producing IFN-γ. Over time, without a persistent stimulus, the T-cell population wanes, and the immune system eventually resolves, but not before causing permanent tissue damage.

The Mathematical Model

This biological narrative is translated into the following system of equations:

Equation 1: Change in Lung Cells (dL/dt)

The rate of lung cell destruction is proportional to the interaction between lung cells and T-cells.

$$ \frac{dL}{dt} = -\gamma \cdot L \cdot T $$

Equation 2: Change in T-cells (dT/dt)

The T-cell population is amplified by IFN-γ feedback but is primarily reduced by natural decay.

$$ \frac{dT}{dt} = \alpha \cdot T \cdot I - \delta_T \cdot T $$

Equation 3: Change in IFN-γ (dI/dt)

IFN-γ concentration increases from T-cell production and decreases via natural deca

$$ \frac{dI}{dt} = \beta \cdot T - \delta_I \cdot I $$

Simulation Results and Interpretation

Running the model with biologically plausible parameters that reflect a damaging but resolving immune response yields the following dynamics:

hypothesised ODE for post effect immune response in PASC

Figure: Simulation of the ODE model over 50 days. The top panel shows the decline in healthy lung cells. The bottom panel shows the dynamics of the immune mediators.

  • Lung Cell Count (Top Panel): The population of healthy lung cells declines sharply when the immune response is strongest. The decline slows as the T-cell population wanes, eventually stabilizing at a new, lower baseline (~50% of the original count). This represents the permanent tissue damage characteristic of sequelae.

  • Immune Concentration (Bottom Panel): The Activated T-cell population (red line) starts high and steadily decays, representing the eventual resolution of the inflammatory response. The IFN-γ concentration (purple dashed line) spikes briefly due to T-cell activity before decaying as its source disappears.

This model successfully visualizes our core hypothesis: PASC may result from an acute, aggressive inflammatory phase that causes lasting damage, even if the underlying immune activity eventually returns to a non-inflammatory baseline. It provides a dynamic framework that connects the molecular signatures from the scRNA-seq data to the clinical outcome of stable, long-term sequelae.

Project Contributions and Future Directions

Contributions of This Project

This project serves as a focused re-analysis of the scRNA-seq data from the original paper, "Immune signatures underlying post-acute COVID-19 lung sequelae". While the original study performed a broad, multi-faceted investigation, this project deliberately narrows its scope to the bronchoalveolar lavage (BAL) T-cells to perform a more controlled and detailed comparison between healthy and convalescent individuals.

By employing a distinct and arguably more robust bioinformatic pipeline, this project not only validates the original findings but also contributes several new and more detailed insights:

  1. Advanced Normalization and Analysis: This project utilized a more advanced normalization method, SCTransform, which uses a regularized negative binomial regression model to more effectively remove technical variability from sequencing depth. This choice was made with the goal of reducing the rate of false positives and increasing the reliability of the downstream differential gene expression results. The clustering was performed using Seurat's SNN modularity optimization-based approach.

  2. Identification of a Specific Inflammatory Signature: A key finding of this re-analysis is the significant upregulation of pro-inflammatory genes like IL32 and CCL5 in cell clusters dominated by the COVID-19 recovery group. This points to a distinct and consistent inflammatory response in sequelae individuals—an aspect not explored in detail with the scRNA-seq data in the original publication. The clear separation of healthy and diseased cells in our UMAP plots provides strong visual support for this specific transcriptomic state.

  3. Detailed and Corroborated Pathway Analysis: By leveraging two distinct databases, Reactome and DAVID, this project provides a more granular view of the dysregulated pathways. While validating the original paper's findings of T-cell activation, our dual analysis offers more specific mechanistic details. For instance, Reactome highlighted the "phosphorylation of CD3 and TCR zeta chains," while DAVID pointed to "T cytotoxic cell surface molecules," both of which are critical upstream events in T-cell activation. This provides stronger, more direct evidence for the mechanisms hypothesized in the source study.

  4. Dynamic Hypothesis Modeling: A major contribution of this project is the development of a conceptual and mathematical ODE model. This model translates the static gene expression signatures (persistent T-cell activation, IFN-γ signaling) into a dynamic system that mechanistically links the initial immune response to the clinical outcome of permanent but stable lung damage. This provides a testable, quantitative framework for understanding the temporal dynamics of PASC.

In essence, this project acts as a valuable case study in how the re-analysis of publicly available data with alternative computational strategies can yield more detailed and novel biological insights, successfully building upon the foundational work of the original authors.

Future Directions

Building on the foundation of this work, several exciting avenues for future research emerge:

  • Granular Sub-Clustering: The major T-cell lineages (CD4+, CD8+) can be isolated and re-clustered to identify more specific subtypes, such as effector memory, central memory, resident memory, and exhausted T-cells. Understanding which of these specific subsets are expanded or dysregulated in PASC is a critical next step.

  • Cell-Cell Communication Modeling: The current analysis treats cells as independent entities. A powerful next step would be to use tools like CellChat or NicheNet to model receptor-ligand interactions. This could reveal how different T-cell subsets are communicating with each other and with other lung cells, and how this communication network is rewired after COVID-19.

  • Integration with T-Cell Receptor (TCR) Sequencing: The original dataset contains paired TCR-seq data. Integrating our gene expression analysis with this TCR data would be a highly impactful step. This would allow us to link the functional state (transcriptome) of a T-cell with its antigen specificity (clonotype), answering questions like: "Are the most expanded T-cell clones also the ones showing the highest expression of cytotoxic or exhaustion markers?" This would provide a direct link between the adaptive immune response to the virus and the long-term functional state of the T-cells.

  • Refining the ODE Model: The current three-component ODE model provides a strong conceptual framework for understanding the dynamics of inflammation. Future work could involve refining this model by incorporating additional cell types identified in the analysis (e.g., natural killer (NK) cells). Another important step would be to use experimental data to estimate key model parameters such as α (alpha), β (beta), and γ (gamma). Accurately calibrating these parameters would improve the model's predictive power and enable in-silico testing of potential therapeutic interventions aimed at reducing the initial burst of inflammation..

Conclusion

This project successfully leveraged single-cell RNA sequencing data to decode the persistent immune signatures in patients recovering from COVID-19. By moving beyond DEG identification to functional pathway analysis, we have demonstrated that the post-acute phase is characterized by a sustained and specific pattern of immune dysregulation. The key takeaway is the prolonged activation of cytotoxic T-cell and interferon-gamma signaling pathways, which provides a compelling molecular explanation for the chronic inflammation and tissue damage underlying post-COVID lung sequelae.

These findings highlight potential therapeutic targets aimed at resolving this persistent inflammation and restoring immune homeostasis in patients suffering from the long-term effects of COVID-19.

About

Analysis pipeline for single-cell RNA sequencing (scRNA-seq) data focused on T cells in COVID-19. Includes scripts and workflows for data processing, visualization, and identifying immune cell signatures and responses. Designed to accelerate reproducible research in immunology and single-cell genomics.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors