Data input, cleaning and pre-processing

This is the first step of any network analysis. We show here how to load typical expression data, pre-process them into a format suitable for network analysis, and clean the data by removing obvious outlier samples as well as genes and samples with excessive numbers of missing entries.

Data Input

We store raw expression data along information in anndata format in geneExpr variable. you can pass your expression data, gene and sample information all together or separately:

expression data, gene and sample information all together in anndata format

If you already have your expression data in anndata format you can define your pyWGCNA object by passing your variable in anndata format. keep in mind X should be expression matrix. var is gene information and obs is sample information.

expression data, gene and sample information separately

you can pass the paths that store each information or the table contains them.

Gene Expression

The expression data is a table which the rows are samples and columns are genes. The first column (index of dataframe) is going to be sample id or sample name and first column (column of dataframe) should be gene id or gene name which both of them should be unique.

sample_id	ENSMUSG00000000003	ENSMUSG00000000028	ENSMUSG00000000031	ENSMUSG00000000037
sample_11615	12.04	11.56	16.06	13.18
sample_11616	1.35	1.63	1.28	1

Gene Information

The gene information is a table which contains additional information about each genes. First column should be your index which should be the same name as first column of gene expression data (gene ID).

gene_id	gene_name	gene_type
ENSMUSG00000000003	Pbsn	protein_coding
ENSMUSG00000000028	Cdc45	protein_coding
ENSMUSG00000000031	H19	lncRNA
ENSMUSG00000000037	Scml2	protein_coding

Sample Information

The sample information is a table which contains additional information about each sample. First column should be your index which should be the same name as first row of gene expression data (sample ID).

Sample_id	Age	Tissue	Sex	Genotype
sample_11615	4mon	Cortex	Female	5xFADHEMI
sample_11616	4mon	Cortex	Female	5xFADWT

Other parameters

These are other parameters we suggest checking them before starting any analysis.

name: name of the WGCNA we used to visualize data (default: WGCNA)
save: define whether you want to save result of important steps or not (If you want to set it TRUE you should have a write access on the output directory)
outputPath: define where you want to save your data, otherwise it will be store near the code.
TPMcutoff: cut off for removing genes that expressed under this number along samples
networkType : Type of networks (default: signed hybrid and Options: unsigned, signed and signed hybrid)
adjacencyType: Type of adjacency matrix (default: signed hybrid and Options: unsigned, signed and signed hybrid)
TOMType: Type of topological overlap matrix(TOM) (default: signed and Options: unsigned and signed)

For depth-in documents look at here.

Data cleaning and pre-processing

PyWGCNA checks data for genes and samples with too many missing values.

Remove genes without any expression more than TPMcutoff value (default one) across all samples.
goodSamplesGenes() function to find genes and samples with too many missing values.
Cluster the samples (use Hierarchical clustering from scipy) to see if there are any obvious outliers. you can define value the height by cut value. By default, we don't remove any sample by hierarchical clustering

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data input, cleaning and pre-processing

Data Input

expression data, gene and sample information all together in anndata format

expression data, gene and sample information separately

Gene Expression

Gene Information

Sample Information

Other parameters

Data cleaning and pre-processing

FilesExpand file tree

Data_format.md

Latest commit

History

Data_format.md

File metadata and controls

Data input, cleaning and pre-processing

Data Input

expression data, gene and sample information all together in anndata format

expression data, gene and sample information separately

Gene Expression

Gene Information

Sample Information

Other parameters

Data cleaning and pre-processing