iCRISPR can summary the result from CrisprCasFinder gracefully, and further CRISPR analysis including CRISPR identification, prepare, summary and visualization.
You can install the development version of iCRISPR from GitHub with:
# install.packages("devtools")
devtools::install_github("Asa12138/iCRISPR")Imports: dplyr, ggplot2, utils, pcutils (>= 0.2.5), ggnewscale, reshape2, scales
You need to use CrisprCasFinder to identify crispr in your genomes first.
MAG_test is a test result folder from CrisprCasFinder (contains
“TSV”,“GFF”,“result.json”,…), we can use pre_CCF_res to get a crispr
object.
library(iCRISPR)
# help analysis
library(pcutils)
library(dplyr)
library(ggplot2)
MAG_test <- system.file("extdata/MAG_test", package = "iCRISPR")
crispr <- pre_CCF_res(MAG_test)
crispr
class(crispr)crispr object contains CRISPR, Cas, Array and Spacer:
- CRISPR found in the whole genome (with multi-contigs), CRISPR_id is the key
- Cas system found in the whole genome, Cas_id is the key, one Cas_id contains multi cas protein
- Array: the array structure of each CRISPR, “LeftFLANK”,“CRISPRdr”,“CRISPRspacer”,“RightFLANK”
- Spacer: all spacer come from each CRISPR, Spacer_id is the key
If you have more than one result from CrisprCasFinder (maybe you perform
this analysis on lots of genomes or MAGs), you can use
multi_pre_CCF_res to get all crispr information.
# multi_crispr=multi_pre_CCF_res(input_folder="/path/your/folder",output_folder="./pre_CCF_res_out",threads=1)
data("multi_crispr", package = "iCRISPR")
multi_crispr
class(multi_crispr)multi_crispr is a list contains lots of crispr object.
We can use plot_crispr to visualize a CRISPR-Cas system in one
sequence easily.
Choose the crispr with evidence_level = 4 will be better.
plot_crispr(crispr, genome = "MAG_test", contig = "AAB-S01R1_k55_9399631_flag=0_multi=9.8751_len=26518")
plot_crispr(multi_crispr$GCA_002396005.1_ASM239600v1, "DGVW01000088.1", array = F)First, you can have a look at distribution of Cas-system type.
cas_type_res <- summary_cas_type(multi_crispr, each_genome = F)
plot(cas_type_res) + scale_fill_manual(values = pcutils::get_cols(15))
data(multi_crispr2)
cas_type_res2 <- summary_cas_type(multi_crispr2, each_genome = F)
# compare two multi_crispr, maybe from two group.
plot(cas_type_res, cas_type_res2) + scale_fill_manual(values = pcutils::get_cols(15))As we all know, Cas-system can be classified into 2 classes and at least six type, each have own feature and signature cas protein.
Use show_cas_type get an overview. (refer to Makarova, K. S. et
al. Nat Rev Microbiol
2020.)
show_cas_type()Secondly, summary the crispr and spacer number at different evidence levels:
level_res <- summary_levels(multi_crispr, each_genome = F)
plot(level_res)The evidence levels comes from CrisprCasFinder (Couvin, D. et al. Nucleic Acids Research 2018.):
Short candidate arrays made of one to three spacers often do not correspond to CRISPRs and are therefore given the lowest evidence level (rated 1). Evidence levels 2–4 are attributed based on combined degrees of similarity of repeats and spacers.
Putative CRISPR arrays with at least four spacers are assigned to levels 2–4 as follows: - repeats EBcons < 70 (level 2) - repeats EBcons ≥ 70 and spacers overall percentage identity > 8% (level 3) - repeats EBcons ≥ 70 and spacers overall percentage identity ≤ 8% (level 4).
CRISPR arrays having evidence-levels 3 and 4 may be considered as highly likely candidates, whereas evidence-levels 1 and 2 indicate potentially invalid CRISPR arrays.
And we can use get_spacer_fa to get spacer fasta sequence easily, use
evidence_level or cas to filter spacer in crispr array at
evidence_level 4 and with cas.
level4_fa <- get_spacer_fa(crispr, evidence_level = 4, cas = T)
head(level4_fa)
# then use write_fasta to save the fasta file
# pcutils::write_fasta()The Spacer_id includes these information splited by @:
- which genome the spacer comes from
- which sequence (contig) the spacer comes from
- which crispr array is on the sequence
- the start and end sites of the crispr array
- the start position of a spacer,the spacer’s length,and the number id
of spacers in the array combined with
_.
The sequence features of spacer is very important. You can use
summary_seq to get basic information and sequence complexity (Shannon
complexity and Sequence complexity):
level4_fa$sequence[1:5]
summary_seq(level4_fa$sequence[1:5])The Shannon Complexity value refers to (Konopka, A. K. Wiley, 2005.) and Sequence complexity refers to (https://resources.qiagenbioinformatics.com).
Then we can compare the sequence features between spacer with other type sequence like random sequence, tandem repeat, and consensus repeat in CRISPR.
# spacer
level4_fa <- get_spacer_fa(multi_crispr, evidence_level = 4, cas = T)
# random sequence
random_fa <- random_seq(500, mean_length = 35, sd_length = 3)
# tandem repeat
tandem_rep_fa <- random_tandem_repeat(500)
# consensus repeat
consensus_rep_fa <- get_consensus_rep_fa(multi_crispr)ps: tandem repeat here come from trf software result performed on
1,000 random bacteria genome.
seq_stat <- lapply(list(level4_fa, random_fa, tandem_rep_fa[, 1:2], consensus_rep_fa), \(i)summary_seq(i)[, c(-1, -2)])
seq_stat2 <- do.call(rbind, lapply(1:4, \(i)cbind(type = c("spacer", "random", "tandem_rep", "consensus_rep")[i], seq_stat[[i]])))
seq_stat2$type <- factor(seq_stat2$type, levels = c("random", "spacer", "consensus_rep", "tandem_rep"))
pcutils::get_cols(4, pal = "col3") -> cols
pcutils::add_theme()
pcutils::group_box(seq_stat2[c("shannon", "complexity", "GC_content")], "type", seq_stat2, alpha = T, mode = 3) + mytheme +
theme(axis.text.x = element_text(angle = 30, vjust = .5), legend.position = "none") +
scale_color_manual(values = cols) + scale_fill_manual(values = cols)The important question for CRISPR research is what these CRISPR do and how to do? Although it has been proved that most CRISPR play a role in targeting intruder as immune system, beyond this, many people also guess it can do something special like regulating gene expression.
So we need to figure out what these spacer target. Using Blast to find where these spacer come from is a good idea (which taxa, gene or inter-gene region).
We recommend to build a big database and running blast in Linux server, then come back to iCRISPR for next analysis.
blastn -task blastn-short -query $sample -db $bac23 -out blastn_output_big/bac/$sample.out -qcov_hsp_perc 95 -perc_identity 95 -max_hsps 3 -num_threads 12 -outfmt 6We try to compare the structure variation of two arrays. First, try to alignment all spacers, if there are identity between two spacers bigger than 90%, we think that is same spacer. Then, try to alignment whole array, as followed:
array_test <- random_seq(5)[, 2]
res <- compare_array(array1 = array_test, array2 = array_test[c(5, 1:3)])
plot(res)
# try to alignment whole array
align_array(res) -> align_res
plot(align_res)