This is collection of tools for dna, rna, protein and fastq.
table of contents:
bioinf_tools — is project with fuctions for dna, rna, protein and fastq. Documentation contais recomendations and examples for execution. You can simply copy and execute this project. bioinf_tools consists of 3 tools:
- na_tools - tools for DNA and RNA.
- p_tools - tools for proteins.
- fq_tools - tools for filtering fastq files.
To install this project you need to download bioinf_tools.py and packages folder.
You need to import module bioinf_tools. In this example we will import bioinf_tools as bi for short.
import bioinf_tools as bina_tools takes only DNA and RNA sequences and command to execute. It takes at least 2 arguments first: first to second to last arguments dna/rna sequences and last argument must be command. Arguments must be string type. Arguments other then DNA and RNA will be ignored.
If count of sequences is greater than 1 run_dna_rna_tools returns list, if count of seqeunces is 1 it returns str or int depending on command.
There are 7 functions:
- transcribe
Returns string of DNA/RNA variant of inputed sequence.
- reverse
Returns reverse sequence of inputed DNA/RNA.
- complement
Returns compelement sequence of inputed DNA/RNA.
- reverse_complement
Returns reverse complement of inputed DNA/RNA.
- count_gc
Takes sequence as input in string format. Counts GC content and returns string of percent of GC content in sequence.
- dna_or_rna
Takes sequence as input in string format. Returns 'dna' or 'rna' string depending on content of input sequences. Precision is up to 2 digits after dot.
- get_sequence_length
Takes sequence as input in string format. Returns length of sequences as integer type.
# We aldready imported bioinf_tools, if you did not uncomment next line
#import bioinf_tools as bi
bi.na_tools('ATG', 'transcribe') # 'AUG'
bi.na_tools('ATG', 'reverse') # 'GTA'
bi.na_tools('AtG', 'complement') # 'TaC'
bi.na_tools('ATg', 'reverse_complement') # 'cAT'
bi.na_tools('ATGC', 'count_gc') # '50%'
bi.na_tools('AtG', 'dna_or_rna') # 'dna'
bi.na_tools('ATG', 'aT', 'get_sequence_length') # ['3', '2']Provide a tool with the sequence(s) of the protein(s) in 1-letter format (for example, DYKDDDDK) and the function needed. If you occasionally write down a non-peptide sequence, the programm will return an error.
Here is the catalogue of actions the user can choose:
- count_length: gives the length(s) of the protein sequence(s)
- count_nucleotide_length: counts the length(s) of the coding nucleotide sequence(s) of the protein sequence(s)
- count_molecular_mass: calculates molecular mass of the input (the algorithm takes into consideration water mass and subtracts it)
- show_content: shows the aminoacid content of the protein(s)
- convert_1_to_3: converts 1-letter format into 3-letter one
- count_extinction_280nm: counts the molar extinction coefficient (this function counts cystine contribution to extinction coefficient as two cysteins give 1 SS-bond)
bi.p_tools('count_length', 'DYKDDDDK') #8
bi.p_tools('count_nucleotide_length', 'DYKDDDDK') #24
bi.p_tools('count_molecular_mass', 'DYKDDDDK') #1012.982
bi.p_tools('show_content', 'DYKDDDDK') #{'A': 0, 'C': 0, 'D': 5, 'E': 0, 'F': 0, 'G': 0, 'H': 0, 'I': 0, 'K': 2, 'L': 0, 'M': 0, 'N': 0, 'P': 0, 'Q': 0, 'R': 0, 'S': 0, 'T': 0, 'V': 0, 'W': 0, 'Y': 1}
bi.p_tools('convert_1_to_3', 'DYKDDDDK') #'AspTyrLysAspAspAspAspLys'
bi.p_tools('count_extinction_280nm', 'DYKDDDDK') #1490Filters fastq files by specifiable parametrs.
- seqs (dict[str,str]) - a dictionary consisting of fastq sequences. Key - string, sequence name. The value is a tuple of two strings: sequence and quality.
- gc_bounds (int,tuple) - GC interval (in percent) for filtering, default is (0, 100). If input is single int - it will be ceiling (0, n).
- length_bounds (int,tuple) - length interval for filtering. Works exactly as gc_bounds. Default is (0, 2**32)
- quality_treshholds (int) - The threshold value of average read quality for the filter is 0 by default (phred33 scale).
- output (dict[str,str]) - filtered dictionary, which consists of entities that fulfill entered parametrs.
As example we will use first read from 1_control_psbA3_2019_minq7.fastq (https://zenodo.org/record/3736457/files/1_control_psbA3_2019_minq7.fastq?download=1) Which consist of 192 nucleotides, has 46.88% GC content and has mean quality score ~17.1.
fastq = {'ee15a423-b008-44be-a4b2': ('GTTGTACTTCGTTCAATCGGTAGGTGTTTAACCGGATGGTCACGCCTACCGTGACAAAGAGATTGTCGGTGTCTTTGTGTTTCTGTTGGTGCTGATATTGCATTATGCATGAACGTAATGCCCATTAGTTGTGAATCCACCATGCGCGGAAGATAGAGCGACAGGCAAGTCACAAAGACACCGACAACTGTC', "##$&$&/035881()'$0&*('-.=;685()$.%($'%%&#&)+..0,&+&%.-/+,%&()$3:0&@09BF=>CC8(78029F7=<=)+@+.6CCFFC@-8%2579<B8;88412134,,;:8./,#1#&(%((09;B=??48<=<@79*-:B540,8=B=444:<571-B5=ED2.56;110.5+,*)%%*")}
bi.fq_tools(fastq_file, (50, 100),(100,200),16) #{}
bi.fq_tools(fastq_file, (40, 100),(195,200),16) #{}
bi.fq_tools(fastq_file, (40, 100),(100,200),18) #{}
bi.fq_tools(fastq_file, (40, 100),(100,200),16) #{'ee15a423-b008-44be-a4b2': ('GTTGTACTTCGTTCAATCGGTAGGTGTTTAACCGGATGGTCACGCCTACCGTGACAAAGAGATTGTCGGTGTCTTTGTGTTTCTGTTGGTGCTGATATTGCATTATGCATGAACGTAATGCCCATTAGTTGTGAATCCACCATGCGCGGAAGATAGAGCGACAGGCAAGTCACAAAGACACCGACAACTGTC',
"##$&$&/035881()'$0&*('-.=;685()$.%($'%%&#&)+..0,&+&%.-/+,%&()$3:0&@09BF=>CC8(78029F7=<=)+@+.6CCFFC@-8%2579<B8;88412134,,;:8./,#1#&(%((09;B=??48<=<@79*-:B540,8=B=444:<571-B5=ED2.56;110.5+,*)%%*")}Sivtsev Aleksei