Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
607a749
Create filter_dna_utils.py
CaptnClementine Oct 6, 2023
d279d38
Add 'is_dna' function to check input sequence
CaptnClementine Oct 6, 2023
2165478
Add 'count_gc_content' function
CaptnClementine Oct 6, 2023
a934e0c
Add 'is_in_gc_bounds' function to check sequence threshold
CaptnClementine Oct 6, 2023
85f43f7
Add 'is_in_length_bounds' function to check sequence threshold
CaptnClementine Oct 6, 2023
d3a3456
Add 'check_quality' function to check sequence threshold
CaptnClementine Oct 6, 2023
547e14e
Create gene_code_main_operations
CaptnClementine Oct 6, 2023
44a9be2
Add 'filter_dna' to filter sequences by GC-content, length and quality
CaptnClementine Oct 6, 2023
499b0e9
Create amino_analyzer_utils.py
CaptnClementine Oct 7, 2023
7a35ab5
Add 'is_aa' function to check if a sequence contains only amino acids
CaptnClementine Oct 7, 2023
5349b7f
Add 'choose_weight' function to choose the weight type
CaptnClementine Oct 7, 2023
e5d408b
Add 'aa_weight' function to calculate the protein weight
CaptnClementine Oct 7, 2023
f142af3
Add 'count_hydroaffinity' function to count it in protein sequence
CaptnClementine Oct 7, 2023
1d0610c
Add 'peptide_cutter' to identifies cleavage sites for "trypsin" and "…
CaptnClementine Oct 7, 2023
17859ad
Add 'one_to_three_letter_code' function to convert protein sequence
CaptnClementine Oct 7, 2023
8e2e973
Add 'sulphur_containing_aa_counter' function
CaptnClementine Oct 7, 2023
8b2a489
Add 'run_amino_analyzer' function to analyse protein sequence in one-…
CaptnClementine Oct 7, 2023
167f505
Create dna_rna_tools_utils.py
CaptnClementine Oct 7, 2023
9690bbd
Add 'is_dna' function to check if a sequence is DNA
CaptnClementine Oct 7, 2023
9cacb05
Add 'is_rna' function to check if a sequence is RNA
CaptnClementine Oct 7, 2023
11d5112
Add 'reverse' function to reverse a sequence
CaptnClementine Oct 7, 2023
bed0dcc
Add 'complement' function to find complement of sequence
CaptnClementine Oct 7, 2023
e40468c
Add 'reverse_complement' function to find reverse complement sequence
CaptnClementine Oct 7, 2023
d9dbaa8
Add 'reverse_transcription' function to reverse transcription of RNA
CaptnClementine Oct 7, 2023
0a64a42
Add 'type_rna_or_dna' function to detect type of sequence
CaptnClementine Oct 7, 2023
8855e07
Add 'has_start_codon' function to check if RNA has start codon
CaptnClementine Oct 7, 2023
1af134a
Add 'is_palindrome' function to check if a sequence is a palindrome
CaptnClementine Oct 7, 2023
49e1d63
Add 'transcribe' function to transcribe a DNA sequence into RNA
CaptnClementine Oct 7, 2023
a95b168
Add 'run_dna_rna_tools' function to make various opertaions with DNA/RNA
CaptnClementine Oct 7, 2023
40b9e86
Update README.md
CaptnClementine Oct 7, 2023
94751f3
Correct bounds in 'filter_dna' function
CaptnClementine Oct 8, 2023
6172210
Correct bounds in GC and length filter functions
CaptnClementine Oct 8, 2023
f412a04
Update README.md to correct bounds descrition in 'filter_dna' function
CaptnClementine Oct 8, 2023
a06d117
Update README.md
CaptnClementine Oct 8, 2023
5f682c4
Add typing library
CaptnClementine Oct 14, 2023
7c5ab39
Add 'read_fastq_file' to read fastq from file
CaptnClementine Oct 16, 2023
c0f0e82
Add 'write_filtered_fastq' to write the file
CaptnClementine Oct 16, 2023
5a984df
Change 'filter_dna' to read-write fastq file
CaptnClementine Oct 16, 2023
01c1dd5
Add file-option description of 'filter_dna'
CaptnClementine Oct 16, 2023
54de4ee
Create bio_files_processor.py and the directory for it
CaptnClementine Oct 16, 2023
3ef1ba8
Add 'convert_multiline_fasta_to_oneline' function
CaptnClementine Oct 16, 2023
777f05b
Add 'select_genes_from_gbk_to_fasta' function
CaptnClementine Oct 16, 2023
5da5b65
Add 'change_fasta_start_pos' function to shift the start position
CaptnClementine Oct 16, 2023
565bf3c
Add 'parse_blast_output' function to search and extract sequence id
CaptnClementine Oct 16, 2023
3018829
Create README.md
CaptnClementine Oct 17, 2023
f053940
Create 'example_fasta_for_convert_multiline_fasta_to_oneline.fasta'
CaptnClementine Oct 17, 2023
88e034d
Create example_for_change_fasta_start_pos
CaptnClementine Oct 17, 2023
7c591a7
Create 'example_for_select_genes_from_gbk_to_fasta'
CaptnClementine Oct 17, 2023
423d95d
Rename example_for_change_fasta_start_pos to example_for_change_fasta…
CaptnClementine Oct 17, 2023
bc5297f
Create example_for_parse_blast_output
CaptnClementine Oct 17, 2023
ed69608
Update 'dna_rna_tools_utils.py' for better PEP8-style
CaptnClementine Oct 17, 2023
560242d
Update 'amino_analyzer_utils.py' to improve PEP8 style
CaptnClementine Oct 17, 2023
e820d51
Update 'filter_dna_utils.py' for better PEP8 style
CaptnClementine Oct 17, 2023
0394377
Add float type for gc_bound in 'gene_code_main_operations' function
CaptnClementine Oct 17, 2023
5368bb3
Update 'filter_dna' function to write output file in directory
CaptnClementine Oct 18, 2023
96e99a7
Rename gene_code_main_operations to gene_code_main_operations.py
CaptnClementine Oct 31, 2023
3279f69
Update gene_code_utils/filter_dna_utils.py
CaptnClementine Oct 31, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
293 changes: 292 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,292 @@
# gene_code_tools
![image](https://github.com/CaptnClementine/gene_code_tools/assets/131146976/68f2999b-5b6e-4668-9865-fae0d4e0b778)
# gene_code_tools


`gene_code_tools` is a collection of Python functions for working with DNA, RNA, and protein sequences. It provides utility functions to check and manipulate sequences based on various criteria.

In the main file **gene_code_main_operations** you can find 3 most important functions**:**

- [ ] filter_dna
- Filter a file with dictionaries of FASTQ sequences based on various criteria.
- [ ] run_amino_analyzer
- Perform basic protein analytics.
- [ ] run_dna_rna_tools
- Conduct fundamental analytics on RNA and DNA sequences.

| Function | Description | Returns | Arguments |
| --- | --- | --- | --- |
| filter_dna | Filter FASTQ sequences based on criteria like GC content, length, and quality. | Filtered sequences (file) | input_path (str), output_path (str), gc_bounds (tuple or int), length_bounds (tuple or int, optional), quality_threshold (int, optional) |
| run_amino_analyzer | Perform various protein sequence operations. | Result of specified operation(s) | seq (str), args (Union[str, Tuple[str, ...]]) |
| run_dna_rna_tools | Perform DNA and RNA sequence operations. | Result of specified operation(s) | seq (str), args (Union[str, Tuple[str, ...]]) |

This README is a long one! If you want just try one function -> Ctrl+F and search for Usage paragraph 💜

Here's more detailed information and examples for each function:

## ⭐ function filter_dna

### Features

- [ ] Check if a sequence consists only of DNA characters.
- [ ] Calculate the GC content percentage of a DNA sequence.
- [ ] Check if a DNA sequence falls within specified GC content bounds.
- [ ] Check if a DNA sequence falls within specified length bounds.
- [ ] Check if the average quality score of a sequence exceeds a threshold.

### Usage

Here's an example of how to use the functions provided by `gene_code_tools`:

```python
# Create a input_file of FASTQ sequences like this:
@SRX079804:1:SRR292678:1:1101:21885:21885 1:N:0:1 BH:ok
ACAGCAACATAAACATGATGGGATGGCGTAAGCCCCCGAGATATCAGTTTACCCAGGATAAGAGATTAAATTATGAGCAACATTATTAA
+SRX079804:1:SRR292678:1:1101:21885:21885 1:N:0:1 BH:ok
FGGGFGGGFGGGFGDFGCEBB@CCDFDDFFFFBFFGFGEFDFFFF;D@DD>C@DDGGGDFGDGG?GFGFEGFGGEF@FDGGGFGFBGGD
# Add more sequences as needed


# Specify your filtering criteria
gc_bounds = (0, 80) # GC content bounds
length_bounds = 100 # Sequence length bounds
quality_threshold = 30 # Quality threshold

# Filter the sequences based on the criteria
filtered_seqs = filter_dna(input_file_name, output_file_name, gc_bounds, length_bounds, quality_threshold)

# Use the filtered sequences as needed
print(filtered_seqs)
```

### Common Errors

When using Gene Code Tools, you might encounter common errors such as invalid input values or incorrect sequence formats. Here are some typical errors and how to handle them:

1. **Invalid gc_bounds or length_bounds**: Ensure that the bounds provided are valid tuples with two non-negative values or a single non-negative integer. For example, `gc_bounds=(20, 80)` is valid, and `gc_bounds=44.4` sets an upper GC content limit of 44.4%. All bounds inclusive
2. **Invalid quality_threshold**: The `quality_threshold` should be an integer between 0 and 42 (inclusive).
3. **Invalid sequence characters**: When working with DNA sequences, make sure that the input sequences contain only valid DNA characters (A, T, G, C, a, t, g, c).

### Specified Variables and Parameters

Gene Code Tools provides the following specified variables and parameters:

- input_path (str): Path to the input FASTQ file. Please write your path with directory etc. You can use os.path.join(dir_name, file_name)
- output_filename (str): Name of the output FASTQ file (without the file extension). By default, it is the same as the input name.

- `gc_bounds` (tuple or int): GC content filtering bounds.
- `length_bounds` (tuple or int, optional): Length filtering bounds.
- `quality_threshold` (int, optional): Quality threshold for filtering sequences.

### Examples

Here are some examples of how to use Gene Code Tools:

```python
# Example 1: Filtering DNA sequences
filtered_seqs = filter_dna(seqs_file, gc_bounds=(20, 80), length_bounds=50, quality_threshold=30)

# Example 2: Using a single upper bound for GC content
filtered_seqs = filter_dna(seqs_file, gc_bounds=44.4, length_bounds=(10, 100))

# Example 3: Using a single upper bound for sequence length
filtered_seqs = filter_dna(seqs_file, gc_bounds=(20, 80), length_bounds=1000)

# Example 4: Filtering without specifying bounds
filtered_seqs = filter_dna(seqs_file)
```

## ⭐ function run_amino_analyzer

### **Features**

- [ ] Transcribe DNA sequences into RNA.
- [ ] Reverse transcribe RNA sequences into DNA.
- [ ] Check for the presence of a start codon in RNA sequences.
- [ ] Reverse sequences.
- [ ] Find the complement of DNA or RNA sequences.
- [ ] Find the reverse complement of DNA or RNA sequences.
- [ ] Check if a sequence is a palindrome.
- [ ] Determine the type of RNA or DNA sequences (DNA, RNA, or mixed).

## Usage

To run amino_analyzer tool you need to use the function ***run_amino_analyzer*** with the following arguments:

```python
from amino_analyzer import run_amino_analyzer
run_amino_analyzer(sequence, procedure, *, weight_type = 'average', enzyme: str = 'trypsine')`
```

- `sequence (str):` The input protein sequence in one-letter code.
- `procedure (str):` The procedure to perform over your protein sequence.
- `weight_type: str = 'average':` default argument for `aa_weight` function. `weight_type = 'monoisotopic'` can be used as another option.
- `enzyme: str = 'trypsine':` default argument for `peptide_cutter` function. `enzyme = 'chymotrypsin'` can be used as another option


**Available procedures list**
- `aa_weight` — calculates the amino acids weight in a protein sequence.
- `count_hydroaffinity` — counts the quantity of hydrophobic and hydrophilic amino acids in a protein sequence.
- `peptide_cutter` — identifies cleavage sites in a given peptide sequence using a specified enzyme (trypsine or chymotripsine).
- `one_to_three_letter_code` — converts a protein sequence from one-letter amino acid code to three-letter code.
- `sulphur_containing_aa_counter` - counts sulphur-containing amino acids in a protein sequence.

You can also use each function separately by importing them in advance.

## Examples
To calculate protein molecular weight:
```python
run_amino_analyzer("VLSPADKTNVKAAW", "aa_weight") # Output: 1481.715

run_amino_analyzer("VLSPADKTNVKAAW", "aa_weight", weight_type = 'monoisotopic') # Output: 1480.804
```

To count hydroaffinity:
```python
run_amino_analyzer("VLSPADKTNVKAAW", "count_hydroaffinity") # Output: (8, 6)
```

To find trypsin/chymotripsine clivage sites:
```python
run_amino_analyzer("VLSPADKTNVKAAW", "peptide_cutter") # Output: 'Found 2 trypsin cleavage sites at positions 7, 11'

run_amino_analyzer("VLSPADKTNVKAAWW", "peptide_cutter", enzyme = 'chymotrypsin') # Output: 'Found 1 chymotrypsin cleavage sites at positions 14'
```

To change to 3-letter code and count sulphur-containing amino acids.
```python
run_amino_analyzer("VLSPADKTNVKAAW", "one_to_three_letter_code") # Output: 'ValLeuSerProAlaAspLysThrAsnValLysAlaAlaTrp'

run_amino_analyzer("VLSPADKTNVKAAWM", "sulphur_containing_aa_counter") # Output: The number of sulphur-containing amino acids in the sequence is equal to 1
```

## Common Errors
Here are some common issues you can come ascross while using the amino-analyzer tool and their possible solutions:

1. **ValueError: Incorrect procedure**
If you receive this error, it means that you provided an incorrect procedure when calling `run_amino_analyzer`. Make sure you choose one of the following procedures: `aa_weight`, `count_hydroaffinity`, `peptide_cutter`, `one_to_three_letter_code`, or `sulphur_containing_aa_counter`.

Example:
```python
run_amino_analyzer("VLSPADKTNVKAAW", "incorrect_procedure")
# Output: ValueError: Incorrect procedure. Acceptable procedures: aa_weight, count_hydroaffinity, peptide_cutter, one_to_three_letter_code, sulphur_containing_aa_counter
```

2. **ValueError: Incorrect sequence**
This error occurs if the input sequence provided to run_amino_analyzer contains characters that are not valid amino acids. Make sure your sequence only contains valid amino acid characters (V, I, L, E, Q, D, N, H, W, F, Y, R, K, S, T, M, A, G, P, C, v, i, l, e, q, d, n, h, w, f, y, r, k, s, t, m, a, g, p, c).

Example:
```python
run_amino_analyzer("VLSPADKTNVKAAW!", "aa_weight")
# Output: ValueError: Incorrect sequence. Only amino acids are allowed (V, I, L, E, Q, D, N, H, W, F, Y, R, K, S, T, M, A, G, P, C, v, i, l, e, q, d, n, h, w, f, y, r, k, s, t, m, a, g, p, c).
```

3. **ValueError: You have chosen an enzyme that is not provided**
This error occurs if you provide an enzyme other than "trypsin" or "chymotrypsin" when calling peptide_cutter. Make sure to use one of the specified enzymes.

Example:
```python
peptide_cutter("VLSPADKTNVKAAW", "unknown_enzyme")
# Output: You have chosen an enzyme that is not provided. Please choose between trypsin and chymotrypsin.
```
4. **ValueError: You have chosen an enzyme that is not provided.**
If you encounter this error, it means that you're trying to iterate over a float value. Ensure that you're using the correct function and passing the correct arguments.

Example:
```python
result = count_hydroaffinity(123)
# Output: TypeError: 'int' object is not iterable
```



## ⭐ function run_dna_rna_tools

**Features**

- [ ] Transcribe DNA sequences into RNA.
- [ ] Reverse transcribe RNA sequences into DNA.
- [ ] Check for the presence of a start codon in RNA sequences.
- [ ] Reverse sequences.
- [ ] Find the complement of DNA or RNA sequences.
- [ ] Find the reverse complement of DNA or RNA sequences.
- [ ] Check if a sequence is a palindrome.
- [ ] Determine the type of RNA or DNA sequences (DNA, RNA, or mixed).

## **Usage**

### **Main Function: `run_dna_rna_tools`**

The **`run_dna_rna_tools`** function is the main entry point for performing various DNA and RNA sequence operations. It takes a sequence and an optional set of additional arguments to specify the operation to perform.

```python
from dna_rna_tools_utils import run_dna_rna_tools

# Example 1: Transcribe DNA to RNA
result = run_dna_rna_tools("ATGC", "transcribe")
print(result) # Output: "AUGC"

# Example 2: Reverse RNA sequence
result = run_dna_rna_tools("AUGC", "reverse")
print(result) # Output: "CGUA"

```

### Arguments

- **`seq (str)`**: The input DNA or RNA sequence.
- **`args (Union[str, Tuple[str, ...]])`**: Additional sequences or options. If the last argument is a string, it specifies the operation to perform.

### Supported Operations

- **`transcribe`**: Transcribe a DNA sequence into RNA.
- **`reverse_transcription`**: Reverse transcribe an RNA sequence into DNA.
- **`has_start_codon`**: Check if an RNA sequence has a start codon.
- **`reverse`**: Reverse a sequence.
- **`complement`**: Find the complement of a DNA or RNA sequence.
- **`reverse_complement`**: Find the reverse complement of a DNA or RNA sequence.
- **`is_palindrome`**: Check if a sequence is a palindrome.

### **Gene Code Utilities (`dna_rna_tools_utils.py`)**

The **`dna_rna_tools_utils.py`** module contains the core functions used by the main function. It includes the following functions:

- **`is_dna(seq: str) -> bool`**: Check if a sequence is DNA.
- **`is_rna(seq: str) -> bool`**: Check if a sequence is RNA.
- **`transcribe(seq: str) -> str`**: Transcribe a DNA sequence into RNA.
- **`reverse(seq: str) -> str`**: Reverse a sequence.
- **`complement(seq: str) -> str`**: Find the complement of a DNA or RNA sequence.
- **`reverse_complement(seq: str) -> str`**: Find the reverse complement of a DNA or RNA sequence.
- **`reverse_transcription(seq: str) -> str`**: Perform reverse transcription on an RNA sequence.
- **`is_palindrome(seq: str) -> bool`**: Check if a sequence is a palindrome.
- **`has_start_codon(seq: str) -> Union[bool, str]`**: Check if an RNA sequence has a start codon.
- **`type_rna_or_dna(seqs: List[str]) -> str`**: Determine the type of RNA or DNA from a list of sequences.

You can use these functions directly if needed.

## **Common Errors**

- **Invalid Procedure**: If you specify an invalid operation, you will receive an "Invalid procedure. Check your sequences and try again." error.
- **Sequence Type Mismatch**: If you try to perform an operation on the wrong sequence type (e.g., transcribing a DNA sequence), you will receive a type-specific error message.
- **Unsupported Sequences**: If your sequences contain characters other than A, T, G, C, U, a, t, g, c, u, you will receive an "Input sequence must be DNA or RNA." error.

## **Specified Variables and Parameters**

- **`seq (str)`**: The input DNA or RNA sequence.
- **`args (Union[str, Tuple[str, ...]])`**: Additional sequences or options.
- **`procedure (str)`**: The specified operation to perform.
- **`seqs (List[str])`**: List of sequences to operate on.
- **`dna_or_rna (str)`**: Indicates whether the sequences are DNA, RNA, or mixed.
- **`new_seq (List[str])`**: List to store the results of operations.

## **Example**

```python
from dna_rna_tools_utils import run_dna_rna_tools

# Transcribe DNA to RNA and find the reverse complement
result = run_dna_rna_tools("ATGC", "transcribe", "reverse_complement")
print(result) # Output: "GCAT"

```

If you have any questions, suggestions, or encounter any issues while using the amino-analyzer tool, feel free to reach out [CaptnClementine](https://github.com/YourGitHubUsername) 💛
Loading