Skip to content

Commit 17c4aaf

Browse files
committed
update doc and strand opp
2 parents bf6d730 + f06fc19 commit 17c4aaf

40 files changed

Lines changed: 1612 additions & 647 deletions

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22
# Orfmap data #
33
##########################
44
orfmap/data/*
5+
orfmap/data/scerevisiae/mapping*
6+
orfmap/data/scerevisiae/*/mapping*
57

68
# Byte-compiled / optimized / DLL files
79
__pycache__/

README.md

Lines changed: 123 additions & 119 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,21 @@
1-
ORFMap
2-
===========
3-
1+
# ORFMap
42
ORFMap - A tool aimed at scanning a genome for stop-codons delimited sequences (ORFs) and annotating them.
53

4+
## Summary
5+
* <p><a href="#descr">Description</a></p>
6+
* <p><a href="#installation">Installation</a></p>
7+
* <p><a href="#usage_descr">Usage description</a></p>
8+
* <p><a href="#usage_ex">Some usage examples</a></p>
69

7-
8-
Description
9-
-----------
10+
<h2><a name="descr">Description</a></h2>
1011

1112
From a genomic fasta file and its associated GFF, the program first scans the genome to retrieve all sequences
1213
delimited by stop codons. Only sequences of at least 60 nucleotides long are kept by default.
1314

14-
Those so-called ORF sequences are then annotated depending upon GFF element type(s) used as a reference. The CDS element type is always used as a reference but others can be added. By default an ORF sequence has 5 possible annotations:
15+
Those so-called ORF sequences are then annotated depending upon GFF element type(s) used as a reference.
16+
The CDS element type is always used as a reference but others can be added.
17+
18+
By default an ORF sequence has 5 possible annotations:
1519

1620
| ORF annotation | Condition |
1721
| --- | --- |
@@ -21,9 +25,11 @@ Those so-called ORF sequences are then annotated depending upon GFF element type
2125
| nc_ovp-CDS | if the ORF overlap with a CDS in a different phase |
2226
| nc_intergenic | if the ORF do not overlap with anything |
2327

24-
**Note** that if an ORF sequence is tagged as 'c_CDS', this sequence is further processed to be cut at its 5' and 3' extremities that do not overlap with the CDS. If their length is above or equal to 60 nucleotides, then those subsequences can be assigned as nc_5-CDS and/or nc_3-CDS.
28+
**Note:**
29+
If an ORF sequence is tagged as 'c_CDS', this sequence is further processed to be cut at its 5' and 3' extremities that do not overlap with the CDS. If their length is above or equal to 60 nucleotides, then those subsequences can be assigned as nc_5-CDS and/or nc_3-CDS.
30+
<br></br>
31+
<br></br>
2532

26-
2733
The user can also specify what GFF element type(s) can be used as reference(s) to annotate ORF sequences in addition to the CDS type. For instance, if the user adds the tRNA element type, ORF sequences could now be assigned as nc_ovp-tRNA if they overlap with a tRNA. Thus 6 assignments would now be possible for an ORF sequence:
2834

2935
| ORF annotation | Condition |
@@ -36,93 +42,122 @@ The user can also specify what GFF element type(s) can be used as reference(s) t
3642
| nc_intergenic | if the ORF do not overlap with anything |
3743

3844
**Note on default parameters**:
39-
* CDS in the only element type used as a reference to annotate ORF sequences.
45+
* CDS is the only element type used as a reference to annotate ORF sequences.
4046
* the minimum nucleotide number required to consider an ORF sequence is set at 60 nucleotides
4147
* an ORF sequence is considered as overlapping with an element (e.g. CDS) if at least 70 % of its sequence overlap with the element or if this element is totally included within the ORF sequence
4248

4349

44-
----------------------------------------
45-
Installation procedure from distribution
46-
----------------------------------------
50+
<h2><a name="installation">Installation</a></h2>
4751

52+
### 1. Download and uncompress the latest release archive
4853

49-
I. First steps
50-
------------
54+
#### Download the latest release
55+
Latest release:
56+
[ ![](./docs/images/download-flat/16x16.png "Click to download the latest release")](https://github.com/nchenche/orfmap/releases/latest/)
5157

52-
1. Uncompress and untar the package:
58+
#### Uncompress the archive
59+
If you downloaded:
60+
* the *.zip* file: ```unzip ORFMap-x.x.x.zip```
61+
* the *.tar.gz* file: ```gunzip ORFMap-x.x.x.tar.gz | tar xvf```
62+
63+
64+
### 2. Create an isolated environment
65+
Although not strictly necessary, this step is highly recommended (it will allow you to work on different projects without having
66+
any conflicting library versions).
67+
68+
#### Install virtualenv
69+
``` python
70+
python3 -m pip install virtualenv
71+
```
5372

73+
#### Create a virtual python3 environment
5474
```bash
55-
tar -xzvf orfmap-0.0.tgz
75+
virtualenv -p python3 my_env
5676
```
5777

58-
2. Go to the ORFMap directory
59-
78+
#### Activate the created environment
6079
```bash
61-
cd ORFMap-0.0
80+
source my_env/bin/activate
6281
```
6382

83+
Once activated, any python library you'll install using pip will be installed solely in this isolated environment.
84+
Every time you'll need to work with libraries installed in this environment (i.e. work on your project), you'll have
85+
to activate it.
6486

65-
II. Install the package in a virtual environment (the recommended way to avoid dependencies conflict)
66-
------------------------------------------------------------------
87+
Once you're done working on your project, simply type `deactivate` to exit the environment.
6788

68-
1. Install virtualenv (if necessary)
6989

70-
```bash
71-
pip install virtualenv
72-
```
90+
### 3. Install ORFMap in your isolated environment
7391

74-
2. Create a virtual environment (for python3)
92+
Be sure you're virtual environment is activated, and then follow the procedure described below.
7593

94+
#### Go to the ORFMap directory
95+
7696
```bash
77-
virtualenv -p python3 env_orfmap
97+
cd ORFMap-x.x.x/
7898
```
7999

80-
3. Activate your virtual environment
100+
#### Install
81101

82-
```bash
83-
source env_orfmap/bin/activate
102+
```python
103+
python setup.py install
84104
```
85105

86-
**Note that once activated:**:
87-
* you should see the name of your virtual environment in brackets on your terminal line
88-
* any python commands will now work within your virtual environment
89-
90-
106+
or
107+
```python
108+
pip install .
109+
```
91110

92-
4. Install ORFMap in your virtual environment
93111

94-
```bash
95-
python setup.py install
96-
```
112+
<h2><a name="usage_descr">Usage description</a></h2>
97113

98-
**Note**: once installed, you should be able to run orfmap (see below). Once you don't need to use it, you can deactivate or exit your virtual environment by executing in the terminal:
114+
To see all options available:
99115

100-
```bash
101-
deactivate
116+
```
117+
run_orfmap -h
102118
```
103119

104-
From this installation, everytime you'll want to use orfmap you'll need to activate your dedicated virtual environment.
120+
This command will show:
105121

122+
<pre>usage: run_orfmap [-h] -fna [FNA] -gff [GFF] [-chr [CHR]] [-types_only TYPES_ONLY [TYPES_ONLY ...]]
123+
[-types_except TYPES_EXCEPT [TYPES_EXCEPT ...]] [-o_include O_INCLUDE [O_INCLUDE ...]] [-o_exclude O_EXCLUDE [O_EXCLUDE ...]]
124+
[-orf_len [ORF_LEN]] [-co_ovp [CO_OVP]] [-out [OUT]] [--show-types] [--show-chrs]
106125

126+
Genomic mapping of pseudo-ORF
107127

108-
II_bis. Install the package in the standard python libraries of your system (not recommended)
109-
---------------------------------------------------------------------------------------------
128+
optional arguments:
129+
-h, --help show this help message and exit
130+
-fna [FNA] Genomic fasta file (.fna)
131+
-gff [GFF] GFF annotation file (.gff)
132+
-chr [CHR] Chromosome name
133+
-types_only TYPES_ONLY [TYPES_ONLY ...]
134+
Type feature(s) to use as reference(s) (&apos;CDS&apos; in included by default).
135+
-types_except TYPES_EXCEPT [TYPES_EXCEPT ...]
136+
Type feature(s) to not consider as reference(s) (None by default).
137+
-o_include O_INCLUDE [O_INCLUDE ...]
138+
Type feature(s) and/or Status attribute(s) desired to be written in the output (all by default).
139+
-o_exclude O_EXCLUDE [O_EXCLUDE ...]
140+
Type feature(s) and/or Status attribute(s) desired to be excluded (None by default).
141+
-orf_len [ORF_LEN] Minimum number of nucleotides required to define a sequence between two consecutive stop codons as an ORF sequence (60
142+
nucleotides by default).
143+
-co_ovp [CO_OVP] Cutoff defining the minimum CDS overlapping ORF fraction required to label on ORF as a CDS. By default, an ORF sequence
144+
will be tagged as a CDS if at least 70 per cent of its sequence overlap with the CDS sequence.
145+
-out [OUT] Output directory
146+
--show-types Print all type features
147+
--show-chrs Print all chromosome names
148+
</pre>
110149

111-
In ORFMap-0.0/:
150+
Except -fna and -gff arguments that are mandatory, all others are optional.
112151

113-
```bash
114-
python setup.py install
115-
```
116152

153+
### Basic run
117154

118-
----------------------------------------
119-
Running ORFMap
120-
----------------------------------------
155+
ORFMap requires two input files:
156+
* a genomic fasta file (-fna)
157+
* its associated GFF file (-gff).
121158

122-
Basic run
123-
---------
124159

125-
ORFMap requires two input files: a genomic fasta file (-fna) and its associated GFF file (-gff). The most basic run can be executed by typing:
160+
The most basic run can be executed by typing:
126161

127162
```
128163
run_orfmap -fna mygenome.fna -gff mygenome.gff
@@ -146,100 +181,69 @@ The output will be two separated files with the prefix "mapping_orf_":
146181
By default, the two output files will contain all possible 5 annotations mentionned above.
147182

148183

149-
Usage description
150-
-----------------
151-
152-
To see all options available:
153-
154-
```
155-
run_orfmap -h
156-
```
184+
<h2><a name="usage_ex">Some usage examples</a></h2>
157185

158-
This command will show:
186+
By default, all element types (except 'region' and 'chromosome') found in the GFF file are used as reference
187+
to annotate ORF sequences. If an ORF sequence overlaps with more than 2 elements, then the ORF sequence will be assigned
188+
according to the element with which it overlaps the most. For instance, let's say an ORF sequence overlaps at 85% with
189+
a tRNA and at 90% with a sRNA, then the ORF will be assigned as nc-ovp_sRNA.
190+
Note that the CDS element type always has the priority relative to any other element types. Therefore, if an ORF
191+
sequence overlaps at 72% with a CDS and at 95% with an other element that is not a CDS, then the ORF will be assigned as
192+
c_CDS. When an ORF sequence entirely overlaps with multiple elements, then the choice for its assignment is quite
193+
arbitrary : the ORF will be assigned depending on the first element met in the GFF. That case could appear for
194+
intrinsically related elements such as gene, exon and mRNA. For example, let's say an ORF sequence equally overlaps with
195+
an exon and a gene region (but there's no overlap with the CDS part). Since the gene normally appears firt in the GFF
196+
file, the ORF will be assigned as nc-ovp_gene. In order to avoid those special cases, an option allows the user specify
197+
element types that should not be considered as reference for the ORF assignment.
159198

160-
usage: run_orfmap [-h] -fna [FNA] -gff [GFF] [-type TYPE [TYPE ...]] [-o_include O_INCLUDE [O_INCLUDE ...]] [-o_exclude O_EXCLUDE [O_EXCLUDE ...]] [-orf_len [ORF_LEN]]
161-
[-co_ovp [CO_OVP]] [-out [OUT]]
162199

163-
Genomic mapping of pseudo-ORF
164200

165201

166-
| Arguments | Description |
167-
| --- | --- |
168-
| -h, --help | show this help message and exit |
169-
| -fna [FNA] | Genomic fasta file (.fna) |
170-
| -gff [GFF] | GFF annotation file (.gff) |
171-
| -type TYPE [TYPE ...] | Type feature(s) a flag is desired for ('CDS' in included by default). |
172-
| -o_include O_INCLUDE [O_INCLUDE ...] | Type feature(s) and/or Status attribute(s) desired to be written in the output (all by default). |
173-
| -o_exclude O_EXCLUDE [O_EXCLUDE ...] | Type feature(s) and/or Status attribute(s) desired to be excluded (None by default). |
174-
| -orf_len [ORF_LEN] | Minimum number of nucleotides required to define a sequence between two consecutive stop codons as an ORF sequence (60 nucleotides by default). |
175-
| -co_ovp [CO_OVP] | Cutoff defining the minimum CDS overlapping ORF fraction required to label on ORF as a CDS. By default, an ORF sequence will be tagged as a CDS if at least 70 per cent of its sequence overlap with the CDS sequence. |
176-
| -out [OUT] | Output directory |
202+
In the case where an ORF sequence overlaps
177203

178-
179-
Except -fna and -gff arguments that are mandatory, all others are optional.
180-
181-
| Arguments | Default value |
182-
| --- | --- |
183-
| -type | CDS |
184-
| -o_include | 'all' |
185-
| -o_exclude | None |
186-
| -orf_len | 60 |
187-
| -co_ovp | 0.7 |
188-
| -out | './' |
204+
##### Use tRNA and snRNA element as a reference to annotate ORF sequences:
205+
```
206+
run_orfmap -fna mygenome.fna -gff mygenome.gff -types_only tRNA snRNA -out myResults
207+
```
189208

190209

191-
Usage examples
192-
-----------------
193210

194-
This command will use tRNA and snRNA element types as a reference to annotate ORF sequences and the outputs will be created in myResults/
211+
##### Use tRNA and snRNA element as a reference to annotate ORF sequences:
195212
```
196-
run_orfmap -fna mygenome.fna -gff mygenome.gff -type tRNA snRNA -out myResults
213+
run_orfmap -fna mygenome.fna -gff mygenome.gff -types_only tRNA snRNA -out myResults
197214
```
198215

199-
7 annotations will be possible for an ORF sequence:
200-
201-
| ORF annotation | Condition |
202-
| --- | --- |
203-
| c_CDS |if the ORF overlap with a CDS in the same phase |
204-
| nc_5-CDS | if the 5' extremity of the c_CDS is at least 60 nucleotides long |
205-
| nc_3-CDS | if the 3' extremity of the c_CDS is at least 60 nucleotides long |
206-
| nc_ovp-CDS | if the ORF overlap with a CDS in a different phase |
207-
| nc_ovp-tRNA | if the ORF overlap with a tRNA |
208-
| nc_ovp-snRNA | if the ORF overlap with a snRNA |
209-
| nc_intergenic | if the ORF do not overlap with anything |
210-
211-
212-
To make the output files having only ORF sequences mapped as nc_ovp-tRNA and nc_ovp-snRNA:
216+
##### Write in output files only ORF sequences mapped as nc_ovp-tRNA and nc_ovp-snRNA:
213217
```
214-
run_orfmap -fna mygenome.fna -gff mygenome.gff -type tRNA snRNA -o_include nc_ovp-tRNA nc_ovp-snRNA -out myResults
218+
run_orfmap -fna mygenome.fna -gff mygenome.gff -types_only tRNA snRNA -o_include nc_ovp-tRNA nc_ovp-snRNA -out myResults
215219
```
216220

217-
To make the output files having all ORF sequences except those mapped as c_CDS:
221+
##### Write in output files all ORF sequences except those mapped as c_CDS:
218222
```
219223
run_orfmap -fna mygenome.fna -gff mygenome.gff -type tRNA snRNA -o_exclude c_CDS -out myResults
220224
```
221225

222-
or:
226+
##### or:
223227
```
224228
run_orfmap -fna mygenome.fna -gff mygenome.gff -type tRNA snRNA -o_exclude coding -out myResults
225229
```
226230

227-
**Note**: -o_include and -o_exclude take either feature types or a status attribute as arguments. Feature types have to be amongst the possible annotations for ORF sequences (e.g. c_CDS, nc_5-CDS, nc_intergenic...) while status attribute is either 'coding' or 'non-coding' ('coding' refers to c_CDS and 'non-coding' refers to the other ones).
231+
<em>Note</em>:
232+
<p>
233+
-o_include and -o_exclude take either feature types or a status attribute as arguments.
234+
Feature types have to be amongst the possible annotations for ORF sequences (e.g. c_CDS, nc_5-CDS, nc_intergenic...)
235+
while status attribute is either 'coding' or 'non-coding' ('coding' refers to c_CDS and 'non-coding' refers to the other ones).
236+
</p>
228237

229238

230-
231-
This command will define ORF sequences if they are at least 50 nucleotides
239+
##### Assign ORF seqences if stop-to-stop length is at least 50 nucleotides:
232240
```
233241
run_orfmap -fna mygenome.fna -gff mygenome.gff -orf_len 50
234242
```
235243

236-
237-
This command will consider an ORF sequence as overlapping with an element (e.g. CDS) if at least 60 % of its sequence overlap with the element or if this element is totally included within the ORF sequence
244+
##### Consider an ORF sequence as overlapping with any element if at least 60 % of its sequence overlap with the element:
238245
```
239246
run_orfmap -fna mygenome.fna -gff mygenome.gff -co_ovp 0.6
240247
```
241248

242249

243-
244-
245-

documentation/docs/annotation.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
## ORF annotation
2+
3+
All ORFs are annotated according to their potential overlapping GFF element(s).
4+
Globally, an ORF can be assigned either as a non-coding (nc) sequence or a coding (c) sequence.
5+
6+
7+
### Non-coding ORF sequences
8+
9+
All GFF elements present in both strands are used to define if an ORF is overlapping or not. Thus,
10+
non-coding ORF sequences have three possible feature types:
11+
12+
* `nc_intergenic` if the ORF sequence has no overlapping GFF element
13+
* `nc_ovp-element_type` if the ORF sequence overlaps with a GFF element in the same strand
14+
* `nc_ovp-element_type-opp` if the ORF sequence overlaps with a GFF element in the opposite strand
15+
16+
For instance, if an ORF sequence overlaps with a tRNA, the ORF type will be `nc_ovp-tRNA`.
17+
18+
If an ORF sequence overlaps with multiple GFF elements, the one that will be considered to annotate
19+
the ORF is the one selected according to the following priority sequence:
20+
21+
1. an element on the same strand has the priority over an element on the opposite strand
22+
2. an element with which the ORF sequence overlaps the most has the priority over other elements
23+
3. the element appearing the first in the GFF file has the priority
24+
25+
26+
27+
28+
If an ORF sequence overlaps with multiple GFF elements, the ORF type will be assigned according to
29+
the overlapping element with wich it overlaps the most. If the ORF overlaps equally with multiple
30+
elements, then it will be arbitrarily assigned according to the first element met in the GFF file.
31+
32+
33+

documentation/docs/download.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
## Download
441 Bytes
Loading
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
,nicolas,nicolas-VivoBook-ASUSLaptop-X571GT-X571GT,09.11.2020 13:00,file:///home/nicolas/.config/libreoffice/4;
61.1 KB
Binary file not shown.
41.9 KB
Loading
23.3 KB
Loading

0 commit comments

Comments
 (0)