Skip to content

Commit 66f858d

Browse files
authored
Merge pull request #18 from jzhang-dev/copilot/update-readme-file
Update README.md to reflect current overlaps.tsv output format
2 parents 5f5167a + b9db377 commit 66f858d

1 file changed

Lines changed: 30 additions & 48 deletions

File tree

README.md

Lines changed: 30 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -61,53 +61,33 @@ Fedrann generates the following output files to help you understand your analysi
6161

6262
- `fedrann.log`: A log file that details the pipeline's progress, from start to finish.
6363

64-
- `overlaps.tsv`: This file contains a list of every sequence analyzed, including its name and orientation.
65-
66-
- `metadata.tsv`: This file lists all identified candidate overlaps and their similarity metrics.
64+
- `overlaps.tsv`: This file lists all identified candidate overlaps with sequence names, orientations, and their similarity metrics.
6765

6866
- `feature_matrix.npz`: (Optional) This sparse-format file contains the feature matrix generated during the analysis.
6967

7068
### `overlaps.tsv`
71-
This file serves as a reference for all input sequences. The index column provides a numerical identifier for each sequence. The `read_name` column contains the original name of the sequence, while the `strand` column specifies its orientation. A strand value of `0` denotes the original sequence, and a value of `1` denotes its reverse complement.
69+
This file details the candidate overlaps identified by the tool. Each row represents a potential overlap between two sequences.
7270

7371
Example `overlaps.tsv` file:
7472
```
75-
index read_name strand
76-
0 c2924806-d5c6-4564-b31a-c701c0226fbc 0
77-
1 c29246d8-d6c6-4564-b31a-c701c0226fbc 1
78-
2 b5ec3070-2ba3-430d-a55d-1d7b178c8d36 0
79-
3 b5ec3070-2ba3-430d-a55d-1d7b178c8d36 1
80-
4 5b1d405b-08b4-448d-b601-dd922aa9380c 0
81-
5 5b1d405b-08b4-448d-b601-dd922aa9380c 1
82-
6 f3d50991-ad3e-4564-98de-97bf986f992c 0
83-
7 f3d50991-ad3e-4564-98de-97bf986f992c 1
84-
8 316caa78-7d27-4d49-b0f3-684fa17063e4 0
85-
9 316caa78-7d27-4d49-b0f3-684fa17063e4 1
86-
10 7b2ee773-fc72-4b54-b6fe-a412df3f8744 0
87-
```
88-
89-
### `metadata.tsv`
90-
This file details the candidate overlaps identified by the tool. You can use the `query_index` and `target_index` to look up the full sequence names in the `overlaps.tsv` file.
91-
92-
Example `metadata.tsv` file:
93-
```
94-
query_index target_index distance rank
95-
0 6541 0.6878114767745782 1
96-
0 828 0.6921228205223058 2
97-
0 10329 0.7078562118327599 3
98-
0 2847 0.7284434279308162 4
99-
0 9642 0.7328719782924367 5
100-
0 7857 0.7369192193171591 6
101-
1 10328 0.661215875322702 1
102-
1 6540 0.6803799447921752 2
103-
1 5053 0.724938780681381 3
104-
1 2846 0.7418480245039423 4
105-
1 829 0.7578067926101968 5
106-
1 9622 0.7603659996459153 6
73+
query_name query_orientation target_name target_orientation neighbor_rank distance
74+
c2924806-d5c6-4564-b31a-c701c0226fbc + 7e3d8f0e-fcf0-4073-ad6b-6d742245f29b + 1 0.6606640366100363
75+
c2924806-d5c6-4564-b31a-c701c0226fbc + dd1acbc0-f219-4701-b1eb-00b9850d3d9e - 2 0.6690033249609217
76+
c2924806-d5c6-4564-b31a-c701c0226fbc + 9b58f527-6e4d-4005-92cb-4863b6d42229 + 3 0.7207873470869667
77+
c2924806-d5c6-4564-b31a-c701c0226fbc + 932c6e73-e5d9-4bbe-b88f-e4babfc043a1 - 4 0.7347354015555427
78+
c2924806-d5c6-4564-b31a-c701c0226fbc + 416c786a-11f1-487f-99cc-830a40ee1c6c - 5 0.7421239213106542
79+
c2924806-d5c6-4564-b31a-c701c0226fbc + 99640d4e-d299-414c-ade3-ebe567b4d1ef + 6 0.7520030538483087
80+
c2924806-d5c6-4564-b31a-c701c0226fbc + ffdda27d-9d97-4d01-8e61-8a6e281e0f69 + 7 0.7670946853284661
81+
c2924806-d5c6-4564-b31a-c701c0226fbc + 2e92fbd0-4577-405f-a0c4-5123179a9e78 + 8 0.7840616303663797
10782
```
108-
`distance`: Measures the dissimilarity between the embedded vectors of the query and target sequences. A smaller value indicates higher similarity between the sequences.
10983

110-
`rank`: The similarity rank of the `target_index` sequence among all potential matches for the `query_index`. A lower rank (closer to 1) signifies a better match.
84+
Column descriptions:
85+
- `query_name`: The name of the query sequence
86+
- `query_orientation`: The orientation of the query sequence (`+` for forward, `-` for reverse complement)
87+
- `target_name`: The name of the target sequence
88+
- `target_orientation`: The orientation of the target sequence (`+` for forward, `-` for reverse complement)
89+
- `neighbor_rank`: The similarity rank of the target sequence among all potential matches for the query sequence, where `1` is the best match, `2` is the second best, and so on.
90+
- `distance`: Measures the dissimilarity between the embedded vectors of the query and target sequences. A smaller value indicates higher similarity between the sequences.
11191

11292

11393

@@ -215,16 +195,18 @@ The following flowchart illustrates the main steps of the FEDRANN pipeline. The
215195
===============================================================
216196
|
217197
+------------------------+------------------------+
218-
| | |
219-
v v v
220-
+------------------+ +---------------------+ +---------------------+
221-
| metadata.tsv | | overlaps.tsv | | feature_matrix.npz |
222-
| | | | | (optional) |
223-
| - Sequence names | | - query_index | | |
224-
| - Strand info | | - target_index | | - Embedding vectors |
225-
| - Read indices | | - distance | | - Sparse format |
226-
| | | - rank | | |
227-
+------------------+ +---------------------+ +---------------------+
198+
| |
199+
v v
200+
+---------------------+ +---------------------+
201+
| overlaps.tsv | | feature_matrix.npz |
202+
| | | (optional) |
203+
| - query_name | | |
204+
| - query_orientation | | - Embedding vectors |
205+
| - target_name | | - Sparse format |
206+
| - target_orientation| | |
207+
| - neighbor_rank | +---------------------+
208+
| - distance |
209+
+---------------------+
228210
```
229211

230212
### Key Implementation Details

0 commit comments

Comments
 (0)