bug in parse_gene_presence_absence.py causing crash/ error message

Corekaburra version 0.0.5 (and presumably earlier)

ERROR MESSAGE:
```
---------------------Processing started----------------------


Initial checks successful

------------Opening the gene presence/absence file------------
Core genes must be found in 530 or more genomes
Low frequency genes must be found in less than 27 genomes

Traceback (most recent call last):
  File "/home/ahhayes/.conda/envs/Corekaburra/bin/Corekaburra", line 8, in <module>
    sys.exit(main())
  File "/home/ahhayes/.conda/envs/Corekaburra/lib/python3.9/site-packages/Corekaburra/__main__.py", line 173, in main
    core_dict, low_freq_dict, acc_gene_dict = read_gene_presence_absence(input_pres_abs_file_path, args.core_cutoff,
  File "/home/ahhayes/.conda/envs/Corekaburra/lib/python3.9/site-packages/Corekaburra/parse_gene_presence_absence.py", line 185, in read_gene_presence_absence
    fragments_close = check_fragmented_gene(fragment_info, input_gffs, tmp_folder_path, logger)
  File "/home/ahhayes/.conda/envs/Corekaburra/lib/python3.9/site-packages/Corekaburra/parse_gene_presence_absence.py", line 67, in check_fragmented_gene
    first_fragment_contig = gff_database[fragment_pieces[0]][0]
  File "/home/ahhayes/.conda/envs/Corekaburra/lib/python3.9/site-packages/gffutils/interface.py", line 280, in __getitem__
    raise FeatureNotFoundError(key)
gffutils.exceptions.FeatureNotFoundError: SF370_01117
```
- Corekaburra has a bug that occurs when there are two non-Unique basename strings in a genome (eg. AEG_SF370.gff and SF370.gff)
- This leads to incorrect assignment of the locus tags for one of the genomes. (but maybe is only an issue when one of them has a unique gene id/ merged/ refound gene).
- Offending line of code appears to be line 50 of parse_gene_presence_absence.py where the default is to take the first matching element [0] of the filename matches.
`gff_file = [file for file in input_gffs if f'{genome}.gff' in file][0]`
- potential bugfix proposed below. `gff_file = next(file for file in input_gffs if os.path.basename(file) == f'{genome}.gff') `(old code commented out - some extra troubleshooting error messages also suggested) - when tested with the fix on a problematic dataset corekaburra ran without any crashing.

```
        # if '.gff' not in genome:
        #     try:
        #         gff_file = [file for file in input_gffs if f'{genome}.gff' in file][0]
        #         db_name = os.path.join(tmp_folder_path, f'{genome}_db')
        #     except IndexError:
        #         raise NotImplementedError(f'No gff match was found when searching fragments for genome: {genome}')
        # else:
        #     gff_file = genome
        #     db_name = f"{os.path.basename(genome)}_db"
        #     db_name = os.path.join(tmp_folder_path, db_name)

        if '.gff' not in genome:
            try:
                gff_file = next(file for file in input_gffs if os.path.basename(file) == f'{genome}.gff')
                db_name = os.path.join(tmp_folder_path, f'{genome}_db')
            except StopIteration:
                logger.error(f"No exact gff match was found for genome: {genome}")
                logger.debug(f"Available GFF files: {[os.path.basename(f) for f in input_gffs]}")
                raise NotImplementedError(f'No exact gff match was found for genome: {genome}')
        else:
            gff_file = genome
            db_name = f"{os.path.basename(genome)}_db"
            db_name = os.path.join(tmp_folder_path, db_name)
```
will aim to make change in forked version and pull request bugfix soon but wanted to document the issue ASAP incase anyone else had the same issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug in parse_gene_presence_absence.py causing crash/ error message #24

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

bug in parse_gene_presence_absence.py causing crash/ error message #24

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions