Working on Rice RNAseq using the https://nf-co.re/rnaseq pipeline that runs RSEM internally.
Here, RSEM fails on:
rsem-extract-reference-transcripts rsem/genome 0 GCF_034140825.1.filtered.gtf None 0 rsem/GCF_034140825.1.fna
The GTF file might be corrupted!
Stop at line : NC_011033.1 RefSeq transcript 11024 315294 . ? . gene_id "OrsajM_p01"; transcript_id "unassigned_transcript_653"; db_xref "GeneID:6450162"; exception "trans-splicing, RNA editing"; gbkey "mRNA"; gene "n ad1"; locus_tag "OrsajM_p01"; transcript_biotype "mRNA";
The specification that I could find on GTF2.2 does not mention ? being allowed in strandedness, so I understand these specification based checks.
The reason for ? is that something weird splicing is happening in the mRNA, and this is above my current knowledge, but looks like even the stop codon and start codon have different strand. The whole transcript is thus a patchwork of sequences from positive and negative strands and thus cannot be uniquely assigned strandedness.
See here: https://www.ncbi.nlm.nih.gov/nuccore/NC_011033.1/ with weird complement(...) happening there for about 4 different genes:

And here is view of the feature in a GTF file (first 8 columns):
NC_011033.1 RefSeq gene 11024 11409 . + .
NC_011033.1 RefSeq gene 239890 315294 . + .
NC_011033.1 RefSeq transcript 11024 315294 . ? .
NC_011033.1 RefSeq exon 11024 11409 . + .
NC_011033.1 RefSeq exon 241499 241580 . - .
NC_011033.1 RefSeq exon 239890 240081 . - .
NC_011033.1 RefSeq exon 251354 251412 . - .
NC_011033.1 RefSeq exon 315036 315294 . - .
NC_011033.1 RefSeq CDS 11024 11409 . + 0
NC_011033.1 RefSeq CDS 241499 241580 . - 1
NC_011033.1 RefSeq CDS 239890 240081 . - 0
NC_011033.1 RefSeq CDS 251354 251412 . - 0
NC_011033.1 RefSeq CDS 315036 315291 . - 1
NC_011033.1 RefSeq start_codon 11024 11026 . + 0
NC_011033.1 RefSeq stop_codon 315036 315038 . - 0
Since this is not an obscure organism, but Rice (and I hoped that when working with model organism for once, everything would be fine), should RSEM be able to handle this issue?
Thanks,
-- Jirka
Working on Rice RNAseq using the
https://nf-co.re/rnaseqpipeline that runs RSEM internally.Here, RSEM fails on:
The specification that I could find on GTF2.2 does not mention
?being allowed in strandedness, so I understand these specification based checks.The reason for
?is that something weird splicing is happening in the mRNA, and this is above my current knowledge, but looks like even the stop codon and start codon have different strand. The whole transcript is thus a patchwork of sequences from positive and negative strands and thus cannot be uniquely assigned strandedness.See here: https://www.ncbi.nlm.nih.gov/nuccore/NC_011033.1/ with weird
complement(...)happening there for about 4 different genes:And here is view of the feature in a GTF file (first 8 columns):
Since this is not an obscure organism, but Rice (and I hoped that when working with model organism for once, everything would be fine), should RSEM be able to handle this issue?
Thanks,
-- Jirka