Skip to content

Bug: SamParser silently corrupts data when encountering non-numeric values in integer fields #22

@Hapsa21

Description

@Hapsa21

Bug

While testing edge cases for SAM file ingestion, I discovered that SamParser::ParseLine silently accepts non-numeric string values in fields that strictly require integers (FLAG, POS, MAPQ, PNEXT, and TLEN). Instead of throwing a validation error, the parser converts the string to 0 and writes the corrupted record to the ROOT file, causing silent downstream data corruption.

To Reproduce

  1. Create a corrupted SAM record with text in the FLAG and POS columns:
    echo -e "read_name\tBROKEN_FLAG\tchr1\tBROKEN_POS\t60\t100M\t=\t1200\t200\tATGC\tIIII" > type_crash.sam
  2. Run the converter:
    ./tools/samtoramntuple type_crash.sam output.root

Current Behavior

The parser outputs Processed 1 SAM records and successfully creates output.root. The corrupted string values are silently cast to 0.

Expected behavior

The parser should throw a validation error and safely abort or reject the record, preventing the creation of a corrupted RNTuple.

Root Cause

In src/ramcore/SamParser.cxx, the parser utilizes the C-style atoi() function, which unsafely returns 0 upon failing to parse a string.

Proposed Solution

I have already tested a local fix where I replace atoi() with a strict C++ class wrapper using std::stoi. This catches std::invalid_argument and properly halts execution to preserve data integrity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions