Skip to content

PGEN backend: plink2 conversion canonicalizes unphased allele order and promotes haploid GTs #203

@d-laub

Description

@d-laub

When variants are supplied via PGEN, plink2 --make-pgen (used to build the .pgen) represents some genotype classes differently from the source VCF, so PGEN-backed haplotype output diverges from VCF/SVAR-backed output (and from a VCF-derived bcftools consensus oracle) on those classes. Verified empirically by round-tripping VCF → plink2 --make-pgen --vcf-half-call r → genoray PGEN read:

Input GT PGEN (hap0,hap1,phased)
1 (haploid) (1,1,phased) — promoted to homozygous diploid
1/0, 0/1 (unphased het) both (0,1,unphased) — allele order canonicalized
1/. (unphased half-call) (0,1,unphased) — sorted
`1 0, 0
`1 .` (phased half-call)
./. (-1,-1) missing

So phased diploid genotypes (incl. phased half-calls under --vcf-half-call r) are faithful, but unphased genotypes lose their listed allele order and haploid genotypes are promoted. For unphased data the per-haplotype assignment is arbitrary anyway, but it means PGEN-backed per-haplotype output is not comparable to VCF allele order.

This is largely inherent to how PGEN/plink2 represents data; filing for documentation/visibility. Found via property-based testing (Phase 2); the property test restricts the PGEN backend to phased-diploid inputs for the haplotype-vs-consensus comparison.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions