When variants are supplied via PGEN, plink2 --make-pgen (used to build the .pgen) represents some genotype classes differently from the source VCF, so PGEN-backed haplotype output diverges from VCF/SVAR-backed output (and from a VCF-derived bcftools consensus oracle) on those classes. Verified empirically by round-tripping VCF → plink2 --make-pgen --vcf-half-call r → genoray PGEN read:
| Input GT |
PGEN (hap0,hap1,phased) |
1 (haploid) |
(1,1,phased) — promoted to homozygous diploid |
1/0, 0/1 (unphased het) |
both (0,1,unphased) — allele order canonicalized |
1/. (unphased half-call) |
(0,1,unphased) — sorted |
| `1 |
0, 0 |
| `1 |
.` (phased half-call) |
./. |
(-1,-1) missing |
So phased diploid genotypes (incl. phased half-calls under --vcf-half-call r) are faithful, but unphased genotypes lose their listed allele order and haploid genotypes are promoted. For unphased data the per-haplotype assignment is arbitrary anyway, but it means PGEN-backed per-haplotype output is not comparable to VCF allele order.
This is largely inherent to how PGEN/plink2 represents data; filing for documentation/visibility. Found via property-based testing (Phase 2); the property test restricts the PGEN backend to phased-diploid inputs for the haplotype-vs-consensus comparison.
When variants are supplied via PGEN,
plink2 --make-pgen(used to build the .pgen) represents some genotype classes differently from the source VCF, so PGEN-backed haplotype output diverges from VCF/SVAR-backed output (and from a VCF-derivedbcftools consensusoracle) on those classes. Verified empirically by round-tripping VCF → plink2--make-pgen --vcf-half-call r→ genoray PGEN read:1(haploid)1/0,0/1(unphased het)1/.(unphased half-call),0./.So phased diploid genotypes (incl. phased half-calls under
--vcf-half-call r) are faithful, but unphased genotypes lose their listed allele order and haploid genotypes are promoted. For unphased data the per-haplotype assignment is arbitrary anyway, but it means PGEN-backed per-haplotype output is not comparable to VCF allele order.This is largely inherent to how PGEN/plink2 represents data; filing for documentation/visibility. Found via property-based testing (Phase 2); the property test restricts the PGEN backend to phased-diploid inputs for the haplotype-vs-consensus comparison.