The reference genome for the human embryonic stem cell H9
We have assembled the first fully phased, telomere-to-telomere (T2T) diploid reference genome for H9 (WAe009-A), one of the most widely used and ethically approved human embryonic stem cell (hESC) lines in biomedical research, registered in both the European and NIH Human Pluripotent Stem Cell Registries.
The H9 T2T diploid assembly was generated using high-coverage PacBio HiFi, Oxford Nanopore, and Hi-C data. It achieves a QV score exceeding 63, end-to-end chromosomes with complete telomeres and centromeres, and greater than 99.8% k-mer and BUSCO gene completeness, placing it on par with the most accurate human genome assemblies to date.
Our H9 assembly showcases comprehensive annotations, including genes, segmental duplications, methylation, chromatin conformation, specific variants and structural rearrangements, and centromeric sequence. Crucially, it enables haplotype-resolved gene expression and chromatin accessibility analyses, which highlights the power of this resource for allele-specific, high-precision transcriptomic, genetic, and epigenetic analyses.
This repository contains documentation, scripts and processed data relative to the H9 diploid genome assembly, annotations and multi-omics analyses.
A UCSC Browser hub with data associated with this assembly is available at:
https://public.gi.ucsc.edu/hausslerlab/t2t-h9-hub/.
GenBank accessions of assembly:
GCA_054883195.1 (haplotype 1)
GCA_054883265.1 (haplotype 2)
SRA accessions/Bioproject of reads used to construct the assembly:
BioProject PRJNA1431686
SRA BioProject SRP680790
The reference genome from the H9 hESC line was generated using a combination of Pacific Biosciences (PacBio) HiFi reads (coverage 75×), Oxford Nanopore Technology (ONT) R10 ligation reads (coverage 123×, including 47× >100 Kbps), and Arima high-throughput chromosome conformation capture (Hi-C) long-range information(coverage 87×). Two genome assembly strategies were attempted using Verkko v2.2.1.:
The first assembly (asm1) uses HiFi reads for graph construction with ONT reads for graph resolution.
The second assembly (asm2) incorporates HiFiasm-corrected ONT reads into the graph construction.
The quality of the H9 genome was assessed using a variety of tools:
Assembly basic statistics, computed using gfastats, showed that H9 is essentially gapless (contig N50=155.2 Mbps for hap1; 153.7 Mbps for hap2).
Primary alignments of HiFi reads to H9 haploid genomes were used to generate genome-wide coverage plots with NucFreq. In addition, primary alignments of HiFi and ONT reads to the H9 diploid genome were used to generate a genome-wide coverage plot with NucFlag. Coverage plots display the frequencies of the most and second most common bases at each genomic position, and they showed an overall homogeneous distribution across chromosomes in both haplotypes.
By comparing the k-mers in the HiFi reads to the k-mers found in the assembly, we obtained a quality value (QV) of 63.6 for Hap1 and 66.1 for Hap2, and 99.87% completeness for both haplotypes. Furthermore, k-mer spectra revealed a multiplicity profile consistent with a near-complete assembly, with no detectable duplications.
Based on the alignment of HiFi reads to the H9 assembly, HMM-Flagger classified 99.29% of the diploid assembly (6.06 Gbps) as a reliable haploid sequence. Regions flagged as assembly errors were rare, totaling 5.45 Mbps (0.09%), and collapsed regions totaled 4.10 Mbps (0.07%). The genomic coordinates of these low-confidence regions were used to define a low-confidence annotation track in the final assembly.
Gene completeness was determined by Compleasm using mammalia_odb12 as the gene dataset. Compleasm identified 99.11% complete genes in both haplotypes, with a small difference in duplicated genes (0.7% in haplotype 1; 0.69% in haplotype 2). The missing genes are 0.04% and 0.03%, while the fragmented ones are 0.15% and 0.16% for haplotype 1 and haplotype 2, respectively.
...
...
...
...
...
Publicly available ATAC-Seq raw data, generated from H9 cells undergoing early neural differentiation (Bioproject: PRJNA1235757) were downloaded to perform chromatin accessibility analysis across H9 haplotypes. Specifically, we downloaded paired-end ATAC-Seq data from 10 samples, with the following accession numbers: SRR32687946, SRR32687947, SRR32687948, SRR32687949, SRR32687950, SRR32687951, SRR32687952, SRR32687953, SRR32687954, and SRR32687955.
Scripts are available in the "Chromatin accessibility analysis" folder within this repository.
Warning
This repository is under current editing.