Remap is a projection-based re-mapping framework designed to improve the alignment consistency of long-read sequencing data (PacBio HiFi and ONT).
Standard read-to-reference alignment often suffers from ambiguity in repetitive regions (e.g., segmental duplications) and breakpoint jitter in structural variants (SVs). Remap addresses this by using assembly contigs as a bridge: reads are first mapped to their local assembly, and then projected onto the reference genome. This approach significantly improves the precision and recall of downstream SV and SNP/Indel calling.
- Assembly-Mediated Projection: Resolves multi-mapping ambiguity by utilizing the longer context of assembled contigs.
- Exact Anchor Consistency: Uses strict one-to-one mapping anchors to prevent breakpoint drifting.
- High Precision in Complex Regions: Specifically optimized for low-mappability regions, segmental duplications, and extreme GC regions.
- Versatile Support: Works with PacBio HiFi, ONT(need existing assembly), and R10+ (HQLR) data.
- Compatible Output: Generates standard BAM files compatible with existing callers like Sniffles2, SVIM, and DeepVariant.

Instead of aligning reads directly to the reference (
-
Read-to-Assembly (
$R \to A$ ): Reads are aligned to de novo assembled contigs. -
Assembly-to-Reference (
$A \to G$ ): Contigs are aligned to the reference genome. -
Coordinate Projection: Valid anchors are projected from
$R$ to$G$ via$A$ , and gaps are filled using local alignment.
We provide a environment file.
git clone https://github.com/micahvista/Remap.git
cd Remap
conda env create -f remap_environment.yml
conda activate remap_env
pip install .Remap requires an input read set, a reference genome, and a working directory. It can perform de novo assembly internally or use a pre-computed assembly.
Run full pipeline (Assembly + Remap) for PacBio HiFi data:
remap -i "reads/*.fastq.gz" \
-r reference.fasta \
-w ./workdir \
-o ./output/sample_name \
-d hifi \
-t 16Run full pipeline (Assembly + Remap) for ONT R10.4 data:
remap -i "reads/*.fastq.gz" \
-r reference.fasta \
-w ./workdir \
-o ./output/sample_name \
-d hqlr \
-t 16| Argument | Required | Description |
|---|---|---|
-i, --inputreadpath |
Yes | Path to input reads (supports wildcards, e.g., *.fq.gz). |
-r, --refpath |
Yes | Path to the reference genome (FASTA). |
-w, --workdir |
Yes | Working directory for intermediate files. |
-o, --outputprefix |
Yes | Prefix for the final output BAM file. |
-d, --datatype |
Yes | Data type: ont (R9.x), hqlr (R10.x), or hifi (PacBio HiFi). |
-t, --threads |
No | Number of threads (default: 8). |
-a, --asmdata |
No (yes if ont) | Path to existing assembly (FASTA or GFA). If omitted, hifiasm is run. |
--localasm |
No | Enable local assembly mode for reads containing SVs. |
--asm2ref |
No | Path to existing Assembly-to-Reference BAM. |
--read2asm |
No | Path to existing Read-to-Assembly BAM. |
--rutg |
No | Use raw unitig graph for assembly (default: False). |
The pipeline generates a standard BAM file containing the re-mapped reads.
{outputprefix}all.bam (Contains both re-mapped and rescued reads).
Includes standard SAM tags plus custom tags for alignment consistency.
Reads are aligned to high-quality contigs to resolve local ambiguities.
Contigs are aligned to the reference, spanning repetitive regions with uniqueness.
Coordinates are projected mathematically from
Gaps between anchors are filled using local alignment (Smith-Waterman) to ensure base-level precision.
Based on the GIAB HG002 T2T benchmark:
Structural Variants: Remap increases F1 scores by +0.04 ~ +0.05 over direct mapping.
Segmental Duplications: Significant recall improvement (
If you use Remap in your research, please cite:
This project is licensed under the MIT License.