3' terminal exon capture diagnostics for long-read single-cell RNA-seq.
tecap classifies long-read alignments by where their 3' end lands relative to the terminal exon (TE), its UTR, and a polyA site atlas. It decomposes capture failures into nine mechanism buckets (successful capture, truncation at a real polyA site, internal priming in the UTR, internal priming in the CDS, alternative polyadenylation, upstream-exon mispriming, intronic mispriming, downstream readthrough) and measures reference base composition downstream of each cleavage site, separating classical A-tract internal priming (>=60% A) from moderate-A priming (30-50% A). Empirically the two regimes split single-cell preps from bulk Iso-Seq; the biochemical driver of the split is currently uncharacterized.
Designed for PacBio Iso-Seq / Kinnex and Oxford Nanopore cDNA BAMs. Direct-RNA sequencing is explicitly unsupported (no RT, no priming artifact to diagnose).
Every classified read lands in exactly one of nine buckets, defined by where its 3' end falls relative to the terminal exon (TE), the TE's UTR / CDS, and the nearest annotated PolyASite cluster.
| Bucket | What it means | Why it matters |
|---|---|---|
| Captured | 3' end in the TE; read covers >=50% of it. | Successful full-length capture of the mRNA 3' end; the goal of any 3'-end protocol. |
| MechA-correct | 3' end in the TE 3' UTR within +-25 bp of an annotated polyA cluster, but read covers <50% of TE. | Truncated transcript that nonetheless terminates at a real polyA site; common with degraded input or short-fragment library prep. |
| MechA-internalUTR | 3' end in the TE 3' UTR but not at any annotated polyA cluster. | Internal oligo-dT priming on an A-rich stretch in the UTR; classic mispriming signature. |
| IP-TE-CDS | 3' end inside the terminal exon's CDS portion. | Internal priming on the coding portion of the TE; strong mispriming signal. |
| MechA-noCDS | 3' end inside the TE of a non-coding gene. | Reported separately so the coding-gene buckets stay clean. |
| MechB-APA | 3' end upstream of the TE at an annotated polyA cluster on an upstream exon. | Alternative polyadenylation isoform; biological, not a mispriming artifact. |
| MechB-exon | 3' end on an upstream exon, no nearby polyA cluster. | Internal priming on an upstream exon. |
| MechB-aspecific | 3' end upstream of the TE in an intron or gene flank. | Pre-mRNA priming or off-target alignment. |
| MechC | 3' end downstream of the TE end. | Read-through, unannotated 3' UTR extension, or alignment artifact. |
The basecomp subcommand also splits Captured / MechA / MechB-APA reads by whether their cluster carries a canonical AAUAAA-like hexamer (PAS+/-).
Run tecap explain to print these definitions on the terminal, or
tecap explain --mechanism MechA-correct --format json for a single entry.
{sample}_terminal_exon.png— three panels: bucket fractions, read-length density (Captured vs MechA-correct), and rates by 3' UTR length bin. Mispriming bias concentrates in the long-UTR bins.{sample}_mecha_scatter.png— read length vs TE coverage for MechA-correct reads only; reads above the dashed coverage threshold get promoted to Captured.{sample}_basecomp.png— eight panels, one per bucket, showing %A in the reference window downstream of cleavage. Grey band (30-50% A): moderate-A priming. Dashed line (>=60% A): classical A-tract priming. Empirically, single-cell prep datasets (10x, in-house FLASH-seq variants on the BD Rhapsody platform and on plates, ArgenTag) cluster in the grey band; bulk Iso-Seq datasets cluster past the dashed line. The biochemical driver of this split is currently uncharacterized.comparison_*.png— same panels, multiple samples grouped on the same axes. Generated bytecap compareortecap report(multi-sample mode).
pip install git+https://github.com/FullLengthFanatic/tecap@v0.4.0Development install:
git clone https://github.com/FullLengthFanatic/tecap
cd tecap
pip install -e .[dev]
pytest# Classify reads. References are auto-fetched on first run and cached
# under ~/.cache/tecap/GRCh38/.
tecap classify \
--bam sample.bam \
--genome GRCh38 \
--gtf-version 45 \
--sample S1 \
--out-dir results/ \
--threads 8 \
--platform cdna-pacbio \
--verbose
# Or pass references explicitly (no auto-download):
tecap classify \
--bam sample.bam \
--gtf gencode.v45.annotation.gtf.gz \
--polya-sites atlas.clusters.3.0.GRCh38.GENCODE_42.bed.gz \
--sample S1 --out-dir results/ --threads 8
# Measure base composition in the 20 nt window downstream of each cleavage site
tecap basecomp \
--bam sample.bam \
--genome GRCh38 \
--gtf-version 45 \
--fasta GRCh38.primary_assembly.genome.fa.gz \
--sample S1 \
--out-dir results/ \
--threads 8 \
--verbose
# Render a self-contained HTML report (per-sample)
tecap report \
--classify-json results/S1_terminal_exon.json \
--basecomp-json results/S1_basecomp.json \
--out-html results/S1_report.html
# Cross-sample HTML report (space-separated paths)
tecap report \
--classify-json results/A_terminal_exon.json results/B_terminal_exon.json \
--basecomp-json results/A_basecomp.json results/B_basecomp.json \
--out-html results/compare.html
# Print the mechanism glossary
tecap explain
tecap explain --mechanism MechA-correct --format json
# Cross-sample comparison plots only (no HTML)
tecap compare \
--mode classify \
--inputs results/A_terminal_exon.json,results/B_terminal_exon.json \
--out-dir results/
# Fetch references explicitly (otherwise --genome handles this)
tecap download-atlas \
--genome GRCh38 \
--gtf-version 45 \
--out-dir ref/Per sample (classify):
{sample}_terminal_exon.json— bucket counts, fractions, PAS split, UTR-length stratification, orientation sanity check, read-length medians.{sample}_terminal_exon.png— 3-panel summary plot.{sample}_mecha_scatter.png— read length vs TE coverage for MechA-correct reads.{sample}_tecap_mqc.json— MultiQC custom-content table (auto-detected by the_mqc.jsonsuffix).{sample}_per_gene.tsv(optional, with--per-gene-table) — per-gene bucket counts.
Per sample (basecomp):
{sample}_basecomp.json— %A histograms per bucket, medians, >=60% and 30-50% fractions.{sample}_basecomp.png— 8-panel histogram grid.
Cross-sample:
comparison_terminal_exon.png— grouped bars across samples.comparison_basecomp.png— per-bucket histogram overlays.
Five of the six samples are PacBio HiFi sequenced after Kinnex
concatenation: most use the Kinnex Full-length kit (8x), while the
public ArgenTag sample uses the Kinnex Single Cell kit (12x).
PBMC_FS_ONT is the exception, sequenced as Oxford Nanopore cDNA
without Kinnex.
| Sample | Tissue | Organism | Source |
|---|---|---|---|
10x_FL_v02_full |
Retinal organoid | Human | In-house. 10x GEM-X 3' RNA-seq kit, then Kinnex FL (8x), PacBio HiFi. |
BD46_FS_SEQ |
Retinal organoid | Human | In-house. FLASH-seq variant on the BD Rhapsody platform (bead-bound oligo-dT capture), then Kinnex FL (8x), PacBio HiFi. |
PBMC_FS_ONT |
PBMC | Human | In-house. Plate-based FLASH-seq, sequenced as ONT cDNA (no Kinnex). |
MDA_argentag_kinnex12x |
MDA-MB-453 cell line | Human | Public ArgenTag, nanowell single-cell + Kinnex SC (12x), PacBio HiFi. Source: downloads.pacbcloud.com/public/dataset/Kinnex-single-cell-RNA/DATA-RevioSPRQ-Kinnex-ArgenTag-MDAcellLine/MDA-12fold/. |
kinnex_cerebellum |
Cerebellum (bulk) | Human | Public PacBio bulk Iso-Seq + Kinnex FL (8x). Source: downloads.pacbcloud.com/public/dataset/Kinnex-full-length-RNA/. |
kinnex_heart |
Heart (bulk) | Human | Public PacBio bulk Iso-Seq + Kinnex FL (8x). Source: downloads.pacbcloud.com/public/dataset/Kinnex-full-length-RNA/ (same parent bucket as cerebellum). |
In-house datasets are not deposited in SRA / GEO; the chemistry descriptions above are sufficient to identify what was run. Public datasets have no associated DOI / paper at this time; URLs link to the PacBio public download locations.
Two pairwise comparisons rendered with tecap report. All samples
human GRCh38, sequenced as FL Kinnex / MAS-ISO / PacBio HiFi.
Single-cell vs single-cell: 10x Kinnex (10x_FL_v02_full) vs
BD46_FS_SEQ, an in-house FLASH-seq variant on the BD Rhapsody
platform with Kinnex concatenation (not stock Rhapsody chemistry).
Single-cell vs bulk Iso-Seq: 10x Kinnex vs PacBio Kinnex bulk cerebellum.
HTML report (tecap report):
- Self-contained
.htmlper sample (and per comparison) with embedded PNGs, executive summary tiles, mechanism legend, per-bucket tables, PAS split, and UTR-length stratification. Single file, no JS.
If you use tecap, please cite the GitHub release DOI (see CITATION.cff).
MIT



