Course: Bioinformatics (Fall 2025) Project Date: January 2026
This project implements a complete bioinformatics pipeline for analyzing sequencing data from the Major Histocompatibility Complex (MHC) region. The MHC region is located on chromosome 6 and contains many immune-related genes, including HLA genes, which are essential for antigen presentation and immune response.
The MHC region is biologically important but technically challenging because it is:
- highly polymorphic
- gene-dense
- rich in homologous sequences
- difficult to analyze with short-read sequencing
The main purpose of this project is to go from raw alignment files (BAM) to interpretable results by completing three sequential tasks:
- Gene-level annotation and sequence extraction
- Mapping quality (MAPQ) quality control
- MHC-specific gene extraction and allele typing via alignment
The project dataset contains multiple aligned sequencing samples:
- 20 BAM files (paired-end sequencing alignments)
To interpret BAM alignments correctly, two reference files were required:
- GTF gene annotation file (must match BAM genome build)
- Reference genome FASTA (for extracting gene sequences)
Because the BAM files were aligned to UCSC hg19, the annotation and reference genome also had to be hg19-compatible.
This pipeline was executed using standard command-line genomics tools and Python scripting:
-
samtools
- Inspect BAM headers
- Extract MAPQ values
- Index reference FASTA
-
featureCounts (Subread package)
- Count fragments per gene (paired-end mode)
-
bedtools
- Extract gene sequences from hg19 FASTA using BED coordinates
-
Python 3
- Parse GTF
- Filter gene sets
- Merge tables
- Convert CSV → FASTA
- Implement global alignment and allele typing
The goal of Task 1 was to annotate each BAM file with gene information and produce a structured table of genes covered by sequencing reads.
A gene was considered covered/present if:
gene is covered ⇐⇒ gene fragment count > 0
The required output was one CSV per sample, containing gene metadata and reference sequences.
Task 1 required building a reproducible pipeline with the following steps.
Before using any annotation, I checked the BAM header to confirm the genome build:
samtools view -H <sample>.bamThe BAM header included a reference path containing:
/genomics/opt/RefGenomes/hg19/ucsc.hg19.fasta
This confirmed that all BAM files were aligned to UCSC hg19, so the annotation and FASTA must match hg19 coordinate space.
Because the reference was hg19, I selected an hg19-compatible GTF:
GENCODE v37lift37:
gencode.v37lift37.annotation.gtf
This ensured gene coordinates aligned correctly with BAM coordinates and chromosome naming conventions (chr6, not 6).
The BAM files are paired-end, so gene counting must be done in paired-end mode (fragment counting).
Command used:
featureCounts -p -B \
-a project/gtf/gencode.v37lift37.annotation.gtf \
-o project/counts/<sample>.counts.txt \
-g gene_name \
-t exon \
project/bam/<sample>.bamKey settings:
-p: paired-end fragment counting-B: require properly paired fragments-t exon: count exons and aggregate to gene-level-g gene_name: group features by gene name
This produced a gene-level count table for each sample.
To get only covered genes:
awk 'BEGIN{FS="\t"} $1 !~ /^#/ && NR>2 { if($NF>0) print $1 }' \
project/counts/<sample>.counts.txt \
> project/genes/<sample>.genes.txtExample result:
- Sample MOT36308 contained 734 genes with count > 0.
Counts alone are not enough. The CSV must contain:
- chromosome
- start
- end
- strand
A Python script was implemented to:
- Load
<sample>.genes.txt - Parse the GTF
- Keep only
feature == "gene" - Extract gene_name from attributes
- Output a coordinate table (TSV)
- Output a BED file for sequence extraction
Coordinate conversion:
- GTF uses 1-based inclusive
- BED uses 0-based half-open
So:
- BED start = GTF start − 1
- BED end = GTF end
This step was essential for correct sequence extraction.
I downloaded UCSC hg19 chromosome FASTA files and concatenated them:
cat chr*.fa > ucsc.hg19.fastaThen indexed the FASTA:
samtools faidx project/ref/ucsc.hg19.fastaThis enabled fast extraction of sequences by coordinate.
Using the BED intervals from Step 5:
bedtools getfasta \
-fi project/ref/ucsc.hg19.fasta \
-bed project/genes/<sample>.genes.bed \
-s -name \
-fo project/seq/<sample>.genes.faImportant flags:
-s: strand-aware extraction-name: use gene names in FASTA headers
The final deliverable is one CSV per sample:
project/csv/<sample>.task1.csv
A key technical issue occurred:
bedtools FASTA headers looked like:
>ABCF1::chr6:30539169-30564956(+)
But the gene table contained only:
ABCF1
So the merge initially failed and returned:
Missing sequences for 734 genes.
Fix:
The merge script was updated to normalize FASTA headers by splitting on :: and keeping only the gene name.
After the fix, the pipeline produced:
Wrote 734 rows. Missing sequences for 0 genes.
For each BAM file:
project/csv/<sample>.task1.csv
And intermediate files for reproducibility:
project/counts/<sample>.counts.txtproject/genes/<sample>.genes.txtproject/genes/<sample>.genes.tsvproject/genes/<sample>.genes.bedproject/seq/<sample>.genes.fa
The pipeline was successfully executed for 20 BAM files, producing 20 final CSV outputs.
The goal of Task 2 was to assess whether read alignments are reliable enough for downstream MHC interpretation.
This step is especially important because the MHC region often produces ambiguous alignments due to homology and repeats.
MAPQ (mapping quality) was used as the main metric:
- MAPQ ≥ 30 → confidently mapped reads
- MAPQ = 0 → ambiguous multi-mapped reads
For each BAM file:
-
Extract MAPQ values using samtools
-
Generate plots for each sample:
- histogram (counts)
- histogram (normalized)
- KDE (density)
- CDF (cumulative distribution)
-
Across all samples:
- bar plot: % reads with MAPQ ≥ 30
- bar plot: % reads with MAPQ = 0
These plots were used to identify outlier samples and evaluate alignment confidence.
Across the dataset:
- Most samples showed strong mapping quality
- Typically ~80–85% of reads had MAPQ ≥ 30
- Some samples had elevated MAPQ = 0 fractions, indicating more ambiguity
A representative example sample (MOT36308) showed a bimodal distribution:
- a strong high-MAPQ peak
- a smaller MAPQ=0 peak
This pattern matches known characteristics of MHC sequencing.
Conclusion from Task 2:
-
The dataset is generally high-quality and suitable for downstream analysis.
-
Samples with higher MAPQ=0 require caution because ambiguous mappings can bias:
- coverage depth estimates
- gene-level comparisons
- allele typing results
Task 3 focused on MHC-specific analysis by:
-
Extracting MHC genes from Task 1 outputs
-
Converting extracted genes to FASTA
-
Performing global alignment to reference HLA alleles
-
Selecting the best allele for each gene based on:
- alignment score
- percent identity
Task 3 used:
-
Task 1 per-sample gene tables:
project/csv/MOT*.task1.csv
-
Reference HLA allele sequences:
-
downloaded from the IPD-IMGT/HLA database mirror
-
combined into:
project/ref/hla_alleles.fa
-
Each Task 1 table was filtered to keep only MHC-related genes (Class I, II, III).
Output:
project/task3/step1_mhc_tables/MOTxxxx.mhc_step1.csv
Technical issue: Some sequence fields exceeded Python’s default CSV field size limit.
Fix: Increased CSV field size limit in the Step 1 script to allow robust parsing of long sequences.
Each filtered MHC table was converted into FASTA.
Example header format:
>HLA-A|class=I|chr6:...
Output:
project/task3/step2_mhc_fasta/MOTxxxx.mhc_step2.fa
Sanity check:
The number of FASTA records (>) was verified to match the number of extracted MHC genes.
Reference HLA FASTA files were assembled from locus-specific IMGT/HLA nucleotide FASTAs and concatenated into one file.
A key challenge:
IMGT/HLA headers often include internal IDs like HLA00001, meaning allele names are not always the first token.
Fix: A robust parser was implemented that searches the full header line for allele patterns like:
A*01:01:01:01DRB1*15:01:01
For each sample gene sequence:
-
perform global alignment against allele sequences for the same locus
-
use Needleman–Wunsch dynamic programming
-
scoring:
- match = +2
- mismatch = −1
- gap = −2
Best allele selection:
- maximum alignment score
- tie-breaker: maximum percent identity
Final allele typing table per sample:
project/task3/step3_alignment/MOTxxxx.mhc_allele_typing.real.csv
Columns:
sample_idgene_namemhc_classbest_alleleread_supportmean_alignment_scorepercent_identity
Task 3 was executed successfully for all 20 samples.
Outputs generated:
- 20 MHC filtered tables (Step 1)
- 20 MHC FASTA files (Step 2)
- allele typing results per sample (Step 3)
Observations:
- HLA Class I and II genes usually produced valid allele matches
- Many Class III genes produced empty allele results (not present in the reference database)
Most samples contain high MAPQ values, suggesting reliable alignments.
Each sample contains hundreds of genes with nonzero coverage (example: 734 genes in MOT36308).
Coverage differences across HLA genes reflect known MHC complexity and mapping difficulty.
High-confidence allele calls were obtained for some loci (example: HLA-A), but many loci remain difficult to type using short-read data alone.
This project is scientifically meaningful, but several limitations apply:
-
Short-read sequencing limits HLA typing resolution
- many alleles are too similar to distinguish reliably
-
MAPQ is aligner-dependent
- MAPQ values are not absolute truth, only confidence estimates
-
Reference-based gene sequences are not sample haplotypes
- Task 1 sequences come from hg19 reference, not individual variants
-
MHC multi-mapping affects coverage interpretation
- ambiguous reads can bias gene-level depth and allele calls
project/csv/<sample>.task1.csv(20 files)
- MAPQ distribution plots (per sample)
- summary bar plots across samples
- written QC interpretation
- MHC-only tables (20 files)
- MHC FASTA outputs (20 files)
- allele typing outputs per sample
This project successfully built and executed a full MHC-focused genomics pipeline:
- BAM files were correctly annotated using hg19-compatible GTF + FASTA
- mapping quality was evaluated to validate alignment reliability
- MHC genes were extracted and typed using pairwise global alignment against IMGT/HLA alleles
The results demonstrate that:
- most samples have strong alignment quality
- MHC gene coverage is present but uneven (expected)
- allele typing is feasible for some loci but remains challenging overall
This workflow provides a reproducible foundation for deeper MHC analysis, including improved allele typing methods, variant calling, or long-read sequencing validation.