|
| 1 | +# ReadFaker |
| 2 | + |
| 3 | +A tool for simulating Oxford Nanopore sequencing reads with realistic quality profiles by extracting empirical models |
| 4 | +from real FASTQ data. |
| 5 | + |
| 6 | +## Features |
| 7 | + |
| 8 | +- Creates empirical models for read length and quality scores (quality scores are grouped by length batches). |
| 9 | +- Supports compressed input and output FASTQ files. |
| 10 | +- Fast: can generate a million reads in around a minute. |
| 11 | + |
| 12 | +## Motivation |
| 13 | + |
| 14 | +Oxford Nanopore data quality depends on many factors: mainly the kit used, model version, and model precision. |
| 15 | +These last two change quite often, making it challenging to simulate realistic data with fixed parameters. |
| 16 | + |
| 17 | +This tool takes a different approach: instead of hardcoded models, it extracts length and quality profiles directly from |
| 18 | +your real data, ensuring the simulated reads match the characteristics of actual sequencing runs. |
| 19 | + |
| 20 | +## Limitations |
| 21 | + |
| 22 | +- Insertions and deletions are limited to one nucleotide length. Alteration ratios are fixed. |
| 23 | +- Only generates modified sequences, not chimeras, junk reads and other types of artifacts. |
| 24 | +- No BAM file support. |
| 25 | + |
| 26 | +## Usage |
| 27 | + |
| 28 | +```bash |
| 29 | +readfaker -r <reference.fasta> -i <input.fastq> -o <output.fastq> -n <num_reads> |
| 30 | +``` |
| 31 | + |
| 32 | +### Required Arguments |
| 33 | + |
| 34 | +- `-r, --reference <FASTA>` - Reference sequences to sample reads from |
| 35 | +- `-i, --input <FASTQ>` - Input FASTQ file to extract quality and length models |
| 36 | +- `-o, --output <FASTQ>` - Output FASTQ file for simulated reads |
| 37 | + |
| 38 | +### Optional Arguments |
| 39 | + |
| 40 | +- `-n, --num-reads <N>` - Number of reads to generate (default: 100000) |
| 41 | +- `-s, --seed <N>` - Random seed for reproducibility |
| 42 | +- `-v, --verbose` - Enable verbose output |
| 43 | + |
| 44 | +### Example |
| 45 | + |
| 46 | +```bash |
| 47 | +# Generate 10000 reads with verbose output |
| 48 | +readfaker -r genome.fasta -i real_reads.fastq.gz -o simulated_reads.fastq.gz -n 10000 -v |
| 49 | + |
| 50 | +# Generate reproducible reads with a fixed seed |
| 51 | +readfaker -r genome.fasta -i real_reads.fastq -o simulated_reads.fastq -s 42 |
| 52 | +``` |
| 53 | + |
| 54 | +## How It Works |
| 55 | + |
| 56 | +1. **Model Extraction**: Reads an existing FASTQ file to build empirical models of read lengths and quality scores |
| 57 | +2. **Reference Loading**: Parses reference genome sequences from FASTA format |
| 58 | +3. **Read Generation**: Samples read lengths, selects random reference positions, applies quality profiles, and |
| 59 | + introduces errors based on quality scores |
| 60 | +4. **Output**: Writes FASTQ records with automatic BGZF compression for `.gz`, `.bgz`, or `.bgzf` files |
| 61 | + |
| 62 | +## Building from Source |
| 63 | + |
| 64 | +```bash |
| 65 | +cargo build --release |
| 66 | +``` |
| 67 | + |
| 68 | +The binary will be available at `target/release/readfaker`. |
| 69 | + |
| 70 | +## Running Tests |
| 71 | + |
| 72 | +```bash |
| 73 | +cargo test |
| 74 | +``` |
0 commit comments