Skip to content

Commit 29c8e9d

Browse files
committed
chore: Add README
1 parent 4a827b1 commit 29c8e9d

1 file changed

Lines changed: 74 additions & 0 deletions

File tree

README.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# ReadFaker
2+
3+
A tool for simulating Oxford Nanopore sequencing reads with realistic quality profiles by extracting empirical models
4+
from real FASTQ data.
5+
6+
## Features
7+
8+
- Creates empirical models for read length and quality scores (quality scores are grouped by length batches).
9+
- Supports compressed input and output FASTQ files.
10+
- Fast: can generate a million reads in around a minute.
11+
12+
## Motivation
13+
14+
Oxford Nanopore data quality depends on many factors: mainly the kit used, model version, and model precision.
15+
These last two change quite often, making it challenging to simulate realistic data with fixed parameters.
16+
17+
This tool takes a different approach: instead of hardcoded models, it extracts length and quality profiles directly from
18+
your real data, ensuring the simulated reads match the characteristics of actual sequencing runs.
19+
20+
## Limitations
21+
22+
- Insertions and deletions are limited to one nucleotide length. Alteration ratios are fixed.
23+
- Only generates modified sequences, not chimeras, junk reads and other types of artifacts.
24+
- No BAM file support.
25+
26+
## Usage
27+
28+
```bash
29+
readfaker -r <reference.fasta> -i <input.fastq> -o <output.fastq> -n <num_reads>
30+
```
31+
32+
### Required Arguments
33+
34+
- `-r, --reference <FASTA>` - Reference sequences to sample reads from
35+
- `-i, --input <FASTQ>` - Input FASTQ file to extract quality and length models
36+
- `-o, --output <FASTQ>` - Output FASTQ file for simulated reads
37+
38+
### Optional Arguments
39+
40+
- `-n, --num-reads <N>` - Number of reads to generate (default: 100000)
41+
- `-s, --seed <N>` - Random seed for reproducibility
42+
- `-v, --verbose` - Enable verbose output
43+
44+
### Example
45+
46+
```bash
47+
# Generate 10000 reads with verbose output
48+
readfaker -r genome.fasta -i real_reads.fastq.gz -o simulated_reads.fastq.gz -n 10000 -v
49+
50+
# Generate reproducible reads with a fixed seed
51+
readfaker -r genome.fasta -i real_reads.fastq -o simulated_reads.fastq -s 42
52+
```
53+
54+
## How It Works
55+
56+
1. **Model Extraction**: Reads an existing FASTQ file to build empirical models of read lengths and quality scores
57+
2. **Reference Loading**: Parses reference genome sequences from FASTA format
58+
3. **Read Generation**: Samples read lengths, selects random reference positions, applies quality profiles, and
59+
introduces errors based on quality scores
60+
4. **Output**: Writes FASTQ records with automatic BGZF compression for `.gz`, `.bgz`, or `.bgzf` files
61+
62+
## Building from Source
63+
64+
```bash
65+
cargo build --release
66+
```
67+
68+
The binary will be available at `target/release/readfaker`.
69+
70+
## Running Tests
71+
72+
```bash
73+
cargo test
74+
```

0 commit comments

Comments
 (0)