Context
The /index/{dcc}/{local_id} endpoint currently supports 4DN only. 4DN provides index files (.px2, .bai) as structured entries in its API response (extra_files array), which are stored in extra.extra_files on materialized file documents and served via the index endpoint.
For genomic visualization (e.g., tiling BAM alignments in a genome browser), clients need to fetch the index file first to determine byte ranges, then stream specific tiles from /data/{dcc}/{local_id} using byte-range requests.
Investigation
ENCODE was investigated as a candidate for extending the index endpoint. Findings:
ENCODE does not provide genomic index files
- No
.bai for BAM files — ENCODE releases BAM files without accompanying indexes. Users generate them locally with samtools index. See broadinstitute/gdr-ingest#8 for precedent.
- No
.tbi/.csi for VCF/BED — No tabix or CSI indexes are provided.
- No
extra_files equivalent — Unlike 4DN's API which returns an extra_files array with index file metadata, ENCODE file objects have no field for companion files.
index_of field is unrelated — ENCODE's index_of refers to FASTQ index reads (barcode sequences for demultiplexing), not .bai/.tbi genomic indexes. The ENCODE file schema defines it as linking output_type: "index reads" files to parent FASTQs.
Self-indexed formats already work
bigWig, bigBed, and hic files are self-indexed and support random access via HTTP Range requests on the existing /data/encode/{local_id} endpoint — no separate index endpoint needed.
| Format |
Index Type |
Provided by ENCODE? |
Self-Indexed? |
| BAM |
.bai |
No |
No |
| VCF.gz |
.tbi |
No |
No |
| BED.gz |
.tbi |
No |
No |
| bigWig |
— |
N/A |
Yes |
| bigBed |
— |
N/A |
Yes |
| hic |
— |
N/A |
Yes |
Possible approaches
-
Server-side index generation during sync — Run samtools index on BAM files and tabix on VCF/BED files during ENCODE sync; store the generated index files (e.g., in object storage) along with their metadata and serve them via the /index endpoint. Adds compute time to sync and storage cost.
-
On-demand index generation — Generate the index on first /index/encode/{local_id} request, cache it, and serve subsequent requests from cache. Adds significant latency on first access (must download full BAM to index it). Users may need to request indexing explicitly via a new endpoint for files missing indexes.
-
Proxy to cloud storage with convention-based URLs — If ENCODE ever co-locates .bai files alongside BAMs in S3 (s3://encode-public/.../ENCFF*.bam.bai), we could try fetching from the predicted URL. Currently no evidence this exists.
Context
The
/index/{dcc}/{local_id}endpoint currently supports 4DN only. 4DN provides index files (.px2,.bai) as structured entries in its API response (extra_filesarray), which are stored inextra.extra_fileson materialized file documents and served via the index endpoint.For genomic visualization (e.g., tiling BAM alignments in a genome browser), clients need to fetch the index file first to determine byte ranges, then stream specific tiles from
/data/{dcc}/{local_id}using byte-range requests.Investigation
ENCODE was investigated as a candidate for extending the index endpoint. Findings:
ENCODE does not provide genomic index files
.baifor BAM files — ENCODE releases BAM files without accompanying indexes. Users generate them locally withsamtools index. See broadinstitute/gdr-ingest#8 for precedent..tbi/.csifor VCF/BED — No tabix or CSI indexes are provided.extra_filesequivalent — Unlike 4DN's API which returns anextra_filesarray with index file metadata, ENCODE file objects have no field for companion files.index_offield is unrelated — ENCODE'sindex_ofrefers to FASTQ index reads (barcode sequences for demultiplexing), not.bai/.tbigenomic indexes. The ENCODE file schema defines it as linkingoutput_type: "index reads"files to parent FASTQs.Self-indexed formats already work
bigWig, bigBed, and hic files are self-indexed and support random access via HTTP Range requests on the existing
/data/encode/{local_id}endpoint — no separate index endpoint needed.Possible approaches
Server-side index generation during sync — Run
samtools indexon BAM files andtabixon VCF/BED files during ENCODE sync; store the generated index files (e.g., in object storage) along with their metadata and serve them via the/indexendpoint. Adds compute time to sync and storage cost.On-demand index generation — Generate the index on first
/index/encode/{local_id}request, cache it, and serve subsequent requests from cache. Adds significant latency on first access (must download full BAM to index it). Users may need to request indexing explicitly via a new endpoint for files missing indexes.Proxy to cloud storage with convention-based URLs — If ENCODE ever co-locates
.baifiles alongside BAMs in S3 (s3://encode-public/.../ENCFF*.bam.bai), we could try fetching from the predicted URL. Currently no evidence this exists.