You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implement a functional/integration test suite that validates GIQL operator correctness against bedtools. Each test generates controlled genomic interval datasets, executes the equivalent operation in both GIQL (transpiled to SQL and executed via DataFusion) and bedtools (via pybedtools), then compares the results.
Operators to cover
INTERSECTS — validate against bedtools intersect, including strand-aware modes (-s, -S)
MERGE — validate against bedtools merge, including strand-aware merging
NEAREST — validate against bedtools closest, including k-nearest and distance calculations
CLUSTER — validate against bedtools cluster
DISTANCE — validate against bedtools closest -d distance output
Arrange — Generate intervals using IntervalGenerator with a deterministic seed. Load into both a DataFusion session (as Arrow tables) and pybedtools BedTool objects.
Act (bedtools) — Execute the operation via the pybedtools wrapper.
Act (GIQL) — Transpile the GIQL query and execute it against DataFusion.
Assert — Compare results using order-independent comparison with epsilon tolerance for floats and exact matching for integer positions.
DataFusion execution
Use datafusion (PyArrow-based Python bindings) as the execution engine. GIQL transpiles to SQL; DataFusion executes it. This validates that GIQL's generated SQL is portable and correct on the engine the project targets for production use. The test engine wrapper registers Arrow tables from the interval generator and executes transpiled GIQL queries via SessionContext.sql().
Dependencies
pybedtools — Python wrapper for bedtools CLI
bedtools — System dependency (tests skip gracefully if not installed)
datafusion — Apache DataFusion Python bindings
hypothesis — Property-based test data generation for edge-case discovery
Motivation
GIQL transpiles spatial genomic queries into SQL, but the existing unit tests only verify that the generated SQL has the expected structure — they do not verify that the SQL produces correct results on real data. bedtools is the de facto standard for genomic interval operations, making it the ideal oracle for correctness testing. Using DataFusion as the execution engine ensures the suite validates correctness on GIQL's target production engine and catches any SQL dialect incompatibilities early.
Expected outcome
Integration test suite under tests/integration/bedtools/ covers the five merged GIQL operators (INTERSECTS, MERGE, NEAREST, CLUSTER, DISTANCE)
Tests use DataFusion as the execution engine for GIQL queries
Tests skip gracefully when bedtools is not installed
Interval generation is seeded and reproducible
Strand-aware modes are tested for operators that support them
The test suite passes against the current GIQL transpiler output
Summary
Implement a functional/integration test suite that validates GIQL operator correctness against bedtools. Each test generates controlled genomic interval datasets, executes the equivalent operation in both GIQL (transpiled to SQL and executed via DataFusion) and bedtools (via pybedtools), then compares the results.
Operators to cover
bedtools intersect, including strand-aware modes (-s,-S)bedtools merge, including strand-aware mergingbedtools closest, including k-nearest and distance calculationsbedtools clusterbedtools closest -ddistance outputArchitecture
Test pattern
Each test follows a consistent pattern:
IntervalGeneratorwith a deterministic seed. Load into both a DataFusion session (as Arrow tables) and pybedtools BedTool objects.DataFusion execution
Use
datafusion(PyArrow-based Python bindings) as the execution engine. GIQL transpiles to SQL; DataFusion executes it. This validates that GIQL's generated SQL is portable and correct on the engine the project targets for production use. The test engine wrapper registers Arrow tables from the interval generator and executes transpiled GIQL queries viaSessionContext.sql().Dependencies
pybedtools— Python wrapper for bedtools CLIbedtools— System dependency (tests skip gracefully if not installed)datafusion— Apache DataFusion Python bindingshypothesis— Property-based test data generation for edge-case discoveryMotivation
GIQL transpiles spatial genomic queries into SQL, but the existing unit tests only verify that the generated SQL has the expected structure — they do not verify that the SQL produces correct results on real data. bedtools is the de facto standard for genomic interval operations, making it the ideal oracle for correctness testing. Using DataFusion as the execution engine ensures the suite validates correctness on GIQL's target production engine and catches any SQL dialect incompatibilities early.
Expected outcome
tests/integration/bedtools/covers the five merged GIQL operators (INTERSECTS, MERGE, NEAREST, CLUSTER, DISTANCE)