🚧 WORK IN PROGRESS 🚧
This is an extension for DuckDB for reading and writing raster files data using SQL.
The DuckDB Raster Extension is available as a signed community extension. See more details on its DuckDB CE web page.
To install and load it, you can run the following SQL commands in DuckDB:
INSTALL raster FROM community;
LOAD raster;This extension is based on the DuckDB extension template.
First, make sure to load the extension in your DuckDB session.
Then you can use the extension to read and write raster files data using SQL.
This is the list of available functions:
-
Returns the list of supported GDAL raster drivers.
SELECT short_name, long_name, help_url FROM RT_Drivers(); ┌────────────────┬──────────────────────────────────────────────────────────┬─────────────────────────────────────────────────────┐ │ short_name │ long_name │ help_url │ │ varchar │ varchar │ varchar │ ├────────────────┼──────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤ │ VRT │ Virtual Raster │ https://gdal.org/drivers/raster/vrt.html │ │ DERIVED │ Derived datasets using VRT pixel functions │ https://gdal.org/drivers/raster/derived.html │ │ GTI │ GDAL Raster Tile Index │ https://gdal.org/drivers/raster/gti.html │ │ SNAP_TIFF │ Sentinel Application Processing GeoTIFF │ https://gdal.org/drivers/raster/snap_tiff.html │ │ GTiff │ GeoTIFF │ https://gdal.org/drivers/raster/gtiff.html │ │ COG │ Cloud optimized GeoTIFF generator │ https://gdal.org/drivers/raster/cog.html │ │ · │ · │ · │ │ · │ · │ · │ │ · │ · │ · │ │ ENVI │ ENVI .hdr Labelled │ https://gdal.org/drivers/raster/envi.html │ │ EHdr │ ESRI .hdr Labelled │ https://gdal.org/drivers/raster/ehdr.html │ │ ISCE │ ISCE raster │ https://gdal.org/drivers/raster/isce.html │ │ Zarr │ Zarr │ NULL │ │ HTTP │ HTTP Fetching Wrapper │ NULL │ └────────────────┴──────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────┘
-
Reads a raster file and returns a table with the raster data.
SELECT * FROM RT_Read('path/to/raster/file.tif'); ┌───────┬───────────┬────────────┬────────────────────────────────┬─────────────────────────┬───────┬────────┬────────┬───────┬───────┬────────────┬────────────┐ │ id │ x │ y │ bbox │ geometry │ level │ tile_x │ tile_y │ cols │ rows │ metadata │ databand_1 │ │ int64 │ double │ double │ struct(xmin, ymin, xmax, ymax) │ geometry('epsg:25830') │ int32 │ int32 │ int32 │ int32 │ int32 │ JSON │ BLOB │ ├───────┼───────────┼────────────┼────────────────────────────────┼─────────────────────────┼───────┼────────┼────────┼───────┼───────┤────────────┤────────────┤ │ 0 │ 545619.75 │ 4724508.25 │ { │ POLYGON ((...)) │ 0 │ 0 │ 0 │ 320 │ 8 │ {...} │ ... │ │ │ │ │ 'xmin': 545539.75, │ │ │ │ │ │ │ │ │ │ │ │ │ 'ymin': 4724506.25, │ │ │ │ │ │ │ │ │ │ │ │ │ 'xmax': 545699.75, │ │ │ │ │ │ │ │ │ │ │ │ │ 'ymax': 4724510.25 │ │ │ │ │ │ │ │ │ │ │ │ │ } │ │ │ │ │ │ │ │ │ ├───────┼───────────┼────────────┼────────────────────────────────┼─────────────────────────┼───────┼────────┼────────┼───────┼───────┤────────────┤────────────┤ │ 1 │ 545619.75 │ 4724504.25 │ { │ POLYGON ((...)) │ 0 │ 0 │ 1 │ 320 │ 8 │ {...} │ ... │ │ │ │ │ 'xmin': 545539.75, │ │ │ │ │ │ │ │ │ │ │ │ │ 'ymin': 4724502.25, │ │ │ │ │ │ │ │ │ │ │ │ │ 'xmax': 545699.75, │ │ │ │ │ │ │ │ │ │ │ │ │ 'ymax': 4724506.25 │ │ │ │ │ │ │ │ │ │ │ │ │ } │ │ │ │ │ │ │ │ │ └───────┴───────────┴────────────┴────────────────────────────────┴─────────────────────────┴───────┴────────┴────────┴───────┴───────┴────────────┴────────────┘
The
RT_Readtable function is based on the GDAL translator library and enables reading raster data from a variety of geospatial raster file formats as if they were DuckDB tables.See RT_Drivers for a list of supported file formats and drivers.
The table returned by
RT_Readis a tiled representation of the raster file, where each row corresponds to a tile of the raster. The tile size is determined by the original block size of the raster file, but it can be overridden by the user using theblocksize_xandblocksize_yparameters.geometrycolumn is aGEOMETRYtype of typePOLYGONthat represents the footprint of each tile and you can use it to create a new geoparquet file adding the optionGEOPARQUET_VERSION.The
RT_Readfunction accepts parameters, most of them optional:Parameter Type Description pathVARCHAR The path to the file to read. The unique mandatory parameter. open_optionsVARCHAR[] A list of key-value pairs that are passed to the GDAL driver to control the opening of the file. Read GDAL documentation for available options. allowed_driversVARCHAR[] A list of GDAL driver names that are allowed to be used to open the file. If empty, all drivers are allowed. sibling_filesVARCHAR[] A list of sibling files that are required to open the file. compressionVARCHAR The compression method to use when packing the databand column. NONEis the unique option now.blocksize_xINTEGER The block size of the tile in the x direction. You can use this parameter to override the original block size of the raster. blocksize_yINTEGER The block size of the tile in the y direction. You can use this parameter to override the original block size of the raster. skip_empty_tilesBOOLEAN truemeans that empty tiles (tiles with no data) will be skipped (It checksGDAL_DATA_COVERAGE_STATUS_DATAflag if supported).trueis the default.datacubeBOOLEAN truemeans that extension returns one unique N-dimensional databand column with all bands interleaved, otherwise each band is returned as a separate column.falseis the default.This is the list of columns returned by
RT_Read:idis a unique identifier for each tile of the raster.xandyare the coordinates of the center of each tile. The coordinate reference system is the same as the one of the raster file.bboxis the bounding box of each tile, which is a struct withxmin,ymin,xmax, andymaxfields.geometryis the footprint of each tile as a polygon.level,tile_x, andtile_yare the tile coordinates of each tile. The raster is read in tiles of sizeblocksize_xxblocksize_y(or the original block size of the raster if not overridden by the parameters). Each row of the output table corresponds to a tile of the raster, and thedataband_xcolumns contain the data of that tile for each band.colsandrowsare the number of columns and rows of each tile, which can be different from the original raster if theblocksize_xandblocksize_yparameters are used to override the block size.metadatais a JSON column that contains the metadata of the raster file, including the list of bands and their properties (data type, no data value, etc), the spatial reference system, the geotransform, and any other metadata provided by the GDAL driver.databand_xare BLOB columns that contain the data of the raster bands and a header metadata describing the schema of the data. If thedatacubeoption is set totrue, only a single column calleddatacubewill contain all bands interleaved in a single N-dimensional array.
The data band columns are a BLOB with the following internal structure:
- A Header describes the raster tile data stored in the BLOB.
-
magic(uint16_t): Magic code to identify a BLOB as a raster block (0x5253) -
compression(uint8_t): Compression algorithm code used for the tile data.0=NONEis the unique option now, but more can be added in the future. -
data_type(uint8_t): RasterDataType of the tile data:Code Data Type Description 0 UNKNOWN Unknown or unspecified type 1 UINT8 Eight bit unsigned integer 2 INT8 8-bit signed integer 3 UINT16 Sixteen bit unsigned integer 4 INT16 Sixteen bit signed integer 5 UINT32 Thirty two bit unsigned integer 6 INT32 Thirty two bit signed integer 7 UINT64 64 bit unsigned integer 8 INT64 64 bit signed integer 9 FLOAT Thirty two bit floating point 10 DOUBLE Sixty four bit floating point -
bands(int32_t): Number of bands or layers in the data buffer -
cols(int32_t): Number of columns in the data buffer -
rows(int32_t): Number of rows in the data buffer -
no_data(double): NoData value for the tile (To consider when applying algebra operations).-infinityif not defined.
-
data[] (uint8_t): Interleaved pixel data for all bands, stored in row-major order. The size of this array depends on the data type, number of bands, and tile dimensions.
By using
RT_Read, the extension also provides “replacement scans” for common raster file formats, allowing you to query files of these formats as if they were tables directly.SELECT * FROM './path/to/some/raster/dataset.tif';
In practice this is just syntax-sugar for calling
RT_Read, so there is no difference in performance. If you want to pass additional options, you should use theRT_Readtable function directly.The following formats are currently recognized by their file extension:
Format Extension GeoTiff COG .tif, .tiff Erdas Imagine .img GDAL Virtual .vrt RT_Readsupports filter pushdown on the non-BLOB columns, which allows you to prefilter the tiles that are loaded based on their metadata or spatial location. As you have noticed,bboxandgeometrycolumns are available for spatial filtering, so for example, you can filter the tiles that intersect with a certain geometry:SELECT x, y, bbox, geometry FROM RT_Read('path/to/raster/file.tif') WHERE ST_Intersects(geometry, ST_GeomFromText('POLYGON((...))')::GEOMETRY('EPSG:4326')) AND ST_Area(geometry) > 1000 ;
-
You can write raster files using the
COPYcommand in DuckDB.The extension fetches the geometry and data band columns of the input table and creates a raster file with the desired properties. The
geometrycolumn is used to calculate the spatial extent and the resolution of the output raster, and the data band columns are used to populate the pixel values of the output raster.The extension provides the format
RASTERand a set of options to control the writing process:Parameter Type Description FORMATVARCHAR Must be set to 'RASTER' to use the raster writing functionality. DRIVERVARCHAR The GDAL driver to use to write the raster file. You can check available drivers using RT_Driversfunction.CREATION_OPTIONSVARCHAR[] A list of key-value pairs that are passed to the GDAL driver to control the creation of the file. Read GDAL documentation for available options. RESAMPLINGVARCHAR The resampling method to use when the tile size of the input data does not match the block size of the output raster. Available options are nearest,bilinear,cubic,cubicspline,lanczos,average,mode, andmax.nearestis the default.ENVELOPEDOUBLE[] The spatial extent of the output raster in the format [xmin, ymin, xmax, ymax]. If not provided, the extent will be calculated from the input tiles. SRSVARCHAR The spatial reference system of the output raster in WKT or EPSG code format. GEOMETRY_COLUMNVARCHAR The name of the column that contains the geometry of the tiles. This column will be used to calculate the spatial extent and resolution of the output raster. It must be a column of type GEOMETRY.geometryis the default.DATABAND_COLUMNSVARCHAR[] A list of columns that contain the data bands of the raster. These columns must be of type BLOB and have the same internal structure as the data band columns returned by RT_Read.Raster rotation is not supported, so the input
geometrycolumn must contain axis-aligned polygons that represent the footprint of each tile.For example, you can write a new raster file from an existing one by running:
COPY ( SELECT geometry, databand_1, databand_2, databand_3 FROM RT_Read('./test/data/overlay-sample.tiff') ) TO './test/data/copytoraster.tiff' WITH ( FORMAT RASTER, DRIVER 'COG', CREATION_OPTIONS ('COMPRESS=LZW'), RESAMPLING 'nearest', ENVELOPE [545539.750, 4724420.250, 545699.750, 4724510.250], SRS 'EPSG:25830', GEOMETRY_COLUMN 'geometry', DATABAND_COLUMNS ['databand_3', 'databand_2', 'databand_1'] );
Also, because the
geometrycolumn is available, you can create a newgeoparquetfile (or any other geospatial format supported by thespatialextension) with the tile data and their geometries by just running:COPY ( SELECT * EXCLUDE(databand_1,databand_2,databand_3) FROM RT_Read('path/to/raster/file.tif') ) TO 'path/to/output/file.parquet' WITH ( FORMAT PARQUET, GEOPARQUET_VERSION 'V1' ); -- Or using the spatial extension, for example, writing a GeoPackage file: LOAD spatial; COPY ( SELECT * EXCLUDE(databand_1,databand_2,databand_3) FROM RT_Read('path/to/raster/file.tif') ) TO 'path/to/output/file.gpkg' WITH ( FORMAT GDAL, DRIVER 'GPKG', SRS 'EPSG:4326' );
-
Transforms the BLOB data of the data band columns into an array of a numeric data type.
SELECT RT_Blob2ArrayInt32(databand_1, true) AS r, RT_Blob2ArrayInt32(databand_2, true) AS g, RT_Blob2ArrayInt32(databand_3, true) AS b FROM RT_Read('path/to/raster/file.tif') ;
Function accepts the following parameters:
Parameter Type Description blobBLOB The BLOB column of the data band to transform. filter_nodataBOOLEAN Whether to filter out NoData values from the array. If true, the function will exclude NoData values from the resulting array.Extension provides a different function for each numeric data type:
Function Description RT_Blob2ArrayUInt8Transforms a BLOB data column into an array of UINT8 values RT_Blob2ArrayInt8Transforms a BLOB data column into an array of INT8 values RT_Blob2ArrayUInt16Transforms a BLOB data column into an array of UINT16 values RT_Blob2ArrayInt16Transforms a BLOB data column into an array of INT16 values RT_Blob2ArrayUInt32Transforms a BLOB data column into an array of UINT32 values RT_Blob2ArrayInt32Transforms a BLOB data column into an array of INT32 values RT_Blob2ArrayUInt64Transforms a BLOB data column into an array of UINT64 values RT_Blob2ArrayInt64Transforms a BLOB data column into an array of INT64 values RT_Blob2ArrayFloatTransforms a BLOB data column into an array of FLOAT values RT_Blob2ArrayDoubleTransforms a BLOB data column into an array of DOUBLE values Functions return a struct with the following fields:
data_type(INT): RasterDataType code of the data in the BLOB.bands(INT): Number of bands or layers in the data buffer.cols(INT): Number of columns in the tile.rows(INT): Number of rows in the tile.no_data(DOUBLE): NoData value for the tile (To consider when applying algebra operations).-infinityif not defined.values(ARRAY): An array with the pixel values of the tile for the corresponding band and data type.
This allows you to do algebraic operations with the data of the tiles directly in SQL:
WITH __input AS ( SELECT RT_Blob2ArrayInt32(databand_1, false) AS r FROM RT_Read('path/to/raster/file.tif', blocksize_x := 512, blocksize_y := 512) ) SELECT list_min(r.values) AS r_min, list_stddev_pop(r.values) AS r_avg, list_max(r.values) AS r_max FROM __input ;
Choose carefully which
RT_Blob2Array<type>function you invoke, if the array element type in the output does not match the data type in the BLOB data column, the function need to adjust values accordingly, and the performance may be affected. You can check the data type of the bands in themetadatacolumn returned byRT_Read. -
Transforms an array of numeric values into a BLOB data column.
WITH __input AS ( SELECT RT_Blob2ArrayInt32(databand_1, false) AS r FROM RT_Read('path/to/raster/file.tif', blocksize_x := 512, blocksize_y := 512) ) SELECT RT_Array2Blob(r.values, 'none', r.bands, r.cols, r.rows, r.no_data) AS r_array FROM __input ;
Function accepts the following parameters:
Parameter Type Description arrayARRAY The array of numeric values to transform into a BLOB column. compressionVARCHAR The compression method to use when packing the data into the BLOB. NONEis the unique option now.bandsINT Number of bands or layers in the data buffer. colsINT Number of columns in the tile. rowsINT Number of rows in the tile. no_dataDOUBLE NoData value for the tile (To consider when applying algebra operations). This function allows you to transform the results of algebraic operations on the tile data back into a BLOB column with the internal structure required by
COPYwithFORMAT RASTER, so you can write the results of your operations into a new raster file.
The full list of functions and their documentation is available in the function reference
This is the list of things I have in mind for the future, but if you want to contribute or have any suggestion please let me know!
- Add basic math operations for the BLOB databand columns, like
+,-,*,/. - Compression formats for the BLOB databand columns (
GZip,ZSTD?). - Integration with DuckDB File System.
You need a recent version of CMake (3.5) and a C++14 compatible compiler.
We also highly recommend that you install Ninja which you can select when building by setting the GEN=ninja environment variable.
git clone --recurse-submodules https://github.com/ahuarte47/duckdb-raster
cd duckdb-raster
make release
You can then invoke the built DuckDB (with the extension statically linked)
./build/release/duckdb
Please see the Makefile for more options, or the extension template documentation for more details.
Different tests can be created for DuckDB extensions. The primary way of testing DuckDB extensions should be the SQL tests in ./test/sql. These SQL tests can be run using:
make testTo install your extension binaries from S3, you will need to do two things. Firstly, DuckDB should be launched with the
allow_unsigned_extensions option set to true. How to set this will depend on the client you're using. Some examples:
CLI:
duckdb -unsignedPython:
con = duckdb.connect(':memory:', config={'allow_unsigned_extensions' : 'true'})NodeJS:
db = new duckdb.Database(':memory:', {"allow_unsigned_extensions": "true"});Secondly, you will need to set the repository endpoint in DuckDB to the HTTP url of your bucket + version of the extension you want to install. To do this run the following SQL query in DuckDB:
SET custom_extension_repository='bucket.s3.eu-west-1.amazonaws.com/<your_extension_name>/latest';Note that the /latest path will allow you to install the latest extension version available for your current version of
DuckDB. To specify a specific version, you can pass the version instead.
After running these steps, you can install and load your extension using the regular INSTALL/LOAD commands in DuckDB:
INSTALL raster;
LOAD raster;Enjoy!