-
Notifications
You must be signed in to change notification settings - Fork 0
R Package Structure
The structure indicated in Database Backend can theoretically be edited by any software capable of writing to DuckDB/Parquet. However, a language agnostic representation of statistical model objects (e.g. random forests) is out of scope. Hence, a concrete implementation needs to be tied to a specific software environment. The implementation of the logic will be in R and will be organised in a package structure.
See the package dev intro for an overview and principles. See the workflow phases for an outline of what package functions are associated with which phase.
The base database class providing domain-agnostic parquet-backed storage with DuckDB for in-memory SQL operations.
Key features:
- Uses a folder structure where each table is stored as a parquet file (or hive-partitioned directory)
- DuckDB runs in-memory for SQL operations while data is persisted to disk in parquet format
- Supports ZSTD compression for efficient storage
- Can load DuckDB extensions (e.g., "spatial")
Core methods:
-
initialize(path, extensions)- Create/connect to a database folder -
execute(statement)- Execute SQL statements -
get_query(statement)- Execute SQL queries and returndata.tableresults -
commit(x, table_name, method, ...)- Write data to parquet files -
fetch(table_name, where, limit, ...)- Read data from parquet files -
attach_table(table_name)/detach_table(table_name)- Register/unregister tables in DuckDB -
with_tables(tables, fn)- Execute a function with tables temporarily attached
An R6 class that inherits from parquet_db and provides the domain-specific interface for land use change analysis. The class is defined across multiple files:
-
evoland_db.R- Core class definition and methods -
evoland_db_tables.R- Table active bindings (read/write access) -
evoland_db_views.R- View active bindings and query methods -
evoland_db_neighbors.R- Neighbor analysis methods
db <- evoland_db$new(
path = "myproject.evolanddb",
report_name = "my_scenario",
report_name_pretty = "My Scenario Description"
)Tables can be read and written using active bindings with automatic validation:
# Read a table
coords <- db$coords_t
# Write/upsert a table
db$lulc_meta_t <- create_lulc_meta_t(lulc_spec)
db$lulc_data_t <- as_lulc_data_t(lulc_data)Available table bindings: reporting_t, coords_t, periods_t, runs_t, lulc_meta_t, lulc_data_t, pred_meta_t, pred_data_t_float, pred_data_t_int, pred_data_t_bool, trans_meta_t, trans_preds_t, trans_rates_t, intrv_meta_t, intrv_masks_t, trans_models_t, alloc_params_t, neighbors_t
Computed views that don't store additional data:
-
lulc_meta_long_v- Unrolled LULC metadata with one row per source class -
pred_sources_v- Distinct predictor URLs and their MD5 checksums -
trans_v- Land use transitions derived from consecutive LULC observations -
extent- Spatial extent of coords_t asterra::SpatExtent -
coords_minimal- Minimal coordinate representation (id_coord, lon, lat)
-
set_report(...)- Set reporting metadata key-value pairs -
set_coords(type, epsg, extent, resolution)- Initialize coordinate grid -
set_periods(period_length_str, start_observed, end_observed, end_extrapolated)- Define time periods -
set_neighbors(max_distance, distance_breaks)- Compute neighbor relationships
-
add_predictor(pred_spec, pred_data, pred_type)- Add a predictor variable to the database
-
trans_pred_data_v(id_trans, id_period, id_pred, na_value)- Wide table of transition results and predictor data -
pred_data_wide_v(id_trans, id_period, na_value)- Wide predictor data for transition probability prediction -
trans_rates_dinamica_v(id_period)- Transition rates formatted for Dinamica export -
lulc_data_as_rast(extent, resolution, id_period)- Convert LULC data to terra SpatRast
-
set_full_trans_preds(overwrite)- Initialize full transition-predictor relationships -
get_pruned_trans_preds_t(filter_fun, na_value, cores, ...)- Feature selection for transitions -
fit_partial_models(fit_fun, gof_fun, sample_frac, seed, na_value, cores, ...)- Fit models on stratified samples -
fit_full_models(partial_models, gof_criterion, maximize, na_value, cores)- Refit best models on full data -
predict_trans_pot(id_period)- Predict transition potential
-
create_alloc_params_t(n_perturbations, sd)- Compute allocation parameters from historical data -
eval_alloc_params_t(id_runs, work_dir, keep_intermediate)- Evaluate allocation parameters via simulation -
alloc_dinamica(id_periods, id_run, work_dir, keep_intermediate)- Run Dinamica EGO simulation
For multi-run scenarios with hierarchical inheritance:
db$use_run(id_run = 1) # Activate run context
# ... operations scoped to run 1 ...
db$use_run(NULL) # Return to global contextEach table in the schema has a corresponding S3 class that inherits from data.table.
Creating objects is done via as_* functions, for instance:
coords <- as_coords_t(my_data)
periods <- as_periods_t(period_data)Some tables also have create_* constructor functions that generate data from specifications:
lulc_meta <- create_lulc_meta_t(list(
forest = list(pretty_name = "Forest", src_classes = 1:3),
urban = list(pretty_name = "Urban", src_classes = 4:6)
))
periods <- create_periods_t(
period_length_str = "P10Y",
start_observed = "1985-01-01",
end_observed = "2020-01-01",
end_extrapolated = "2060-01-01"
)Upon creation, type coercion and validation are performed via validate.* S3 methods. A specific S3 print method is implemented for each class showing class name, summary statistics, and a preview of the data.
The package supports four main phases: Setup, Ingestion, Calibration, and Prediction/Allocation.
Initialize the database with spatial and temporal configuration.
library(evoland)
db <- evoland_db$new(
path = "switzerland.evolanddb",
report_name = "ch_lulc",
report_name_pretty = "Swiss Land Use Change Model"
)
# Define coordinate grid
db$set_coords(
type = "square",
epsg = 2056,
extent = terra::ext(c(
xmin = 2480000,
xmax = 2840000,
ymin = 1070000,
ymax = 1300000
)),
resolution = 100
)
# Define time periods
db$set_periods(
period_length_str = "P10Y",
start_observed = "1985-01-01",
end_observed = "2020-01-01",
end_extrapolated = "2060-01-01"
)# Define LULC classes with mappings from source data
db$lulc_meta_t <- create_lulc_meta_t(list(
forest = list(
pretty_name = "Forest",
description = "All forest types",
src_classes = c(50:60)
),
urban = list(
pretty_name = "Urban Areas",
description = "Built-up areas",
src_classes = c(1:14)
)
# ... more classes
))
# Ingest LULC observations
db$lulc_data_t <- as_lulc_data_t(lulc_observations)# Add predictors one at a time with metadata
db$add_predictor(
pred_spec = list(
elevation = list(
unit = "masl",
pretty_name = "Elevation",
description = "Digital elevation model",
sources = list(list(url = "...", md5sum = "..."))
)
),
pred_data = elevation_data, # data.table with id_coord, id_period, value
pred_type = "float"
)# Compute spatial neighbors (can be slow for large datasets)
db$set_neighbors(
max_distance = 1000,
distance_breaks = c(0, 100, 500, 1000)
)
# Generate neighbor-based LULC count predictors
db$generate_neighbor_predictors()# Analyze observed transitions and determine viability
db$trans_meta_t <- create_trans_meta_t(
db$trans_v,
min_cardinality_abs = 10000,
exclude_anterior = 9 # e.g., exclude "static" class
)# Initialize full predictor set
db$set_full_trans_preds(overwrite = TRUE)
# Apply covariance filtering
trans_preds_filtered <- db$get_pruned_trans_preds_t(
filter_fun = covariance_filter,
corcut = 0.7,
na_value = 0,
cores = 4
)
db$commit(trans_preds_filtered, "trans_preds_t", method = "overwrite")
# Optional: Apply guided regularized random forest filtering
trans_preds_grrf <- db$get_pruned_trans_preds_t(
filter_fun = grrf_filter,
num.trees = 100,
gamma = 0.8,
cores = 4
)
db$commit(trans_preds_grrf, "trans_preds_t", method = "overwrite")# Fit partial models with train/test split
partial_models <- db$fit_partial_models(
fit_fun = fit_glm, # or fit_ranger for random forests
gof_fun = gof_glm, # or gof_ranger
sample_frac = 0.7,
seed = 42,
na_value = 0,
cores = 4
)
# Select best models and refit on full data
full_models <- db$fit_full_models(
partial_models = partial_models,
gof_criterion = "auc",
maximize = TRUE,
cores = 4
)
db$trans_models_t <- full_models# Calculate observed historical rates
obs_rates <- create_obs_trans_rates_t(db$trans_v, db$trans_meta_t)
db$trans_rates_t <- obs_rates
# Extrapolate to future periods
db$trans_rates_t <- create_extr_trans_rates_t(obs_rates, db$periods_t)# Compute patch expansion/patcher parameters from historical data
db$alloc_params_t <- db$create_alloc_params_t(
n_perturbations = 5,
sd = 0.05
)
# Optional: Evaluate parameters against observed data (requires Dinamica EGO)
db$alloc_params_t <- db$eval_alloc_params_t()# Predict transition potential for a future period
trans_pot <- db$predict_trans_pot(id_period = 5)
# Run full simulation with Dinamica EGO
db$alloc_dinamica(
id_periods = c(4, 5, 6, 7, 8),
id_run = 0,
work_dir = "dinamica_runs"
)It's generally easiest to follow the structures shown in Hadley Wickham's and Jennifer Bryan's R Packages.
-
DESCRIPTIONholds metadata and declares dependencies but does not import them. -
NAMESPACEexports and imports objects out of and into the package namespace. We use roxygen2 to populate this file. -
LICENSE.mdholds a license text. We use the AGPL. -
README.mdshould welcome developers, who we can assume to be identical with users for now. -
R/contains all exportable R logic. No nested directories are allowed. -
src/contains C++ code interfacing with R via Rcpp. -
man/andvignettes/contain manual pages and vignettes respectively. The former is populated using roxygen. The latter can be written in quarto's markdown. -
data-raw/contains logic to populatedata/, used to deliver (really small) sample datasets. -
inst/tinytest/contains tests using the tinytest framework. -
inst/contains any data that should be available verbatim when the package is installed. -
.Rbuildignoreindicates which files from the source package structure should not be included in built/installed packages.
The R/ directory is organized as follows:
| File(s) | Purpose |
|---|---|
parquet_db.R |
Base database class (domain-agnostic) |
evoland_db.R |
Main evoland database class |
evoland_db_tables.R |
Table active bindings |
evoland_db_views.R |
View methods and active bindings |
evoland_db_neighbors.R |
Neighbor analysis methods |
*_t.R (e.g., coords_t.R) |
Table class definitions with as_*, create_*, validate.*, print.*
|
trans_models_glm.R, trans_models_rf.R
|
Model fitting implementations |
covariance_filter.R, grrf_filter.R
|
Feature selection algorithms |
alloc_dinamica.R |
Dinamica EGO integration |
fuzzy_similarity.R |
Map comparison metrics |
util*.R |
Utility functions |
init.R |
Package initialization |
The style should follow the tidyverse styleguide.
The code should be autoformatted before committing using air.
The configuration for the latter is in air.toml.
The dependencies used in the package are declared in DESCRIPTION, see the R Packages chapter.
Dependencies should be kept minimal.
Core dependencies (Imports):
-
R6- OOP for database classes -
data.table- Efficient data manipulation -
DBI,duckdb- Database operations -
terra- Spatial data handling -
glue- String interpolation -
qs2- Fast object serialization -
Rcpp- C++ integration -
stringi,curl- Utilities
Optional dependencies (Suggests):
-
ranger- Random forest models -
pROC- ROC/AUC calculations -
processx- External process management (Dinamica) -
tinytest- Testing framework -
quarto- Vignette building
The roxygen @importFrom should be used for functions that are called so often that the package::function() syntax becomes cumbersome.
This package uses tinytest for testing:
# Test the full package (build, install, test)
R -e "tinytest::build_test_install()"
# Test individual files during development
R -e "pkgload::load_all(); tinytest::run_test_file('inst/tinytest/test_coords_t.R')"Non-exported functions are tested using the evoland:::private_function syntax.