Skip to content

R Package Structure

mmyrte edited this page Jan 29, 2026 · 2 revisions

R Package Structure

The structure indicated in Database Backend can theoretically be edited by any software capable of writing to DuckDB/Parquet. However, a language agnostic representation of statistical model objects (e.g. random forests) is out of scope. Hence, a concrete implementation needs to be tied to a specific software environment. The implementation of the logic will be in R and will be organised in a package structure.

See the package dev intro for an overview and principles. See the workflow phases for an outline of what package functions are associated with which phase.

Classes of R Package

Database Class parquet_db

The base database class providing domain-agnostic parquet-backed storage with DuckDB for in-memory SQL operations.

Key features:

  • Uses a folder structure where each table is stored as a parquet file (or hive-partitioned directory)
  • DuckDB runs in-memory for SQL operations while data is persisted to disk in parquet format
  • Supports ZSTD compression for efficient storage
  • Can load DuckDB extensions (e.g., "spatial")

Core methods:

  • initialize(path, extensions) - Create/connect to a database folder
  • execute(statement) - Execute SQL statements
  • get_query(statement) - Execute SQL queries and return data.table results
  • commit(x, table_name, method, ...) - Write data to parquet files
  • fetch(table_name, where, limit, ...) - Read data from parquet files
  • attach_table(table_name) / detach_table(table_name) - Register/unregister tables in DuckDB
  • with_tables(tables, fn) - Execute a function with tables temporarily attached

Database Class evoland_db

An R6 class that inherits from parquet_db and provides the domain-specific interface for land use change analysis. The class is defined across multiple files:

  • evoland_db.R - Core class definition and methods
  • evoland_db_tables.R - Table active bindings (read/write access)
  • evoland_db_views.R - View active bindings and query methods
  • evoland_db_neighbors.R - Neighbor analysis methods

Initialization

db <- evoland_db$new(
  path = "myproject.evolanddb",
  report_name = "my_scenario",
  report_name_pretty = "My Scenario Description"
)

Active Bindings for Tables

Tables can be read and written using active bindings with automatic validation:

# Read a table
coords <- db$coords_t

# Write/upsert a table
db$lulc_meta_t <- create_lulc_meta_t(lulc_spec)
db$lulc_data_t <- as_lulc_data_t(lulc_data)

Available table bindings: reporting_t, coords_t, periods_t, runs_t, lulc_meta_t, lulc_data_t, pred_meta_t, pred_data_t_float, pred_data_t_int, pred_data_t_bool, trans_meta_t, trans_preds_t, trans_rates_t, intrv_meta_t, intrv_masks_t, trans_models_t, alloc_params_t, neighbors_t

Active Bindings for Views

Computed views that don't store additional data:

  • lulc_meta_long_v - Unrolled LULC metadata with one row per source class
  • pred_sources_v - Distinct predictor URLs and their MD5 checksums
  • trans_v - Land use transitions derived from consecutive LULC observations
  • extent - Spatial extent of coords_t as terra::SpatExtent
  • coords_minimal - Minimal coordinate representation (id_coord, lon, lat)

Setter Methods

  • set_report(...) - Set reporting metadata key-value pairs
  • set_coords(type, epsg, extent, resolution) - Initialize coordinate grid
  • set_periods(period_length_str, start_observed, end_observed, end_extrapolated) - Define time periods
  • set_neighbors(max_distance, distance_breaks) - Compute neighbor relationships

Adder Methods

  • add_predictor(pred_spec, pred_data, pred_type) - Add a predictor variable to the database

Query Methods

  • trans_pred_data_v(id_trans, id_period, id_pred, na_value) - Wide table of transition results and predictor data
  • pred_data_wide_v(id_trans, id_period, na_value) - Wide predictor data for transition probability prediction
  • trans_rates_dinamica_v(id_period) - Transition rates formatted for Dinamica export
  • lulc_data_as_rast(extent, resolution, id_period) - Convert LULC data to terra SpatRast

Analysis Methods

  • set_full_trans_preds(overwrite) - Initialize full transition-predictor relationships
  • get_pruned_trans_preds_t(filter_fun, na_value, cores, ...) - Feature selection for transitions
  • fit_partial_models(fit_fun, gof_fun, sample_frac, seed, na_value, cores, ...) - Fit models on stratified samples
  • fit_full_models(partial_models, gof_criterion, maximize, na_value, cores) - Refit best models on full data
  • predict_trans_pot(id_period) - Predict transition potential

Allocation Methods

  • create_alloc_params_t(n_perturbations, sd) - Compute allocation parameters from historical data
  • eval_alloc_params_t(id_runs, work_dir, keep_intermediate) - Evaluate allocation parameters via simulation
  • alloc_dinamica(id_periods, id_run, work_dir, keep_intermediate) - Run Dinamica EGO simulation

Context Management

For multi-run scenarios with hierarchical inheritance:

db$use_run(id_run = 1)  # Activate run context
# ... operations scoped to run 1 ...
db$use_run(NULL)        # Return to global context

Table Classes

Each table in the schema has a corresponding S3 class that inherits from data.table. Creating objects is done via as_* functions, for instance:

coords <- as_coords_t(my_data)
periods <- as_periods_t(period_data)

Some tables also have create_* constructor functions that generate data from specifications:

lulc_meta <- create_lulc_meta_t(list(
  forest = list(pretty_name = "Forest", src_classes = 1:3),
  urban = list(pretty_name = "Urban", src_classes = 4:6)
))

periods <- create_periods_t(
  period_length_str = "P10Y",
  start_observed = "1985-01-01",
  end_observed = "2020-01-01",
  end_extrapolated = "2060-01-01"
)

Upon creation, type coercion and validation are performed via validate.* S3 methods. A specific S3 print method is implemented for each class showing class name, summary statistics, and a preview of the data.

Workflow Phases

The package supports four main phases: Setup, Ingestion, Calibration, and Prediction/Allocation.

Phase 0: Setup

Initialize the database with spatial and temporal configuration.

library(evoland)

db <- evoland_db$new(
  path = "switzerland.evolanddb",
  report_name = "ch_lulc",
  report_name_pretty = "Swiss Land Use Change Model"
)

# Define coordinate grid
db$set_coords(
  type = "square",
  epsg = 2056,
  extent = terra::ext(c(
    xmin = 2480000,
    xmax = 2840000,
    ymin = 1070000,
    ymax = 1300000
  )),
  resolution = 100
)

# Define time periods
db$set_periods(
  period_length_str = "P10Y",
  start_observed = "1985-01-01",
  end_observed = "2020-01-01",
  end_extrapolated = "2060-01-01"
)

Phase 1: Data Ingestion

LULC Data

# Define LULC classes with mappings from source data
db$lulc_meta_t <- create_lulc_meta_t(list(
  forest = list(
    pretty_name = "Forest",
    description = "All forest types",
    src_classes = c(50:60)
  ),
  urban = list(
    pretty_name = "Urban Areas",
    description = "Built-up areas",
    src_classes = c(1:14)
  )
  # ... more classes
))

# Ingest LULC observations
db$lulc_data_t <- as_lulc_data_t(lulc_observations)

Predictor Data

# Add predictors one at a time with metadata
db$add_predictor(
  pred_spec = list(
    elevation = list(
      unit = "masl",
      pretty_name = "Elevation",
      description = "Digital elevation model",
      sources = list(list(url = "...", md5sum = "..."))
    )
  ),
  pred_data = elevation_data,  # data.table with id_coord, id_period, value
  pred_type = "float"
)

Neighbor Relationships

# Compute spatial neighbors (can be slow for large datasets)
db$set_neighbors(
  max_distance = 1000,
  distance_breaks = c(0, 100, 500, 1000)
)

# Generate neighbor-based LULC count predictors
db$generate_neighbor_predictors()

Phase 2: Calibration

Transition Metadata

# Analyze observed transitions and determine viability
db$trans_meta_t <- create_trans_meta_t(
  db$trans_v,
  min_cardinality_abs = 10000,
  exclude_anterior = 9  # e.g., exclude "static" class
)

Feature Selection

# Initialize full predictor set
db$set_full_trans_preds(overwrite = TRUE)

# Apply covariance filtering
trans_preds_filtered <- db$get_pruned_trans_preds_t(
  filter_fun = covariance_filter,
  corcut = 0.7,
  na_value = 0,
  cores = 4
)
db$commit(trans_preds_filtered, "trans_preds_t", method = "overwrite")

# Optional: Apply guided regularized random forest filtering
trans_preds_grrf <- db$get_pruned_trans_preds_t(
  filter_fun = grrf_filter,
  num.trees = 100,
  gamma = 0.8,
  cores = 4
)
db$commit(trans_preds_grrf, "trans_preds_t", method = "overwrite")

Model Training

# Fit partial models with train/test split
partial_models <- db$fit_partial_models(
  fit_fun = fit_glm,      # or fit_ranger for random forests
  gof_fun = gof_glm,      # or gof_ranger
  sample_frac = 0.7,
  seed = 42,
  na_value = 0,
  cores = 4
)

# Select best models and refit on full data
full_models <- db$fit_full_models(
  partial_models = partial_models,
  gof_criterion = "auc",
  maximize = TRUE,
  cores = 4
)

db$trans_models_t <- full_models

Transition Rates

# Calculate observed historical rates
obs_rates <- create_obs_trans_rates_t(db$trans_v, db$trans_meta_t)
db$trans_rates_t <- obs_rates

# Extrapolate to future periods
db$trans_rates_t <- create_extr_trans_rates_t(obs_rates, db$periods_t)

Allocation Parameters

# Compute patch expansion/patcher parameters from historical data
db$alloc_params_t <- db$create_alloc_params_t(
  n_perturbations = 5,
  sd = 0.05
)

# Optional: Evaluate parameters against observed data (requires Dinamica EGO)
db$alloc_params_t <- db$eval_alloc_params_t()

Phase 3: Prediction and Allocation

# Predict transition potential for a future period
trans_pot <- db$predict_trans_pot(id_period = 5)

# Run full simulation with Dinamica EGO
db$alloc_dinamica(
  id_periods = c(4, 5, 6, 7, 8),
  id_run = 0,
  work_dir = "dinamica_runs"
)

Very Short Intro to R Package Development

It's generally easiest to follow the structures shown in Hadley Wickham's and Jennifer Bryan's R Packages.

File Structure

  • DESCRIPTION holds metadata and declares dependencies but does not import them.
  • NAMESPACE exports and imports objects out of and into the package namespace. We use roxygen2 to populate this file.
  • LICENSE.md holds a license text. We use the AGPL.
  • README.md should welcome developers, who we can assume to be identical with users for now.
  • R/ contains all exportable R logic. No nested directories are allowed.
  • src/ contains C++ code interfacing with R via Rcpp.
  • man/ and vignettes/ contain manual pages and vignettes respectively. The former is populated using roxygen. The latter can be written in quarto's markdown.
  • data-raw/ contains logic to populate data/, used to deliver (really small) sample datasets.
  • inst/tinytest/ contains tests using the tinytest framework.
  • inst/ contains any data that should be available verbatim when the package is installed.
  • .Rbuildignore indicates which files from the source package structure should not be included in built/installed packages.

R Source Organization

The R/ directory is organized as follows:

File(s) Purpose
parquet_db.R Base database class (domain-agnostic)
evoland_db.R Main evoland database class
evoland_db_tables.R Table active bindings
evoland_db_views.R View methods and active bindings
evoland_db_neighbors.R Neighbor analysis methods
*_t.R (e.g., coords_t.R) Table class definitions with as_*, create_*, validate.*, print.*
trans_models_glm.R, trans_models_rf.R Model fitting implementations
covariance_filter.R, grrf_filter.R Feature selection algorithms
alloc_dinamica.R Dinamica EGO integration
fuzzy_similarity.R Map comparison metrics
util*.R Utility functions
init.R Package initialization

Coding Style

The style should follow the tidyverse styleguide. The code should be autoformatted before committing using air. The configuration for the latter is in air.toml.

Dependency Management

The dependencies used in the package are declared in DESCRIPTION, see the R Packages chapter. Dependencies should be kept minimal.

Core dependencies (Imports):

  • R6 - OOP for database classes
  • data.table - Efficient data manipulation
  • DBI, duckdb - Database operations
  • terra - Spatial data handling
  • glue - String interpolation
  • qs2 - Fast object serialization
  • Rcpp - C++ integration
  • stringi, curl - Utilities

Optional dependencies (Suggests):

  • ranger - Random forest models
  • pROC - ROC/AUC calculations
  • processx - External process management (Dinamica)
  • tinytest - Testing framework
  • quarto - Vignette building

The roxygen @importFrom should be used for functions that are called so often that the package::function() syntax becomes cumbersome.

Testing

This package uses tinytest for testing:

# Test the full package (build, install, test)
R -e "tinytest::build_test_install()"

# Test individual files during development
R -e "pkgload::load_all(); tinytest::run_test_file('inst/tinytest/test_coords_t.R')"

Non-exported functions are tested using the evoland:::private_function syntax.