dbstan is glue code for mapping rstan::stanfit objects to
relational database schemas. It leverages the DBI package to provide a
simple, DBMS-agnostic interface for:
INSERTing a representation of astanfitobject into a DB- Retrieving a particular
stanfitobject's information from the DB. - Materializing a
dbplyr-based list of tables which contain allstanfitinformation.
This makes it easy to move (batches of) sampler results into a database, which makes it more convenient to run analyses and share data.
The package maps a stanfit object into records stored in nine SQL tables:
stanfit.run_idsstanfit.run_infostanfit.model_parsstanfit.stanmodelstanfit.summarystanfit.c_summarystanfit.samplesstanfit.log_posteriorstanfit.sampler_params
library(rstan)
library(dbstan)
# Establish a DB connection using DBI
conn <- DBI::dbConnect(
RPostgres::Postgres(),
user = 'postgres',
password = 'password',
host = 'mydb.abcdefg1234567.us-east-2.rds.amazonaws.com'
)
# Get samples from any sampled Stan model. Currently only NUTS is supported.
fit1 <- stan(
file = "schools.stan", # Stan program
data = schools_data, # named list of data
)
# INSERT `stanfit` into db and store the generated primary key, optionally
# including all posterior samples.
id <- stanfit_insert(fit1, conn, insert_samples = T)
# Retrieve all relevant tables as a list of dbplyr-backed tables
tbl_dict(conn)
# Or, retrieve all relevant tables for a particular `id`
tables <- get_stanfit(id, conn)This helps avoid situations where:
-
Multiple researchers swap
.RDSarchives back and forth to exchange results.Now, just write queries against a database to get the results
-
Researchers want to do analysis across multiple (possibly many) sampled runs, but don't have enough RAM to do so, or don't want to keep track of various slimmed-down representations of the original
stanfitobject.Now, the RDBMS does the heavy lifting, and gracefully adapts to a model that may change over time in its parameter schema
-
Researchers are using a computing cluster and want to avoid shuffling around tons of
.RDSfiles, running out of space on either the cluster or their dev machine, etc.Now, the cluster can just
INSERTthestanfitobject once it is created, and move onto the next task - no write to disk necessary.
Tested with Postgres 14.2, but it should work with many other SQL-based databases.
-
Execute
init.sqlagainst your database, whichCREATEs a"stanfit"schema in your database andCREATEs all tables. We recommend readinginit.sqlfirst!If you're just testing out
dbstan, you could use a SQLite db for this, or Postgres in a container.psql -f init.sql -
Make sure you can connect to the database using
DBI::dbConnect(). -
Pass a
stanfitobject tostanfit_insert(stanfit_object, conn). The returned number is a unique identifieridwhich is a field in all the tables.
Results from calls to rstan::sampling() or rstan::stan() are stored in
stanfit objects. The contents of the object are described in detail
here. Each object is an S4 class with a bunch of slots that
represent various parts of the sampling process, such as the model code, the
samples drawn from the posterior, and diagnostic messages from the NUTS sampler.
dbstan organizes these slots into a relational model. The slots are
summarized below:
model_nameName of the modelmodel_parsParameters in the model, including thegenerated quantities{}block andtransformed parameters{}blockpar_dimsDimensions of said paramaetersmodeStatus code indicating success or failure of the samplersimMatrix containing the individual samples.initsInitial values of all parameters on the first iterationstan_argsArguments for the samplerstanmodelModel code, as astanmodelobjectdate- Date of object creation
The schema can be viewed in init.sql - here is a more descriptive mapping
between the stanfit object and its relational model.
This table is an index of all of the stanfit objects represented in the
schema. The method column identifies the run as a sampled run or an optimized
run.
| field | stanfit slot | notes |
|---|---|---|
| id | None | Serial, assigned by the database on insertion using stanfit_insert |
| method | None | enum, sampling/optimizing |
Basic information about the run including how long it took.
| field | stanfit slot | type | notes |
|---|---|---|---|
| id | None | serial | Foreign key |
| model_name | @model_name |
text | The name of the file, minus the extension |
| date | @date |
timestamptz | The time the Stanfit object was created |
| duration | get_elapsed_time(r) %>% as_tibble(rownames='chain') |
JSON | Retval is in SECONDS* |
| mode | @mode |
numeric | |
| stan_args | @stan_args[[1]] |
JSON | Take the first chain and represent it as JSON |
Optimizing runs have a lot less information associated with them - here we only keep track of the optimizer's return code and the value of the log-posterior.
| field | stanfit slot | type | notes |
|---|---|---|---|
| id | None | serial | Foreign key |
| return_code | $return_code |
integer | Return code from the optimizing routine |
| log_posterior | $value |
double precision | The value of the log-posterior at the point-estimate |
Model parameters and their dimension. Includes variables declared in:
parameters{}blocktransformed parameters{}blockgenerated quantities{}block
| field | stanfit slot | type | notes |
|---|---|---|---|
| id | None | serial | Foreign key |
| par | names(r@par_dims) |
text | Includes all parameters/transformed ps/generated |
| dim | @par_dims |
numeric | 0 represents scalars |
Model code.
| field | stanfit slot | type | notes |
|---|---|---|---|
| id | serial | ||
| code | get_stancode(r) |
text |
The output of calling rstan::summary() on a stanfit object and indexing
into the combined-chain summary (rstan::summary(obj)$summary).
| field | stanfit slot | type | notes |
|---|---|---|---|
| id | None | ||
| par | summary(r)$summary$par |
text | |
| idx | Slightly complicated | int | |
| mean | summary(r)$summary$mean |
double precision | |
| se_mean | summary(r)$summary$se_mean |
double precision | |
| sd | summary(r)$summary$sd |
double precision | |
| P2_5 | summary(r)$summary[["2.5%"]] |
double precision | |
| P25 | summary(r)$summary[["25%"]] |
double precision | |
| P50 | summary(r)$summary[["50%"]] |
double precision | |
| P75 | summary(r)$summary[["75%"]] |
double precision | |
| P97_5 | summary(r)$summary[["97.5%"]] |
double precision | |
| n_eff | summary(r)$summary$n_eff |
double precision | |
| Rhat | summary(r)$summary$Rhat |
double precision |
Per-chain summary.
- There's no
n_efforRhat - The column names in the object returned by
summary(r)are confusing, as all vars exceptparare named[metric].chain:[n].
We pivot the data a bit to get it into nearly the same structure as the summary table.
| field | stanfit slot | type | notes |
|---|---|---|---|
| id | None | ||
| chain | No exact mapping | ||
| par | summary(r)$summary$par |
text | |
| idx | Slightly complicated | double precision | |
| mean | summary(r)$summary$mean |
double precision | |
| se_mean | summary(r)$summary$se_mean |
double precision | |
| sd | summary(r)$summary$sd |
double precision | |
| P2_5 | summary(r)$summary[["2.5%"]] |
double precision | |
| P25 | summary(r)$summary[["25%"]] |
double precision | |
| P50 | summary(r)$summary[["50%"]] |
double precision | |
| P75 | summary(r)$summary[["75%"]] |
double precision | |
| P97_5 | summary(r)$summary[["97.5%"]] |
double precision |
All samples. Call stanfit_insert(sft, include_samples = T) to insert samples
from your posterior into the database. Otherwise, the default behavior is to
save (potentially lots) of space and not insert any samples.
| field | stanfit slot | type | notes |
|---|---|---|---|
| id | None | smallint | using smallint to save space |
| chain | No exact mapping | smallint | |
| iter | see R/get_table_entries.R |
smallint | |
| par | text | ||
| idx | smallint | ||
| value | double precision |
Optimizing results are sufficiently different in structure from sampler results that they are given their own table.
| field | list key | type | notes |
|---|---|---|---|
| id | None | ||
| par | names(r$par) |
text | |
| idx | names(r$par) |
integer | |
| point_est | r$value |
double precision |
The log posterior at each iteration, for each chain. Note: log_posterior
values for optimized runs are not located here - they're in the
optimizing_run_info table.
| field | stanfit slot | type | notes |
|---|---|---|---|
| id | None | integer | |
| chain | get_logposterior(r)[[1,2,...]] |
numeric | |
| iter | Row number of vector | numeric | |
| value | The vector from each chain | double precision |
Diagnostic parameters, by chain.
| field | stanfit slot | type | notes |
|---|---|---|---|
| id | None | integer | |
| chain | No exact mapping | numeric | |
| iter | Row number of vector | numeric | |
| accept_stat | get_sampler_params(r)[[chain]]$accept_stat__ |
double precision | |
| stepsize | get_sampler_params(r)[[chain]]$stepsize__ |
double precision | |
| treedepth | get_sampler_params(r)[[chain]]$treedepth__ |
double precision | |
| n_leapfrog | get_sampler_params(r)[[chain]]$n_leapfrog__ |
double precision | |
| divergent | get_sampler_params(r)[[chain]]$divergent__ |
double precision | 1=divergent, 0=not divergent |
| energy | get_sampler_params(r)[[chain]]$energy__ |
double precision |