PORPOISE: Development and External Validation of ML Models for Identifying Patients at Risk of Postoperative Prolonged Opioid Use
The PORPOISE (PostOpeRative Prolonged OpioId uSE) study develops and validates machine learning (ML) models to predict patients at risk of prolonged opioid use in a diverse, multisite cohort by evaluating not only their performance but also their generalizability, discrimination, and calibration abilities over different subgroups, including diabetes, depression, and obesity.
___This study has been validated in four countries, and all internal and external validation results will soon be available via a Shiny app.
___As soon as the study paper is published, all pre-trained models will be made available on this GitHub page for future validation by researchers in additional countries and comparison with other models.
- Study protocol: PORPOISE-Study-Protocol-V2.0.pdf
- Preliminary results from Protocol V1.0: Shiny App
- Final external validation results: Shiny app will be available soon
- Participation call: OHDSI Forum
- Project introduction: Slides and Presentation
- Oral presentations:
-
Naderalvojoud, B., Hond, A., Shapiro, A., Coquet, J., Seto, T., Hernandez-Boussard, T. Predicting Prolonged Opioid Use Following Surgery Using Machine Learning: Challenges and Outcomes in American Medical Informatics Association (AMIA) Annual Symposium, Washington DC, 2022.
-
Naderalvojoud, B. and Hernandez-Boussard, T. Machine Learning for Predicting Patients at Risk of Prolonged Opioid Use Following Surgery in Observational Health Data Sciences and Informatics.(OHDSI) Annual Symposium, Rockville MD, 2022.
-
- Improve pain management following surgery.
- Identify patients at risk for prolonged opioid use prior to prescribing pain management regimens.
- Develop and validate ML models in a diverse, multisite cohort by evaluating their generalizability, discrimination, and calibration abilities.
- Evaluate the transportability of ML models based on population differences in the various CDM databases.
- Train and validate five machine learning algorithms using a multiple prediction module.
- Externally validate pre-trained models on various CDM databases using three subgroup cohorts for diabetes, depression, and obesity.
PORPOISE is being developed in R Studio using the OHDSI PatientLevelPrediction R package.
- R version 4.1.3
- RStudio 2022.02.0
- JAVA
- RTools
- PatientLevelPrediction R package version >= 6.0.4
- FeatureExtraction R package version >= 3.2.0
- DatabaseConnector R package version >= 5.1.0
- OhdsiShinyModules R package version >= 1.0.0
- SqlRender R package version >= 1.10.0
- cdm version >= 5.0.0
The prediction module requires PatientLevelPrediction (PLP) R package. The PLP package requires installing:
R,RStudio,JavaRToolsfor Windows usersXcode command line toolsfor Mac and Linux users
To install all the above requirements, please follow the instructions in the R setup document provided by the OHDSI HADES team. After setting up the R environment, you can install the PLP package as follows (more information).
install.packages("remotes")
remotes::install_github("ohdsi/FeatureExtraction")
remotes::install_github("ohdsi/OhdsiShinyModules")
remotes::install_github("ohdsi/PatientLevelPrediction")The remotes will automatically install the latest release and all the latest dependencies. To determine the version of the package, you can run:
remotes::install_github("ohdsi/FeatureExtraction@v3.2.0")
remotes::install_github("ohdsi/PatientLevelPrediction@v6.0.4")To install SqlRender and DatabaseConnector, run:
install.packages("SqlRender")
install.packages("DatabaseConnector")If the CDM dataset is in BigQuery, the following package must be installed:
remotes::install_github("jdposada/BQJdbcConnectionStringR")The project's materials and methods are divided into the cohort and prediction studies.
The OHDSI ATLAS software was used to define and characterize the project cohorts. The cohort study of PORPOISE is carried out by sharing JSON rather than R code, so project partners can obtain the results without knowing R and visually explore the cohort inclusion criteria and characterization settings in ATLAS. The cohort-study folder contains the JSON file pertaining to the cohorts' definition and characterization settings that can be imported and executed on any ATLAS instance.
To run the cohort definition in ATLAS:
- Go to the
Cohort Definitionsfrom the left menu bar - Click on the
New Cohortbutton - Go to
Exporttab - Go to
JSONtab - Add the JSON into the text box
- Click on the
Reloadbutton - Save the cohort
- Go to the
Generationtab and click on theGeneratebutton
To run cohort characterization in ATLAS:
- Go to the
Characterizationsfrom the left menu bar - Click on the
New Characterizationbutton - Go to
Utilitiestab - Go to
Importtab - Add the JSON into the text box
- Click on the
Importbutton - Save the setting
- Go to the
Executionstab and click on theGeneratebutton
All JSONs include the necessary concept sets and can be used independently.
The PORPOISE prediction study was developed using the OHDSI PLP R package and can be run with two settings:
- Model training with internal validation
- External validation of pre-trained models
Both settings make use of the same code in different configurations. All the configuration is done through config.yml in the config folder. The config file is classified into four groups: run, bq, db, and cdm.
All the parameters in the run class are described in the table below:
run Parameter |
Description |
|---|---|
| type (including 'multiple' and 'cohort') | if it is set to 'multiple', the multiple prediction module is run. In type 'cohort', the prediction is not run and the cohort and CDM subset generators are run if their parameters are set to yes. |
| plp_output_folder_name | The name of the output folder for running PLP multiple training when the type parameter is set to multiple. Please do not use the name folder, which already exists. The existence of some files will not allow the PLP module to run completely, and the results will not be as expected. |
| feature_selection_output | Path to the feature selection output file. This file is created by running ./src/featureSelection.R |
| external_validation (Yes/No) | If Yes, previously trained models will be validated on the specified CDM dataset. If the value is No, local training with internal validation is performed. |
| pretrained_models_folder_name | The name of the output folder containing the pretrained models, which should be moved into the project folder to be validated in the target CDM database. This must be set if the external_validation parameter is set to Yes |
| validation_subgroup (Yes/No) | If Yes, previously trained models will be validated on the target cohort as well as three subgroups: diabetes, depression, and obesity. |
| validation_noPostCriteria (Yes/No) | If Yes, previously trained models will be validated on the target cohort in which all post-index criteria have been removed. |
| only_validation_test_set (Yes/No) | If Yes, previously trained models will be validated on the target and subgroup cohorts in which all patients used in training have been removed. Please note that the cohorts must be created first by running ./src/createTestSetCohorts.R |
| cohort_generator (Yes/No) | If Yes, target and outcome cohorts will be generated prior to training. This parameter can only be set to Yes once. After the cohorts are generated in the first run, you no longer need to set this parameter to Yes for subsequent runs, and it can be set to No. |
| cohort_subgroup_generator (Yes/No) | If Yes, the three subgroup cohorts are generated from the target cohort and saved in the cohort table. These sub-cohorts will be used for model validation. |
| cdm_subset_generator (Yes/No) | Because some CDM tables are big and are maintained in the cloud with a query charge, you can create a subset of clinical tables in a working schema (determined as target_database_schema in the config file). If it is set to Yes, a subset of CDM tables with records related to the subjects in the target cohort is created. |
| models (LR, RF, AB, GB, NB) | This parameter indicates all models included in the multiple prediction module. It includes 'LR' (Lasso Logistic Regression), 'RF' (Random Forest), 'AB' (AdaBoost), 'GB' (Gradient Boosting Machine), and 'NB' (Naive Bayes). The same models can be run using features previously selected. To this end, you should first run ./src/featureSelection.R and set this parameter as LRFS, RFFS, ABFS, GBFS, and NBFS. |
Some CDM data warehouses, e.g., BigQuery, require credentials for access. If you use BQ, you must configure all of the following parameters.
bq Parameter |
Description |
|---|---|
| credentials | Credentials JSON path |
| driverPath | JDBC Driver path for BigQuery |
| projectId | Project name |
| defaultDataset | Working dataset name |
For other databases, you should configure the following parameters. If the target dataset is not in BQ, it must be removed from the config file. If the bg is not found, the system checks the db automatically. You can directly modify the getDbConnectionDetails() function in ./src/databaseConnection.R according to your target database settings.
db Parameter |
Description |
|---|---|
| dbms | The dbms name, e.g., postgresql |
| server | Database Server URL |
| port | Database port |
| user | Database user name |
| password | Database password |
| driverPath | Path to the db driver |
All the parameters related to the CDM dataset is described below:
cdm Parameter |
Description |
|---|---|
| target_database_schema | working schema containing the cohort table |
| cohort_table | cohort table name |
| target_cohort_id | target cohort id |
| outcome_cohort_id | outcome cohort id |
| diabetes_cohort_id | diabetes cohort id |
| depression_cohort_id | depression cohort id |
| obesity_cohort_id | obesity cohort id |
| targetNoPostCriteria_cohort_id | target cohort with no post criteria id |
| vocabulary_database_schema | CDM vocabulary database schema |
| cdm_database_schema | CDM database schema containing all clinical tables |
| cdm_database_name | CDM database name |
Before training models, you need to generate the target and outcome cohorts in the target database schema (your working schema), as well as a subset of the CDM dataset. To that end, you need to set cohort_generator and cohort_subgroup_generator to 'Yes' and run the main.R in the src folder.
The cohort_subgroup_generator parameter generates external validation subgroups for later use. To train the local models, you only need to set the external_validation parameter to 'No', and run the main.R as follows:
Rscript ./src/main.RYou can run the cohort generator and prediction modules simultaneously or separately. To only run the cohort generator, the config file should be configured as follows:
type: "cohort"external_validation: "No"validation_subgroup: "No"cohort_generator: "Yes"cohort_subgroup_generator: "Yes"cdm_subset_generator: "No"
To only run the multiple prediction module, the config file should be set up as follows:
type: "multiple"plp_output_folder_name: "<optional name, e.g. PlpMultiOutput>"external_validation: "No"validation_subgroup: "No"cohort_generator: "No"cohort_subgroup_generator: "No"cdm_subset_generator: "No"
To only run the external validation module, the config file should be set up as follows:
type: "multiple"external_validation: "Yes"- pretrained_models_folder_name: ""
validation_subgroup: "Yes"cohort_generator: "No"cohort_subgroup_generator: "No"cdm_subset_generator: "No"
The default cohort table and ids are as follows. Any new cohort table with any ids can be created.
cohort_table: "cohort"target_cohort_id: 1outcome_cohort_id: 2diabetes_cohort_id: 3depression_cohort_id: 4obesity_cohort_id: 5targetNoPostCriteria_cohort_id: 6
By executing the multiple prediction module, the PlpMultiOutput folder will be created. This folder will contain five Analysis folders corresponding to prediction models, two target cohort folders, and a settings.csv file, as well as a sqlite folder containing databaseFile.sqlite to store all prediction results.
- Each analysis folder must include:
plpLog.txtdiagnosePlp.rdsplpResultfolder including:runPlp.rdsmodelfolder including:attributes.jsoncovariateImportance.csv(only for LR)model.jsonor model folder includingmodel.pklmodelDesign.jsonpreprocessing.jsontrainDetails.json
- targetId folder must include:
cohorts.rdscovariatesmetaData.rdsoutcomes.rdstimeRef.rds
To share the models for external validation, partners only need to keep the content of the model folder in ./PlpMultiOutput/Analysis_*/plpResult/model and databaseFile.sqlite file in ./PlpMultiOutput/sqlite folder. You can delete all other folders and share them with us.
To validate pre-trained models, you need to
- Set up the
config.ymlfile, including thedborbgandcdmparameters for the external validation database, and set theexternal_validationand ,cohort_subgroup_generatorparameters to 'Yes' (if you have already generated the subgroups, there is no need to setcohort_subgroup_generatorto 'Yes'), - Place already trained results in the
PlpMultiOutputfolder, - Run the
main.R.
The results of external validation will be generated in the ./PlpMultiOutput/Validation path.
The Validation folder contains four sub-folders for each cohort id, including target and evaluation subgroups. Each cohort id folder must contain five Analysis folders that correspond to those created in the PlpMultiOutput folder.
After running the prediction module, both internal and external validation results are automatically inserted into a databaseFile.sqlite file in the ./PlpMultiOutput/sqlite folder. If the sqlite file is created correctly, you will be able to see both the internal and external validation results using the Shiny app. To run the shiny app after running the prediction module, run the following code by giving the output folder path:
viewMultiplePlp("./PlpMultiOutput")In addition to the sqlite file, the prediction module also exports the results through a set of CSV files in the ./PlpMultiOutput/csv folder. To share the results, you only need to send us the databaseFile.sqlite file. The results do not include any PHI or high-risk data. However, the partners may review the content of the results in the csv files based on their institution's policy.
It is worth noting that the prediction module generates an output folder called "PlpMultiOutput". As a result, when training new models or validating pre-trained models, the output folders must be managed manually to avoid overwriting.
If you require assistance with the project, please contact Dr. Behzad Naderalvojoud at behzadn[at]stanford[dot]edu or OHDSI forum.
The following errors may occur while the prediction module is running. If you encounter an error, please send us the corresponding log file. Separate log files are created in the corresponding folders for each model. You can also save and send us all the console process logs (for the entire process).
When running the multiple prediction module, some errors may occur, such as insufficient memory for plpData. The multiple prediction module should be run with at least 16 GB of RAM. This error may cause issues with the results, and you may need to restart the module.
Any interruption during feature extraction causes temp tables to remain in the target database schema (your working schema). As a result, you must explore your database schema to see if any temp tables exist and then delete them all. Otherwise, the feature extraction module would throw an exception.
When the prediction module is executed, some temp tables in the target database schema are created. If you do not have write permission, you may encounter an error during the run. This is another reason why the CDM database schema is separated from the target schema and a subset of CDM is generated in the target database schema.
Object has no attribute 'feature_importances_'. This error may occur in a few models that lack a function to calculate the importance of features. This will have no effect on the outcome.
PORPOISE is licensed under Apache License 2.0.
