Skip to content

Releases: karimhalal/Thesis

Security Audit Report For DSH Import

21 Jan 18:52

Choose a tag to compare

Pre-release

Repository: Thesis | Scan Date: 2026-01-21 | Risk Level: LOW


Summary

Category Findings Status
Secrets (R Scanner) 0 N/A
PII Instances (Presidio) 242 All false positives
R Security Patterns 3 Actionable
Files Scanned (PII) 41
Files Scanned (R) 26
Files Excluded mock_data_NMS.R Synthetic test data

Executive Summary

After review, no actual security vulnerabilities or personal identifying information (PII) were found:

  • 242 PII detections: All are false positives (drug approval dates and R code patterns)
  • 3 R pattern detections: Hardcoded file paths that should use here::here() for portability

1. PII Detection (Microsoft Presidio)

Status: Completed | Threshold: 0.9 | Findings: 242 (all false positives)

PII Entity Types Scanned

The following 14 entity types were scanned using Microsoft Presidio's NLP-based detection:

Entity Type Description
PERSON Names of individuals that could identify real people
EMAIL_ADDRESS Email addresses that could be used to contact or identify individuals
PHONE_NUMBER Phone numbers that could be used to contact individuals
LOCATION Geographic locations, addresses, or place names
DATE_TIME Dates and times that could be linked to individuals or events
NRP Nationalities, religious or political groups
MEDICAL_LICENSE Medical license numbers or healthcare identifiers
URL Web addresses that could contain sensitive endpoints or parameters
IP_ADDRESS IP addresses that could identify networks or individuals
CREDIT_CARD Credit card numbers
IBAN_CODE International Bank Account Numbers
SSN US Social Security Numbers
US_PASSPORT US Passport numbers
US_DRIVER_LICENSE US Driver's License numbers

Summary of Findings by Entity Type

Entity Type Count Status Explanation
DATE_TIME 238 False Positive Drug approval dates from Health Canada DIN database (public regulatory data)
IP_ADDRESS 3 False Positive R code indentation patterns misidentified as IP addresses. Namespace calls (::) were incorrectly identfied as potential elements of IP

Analysis

DATE_TIME Detections (238 - False Positives)

All DATE_TIME detections are from CSV files containing Drug Identification Number (DIN) reference data from Health Canada. The dates represent:

  • Drug market authorization dates
  • Drug discontinuation dates

These are publicly available regulatory dates, not personally identifiable information. These can be used during NMS variable cleaning to validate feasibility of drug prescription dates (make sure that prescription overlaps period of public availability). For more information, visit the Drug Product Database.

Files affected:

  • worksheets/din_list2up.csv (44 detections)
  • worksheets/NMS_sheets/DIN _list.csv (55 detections)
  • worksheets/NMS_sheets/din_list_oral.csv (56 detections)
  • worksheets/NMS_sheets/DIN_list2.csv (44 detections)
  • worksheets/NMS_sheets/NMS_Datasheet.csv (37 detections)
  • worksheets/NMS_sheets/bzd.csv (2 detections)

IP_ADDRESS Detections (3 - False Positives)

R code patterns where whitespace/indentation was misidentified:

  • R/get-desc-data.R:180, 182, 260 - dplyr filter operations

2. R-Specific Security Patterns

Status: Completed | Findings: 3 (all actionable)

Security Patterns Scanned

The following 10 R-specific security patterns were scanned using regex-based detection:

Pattern Risk Level Description
system_call High Direct system command execution via system() - could allow arbitrary command injection if user input is passed
system2_call Medium System command execution via system2() - safer than system() but still requires input validation
shell_call High Shell command execution - similar risks to system() calls
eval_parse High Dynamic code evaluation via eval(parse()) - high risk if parsing untrusted input, can execute arbitrary R code
source_url High Sourcing R code from remote URLs - risk of executing malicious code from compromised sources
download_file Medium File downloads from external sources via download.file() - could introduce malicious files or data
db_connection Medium Database connections (dbConnect, odbcConnect, mongoConnect) - ensure credentials are not hardcoded and connections are secure
hardcoded_path Low Hardcoded file paths (e.g., /Users/, /home/, C:\\) - reduces portability and may expose system structure
password_var High Potential password or secret storage in variables (password, passwd, api_key, token, etc.) - credentials should be stored securely, not in code
connection_string High Database connection strings (mysql://, postgres://, mongodb://, redis://) - may contain embedded credentials

Actionable Findings

These hardcoded paths should be converted to relative paths using here::here():

File Line Issue Recommended Fix
R/conversion_factor_assignment.R 6 Hardcoded path Use here::here("worksheets/DIN _list.csv")
R/dependency_table.R 33 Hardcoded path Use here::here("worksheets/...")
R/dose_cat_fun.R 1 Hardcoded path Use here::here("R/DIN_utils.R")

3. Solutions

Actions taken

  1. Convert hardcoded paths to relative paths - Use here::here() for all file operations to ensure portability- Action completed

No Action Required

  • DATE_TIME detections - Public drug regulatory dates, not personal identifying information. These are publicly available through
  • IP_ADDRESS detections - R code patterns, not actual IP addresses

Best Practices (Already Followed)

  • No secrets or credentials detected
  • No actual PII (names, emails, SSNs) found in data or
  • Mock data file appropriately excluded from attached zip

4. Tools Used

Tool Purpose Status
Microsoft Presidio PII detection using NLP Completed
spaCy (en_core_web_lg) Named Entity Recognition Completed
Custom R Scanner R-specific security patterns Completed

5. Package and Dependency Check

All packages and dependencies were assessed to determine their source using the code enclosed in the following block:

# Load the lockfile
lockfile <- renv::lockfile_read("renv.lock")

# Identify packages not from CRAN
non_cran <- Filter(function(pkg) {
  is.null(pkg$Repository) || pkg$Repository != "CRAN"
}, lockfile$Packages)

# Print results
if (length(non_cran) > 0) {
  print("Non-CRAN packages found:")
  print(names(non_cran))
} else {
  print("All packages are from CRAN.")
}

Analysis indicated no Non-CRAN packages were identified. Renv was developed and populated using R 4.4.2, the most recently available version in DSH.

Conclusion

The repository contains no actual PII or security vulnerabilities. All 242 automated PII detections were false positives caused by:

  1. Public regulatory dates in drug reference datasets
  2. R code patterns misidentified as IP addresses

The only recommended action was to convert 3 hardcoded file paths to relative paths for improved code portability, this has been addressed.


Report generated by security_audit_safe.R Manual review completed: 2026-01-21