Releases: karimhalal/Thesis
Security Audit Report For DSH Import
Repository: Thesis | Scan Date: 2026-01-21 | Risk Level: LOW
Summary
| Category | Findings | Status |
|---|---|---|
| Secrets (R Scanner) | 0 | N/A |
| PII Instances (Presidio) | 242 | All false positives |
| R Security Patterns | 3 | Actionable |
| Files Scanned (PII) | 41 | |
| Files Scanned (R) | 26 | |
| Files Excluded | mock_data_NMS.R |
Synthetic test data |
Executive Summary
After review, no actual security vulnerabilities or personal identifying information (PII) were found:
- 242 PII detections: All are false positives (drug approval dates and R code patterns)
- 3 R pattern detections: Hardcoded file paths that should use
here::here()for portability
1. PII Detection (Microsoft Presidio)
Status: Completed | Threshold: 0.9 | Findings: 242 (all false positives)
PII Entity Types Scanned
The following 14 entity types were scanned using Microsoft Presidio's NLP-based detection:
| Entity Type | Description |
|---|---|
| PERSON | Names of individuals that could identify real people |
| EMAIL_ADDRESS | Email addresses that could be used to contact or identify individuals |
| PHONE_NUMBER | Phone numbers that could be used to contact individuals |
| LOCATION | Geographic locations, addresses, or place names |
| DATE_TIME | Dates and times that could be linked to individuals or events |
| NRP | Nationalities, religious or political groups |
| MEDICAL_LICENSE | Medical license numbers or healthcare identifiers |
| URL | Web addresses that could contain sensitive endpoints or parameters |
| IP_ADDRESS | IP addresses that could identify networks or individuals |
| CREDIT_CARD | Credit card numbers |
| IBAN_CODE | International Bank Account Numbers |
| SSN | US Social Security Numbers |
| US_PASSPORT | US Passport numbers |
| US_DRIVER_LICENSE | US Driver's License numbers |
Summary of Findings by Entity Type
| Entity Type | Count | Status | Explanation |
|---|---|---|---|
| DATE_TIME | 238 | False Positive | Drug approval dates from Health Canada DIN database (public regulatory data) |
| IP_ADDRESS | 3 | False Positive | R code indentation patterns misidentified as IP addresses. Namespace calls (::) were incorrectly identfied as potential elements of IP |
Analysis
DATE_TIME Detections (238 - False Positives)
All DATE_TIME detections are from CSV files containing Drug Identification Number (DIN) reference data from Health Canada. The dates represent:
- Drug market authorization dates
- Drug discontinuation dates
These are publicly available regulatory dates, not personally identifiable information. These can be used during NMS variable cleaning to validate feasibility of drug prescription dates (make sure that prescription overlaps period of public availability). For more information, visit the Drug Product Database.
Files affected:
worksheets/din_list2up.csv(44 detections)worksheets/NMS_sheets/DIN _list.csv(55 detections)worksheets/NMS_sheets/din_list_oral.csv(56 detections)worksheets/NMS_sheets/DIN_list2.csv(44 detections)worksheets/NMS_sheets/NMS_Datasheet.csv(37 detections)worksheets/NMS_sheets/bzd.csv(2 detections)
IP_ADDRESS Detections (3 - False Positives)
R code patterns where whitespace/indentation was misidentified:
R/get-desc-data.R:180, 182, 260- dplyr filter operations
2. R-Specific Security Patterns
Status: Completed | Findings: 3 (all actionable)
Security Patterns Scanned
The following 10 R-specific security patterns were scanned using regex-based detection:
| Pattern | Risk Level | Description |
|---|---|---|
system_call |
High | Direct system command execution via system() - could allow arbitrary command injection if user input is passed |
system2_call |
Medium | System command execution via system2() - safer than system() but still requires input validation |
shell_call |
High | Shell command execution - similar risks to system() calls |
eval_parse |
High | Dynamic code evaluation via eval(parse()) - high risk if parsing untrusted input, can execute arbitrary R code |
source_url |
High | Sourcing R code from remote URLs - risk of executing malicious code from compromised sources |
download_file |
Medium | File downloads from external sources via download.file() - could introduce malicious files or data |
db_connection |
Medium | Database connections (dbConnect, odbcConnect, mongoConnect) - ensure credentials are not hardcoded and connections are secure |
hardcoded_path |
Low | Hardcoded file paths (e.g., /Users/, /home/, C:\\) - reduces portability and may expose system structure |
password_var |
High | Potential password or secret storage in variables (password, passwd, api_key, token, etc.) - credentials should be stored securely, not in code |
connection_string |
High | Database connection strings (mysql://, postgres://, mongodb://, redis://) - may contain embedded credentials |
Actionable Findings
These hardcoded paths should be converted to relative paths using here::here():
| File | Line | Issue | Recommended Fix |
|---|---|---|---|
R/conversion_factor_assignment.R |
6 | Hardcoded path | Use here::here("worksheets/DIN _list.csv") |
R/dependency_table.R |
33 | Hardcoded path | Use here::here("worksheets/...") |
R/dose_cat_fun.R |
1 | Hardcoded path | Use here::here("R/DIN_utils.R") |
3. Solutions
Actions taken
- Convert hardcoded paths to relative paths - Use
here::here()for all file operations to ensure portability- Action completed
No Action Required
- DATE_TIME detections - Public drug regulatory dates, not personal identifying information. These are publicly available through
- IP_ADDRESS detections - R code patterns, not actual IP addresses
Best Practices (Already Followed)
- No secrets or credentials detected
- No actual PII (names, emails, SSNs) found in data or
- Mock data file appropriately excluded from attached zip
4. Tools Used
| Tool | Purpose | Status |
|---|---|---|
| Microsoft Presidio | PII detection using NLP | Completed |
| spaCy (en_core_web_lg) | Named Entity Recognition | Completed |
| Custom R Scanner | R-specific security patterns | Completed |
5. Package and Dependency Check
All packages and dependencies were assessed to determine their source using the code enclosed in the following block:
# Load the lockfile
lockfile <- renv::lockfile_read("renv.lock")
# Identify packages not from CRAN
non_cran <- Filter(function(pkg) {
is.null(pkg$Repository) || pkg$Repository != "CRAN"
}, lockfile$Packages)
# Print results
if (length(non_cran) > 0) {
print("Non-CRAN packages found:")
print(names(non_cran))
} else {
print("All packages are from CRAN.")
}Analysis indicated no Non-CRAN packages were identified. Renv was developed and populated using R 4.4.2, the most recently available version in DSH.
Conclusion
The repository contains no actual PII or security vulnerabilities. All 242 automated PII detections were false positives caused by:
- Public regulatory dates in drug reference datasets
- R code patterns misidentified as IP addresses
The only recommended action was to convert 3 hardcoded file paths to relative paths for improved code portability, this has been addressed.
Report generated by security_audit_safe.R Manual review completed: 2026-01-21