Releases · karimhalal/Thesis

Repository: Thesis | Scan Date: 2026-01-21 | Risk Level: LOW

Summary

Category	Findings	Status
Secrets (R Scanner)	0	N/A
PII Instances (Presidio)	242	All false positives
R Security Patterns	3	Actionable
Files Scanned (PII)	41
Files Scanned (R)	26
Files Excluded	`mock_data_NMS.R`	Synthetic test data

Executive Summary

After review, no actual security vulnerabilities or personal identifying information (PII) were found:

242 PII detections: All are false positives (drug approval dates and R code patterns)
3 R pattern detections: Hardcoded file paths that should use here::here() for portability

1. PII Detection (Microsoft Presidio)

Status: Completed | Threshold: 0.9 | Findings: 242 (all false positives)

PII Entity Types Scanned

The following 14 entity types were scanned using Microsoft Presidio's NLP-based detection:

Entity Type	Description
PERSON	Names of individuals that could identify real people
EMAIL_ADDRESS	Email addresses that could be used to contact or identify individuals
PHONE_NUMBER	Phone numbers that could be used to contact individuals
LOCATION	Geographic locations, addresses, or place names
DATE_TIME	Dates and times that could be linked to individuals or events
NRP	Nationalities, religious or political groups
MEDICAL_LICENSE	Medical license numbers or healthcare identifiers
URL	Web addresses that could contain sensitive endpoints or parameters
IP_ADDRESS	IP addresses that could identify networks or individuals
CREDIT_CARD	Credit card numbers
IBAN_CODE	International Bank Account Numbers
SSN	US Social Security Numbers
US_PASSPORT	US Passport numbers
US_DRIVER_LICENSE	US Driver's License numbers

Summary of Findings by Entity Type

Entity Type	Count	Status	Explanation
DATE_TIME	238	False Positive	Drug approval dates from Health Canada DIN database (public regulatory data)
IP_ADDRESS	3	False Positive	R code indentation patterns misidentified as IP addresses. Namespace calls (::) were incorrectly identfied as potential elements of IP

Analysis

DATE_TIME Detections (238 - False Positives)

All DATE_TIME detections are from CSV files containing Drug Identification Number (DIN) reference data from Health Canada. The dates represent:

Drug market authorization dates
Drug discontinuation dates

These are publicly available regulatory dates, not personally identifiable information. These can be used during NMS variable cleaning to validate feasibility of drug prescription dates (make sure that prescription overlaps period of public availability). For more information, visit the Drug Product Database.

Files affected:

worksheets/din_list2up.csv (44 detections)
worksheets/NMS_sheets/DIN _list.csv (55 detections)
worksheets/NMS_sheets/din_list_oral.csv (56 detections)
worksheets/NMS_sheets/DIN_list2.csv (44 detections)
worksheets/NMS_sheets/NMS_Datasheet.csv (37 detections)
worksheets/NMS_sheets/bzd.csv (2 detections)

IP_ADDRESS Detections (3 - False Positives)

R code patterns where whitespace/indentation was misidentified:

R/get-desc-data.R:180, 182, 260 - dplyr filter operations

2. R-Specific Security Patterns

Status: Completed | Findings: 3 (all actionable)

Security Patterns Scanned

The following 10 R-specific security patterns were scanned using regex-based detection:

Pattern	Risk Level	Description
`system_call`	High	Direct system command execution via `system()` - could allow arbitrary command injection if user input is passed
`system2_call`	Medium	System command execution via `system2()` - safer than `system()` but still requires input validation
`shell_call`	High	Shell command execution - similar risks to `system()` calls
`eval_parse`	High	Dynamic code evaluation via `eval(parse())` - high risk if parsing untrusted input, can execute arbitrary R code
`source_url`	High	Sourcing R code from remote URLs - risk of executing malicious code from compromised sources
`download_file`	Medium	File downloads from external sources via `download.file()` - could introduce malicious files or data
`db_connection`	Medium	Database connections (dbConnect, odbcConnect, mongoConnect) - ensure credentials are not hardcoded and connections are secure
`hardcoded_path`	Low	Hardcoded file paths (e.g., `/Users/`, `/home/`, `C:\\`) - reduces portability and may expose system structure
`password_var`	High	Potential password or secret storage in variables (password, passwd, api_key, token, etc.) - credentials should be stored securely, not in code
`connection_string`	High	Database connection strings (mysql://, postgres://, mongodb://, redis://) - may contain embedded credentials

Actionable Findings

These hardcoded paths should be converted to relative paths using here::here():

File	Line	Issue	Recommended Fix
`R/conversion_factor_assignment.R`	6	Hardcoded path	Use `here::here("worksheets/DIN _list.csv")`
`R/dependency_table.R`	33	Hardcoded path	Use `here::here("worksheets/...")`
`R/dose_cat_fun.R`	1	Hardcoded path	Use `here::here("R/DIN_utils.R")`

3. Solutions

Actions taken

Convert hardcoded paths to relative paths - Use here::here() for all file operations to ensure portability- Action completed

No Action Required

DATE_TIME detections - Public drug regulatory dates, not personal identifying information. These are publicly available through
IP_ADDRESS detections - R code patterns, not actual IP addresses

Best Practices (Already Followed)

No secrets or credentials detected
No actual PII (names, emails, SSNs) found in data or
Mock data file appropriately excluded from attached zip

4. Tools Used

Tool	Purpose	Status
Microsoft Presidio	PII detection using NLP	Completed
spaCy (en_core_web_lg)	Named Entity Recognition	Completed
Custom R Scanner	R-specific security patterns	Completed

5. Package and Dependency Check

All packages and dependencies were assessed to determine their source using the code enclosed in the following block:

# Load the lockfile
lockfile <- renv::lockfile_read("renv.lock")

# Identify packages not from CRAN
non_cran <- Filter(function(pkg) {
  is.null(pkg$Repository) || pkg$Repository != "CRAN"
}, lockfile$Packages)

# Print results
if (length(non_cran) > 0) {
  print("Non-CRAN packages found:")
  print(names(non_cran))
} else {
  print("All packages are from CRAN.")
}

Analysis indicated no Non-CRAN packages were identified. Renv was developed and populated using R 4.4.2, the most recently available version in DSH.

Conclusion

The repository contains no actual PII or security vulnerabilities. All 242 automated PII detections were false positives caused by:

Public regulatory dates in drug reference datasets
R code patterns misidentified as IP addresses

The only recommended action was to convert 3 hardcoded file paths to relative paths for improved code portability, this has been addressed.

Report generated by security_audit_safe.R Manual review completed: 2026-01-21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Summary

Executive Summary

1. PII Detection (Microsoft Presidio)

PII Entity Types Scanned

Summary of Findings by Entity Type

Analysis

DATE_TIME Detections (238 - False Positives)

IP_ADDRESS Detections (3 - False Positives)

2. R-Specific Security Patterns

Security Patterns Scanned

Actionable Findings

3. Solutions

Actions taken

No Action Required

Best Practices (Already Followed)

4. Tools Used

5. Package and Dependency Check

Conclusion

Uh oh!

Releases: karimhalal/Thesis

Security Audit Report For DSH Import

Summary

Executive Summary

1. PII Detection (Microsoft Presidio)

PII Entity Types Scanned

Summary of Findings by Entity Type

Analysis

DATE_TIME Detections (238 - False Positives)

IP_ADDRESS Detections (3 - False Positives)

2. R-Specific Security Patterns

Security Patterns Scanned

Actionable Findings

3. Solutions

Actions taken

No Action Required

Best Practices (Already Followed)

4. Tools Used

5. Package and Dependency Check

Conclusion

Uh oh!