extra tools#72
Conversation
There was a problem hiding this comment.
CRITICAL SEVERITY - INFORMATION DISCLOSURE AND PATH TRAVERSAL
The most critical vulnerability involves multiple path traversal and information disclosure risks. The user-supplied args.path is expanded and resolved but not validated against any sandbox or expected directory, allowing a malicious user to specify arbitrary paths like /etc/passwd, enabling the tool to read and potentially exfiltrate sensitive system files. Additionally, the _rank_candidate_files function uses os.walk without protection against symbolic links. If a repository contains a symlink pointing to external directories such as /etc or /home, it will follow the link and include those files in the repository profile, leading to further exposure of sensitive files outside project boundaries. The final output path for skill files is constructed by concatenating skills_dir with a filename that originates from an LLM response, and the sanitization in _safe_skill_filename could be bypassed, allowing path traversal that results in writing files outside the intended skills_dir.
HIGH SEVERITY - SENSITIVE DATA EXPOSURE AND PROMPT INJECTION
The inclusion of .env files in the TEXT_EXTENSIONS set means that the _rank_candidate_files function will score and include .env files from the target repository in the profile sent to the LLM for skill generation. Since .env files commonly contain secrets like API keys, database passwords, and tokens, this results in credential exposure to the LLM provider. Furthermore, the repository profile is directly interpolated into the user_prompt sent to the LLM without sanitization or separation. A malicious repository can contain crafted content in its source files, such as "ignore previous instructions" style attacks, allowing an attacker to manipulate the LLM's skill generation behavior, effectively performing a prompt injection attack that could override system-level guardrails.
HIGH SEVERITY - UNVALIDATED SKILL FILE CONTENT AND INJECTION RISK
The skill file content is received from the LLM response and written to disk after only whitespace stripping, with no validation, sanitization, or security review. If the LLM is manipulated through prompt injection, it could write malicious or sensitive content into skill files that could later influence future security reviews in arbitrary ways. Similarly, the load_analysis_skills function reads .md files from the skills directory without verifying file integrity or origin. An attacker who can write files to this directory, for example through a compromised repository, can inject malicious skill content that will directly influence the LLM's security analysis behavior.
MEDIUM SEVERITY - DATA CORRUPTION AND CHARACTER HANDLING ISSUES
The truncation of content at an arbitrary byte boundary can split multi-byte UTF-8 characters, producing garbled text. The use of errors='ignore' silently drops incomplete characters, which could corrupt security-relevant context or instructions within skill files and cause the LLM to receive malformed instructions. Additionally, when _read_sample decodes binary file content, it uses errors='replace', which silently replaces invalid UTF-8 bytes with the replacement character U+FFFD. This could inadvertently expose sensitive binary data, misrepresent file contents to the LLM, and more critically, mask encoding-based attacks or cause security-relevant data to be misinterpreted.
| TEXT_EXTENSIONS = { | ||
| ".cs", | ||
| ".css", | ||
| ".env", |
There was a problem hiding this comment.
Security Issue: The TEXT_EXTENSIONS set includes '.env' as a text extension. The _rank_candidate_files function will score and include .env files in the repository profile sent to the LLM for skill generation. Since .env files commonly contain secrets (API keys, database passwords, tokens), this could result in credential exposure to the LLM provider. The IMPORTANT_FILENAMES set includes '.env.example' but actual .env files are also included via TEXT_EXTENSIONS.
Priority: HIGH
CWE: CWE-200
Recommendation: Remove '.env' from TEXT_EXTENSIONS or add explicit filtering to exclude .env files from file sampling. Consider adding '.env' (without .example) to an exclusion list.
Snippet: ".env",
| {repository_profile} | ||
| """ | ||
|
|
||
| generated = await llm.prompt_structured(system_prompt, user_prompt, GeneratedSkillFiles) |
There was a problem hiding this comment.
Security Issue: The repository profile (repository_profile) is constructed from user-controlled files in the target repository being analyzed and is directly interpolated into the user_prompt sent to the LLM. If a malicious repository contains specially crafted content (e.g., 'ignore previous instructions' style attacks) in its source files, this content gets included in the LLM prompt, potentially allowing an attacker to manipulate the LLM's skill generation behavior.
Priority: HIGH
CWE: CWE-94
Recommendation: Sanitize or escape the repository_profile content before including it in the LLM prompt. Consider wrapping it in an indented block or delimiters that clearly separate it from instructions, and add system-level guardrails.
Snippet: generated = await llm.prompt_structured(system_prompt, user_prompt, GeneratedSkillFiles)
| skipped.append(output_path) | ||
| continue | ||
|
|
||
| output_path.write_text(content + "\n", encoding="utf-8") |
There was a problem hiding this comment.
Security Issue: The skill filename comes from an LLM-generated response (skill.filename). While _safe_skill_filename sanitizes the filename, the LLM could theoretically generate a filename resulting in path traversal (e.g., if the sanitization is bypassed or a subdirectory path is embedded). The output_path = skills_dir / filename concatenation could write outside the intended skills_dir if filename contains path separators that survive sanitization.
Priority: MEDIUM
CWE: CWE-22
Recommendation: Validate that the final resolved path is still within the intended skills_dir before writing. Add a check like 'if not output_path.resolve().relative_to(skills_dir.resolve())' after constructing the output path.
Snippet: output_path.write_text(content + "\n", encoding="utf-8")
| skills: list[GeneratedSkillFile] | ||
|
|
||
|
|
||
| def project_root_from_args(args) -> Path: |
There was a problem hiding this comment.
Security Issue: The function uses args.path from user-supplied arguments to construct file system paths. It calls expanduser() (allowing ~ expansion) and resolve() but does not validate that the resulting path is within an expected sandbox or directory. A malicious user could specify an arbitrary path (e.g., /etc/passwd or /proc/...) causing the tool to traverse, read, and exfiltrate content from arbitrary locations on the filesystem.
Priority: MEDIUM
CWE: CWE-22
Recommendation: Validate that the resolved path is within an expected project directory or set of allowed directories. Add checks to prevent path traversal beyond the intended scope.
Snippet: def project_root_from_args(args) -> Path:
| skipped.append(output_path) | ||
| continue | ||
|
|
||
| output_path.write_text(content + "\n", encoding="utf-8") |
There was a problem hiding this comment.
Security Issue: The skill file content comes from an LLM response via generate_skill_files. The content is only stripped of whitespace before being written to disk. There is no validation, sanitization, or security review of the content before writing. An LLM could be manipulated (via prompt injection) to write malicious or sensitive content into the skill files that could influence future security reviews.
Priority: MEDIUM
CWE: CWE-913
Recommendation: Add content validation before writing. At minimum, scan for attempts to override system instructions, inject prompt manipulation content, or include sensitive data.
Snippet: output_path.write_text(content + "\n", encoding="utf-8")
| def _rank_candidate_files(project_root: Path, skills_dir: Path) -> list[tuple[int, str, Path]]: | ||
| candidates: list[tuple[int, str, Path]] = [] | ||
|
|
||
| for current_root, dirnames, filenames in os.walk(project_root): |
There was a problem hiding this comment.
Security Issue: The _rank_candidate_files function uses os.walk on user-controlled project_root without protection against symbolic links. If a repository contains a symlink pointing to an external directory (e.g., /etc, /home), the walker will follow it and include those files in the repository profile. This could lead to reading sensitive files outside the intended project boundaries.
Priority: MEDIUM
CWE: CWE-41
Recommendation: Use followlinks=False (default) with os.walk but add additional checks to ensure resolved paths are within the project_root boundary using path.resolve() comparison. Alternatively, use pathlib.Path.rglob() which has safer symlink handling.
Snippet: for current_root, dirnames, filenames in os.walk(project_root):
| continue | ||
|
|
||
| try: | ||
| content = path.read_text(encoding="utf-8") |
There was a problem hiding this comment.
Security Issue: The load_analysis_skills function reads .md files from the skills_dir path which is derived from user-controlled input (args.path / configured_path). There is no verification of file integrity or origin. If an attacker can write files to this directory (e.g., via a compromised repository), they can inject malicious skill content that influences the LLM's security analysis behavior in arbitrary ways.
Priority: MEDIUM
CWE: CWE-345
Recommendation: Consider signing or checksum-verifying skill files, or restricting write access to the skills directory. At minimum, add a warning when skill content has been modified since generation.
Snippet: content = path.read_text(encoding="utf-8")
| truncated = False | ||
| encoded_length = len(content.encode("utf-8")) | ||
| if encoded_length > remaining: | ||
| content = content.encode("utf-8")[:remaining].decode("utf-8", errors="ignore").strip() |
There was a problem hiding this comment.
Security Issue: When truncating content at an arbitrary byte boundary, a multi-byte UTF-8 character can be split, producing garbled text. The use of errors='ignore' silently drops incomplete characters at the split point, which could corrupt security-relevant context or instructions within skill files. This could cause the LLM to receive malformed instructions.
Priority: LOW
CWE: CWE-172
Recommendation: Use proper character-aware truncation, e.g., truncate by character count or use a method that ensures no multi-byte characters are split. Consider using string slicing on the decoded string instead of encoding and truncating at byte boundaries.
Snippet: content = content.encode("utf-8")[:remaining].decode("utf-8", errors="ignore").strip()
| return "" | ||
|
|
||
| truncated = len(data) > max_bytes | ||
| text = data[:max_bytes].decode("utf-8", errors="replace").strip() |
There was a problem hiding this comment.
Security Issue: When _read_sample decodes binary file content, it uses errors='replace' which silently replaces invalid UTF-8 bytes with the replacement character (U+FFFD). This could inadvertently expose sensitive binary data or misrepresent file contents to the LLM. More critically, it could mask encoding-based attacks or cause security-relevant data to be misinterpreted.
Priority: LOW
CWE: CWE-172
Recommendation: Consider what behavior is appropriate for non-UTF-8 files. Either skip files that are not valid UTF-8 (using errors='strict') or log warnings when replacement occurs. For security-sensitive contexts, transparent data corruption should be avoided.
Snippet: text = data[:max_bytes].decode("utf-8", errors="replace").strip()
|
2026-05-09 10:13:49,958 - saist.reportlab_pdf - ERROR - Unable to write ReportLab PDF file to 'reporting/report.pdf': Flowable <Table@0x781A8C69CE10 1 rows x 1 cols(tallest row 1378)> with cell(0,0) containing |
No description provided.