Conversation
…if it is an existing path; adding click.Paths to all options that should be existing texts
…into add_source_checks
…es the files that are passed with source argument; removing the text processing from StructSenseFlow and doing the processing before; updating cli.run_agent
… to api (but keep it in a separate function); adding the same arguments to StructSenseFlow as in cli: source and source_text
…ssed properly; removing src/tests from gitignore
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request primarily focuses on enhancing the testing suite by introducing new ENR tests and improving the robustness of input handling within the application. The changes streamline how the system accepts and processes different types of source data, making it more flexible and user-friendly. Additionally, various dependencies have been updated to ensure compatibility and leverage the latest features. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Ignored Files
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request primarily refactors how input sources are handled within the StructSenseFlow class and its CLI interface. The input_source parameter has been replaced with source (for file paths) and source_text (for raw text), which are now mutually exclusive. The utils.utils.process_input_data function has been removed and its functionality integrated into a new process_file function and a private _structured_data_to_text helper, which handles PDF, CSV, and TXT files. The CLI commands (extract and run_agent) have been updated to reflect these new options and include validation for file existence and mutual exclusivity. Additionally, several Python dependencies in poetry.lock have been updated, added (e.g., abnf, fastmcp, srsly), or removed (e.g., grpcio-health-checking, mcp, pytest-xdist, pytube, stack-data), and the click and litellm packages were downgraded. Minor code formatting and logging adjustments were made across app.py and cli.py for readability and consistency. New test files were added to cover CLI functionality and StructSenseFlow initialization with the updated source handling. Review comments highlighted a potential security vulnerability in process_file due to lack of path validation against directory traversal, and a general concern about untrusted user input in LLM prompts without proper sanitization. A minor formatting suggestion was also made for a logging.basicConfig call, and a str_to_bool import was moved to the top of cli.py for PEP 8 compliance.
| elif source: | ||
| self.source_text = process_file(source) | ||
| elif source_text: | ||
| self.source_text = source_text |
There was a problem hiding this comment.
Untrusted user input from source_text is directly incorporated into LLM prompts without sufficient sanitization or the use of secure delimiters. This can allow an attacker to manipulate the LLM's behavior, potentially leading to unauthorized tool usage or leakage of system prompts. Implement robust input sanitization and use clear delimiters to separate user input from instructions.
| def process_file(source_path: Union[str, Path]) -> str: | ||
| """Process a file and return its contents as a plain text string. | ||
|
|
||
| # Use the first path that exists, or default to the first path | ||
| source_path = next((p for p in paths_to_try if p.exists()), paths_to_try[0]) | ||
|
|
||
| if not source_path.exists(): | ||
| error_msg = f"Source path does not exist: {source}\n" f"Tried the following paths:\n" + "\n".join( | ||
| f"- {p}" for p in paths_to_try | ||
| ) | ||
| logger.error(error_msg) | ||
| return {"status": "Error", "error": error_msg} | ||
|
|
||
| logger.info(f"Using path: {source_path}") | ||
| Raises: | ||
| ValueError: If the file does not exist, format is unsupported, or processing fails. | ||
| """ | ||
| if isinstance(source_path, str): | ||
| source_path = Path(source_path) | ||
| if not source_path.is_file(): | ||
| raise ValueError(f"File not found: {source_path}") | ||
|
|
||
| logger.info(f"Processing file: {source_path}") | ||
| ext = source_path.suffix.lower() | ||
|
|
||
| if ext == ".pdf": | ||
| grobid_server = os.getenv("GROBID_SERVER_URL_OR_EXTERNAL_SERVICE", "http://localhost:8070") | ||
| external_service = os.getenv("EXTERNAL_PDF_EXTRACTION_SERVICE", "False") | ||
| raw = extract_pdf_content(file_path=source_path, grobid_server=grobid_server, external_service=external_service) | ||
| return _structured_data_to_text(raw) | ||
| elif ext == ".csv": | ||
| try: | ||
| df = pd.read_csv(source_path) | ||
| return df.to_csv(index=False) | ||
| except Exception as e: | ||
| logger.error(f"Error reading CSV file: {e}") | ||
| raise ValueError(f"Error reading CSV file: {e}") | ||
| elif ext == ".txt": | ||
| try: | ||
| with open(source_path, "r", encoding="utf-8") as f: | ||
| return f.read() | ||
| except Exception as e: | ||
| logger.error(f"Error reading TXT file: {e}") | ||
| raise ValueError(f"Error reading TXT file: {e}") | ||
| else: | ||
| source_path = Path(source) | ||
| if not source_path.exists(): | ||
| error_msg = f"Source path does not exist: {source}" | ||
| logger.error(error_msg) | ||
| return {"status": "Error", "error": error_msg} | ||
|
|
||
| # Process single file | ||
| if source_path.is_file(): | ||
| logger.info(f"Processing single file: {source_path}") | ||
| ext = source_path.suffix.lower() | ||
| if ext == ".pdf": | ||
| GROBID_SERVER_URL_OR_EXTERNAL_SERVICE = os.getenv("GROBID_SERVER_URL_OR_EXTERNAL_SERVICE", "http://localhost:8070") | ||
| EXTERNAL_PDF_EXTRACTION_SERVICE = os.getenv("EXTERNAL_PDF_EXTRACTION_SERVICE", "False") | ||
| return extract_pdf_content( | ||
| file_path=source_path, grobid_server=GROBID_SERVER_URL_OR_EXTERNAL_SERVICE, external_service=EXTERNAL_PDF_EXTRACTION_SERVICE | ||
| ) | ||
| elif ext == ".csv": | ||
| try: | ||
| df = pd.read_csv(source_path) | ||
| return df.to_dict(orient="records") | ||
| except Exception as e: | ||
| logger.error(f"Error reading CSV file: {e}") | ||
| return {"status": "Error", "error": str(e)} | ||
| elif ext == ".txt": | ||
| try: | ||
| with open(source_path, "r", encoding="utf-8") as f: | ||
| return f.read() | ||
| except Exception as e: | ||
| logger.error(f"Error reading TXT file: {e}") | ||
| return {"status": "Error", "error": str(e)} | ||
| raise ValueError(f"Unsupported file format: {ext}. Supported formats are PDF, CSV, and TXT.") |
There was a problem hiding this comment.
The process_file function accepts a user-supplied path and reads the file without validating that it resides within an expected directory. This can allow an attacker to read arbitrary files on the system if the library is used in a context where the input path is untrusted (e.g., a web service). Validate that the provided path is within an allowed base directory and sanitize the path to prevent traversal sequences (e.g., ../).
| level=logging.INFO, | ||
| format="%(asctime)s - %(name)s - %(levelname)s - [%(threadName)s] - %(message)s" | ||
| ) | ||
| logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - [%(threadName)s] - %(message)s") |
There was a problem hiding this comment.
For better readability and to adhere to common line length conventions (like PEP 8), it would be better to format this long line across multiple lines, as it was before.
| logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - [%(threadName)s] - %(message)s") | |
| logging.basicConfig( | |
| level=logging.INFO, | |
| format="%(asctime)s - %(name)s - %(levelname)s - [%(threadName)s] - %(message)s", | |
| ) |
| import yaml | ||
|
|
||
| from utils.utils import load_config, process_input_data | ||
| from utils.utils import load_config |
There was a problem hiding this comment.
| @@ -51,14 +95,16 @@ def extract(config, api_key, source, env_file, save_file, chunk_size, max_worker | |||
| enable_human_feedback = bool(human_in_loop.get("humanfeedback_agent", False)) | |||
| if "ENABLE_HUMAN_FEEDBACK" in os.environ: | |||
| from utils.utils import str_to_bool | |||
…runs these tests only when openrouter is available
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## improvement #79 +/- ##
==============================================
Coverage ? 13.93%
==============================================
Files ? 22
Lines ? 4938
Branches ? 0
==============================================
Hits ? 688
Misses ? 4250
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@djarecka thanks for this test but i don't see the value why we want a test just to see that app runs? Also it's NER not ENR. |
|
@tekrajchhetri - I'm getting very inconsistent result for a pretty simple example, even when I came back to the model you used originally in your NER example ( |
adding a simple test that runs ENR using a free model from OPENROUTER
(the test will run only when merged to the main repository or from an account that set the openrouter key independently)
it should be reviewed and merged after adding source checks and source_text option #67, since it contains the changes from it