Add NER tests by djarecka · Pull Request #79 · sensein/structsense

djarecka · 2026-02-27T18:03:46Z

adding a simple test that runs ENR using a free model from OPENROUTER
(the test will run only when merged to the main repository or from an account that set the openrouter key independently)
it should be reviewed and merged after adding source checks and source_text option #67, since it contains the changes from it

…if it is an existing path; adding click.Paths to all options that should be existing texts

…into add_source_checks

…es the files that are passed with source argument; removing the text processing from StructSenseFlow and doing the processing before; updating cli.run_agent

… to api (but keep it in a separate function); adding the same arguments to StructSenseFlow as in cli: source and source_text

…ssed properly; removing src/tests from gitignore

gemini-code-assist · 2026-02-27T18:04:17Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on enhancing the testing suite by introducing new ENR tests and improving the robustness of input handling within the application. The changes streamline how the system accepts and processes different types of source data, making it more flexible and user-friendly. Additionally, various dependencies have been updated to ensure compatibility and leverage the latest features.

Highlights

New ENR Tests: Introduced new tests for Entity-Named Recognition (ENR) functionality, including a simple test using a free model from OPENROUTER.
Input Handling Refactor: Refactored the input handling mechanism in StructSenseFlow and the CLI, separating file-based input (--source) from direct text input (--source_text) and ensuring mutual exclusivity.
Dependency Updates: Updated poetry.lock to reflect various dependency changes, including Poetry version update, package additions (e.g., abnf, srsly), removals (e.g., grpcio-health-checking, mcp, pytest-xdist, pytube), and version adjustments for existing packages (e.g., click, grpcio, litellm).
New Test Infrastructure: Added new test files (cli_test.py, enr_test.py, structsense_flow_test.py) and configuration examples (.env_example, ner-config_free.yaml) to support the new input handling and ENR tests.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

.gitignore
- Removed 'src/tests' from ignored paths to allow new test files to be tracked.
poetry.lock
- Updated Poetry version from 2.2.1 to 2.3.2.
- Added new package 'abnf' version 2.2.0.
- Removed docs, test, and typing extras from package.extras section.
- Downgraded 'click' package from 8.3.1 to 8.1.8 and relaxed Python version constraints.
- Added 'fastmcp' package version 2.1.2.
- Updated 'grpcio' package from 1.71.0 to 1.76.0.
- Removed 'grpcio-health-checking' package.
- Updated 'jsonschema-specifications' version from 2023.03.6 to 2023.3.6.
- Updated 'certifi' version from 14.05.14 to 14.5.14.
- Downgraded 'litellm' package from 1.75.3 to 1.53.3 and adjusted its dependencies and extras.
- Removed 'mcp' package.
- Removed 'pytest-xdist' package.
- Removed 'pytube' package.
- Added 'srsly' package version 2.5.2.
- Removed 'stack-data' package.
- Added new platform-specific wheel files for 'torch' version 2.10.0.
- Reordered markers for 'uvloop' package.
- Updated the content hash.
src/structsense/app.py
- Added an empty line for formatting.
- Renamed process_input_data to process_file in imports.
- Simplified logging configuration for basicConfig.
- Added an empty line for formatting.
- Reformatted DOWNSTREAM_CONTAINER_KEYS for readability.
- Modified __init__ method signature to replace input_source with source and source_text.
- Updated input processing logic in __init__ to handle source and source_text mutually exclusively using process_file.
- Adjusted logging messages and token management logic for clarity and consistency.
- Reformatted _split_downstream_payload to include new container keys and improve readability.
src/structsense/cli.py
- Removed process_input_data import.
- Updated extract CLI command to accept --source (file path) and --source_text (raw text) options, with mutual exclusivity validation.
- Added click.Path(exists=True) validator to --config and --env_file options.
- Updated run_agent CLI command to accept --source and --source_text options, with mutual exclusivity validation.
- Adjusted StructSenseFlow initialization in both extract and run_agent commands to use the new source and source_text parameters.
src/tests/app_test.py
- Removed the file as it contained a placeholder test.
src/tests/cli_test.py
- Added new test file for CLI commands.
- Implemented test_extract_invalid_source_path to verify error handling for non-existent source files.
src/tests/configs/.env_example
- Added new example environment file with various configuration settings.
src/tests/configs/ner-config_free.yaml
- Added new configuration file for NER with agent, task, and embedder settings using a free model.
src/tests/enr_test.py
- Added new test file for ENR extraction.
- Implemented test_enr_1 to test simple ENR extraction with a text input and a free model, verifying entity detection.
src/tests/structsense_flow_test.py
- Added new test file for StructSenseFlow initialization and source validation.
- Implemented tests for invalid source paths, mutual exclusivity of source and source_text, and correct initialization with both input types.
src/utils/utils.py
- Removed the process_input_data function.
- Added a new process_file function to handle reading content from PDF, CSV, and TXT files, returning plain text.
- Introduced a new helper function _structured_data_to_text to convert structured data from file processors into a plain text string.
- Applied minor formatting adjustments to str_to_bool and check_ollama_health functions.
tutorial/python-example/run_structsense_full_pipeline.py
- Updated input_source parameter to source in StructSenseFlow initialization.
tutorial/python-example/run_structsense_partial_extraction_only.py
- Updated input_source parameter to source in StructSenseFlow initialization.
tutorial/python-example/run_structsense_partial_extraction_only_without_key.py
- Updated input_source parameter to source in StructSenseFlow initialization.
tutorial/readme.md
- Updated documentation for CLI and Python examples to reflect the new --source and --source_text options.

Ignored Files

Ignored by pattern: .github/workflows/** (1)
- .github/workflows/test.yaml

Activity

The pull request was created by djarecka to add ENR tests.
The author noted that this PR depends on adding source checks and source_text option #67 and should be merged after it.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request primarily refactors how input sources are handled within the StructSenseFlow class and its CLI interface. The input_source parameter has been replaced with source (for file paths) and source_text (for raw text), which are now mutually exclusive. The utils.utils.process_input_data function has been removed and its functionality integrated into a new process_file function and a private _structured_data_to_text helper, which handles PDF, CSV, and TXT files. The CLI commands (extract and run_agent) have been updated to reflect these new options and include validation for file existence and mutual exclusivity. Additionally, several Python dependencies in poetry.lock have been updated, added (e.g., abnf, fastmcp, srsly), or removed (e.g., grpcio-health-checking, mcp, pytest-xdist, pytube, stack-data), and the click and litellm packages were downgraded. Minor code formatting and logging adjustments were made across app.py and cli.py for readability and consistency. New test files were added to cover CLI functionality and StructSenseFlow initialization with the updated source handling. Review comments highlighted a potential security vulnerability in process_file due to lack of path validation against directory traversal, and a general concern about untrusted user input in LLM prompts without proper sanitization. A minor formatting suggestion was also made for a logging.basicConfig call, and a str_to_bool import was moved to the top of cli.py for PEP 8 compliance.

gemini-code-assist · 2026-02-27T18:07:50Z

src/structsense/app.py

+        elif source:
+            self.source_text = process_file(source)
+        elif source_text:
            self.source_text = source_text


Untrusted user input from source_text is directly incorporated into LLM prompts without sufficient sanitization or the use of secure delimiters. This can allow an attacker to manipulate the LLM's behavior, potentially leading to unauthorized tool usage or leakage of system prompts. Implement robust input sanitization and use clear delimiters to separate user input from instructions.

gemini-code-assist · 2026-02-27T18:07:50Z

src/utils/utils.py

+def process_file(source_path: Union[str, Path]) -> str:
+    """Process a file and return its contents as a plain text string.

-        # Use the first path that exists, or default to the first path
-        source_path = next((p for p in paths_to_try if p.exists()), paths_to_try[0])
-
-        if not source_path.exists():
-            error_msg = f"Source path does not exist: {source}\n" f"Tried the following paths:\n" + "\n".join(
-                f"- {p}" for p in paths_to_try
-            )
-            logger.error(error_msg)
-            return {"status": "Error", "error": error_msg}
-
-        logger.info(f"Using path: {source_path}")
+    Raises:
+        ValueError: If the file does not exist, format is unsupported, or processing fails.
+    """
+    if isinstance(source_path, str):
+        source_path = Path(source_path)
+    if not source_path.is_file():
+        raise ValueError(f"File not found: {source_path}")
+
+    logger.info(f"Processing file: {source_path}")
+    ext = source_path.suffix.lower()
+
+    if ext == ".pdf":
+        grobid_server = os.getenv("GROBID_SERVER_URL_OR_EXTERNAL_SERVICE", "http://localhost:8070")
+        external_service = os.getenv("EXTERNAL_PDF_EXTRACTION_SERVICE", "False")
+        raw = extract_pdf_content(file_path=source_path, grobid_server=grobid_server, external_service=external_service)
+        return _structured_data_to_text(raw)
+    elif ext == ".csv":
+        try:
+            df = pd.read_csv(source_path)
+            return df.to_csv(index=False)
+        except Exception as e:
+            logger.error(f"Error reading CSV file: {e}")
+            raise ValueError(f"Error reading CSV file: {e}")
+    elif ext == ".txt":
+        try:
+            with open(source_path, "r", encoding="utf-8") as f:
+                return f.read()
+        except Exception as e:
+            logger.error(f"Error reading TXT file: {e}")
+            raise ValueError(f"Error reading TXT file: {e}")
    else:
-        source_path = Path(source)
-        if not source_path.exists():
-            error_msg = f"Source path does not exist: {source}"
-            logger.error(error_msg)
-            return {"status": "Error", "error": error_msg}
-
-        # Process single file
-    if source_path.is_file():
-        logger.info(f"Processing single file: {source_path}")
-        ext = source_path.suffix.lower()
-        if ext == ".pdf":
-            GROBID_SERVER_URL_OR_EXTERNAL_SERVICE = os.getenv("GROBID_SERVER_URL_OR_EXTERNAL_SERVICE", "http://localhost:8070")
-            EXTERNAL_PDF_EXTRACTION_SERVICE = os.getenv("EXTERNAL_PDF_EXTRACTION_SERVICE", "False")
-            return extract_pdf_content(
-                file_path=source_path, grobid_server=GROBID_SERVER_URL_OR_EXTERNAL_SERVICE, external_service=EXTERNAL_PDF_EXTRACTION_SERVICE
-            )
-        elif ext == ".csv":
-            try:
-                df = pd.read_csv(source_path)
-                return df.to_dict(orient="records")
-            except Exception as e:
-                logger.error(f"Error reading CSV file: {e}")
-                return {"status": "Error", "error": str(e)}
-        elif ext == ".txt":
-            try:
-                with open(source_path, "r", encoding="utf-8") as f:
-                    return f.read()
-            except Exception as e:
-                logger.error(f"Error reading TXT file: {e}")
-                return {"status": "Error", "error": str(e)}
+        raise ValueError(f"Unsupported file format: {ext}. Supported formats are PDF, CSV, and TXT.")


The process_file function accepts a user-supplied path and reads the file without validating that it resides within an expected directory. This can allow an attacker to read arbitrary files on the system if the library is used in a context where the input path is untrusted (e.g., a web service). Validate that the provided path is within an allowed base directory and sanitize the path to prevent traversal sequences (e.g., ../).

gemini-code-assist · 2026-02-27T18:07:50Z

src/structsense/app.py

-    level=logging.INFO,
-    format="%(asctime)s - %(name)s - %(levelname)s - [%(threadName)s] - %(message)s"
-)
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - [%(threadName)s] - %(message)s")


For better readability and to adhere to common line length conventions (like PEP 8), it would be better to format this long line across multiple lines, as it was before.

Suggested change

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - [%(threadName)s] - %(message)s")

logging.basicConfig(

level=logging.INFO,

format="%(asctime)s - %(name)s - %(levelname)s - [%(threadName)s] - %(message)s",

)

gemini-code-assist · 2026-02-27T18:07:50Z

src/structsense/cli.py

 import yaml

-from utils.utils import load_config, process_input_data
+from utils.utils import load_config


To adhere to PEP 8 guidelines, it's best to group all imports at the top of the file. The import for str_to_bool is currently inside a function on line 97. It should be moved here.

Suggested change

from utils.utils import load_config

from utils.utils import load_config, str_to_bool

gemini-code-assist · 2026-02-27T18:07:50Z

src/structsense/cli.py

@@ -51,14 +95,16 @@ def extract(config, api_key, source, env_file, save_file, chunk_size, max_worker
    enable_human_feedback = bool(human_in_loop.get("humanfeedback_agent", False))
    if "ENABLE_HUMAN_FEEDBACK" in os.environ:
        from utils.utils import str_to_bool


This local import should be moved to the top of the file to adhere to PEP 8 guidelines. I've added a suggestion on line 8 to move it there.

…runs these tests only when openrouter is available

codecov-commenter · 2026-02-28T23:39:23Z

Codecov Report

❌ Patch coverage is 52.94118% with 80 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (improvement@9c0c9d3). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/utils/utils.py	23.80%	48 Missing ⚠️
src/structsense/app.py	28.57%	15 Missing ⚠️
src/structsense/cli.py	64.00%	9 Missing ⚠️
src/tests/ner_test.py	55.55%	8 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##             improvement      #79   +/-   ##
==============================================
  Coverage               ?   13.93%           
==============================================
  Files                  ?       22           
  Lines                  ?     4938           
  Branches               ?        0           
==============================================
  Hits                   ?      688           
  Misses                 ?     4250           
  Partials               ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tekrajchhetri · 2026-03-02T17:28:20Z

@djarecka thanks for this test but i don't see the value why we want a test just to see that app runs? Also it's NER not ENR.

djarecka · 2026-03-05T05:08:22Z

@tekrajchhetri - I'm getting very inconsistent result for a pretty simple example, even when I came back to the model you used originally in your NER example (GPT-4o-mini). Is it the best config to check consistency of extracting entities? As an example, I would expect, that getting "synapses" should work every time?

djarecka added 9 commits February 20, 2026 09:58

[wip] adding source_text option for text and adding check for source …

33927b3

…if it is an existing path; adding click.Paths to all options that should be existing texts

Merge branch 'improvement' of https://github.com/sensein/structsense …

d659a70

…into add_source_checks

changing process_input_source to process_files, since it only process…

f5ee673

…es the files that are passed with source argument; removing the text processing from StructSenseFlow and doing the processing before; updating cli.run_agent

finishing changes to cli.run_agent

edb5fee

moving back processing to the StructSenseFlow to minimize the changes…

0f335c9

… to api (but keep it in a separate function); adding the same arguments to StructSenseFlow as in cli: source and source_text

updating docs and tutorial

36ae0a4

adding simple tests to check that the source and source_text is proce…

3eaaafd

…ssed properly; removing src/tests from gitignore

update poetry.lock

65f7ec0

adding a simple enr test using a free model

3e5f256

gemini-code-assist bot reviewed Feb 27, 2026

View reviewed changes

djarecka added 2 commits February 28, 2026 17:58

testing if the env var is set properly

7e59b8f

adding requires_openrouter pytest marker; changing GA workflow so it …

68e7668

…runs these tests only when openrouter is available

djarecka requested a review from tekrajchhetri March 2, 2026 14:39

tekrajchhetri changed the title ~~Add enr tests~~ Add NER tests Mar 2, 2026

djarecka added 4 commits March 4, 2026 21:08

creating a new job for integration test

4c506ea

adding echo th the check-secrets job

610aefc

removing one assert from the test

cb0cf09

renaming

e7a402f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NER tests#79

Add NER tests#79
djarecka wants to merge 15 commits intosensein:improvementfrom
djarecka:add_enr_tests

djarecka commented Feb 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 27, 2026

Uh oh!

gemini-code-assist bot Feb 27, 2026

Uh oh!

gemini-code-assist bot Feb 27, 2026

Uh oh!

gemini-code-assist bot Feb 27, 2026

Uh oh!

gemini-code-assist bot Feb 27, 2026

Uh oh!

codecov-commenter commented Feb 28, 2026 •

edited

Loading

Uh oh!

tekrajchhetri commented Mar 2, 2026

Uh oh!

djarecka commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	from utils.utils import load_config
	from utils.utils import load_config, str_to_bool

Conversation

djarecka commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 27, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tekrajchhetri commented Mar 2, 2026

Uh oh!

djarecka commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

djarecka commented Feb 27, 2026 •

edited

Loading

codecov-commenter commented Feb 28, 2026 •

edited

Loading