Skip to content

Add NER tests#79

Open
djarecka wants to merge 15 commits intosensein:improvementfrom
djarecka:add_enr_tests
Open

Add NER tests#79
djarecka wants to merge 15 commits intosensein:improvementfrom
djarecka:add_enr_tests

Conversation

@djarecka
Copy link
Contributor

@djarecka djarecka commented Feb 27, 2026

  • adding a simple test that runs ENR using a free model from OPENROUTER
    (the test will run only when merged to the main repository or from an account that set the openrouter key independently)

  • it should be reviewed and merged after adding source checks and source_text option #67, since it contains the changes from it

…if it is an existing path; adding click.Paths to all options that should be existing texts
…es the files that are passed with source argument; removing the text processing from StructSenseFlow and doing the processing before; updating cli.run_agent
… to api (but keep it in a separate function); adding the same arguments to StructSenseFlow as in cli: source and source_text
…ssed properly; removing src/tests from gitignore
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on enhancing the testing suite by introducing new ENR tests and improving the robustness of input handling within the application. The changes streamline how the system accepts and processes different types of source data, making it more flexible and user-friendly. Additionally, various dependencies have been updated to ensure compatibility and leverage the latest features.

Highlights

  • New ENR Tests: Introduced new tests for Entity-Named Recognition (ENR) functionality, including a simple test using a free model from OPENROUTER.
  • Input Handling Refactor: Refactored the input handling mechanism in StructSenseFlow and the CLI, separating file-based input (--source) from direct text input (--source_text) and ensuring mutual exclusivity.
  • Dependency Updates: Updated poetry.lock to reflect various dependency changes, including Poetry version update, package additions (e.g., abnf, srsly), removals (e.g., grpcio-health-checking, mcp, pytest-xdist, pytube), and version adjustments for existing packages (e.g., click, grpcio, litellm).
  • New Test Infrastructure: Added new test files (cli_test.py, enr_test.py, structsense_flow_test.py) and configuration examples (.env_example, ner-config_free.yaml) to support the new input handling and ENR tests.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • .gitignore
    • Removed 'src/tests' from ignored paths to allow new test files to be tracked.
  • poetry.lock
    • Updated Poetry version from 2.2.1 to 2.3.2.
    • Added new package 'abnf' version 2.2.0.
    • Removed docs, test, and typing extras from package.extras section.
    • Downgraded 'click' package from 8.3.1 to 8.1.8 and relaxed Python version constraints.
    • Added 'fastmcp' package version 2.1.2.
    • Updated 'grpcio' package from 1.71.0 to 1.76.0.
    • Removed 'grpcio-health-checking' package.
    • Updated 'jsonschema-specifications' version from 2023.03.6 to 2023.3.6.
    • Updated 'certifi' version from 14.05.14 to 14.5.14.
    • Downgraded 'litellm' package from 1.75.3 to 1.53.3 and adjusted its dependencies and extras.
    • Removed 'mcp' package.
    • Removed 'pytest-xdist' package.
    • Removed 'pytube' package.
    • Added 'srsly' package version 2.5.2.
    • Removed 'stack-data' package.
    • Added new platform-specific wheel files for 'torch' version 2.10.0.
    • Reordered markers for 'uvloop' package.
    • Updated the content hash.
  • src/structsense/app.py
    • Added an empty line for formatting.
    • Renamed process_input_data to process_file in imports.
    • Simplified logging configuration for basicConfig.
    • Added an empty line for formatting.
    • Reformatted DOWNSTREAM_CONTAINER_KEYS for readability.
    • Modified __init__ method signature to replace input_source with source and source_text.
    • Updated input processing logic in __init__ to handle source and source_text mutually exclusively using process_file.
    • Adjusted logging messages and token management logic for clarity and consistency.
    • Reformatted _split_downstream_payload to include new container keys and improve readability.
  • src/structsense/cli.py
    • Removed process_input_data import.
    • Updated extract CLI command to accept --source (file path) and --source_text (raw text) options, with mutual exclusivity validation.
    • Added click.Path(exists=True) validator to --config and --env_file options.
    • Updated run_agent CLI command to accept --source and --source_text options, with mutual exclusivity validation.
    • Adjusted StructSenseFlow initialization in both extract and run_agent commands to use the new source and source_text parameters.
  • src/tests/app_test.py
    • Removed the file as it contained a placeholder test.
  • src/tests/cli_test.py
    • Added new test file for CLI commands.
    • Implemented test_extract_invalid_source_path to verify error handling for non-existent source files.
  • src/tests/configs/.env_example
    • Added new example environment file with various configuration settings.
  • src/tests/configs/ner-config_free.yaml
    • Added new configuration file for NER with agent, task, and embedder settings using a free model.
  • src/tests/enr_test.py
    • Added new test file for ENR extraction.
    • Implemented test_enr_1 to test simple ENR extraction with a text input and a free model, verifying entity detection.
  • src/tests/structsense_flow_test.py
    • Added new test file for StructSenseFlow initialization and source validation.
    • Implemented tests for invalid source paths, mutual exclusivity of source and source_text, and correct initialization with both input types.
  • src/utils/utils.py
    • Removed the process_input_data function.
    • Added a new process_file function to handle reading content from PDF, CSV, and TXT files, returning plain text.
    • Introduced a new helper function _structured_data_to_text to convert structured data from file processors into a plain text string.
    • Applied minor formatting adjustments to str_to_bool and check_ollama_health functions.
  • tutorial/python-example/run_structsense_full_pipeline.py
    • Updated input_source parameter to source in StructSenseFlow initialization.
  • tutorial/python-example/run_structsense_partial_extraction_only.py
    • Updated input_source parameter to source in StructSenseFlow initialization.
  • tutorial/python-example/run_structsense_partial_extraction_only_without_key.py
    • Updated input_source parameter to source in StructSenseFlow initialization.
  • tutorial/readme.md
    • Updated documentation for CLI and Python examples to reflect the new --source and --source_text options.
Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/test.yaml
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request primarily refactors how input sources are handled within the StructSenseFlow class and its CLI interface. The input_source parameter has been replaced with source (for file paths) and source_text (for raw text), which are now mutually exclusive. The utils.utils.process_input_data function has been removed and its functionality integrated into a new process_file function and a private _structured_data_to_text helper, which handles PDF, CSV, and TXT files. The CLI commands (extract and run_agent) have been updated to reflect these new options and include validation for file existence and mutual exclusivity. Additionally, several Python dependencies in poetry.lock have been updated, added (e.g., abnf, fastmcp, srsly), or removed (e.g., grpcio-health-checking, mcp, pytest-xdist, pytube, stack-data), and the click and litellm packages were downgraded. Minor code formatting and logging adjustments were made across app.py and cli.py for readability and consistency. New test files were added to cover CLI functionality and StructSenseFlow initialization with the updated source handling. Review comments highlighted a potential security vulnerability in process_file due to lack of path validation against directory traversal, and a general concern about untrusted user input in LLM prompts without proper sanitization. A minor formatting suggestion was also made for a logging.basicConfig call, and a str_to_bool import was moved to the top of cli.py for PEP 8 compliance.

elif source:
self.source_text = process_file(source)
elif source_text:
self.source_text = source_text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Untrusted user input from source_text is directly incorporated into LLM prompts without sufficient sanitization or the use of secure delimiters. This can allow an attacker to manipulate the LLM's behavior, potentially leading to unauthorized tool usage or leakage of system prompts. Implement robust input sanitization and use clear delimiters to separate user input from instructions.

Comment on lines +58 to +92
def process_file(source_path: Union[str, Path]) -> str:
"""Process a file and return its contents as a plain text string.

# Use the first path that exists, or default to the first path
source_path = next((p for p in paths_to_try if p.exists()), paths_to_try[0])

if not source_path.exists():
error_msg = f"Source path does not exist: {source}\n" f"Tried the following paths:\n" + "\n".join(
f"- {p}" for p in paths_to_try
)
logger.error(error_msg)
return {"status": "Error", "error": error_msg}

logger.info(f"Using path: {source_path}")
Raises:
ValueError: If the file does not exist, format is unsupported, or processing fails.
"""
if isinstance(source_path, str):
source_path = Path(source_path)
if not source_path.is_file():
raise ValueError(f"File not found: {source_path}")

logger.info(f"Processing file: {source_path}")
ext = source_path.suffix.lower()

if ext == ".pdf":
grobid_server = os.getenv("GROBID_SERVER_URL_OR_EXTERNAL_SERVICE", "http://localhost:8070")
external_service = os.getenv("EXTERNAL_PDF_EXTRACTION_SERVICE", "False")
raw = extract_pdf_content(file_path=source_path, grobid_server=grobid_server, external_service=external_service)
return _structured_data_to_text(raw)
elif ext == ".csv":
try:
df = pd.read_csv(source_path)
return df.to_csv(index=False)
except Exception as e:
logger.error(f"Error reading CSV file: {e}")
raise ValueError(f"Error reading CSV file: {e}")
elif ext == ".txt":
try:
with open(source_path, "r", encoding="utf-8") as f:
return f.read()
except Exception as e:
logger.error(f"Error reading TXT file: {e}")
raise ValueError(f"Error reading TXT file: {e}")
else:
source_path = Path(source)
if not source_path.exists():
error_msg = f"Source path does not exist: {source}"
logger.error(error_msg)
return {"status": "Error", "error": error_msg}

# Process single file
if source_path.is_file():
logger.info(f"Processing single file: {source_path}")
ext = source_path.suffix.lower()
if ext == ".pdf":
GROBID_SERVER_URL_OR_EXTERNAL_SERVICE = os.getenv("GROBID_SERVER_URL_OR_EXTERNAL_SERVICE", "http://localhost:8070")
EXTERNAL_PDF_EXTRACTION_SERVICE = os.getenv("EXTERNAL_PDF_EXTRACTION_SERVICE", "False")
return extract_pdf_content(
file_path=source_path, grobid_server=GROBID_SERVER_URL_OR_EXTERNAL_SERVICE, external_service=EXTERNAL_PDF_EXTRACTION_SERVICE
)
elif ext == ".csv":
try:
df = pd.read_csv(source_path)
return df.to_dict(orient="records")
except Exception as e:
logger.error(f"Error reading CSV file: {e}")
return {"status": "Error", "error": str(e)}
elif ext == ".txt":
try:
with open(source_path, "r", encoding="utf-8") as f:
return f.read()
except Exception as e:
logger.error(f"Error reading TXT file: {e}")
return {"status": "Error", "error": str(e)}
raise ValueError(f"Unsupported file format: {ext}. Supported formats are PDF, CSV, and TXT.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The process_file function accepts a user-supplied path and reads the file without validating that it resides within an expected directory. This can allow an attacker to read arbitrary files on the system if the library is used in a context where the input path is untrusted (e.g., a web service). Validate that the provided path is within an allowed base directory and sanitize the path to prevent traversal sequences (e.g., ../).

level=logging.INFO,
format="%(asctime)s - %(name)s - %(levelname)s - [%(threadName)s] - %(message)s"
)
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - [%(threadName)s] - %(message)s")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better readability and to adhere to common line length conventions (like PEP 8), it would be better to format this long line across multiple lines, as it was before.

Suggested change
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - [%(threadName)s] - %(message)s")
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(name)s - %(levelname)s - [%(threadName)s] - %(message)s",
)

import yaml

from utils.utils import load_config, process_input_data
from utils.utils import load_config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To adhere to PEP 8 guidelines, it's best to group all imports at the top of the file. The import for str_to_bool is currently inside a function on line 97. It should be moved here.

Suggested change
from utils.utils import load_config
from utils.utils import load_config, str_to_bool

@@ -51,14 +95,16 @@ def extract(config, api_key, source, env_file, save_file, chunk_size, max_worker
enable_human_feedback = bool(human_in_loop.get("humanfeedback_agent", False))
if "ENABLE_HUMAN_FEEDBACK" in os.environ:
from utils.utils import str_to_bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This local import should be moved to the top of the file to adhere to PEP 8 guidelines. I've added a suggestion on line 8 to move it there.

@codecov-commenter
Copy link

codecov-commenter commented Feb 28, 2026

Codecov Report

❌ Patch coverage is 52.94118% with 80 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (improvement@9c0c9d3). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/utils/utils.py 23.80% 48 Missing ⚠️
src/structsense/app.py 28.57% 15 Missing ⚠️
src/structsense/cli.py 64.00% 9 Missing ⚠️
src/tests/ner_test.py 55.55% 8 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             improvement      #79   +/-   ##
==============================================
  Coverage               ?   13.93%           
==============================================
  Files                  ?       22           
  Lines                  ?     4938           
  Branches               ?        0           
==============================================
  Hits                   ?      688           
  Misses                 ?     4250           
  Partials               ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@djarecka djarecka requested a review from tekrajchhetri March 2, 2026 14:39
@tekrajchhetri tekrajchhetri changed the title Add enr tests Add NER tests Mar 2, 2026
@tekrajchhetri
Copy link
Collaborator

@djarecka thanks for this test but i don't see the value why we want a test just to see that app runs? Also it's NER not ENR.

@djarecka
Copy link
Contributor Author

djarecka commented Mar 5, 2026

@tekrajchhetri - I'm getting very inconsistent result for a pretty simple example, even when I came back to the model you used originally in your NER example (GPT-4o-mini). Is it the best config to check consistency of extracting entities? As an example, I would expect, that getting "synapses" should work every time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants