Skip to content

Fix json op issue#85

Open
puja-trivedi wants to merge 19 commits intoimprovementfrom
fix_json_op_issue
Open

Fix json op issue#85
puja-trivedi wants to merge 19 commits intoimprovementfrom
fix_json_op_issue

Conversation

@puja-trivedi
Copy link
Contributor

@puja-trivedi puja-trivedi commented Mar 2, 2026

Resolving issue #84

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes a robust framework for benchmarking Named Entity Recognition (NER) capabilities of different large language models within the biomedical domain. It integrates a new dataset, a data processing utility, and a suite of evaluation results and analysis tools. The changes aim to provide a clear and quantifiable assessment of model performance, highlighting the impact of human-in-the-loop processes on NER accuracy and ontology alignment.

Highlights

  • New Benchmarking Dataset: Introduced a new benchmarking dataset for evaluating StructSense, including named entity recognition (NER) annotations from NCBI Disease and JNLPBA biomedical corpora.
  • BIO-tagged Data Conversion Utility: Added a Python utility script (bio_txt.py) to convert BIO-tagged TSV/CSV data into continuous text and JSONL entity mappings, handling various parsing complexities.
  • Comprehensive NER Evaluation Results: Incorporated extensive Named Entity Recognition (NER) evaluation results for various Large Language Models (LLMs) such as Claude 3.7 Sonnet, GPT-4o-mini, and DeepSeek V3 0324 on biomedical literature.
  • Performance Metrics and Human-in-the-Loop Analysis: Provided detailed performance metrics, including token usage, entity detection rates, ontology mapping completeness, and judge scores, comparing model performance both with and without human-in-the-loop (HIL) intervention.
  • New Analysis Scripts and Configurations: Included new Python analysis scripts and YAML configuration files to support comprehensive NER evaluation, visualization of results, and detailed statistical analysis of model performance.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • evaluation/benchmark/readme.md
    • Documented the new benchmarking dataset for StructSense and its structure.
  • evaluation/benchmark/script/bio_txt.py
    • Implemented a Python script for converting BIO-tagged data to continuous text and JSONL formats.
  • evaluation/combined_all_token_cost_data/old/combined_all_csv.csv
    • Added a comprehensive CSV file detailing token usage, cost, and speed for various LLMs across multiple NER tasks.
  • evaluation/ner/old/evaluation/Integrating-brainstem/with_hil/Integrating-brainstem-token-usage-with-hl.csv
    • Added token usage data for NER evaluation on the 'Integrating brainstem' paper with human-in-the-loop.
  • evaluation/ner/old/evaluation/Integrating-brainstem/with_hil/ner_config_claudesonet_s41593-024-01787-0_with_hil.json
    • Added judged NER terms for Claude 3.7 Sonnet on the 'Integrating brainstem' paper with human-in-the-loop.
  • evaluation/ner/old/evaluation/Integrating-brainstem/with_hil/ner_config_deepseek_s41593-024-01787-0_with_hil.json
    • Added judged NER terms for DeepSeek V3 0324 on the 'Integrating brainstem' paper with human-in-the-loop.
  • evaluation/ner/old/evaluation/Integrating-brainstem/with_hil/ner_config_gpt_s41593-024-01787-0_with_hil.json
    • Added judged NER terms for GPT-4o-mini on the 'Integrating brainstem' paper with human-in-the-loop.
  • evaluation/ner/old/evaluation/Integrating-brainstem/without_hil/Integrating-brainstem-token-usage-without-hl.csv
    • Added token usage data for NER evaluation on the 'Integrating brainstem' paper without human-in-the-loop.
  • evaluation/ner/old/evaluation/Integrating-brainstem/without_hil/ner_config_claudesonet_s41593-024-01787-0_without_hil.json
    • Added judged NER terms for Claude 3.7 Sonnet on the 'Integrating brainstem' paper without human-in-the-loop.
  • evaluation/ner/old/evaluation/Integrating-brainstem/without_hil/ner_config_deepseek_s41593-024-01787-0_without_hil.json
    • Added judged NER terms for DeepSeek V3 0324 on the 'Integrating brainstem' paper without human-in-the-loop.
  • evaluation/ner/old/evaluation/Integrating-brainstem/without_hil/ner_config_gpt_s41593-024-01787-0_without_hil.json
    • Added judged NER terms for GPT-4o-mini on the 'Integrating brainstem' paper without human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/results/comprehensive_analysis_report.txt
    • Added a text report summarizing comprehensive NER analysis results for the 'Latent-circuit' task.
  • evaluation/ner/old/evaluation/Latent-circuit/results/comprehensive_summary_table.csv
    • Added a CSV table summarizing NER evaluation metrics for the 'Latent-circuit' task.
  • evaluation/ner/old/evaluation/Latent-circuit/results/entities_missing_ontology.csv
    • Added a CSV file listing entities that lacked ontology information.
  • evaluation/ner/old/evaluation/Latent-circuit/results/entity_pool_summary_with_hil.csv
    • Added a CSV file summarizing the entity pool for the 'Latent-circuit' task with human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/results/entity_pool_summary_without_hil.csv
    • Added a CSV file summarizing the entity pool for the 'Latent-circuit' task without human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/results/judge_score_detailed_statistics.csv
    • Added a CSV file with detailed judge score statistics for NER evaluations.
  • evaluation/ner/old/evaluation/Latent-circuit/results/judge_score_statistics_with_hil.csv
    • Added a CSV file with judge score statistics for the 'Latent-circuit' task with human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/results/judge_score_statistics_without_hil.csv
    • Added a CSV file with judge score statistics for the 'Latent-circuit' task without human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/results/label_distribution_statistics.csv
    • Added a CSV file detailing the distribution of entity labels.
  • evaluation/ner/old/evaluation/Latent-circuit/results/location_statistics.csv
    • Added a CSV file with statistics on entity detection by paper location.
  • evaluation/ner/old/evaluation/Latent-circuit/results/missed_entities_details_with_hil.csv
    • Added a CSV file listing entities missed by models for the 'Latent-circuit' task with human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/results/missed_entities_details_without_hil.csv
    • Added a CSV file listing entities missed by models for the 'Latent-circuit' task without human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/results/model_rankings.csv
    • Added a CSV file containing model rankings based on NER evaluation.
  • evaluation/ner/old/evaluation/Latent-circuit/results/ontology_coverage_summary.csv
    • Added a CSV file summarizing ontology coverage in NER evaluations.
  • evaluation/ner/old/evaluation/Latent-circuit/results/shared_entities_all_models_with_hil.csv
    • Added a CSV file listing entities shared by all models for the 'Latent-circuit' task with human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/results/shared_entities_all_models_without_hil.csv
    • Added a CSV file listing entities shared by all models for the 'Latent-circuit' task without human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/with_hil/ner_config_claudesonet_s41593-025-01869-7_with_hil.json
    • Added judged NER terms for Claude 3.7 Sonnet on the 'Latent-circuit' paper with human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/with_hil/ner_config_deepseek_s41593-025-01869-7_with_hil.json
    • Added judged NER terms for DeepSeek V3 0324 on the 'Latent-circuit' paper with human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/with_hil/ner_config_gpt_s41593-025-01869-7_with_hil.json
    • Added judged NER terms for GPT-4o-mini on the 'Latent-circuit' paper with human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/with_hil/ner_token_usage_with_hil_latent.csv
    • Added token usage data for NER evaluation on the 'Latent-circuit' paper with human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/without_hil/ner_config_claudesonet_s41593-024-01787-0_without_hil.json
    • Added judged NER terms for Claude 3.7 Sonnet on the 'Latent-circuit' paper without human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/without_hil/ner_config_deepseek_s41593-025-01869-7_without_hil.json
    • Added judged NER terms for DeepSeek V3 0324 on the 'Latent-circuit' paper without human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/without_hil/ner_config_gpt_s41593-025-01869-7_without_hil.json
    • Added judged NER terms for GPT-4o-mini on the 'Latent-circuit' paper without human-in-the-loop.
  • evaluation/ner/old/evaluation/Latent-circuit/without_hil/ner_token_usage_without_hil_latent.csv
    • Added token usage data for NER evaluation on the 'Latent-circuit' paper without human-in-the-loop.
  • evaluation/ner/old/ner_config_claudesonet.yaml
    • Added YAML configuration for the Claude 3.7 Sonnet NER agent.
  • evaluation/ner/old/ner_config_deepseek.yaml
    • Added YAML configuration for the DeepSeek V3 0324 NER agent.
  • evaluation/ner/old/ner_config_gpt.yaml
    • Added YAML configuration for the GPT-4o-mini NER agent.
  • evaluation/notebook/README.md
    • Documented the token cost and speed analysis script.
  • evaluation/notebook/ner_comprehensive_summary.py
    • Added a Python script for generating comprehensive NER evaluation summaries and visualizations.
  • evaluation/notebook/ner_data_loader.py
    • Added a Python script for loading and preprocessing NER evaluation data.
  • evaluation/notebook/ner_entity_pool_analysis.py
    • Added a Python script for analyzing entity detection performance and false negatives.
  • evaluation/notebook/ner_judge_score_analysis.py
    • Added a Python script for analyzing judge scores and quality assessments in NER.
  • evaluation/notebook/ner_label_distribution.py
    • Added a Python script for analyzing the distribution of entity labels.
  • evaluation/notebook/ner_location_analysis.py
    • Added a Python script for analyzing entity detection patterns across paper locations.
  • evaluation/notebook/old/integrating-brainstem_w-hil/integrating-brainstem_w-hil_cost_violin.svg
    • Added an SVG plot visualizing cost distribution for the 'Integrating brainstem' task with human-in-the-loop.
  • evaluation/notebook/old/integrating-brainstem_w-hil/integrating-brainstem_w-hil_speed_violin.svg
    • Added an SVG plot visualizing speed distribution for the 'Integrating brainstem' task with human-in-the-loop.
  • evaluation/notebook/old/integrating-brainstem_w-hil/integrating-brainstem_w-hil_speed_vs_cost.svg
    • Added an SVG plot visualizing speed versus cost for the 'Integrating brainstem' task with human-in-the-loop.
Activity
  • The pull request was created by puja-trivedi.
  • New benchmarking data and scripts for NER evaluation were added.
  • Comprehensive evaluation results for various LLMs were included.
  • Analysis and visualization scripts for NER performance metrics were introduced.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new benchmarking dataset for StructSense, including a Python script (bio_txt.py) to convert BIO-tagged data into continuous text and JSONL formats. It also adds numerous evaluation results for Named Entity Recognition (NER) tasks, specifically for 'Integrating brainstem' and 'Latent-circuit' papers, comparing various LLMs (Claude 3.7 Sonnet, DeepSeek V3 0324, GPT-4o-mini) both with and without Human-in-the-Loop (HIL). These results are presented in new CSV and JSON files, detailing token usage, costs, speed, entity detection, ontology mapping, and judge scores. Additionally, new Python scripts (ner_comprehensive_summary.py, ner_data_loader.py, ner_entity_pool_analysis.py, ner_judge_score_analysis.py, ner_label_distribution.py, ner_location_analysis.py, ner_ontology_analysis.py) are added to analyze and visualize these NER evaluation metrics, along with a README.md for token cost analysis. Review comments highlight several issues: redundant nested judge_ner_terms keys in JSON output files, an inconsistency in schema definitions where paper_title is specified as a string but examples show it as an array, hardcoded file paths in analysis scripts, and extraneous \ No newline at end of file artifacts in CSV files.

Note: Security Review is unavailable for this PR.

@@ -0,0 +1,369 @@
{
"judge_ner_terms": {
"judge_ner_terms": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The JSON structure contains a redundant, nested judge_ner_terms key. This is likely a bug in the file generation process and makes the data structure unnecessarily complex and error-prone to parse. The structure should be flattened to have only one judge_ner_terms level.

Comment on lines +133 to +146
- paper_title: string # ⚠️ Not an array
- doi: string

notes:
- The following fields must be arrays of the same length:
- sentence
- start
- end
- remarks
- paper_location
- If ontology match is missing, set:
- ontology_id: null
- ontology_label: null
- paper_title must be a string, even if repeated in multiple entries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a contradiction in the schema definition for paper_title. Line 133 requires it to be a string, and line 151 disallows arrays. However, the notes on line 141 and the example output (line 174) show it as an array. This inconsistency should be resolved to ensure the schema is clear and correctly implemented by the agents.

@@ -0,0 +1,336 @@
{
"judge_ner_terms": {
"judge_ner_terms": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The JSON structure contains a redundant, nested judge_ner_terms key. This is likely a bug in the file generation process and makes the data structure unnecessarily complex and error-prone to parse. The structure should be flattened to have only one judge_ner_terms level.

@@ -0,0 +1,1184 @@
{
"judge_ner_terms": {
"judge_ner_terms": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The JSON structure contains a redundant, nested judge_ner_terms key. This is likely a bug in the file generation process and makes the data structure unnecessarily complex and error-prone to parse. The structure should be flattened to have only one judge_ner_terms level.

@@ -0,0 +1,531 @@
{
"judged_structured_information": {
"judge_ner_terms": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The JSON structure contains a redundant, nested judge_ner_terms key. This is likely a bug in the file generation process and makes the data structure unnecessarily complex and error-prone to parse. The structure should be flattened to have only one judge_ner_terms level.

Comment on lines +263 to +264
structsense_root = Path(__file__).parent.parent.parent
output_dir = structsense_root / "evaluation/ner/evaluation/Latent-circuit/results"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The output directory path is hardcoded. This reduces the script's flexibility. It would be better to pass the output directory as a command-line argument, for example using Python's argparse module. This would allow the script to be run for different evaluation sets without code modification.

Comment on lines +522 to +523
structsense_root = Path(__file__).parent.parent.parent
output_dir = structsense_root / "evaluation/ner/evaluation/Latent-circuit/results"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The output directory path is hardcoded. This reduces the script's flexibility. It would be better to pass the output directory as a command-line argument, for example using Python's argparse module. This would allow the script to be run for different evaluation sets without code modification.

Comment on lines +510 to +511
structsense_root = Path(__file__).parent.parent.parent
output_dir = structsense_root / "evaluation/ner/evaluation/Latent-circuit/results"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The output directory path is hardcoded. This reduces the script's flexibility. It would be better to pass the output directory as a command-line argument, for example using Python's argparse module. This would allow the script to be run for different evaluation sets without code modification.

Comment on lines +317 to +318
structsense_root = Path(__file__).parent.parent.parent
output_dir = structsense_root / "evaluation/ner/evaluation/Latent-circuit/results"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The output directory path is hardcoded. This reduces the script's flexibility. It would be better to pass the output directory as a command-line argument, for example using Python's argparse module. This would allow the script to be run for different evaluation sets without code modification.

"Jun 27, 12:53 PM",Favicon for OpenAI,GPT-4o-mini,liteLLM,1236,495,0.000482,47.0 tps,stop,"Multi-animal pose estimation, identification and tracking with DeepLabCut",Resource extraction,NO
"Jun 27, 12:53 PM",Favicon for OpenAI,GPT-4o-mini,liteLLM,300,262,0.000202,69.8 tps,stop,"Multi-animal pose estimation, identification and tracking with DeepLabCut",Resource extraction,NO
"Jun 27, 12:53 PM",Favicon for OpenAI,GPT-4o-mini,liteLLM,59016,634,0.00923,67.6 tps,stop,"Multi-animal pose estimation, identification and tracking with DeepLabCut",Resource extraction,NO
"Jun 27, 12:53 PM",Favicon for OpenAI,GPT-4o-mini,liteLLM,59239,262,0.00904,80.6 tps,stop,"Multi-animal pose estimation, identification and tracking with DeepLabCut",Resource extraction,NO No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file ends with the text \ No newline at end of file. This appears to be an artifact from a version control or diff tool and should be removed. This extra text can cause parsing errors in tools that expect a clean CSV format.

"Jun 27, 12:53 PM",Favicon for OpenAI,GPT-4o-mini,liteLLM,59239,262,0.00904,80.6,tps,stop,"Multi-animal pose estimation, identification and tracking with DeepLabCut",Resource extraction,NO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants