Fix json op issue by puja-trivedi · Pull Request #85 · sensein/structsense

puja-trivedi · 2026-03-02T18:37:35Z

Resolving issue #84

gemini-code-assist · 2026-03-02T18:42:53Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes a robust framework for benchmarking Named Entity Recognition (NER) capabilities of different large language models within the biomedical domain. It integrates a new dataset, a data processing utility, and a suite of evaluation results and analysis tools. The changes aim to provide a clear and quantifiable assessment of model performance, highlighting the impact of human-in-the-loop processes on NER accuracy and ontology alignment.

Highlights

New Benchmarking Dataset: Introduced a new benchmarking dataset for evaluating StructSense, including named entity recognition (NER) annotations from NCBI Disease and JNLPBA biomedical corpora.
BIO-tagged Data Conversion Utility: Added a Python utility script (bio_txt.py) to convert BIO-tagged TSV/CSV data into continuous text and JSONL entity mappings, handling various parsing complexities.
Comprehensive NER Evaluation Results: Incorporated extensive Named Entity Recognition (NER) evaluation results for various Large Language Models (LLMs) such as Claude 3.7 Sonnet, GPT-4o-mini, and DeepSeek V3 0324 on biomedical literature.
Performance Metrics and Human-in-the-Loop Analysis: Provided detailed performance metrics, including token usage, entity detection rates, ontology mapping completeness, and judge scores, comparing model performance both with and without human-in-the-loop (HIL) intervention.
New Analysis Scripts and Configurations: Included new Python analysis scripts and YAML configuration files to support comprehensive NER evaluation, visualization of results, and detailed statistical analysis of model performance.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

evaluation/benchmark/readme.md
- Documented the new benchmarking dataset for StructSense and its structure.
evaluation/benchmark/script/bio_txt.py
- Implemented a Python script for converting BIO-tagged data to continuous text and JSONL formats.
evaluation/combined_all_token_cost_data/old/combined_all_csv.csv
- Added a comprehensive CSV file detailing token usage, cost, and speed for various LLMs across multiple NER tasks.
evaluation/ner/old/evaluation/Integrating-brainstem/with_hil/Integrating-brainstem-token-usage-with-hl.csv
- Added token usage data for NER evaluation on the 'Integrating brainstem' paper with human-in-the-loop.
evaluation/ner/old/evaluation/Integrating-brainstem/with_hil/ner_config_claudesonet_s41593-024-01787-0_with_hil.json
- Added judged NER terms for Claude 3.7 Sonnet on the 'Integrating brainstem' paper with human-in-the-loop.
evaluation/ner/old/evaluation/Integrating-brainstem/with_hil/ner_config_deepseek_s41593-024-01787-0_with_hil.json
- Added judged NER terms for DeepSeek V3 0324 on the 'Integrating brainstem' paper with human-in-the-loop.
evaluation/ner/old/evaluation/Integrating-brainstem/with_hil/ner_config_gpt_s41593-024-01787-0_with_hil.json
- Added judged NER terms for GPT-4o-mini on the 'Integrating brainstem' paper with human-in-the-loop.
evaluation/ner/old/evaluation/Integrating-brainstem/without_hil/Integrating-brainstem-token-usage-without-hl.csv
- Added token usage data for NER evaluation on the 'Integrating brainstem' paper without human-in-the-loop.
evaluation/ner/old/evaluation/Integrating-brainstem/without_hil/ner_config_claudesonet_s41593-024-01787-0_without_hil.json
- Added judged NER terms for Claude 3.7 Sonnet on the 'Integrating brainstem' paper without human-in-the-loop.
evaluation/ner/old/evaluation/Integrating-brainstem/without_hil/ner_config_deepseek_s41593-024-01787-0_without_hil.json
- Added judged NER terms for DeepSeek V3 0324 on the 'Integrating brainstem' paper without human-in-the-loop.
evaluation/ner/old/evaluation/Integrating-brainstem/without_hil/ner_config_gpt_s41593-024-01787-0_without_hil.json
- Added judged NER terms for GPT-4o-mini on the 'Integrating brainstem' paper without human-in-the-loop.
evaluation/ner/old/evaluation/Latent-circuit/results/comprehensive_analysis_report.txt
- Added a text report summarizing comprehensive NER analysis results for the 'Latent-circuit' task.
evaluation/ner/old/evaluation/Latent-circuit/results/comprehensive_summary_table.csv
- Added a CSV table summarizing NER evaluation metrics for the 'Latent-circuit' task.
evaluation/ner/old/evaluation/Latent-circuit/results/entities_missing_ontology.csv
- Added a CSV file listing entities that lacked ontology information.
evaluation/ner/old/evaluation/Latent-circuit/results/entity_pool_summary_with_hil.csv
- Added a CSV file summarizing the entity pool for the 'Latent-circuit' task with human-in-the-loop.
evaluation/ner/old/evaluation/Latent-circuit/results/entity_pool_summary_without_hil.csv
- Added a CSV file summarizing the entity pool for the 'Latent-circuit' task without human-in-the-loop.
evaluation/ner/old/evaluation/Latent-circuit/results/judge_score_detailed_statistics.csv
- Added a CSV file with detailed judge score statistics for NER evaluations.
evaluation/ner/old/evaluation/Latent-circuit/results/judge_score_statistics_with_hil.csv
- Added a CSV file with judge score statistics for the 'Latent-circuit' task with human-in-the-loop.
evaluation/ner/old/evaluation/Latent-circuit/results/judge_score_statistics_without_hil.csv
- Added a CSV file with judge score statistics for the 'Latent-circuit' task without human-in-the-loop.
evaluation/ner/old/evaluation/Latent-circuit/results/label_distribution_statistics.csv
- Added a CSV file detailing the distribution of entity labels.
evaluation/ner/old/evaluation/Latent-circuit/results/location_statistics.csv
- Added a CSV file with statistics on entity detection by paper location.
evaluation/ner/old/evaluation/Latent-circuit/results/missed_entities_details_with_hil.csv
- Added a CSV file listing entities missed by models for the 'Latent-circuit' task with human-in-the-loop.
evaluation/ner/old/evaluation/Latent-circuit/results/missed_entities_details_without_hil.csv
- Added a CSV file listing entities missed by models for the 'Latent-circuit' task without human-in-the-loop.
evaluation/ner/old/evaluation/Latent-circuit/results/model_rankings.csv
- Added a CSV file containing model rankings based on NER evaluation.
evaluation/ner/old/evaluation/Latent-circuit/results/ontology_coverage_summary.csv
- Added a CSV file summarizing ontology coverage in NER evaluations.
evaluation/ner/old/evaluation/Latent-circuit/results/shared_entities_all_models_with_hil.csv
- Added a CSV file listing entities shared by all models for the 'Latent-circuit' task with human-in-the-loop.
evaluation/ner/old/evaluation/Latent-circuit/results/shared_entities_all_models_without_hil.csv
- Added a CSV file listing entities shared by all models for the 'Latent-circuit' task without human-in-the-loop.
evaluation/ner/old/evaluation/Latent-circuit/with_hil/ner_config_claudesonet_s41593-025-01869-7_with_hil.json
- Added judged NER terms for Claude 3.7 Sonnet on the 'Latent-circuit' paper with human-in-the-loop.
evaluation/ner/old/evaluation/Latent-circuit/with_hil/ner_config_deepseek_s41593-025-01869-7_with_hil.json
- Added judged NER terms for DeepSeek V3 0324 on the 'Latent-circuit' paper with human-in-the-loop.
evaluation/ner/old/evaluation/Latent-circuit/with_hil/ner_config_gpt_s41593-025-01869-7_with_hil.json
- Added judged NER terms for GPT-4o-mini on the 'Latent-circuit' paper with human-in-the-loop.
evaluation/ner/old/evaluation/Latent-circuit/with_hil/ner_token_usage_with_hil_latent.csv
- Added token usage data for NER evaluation on the 'Latent-circuit' paper with human-in-the-loop.
evaluation/ner/old/evaluation/Latent-circuit/without_hil/ner_config_claudesonet_s41593-024-01787-0_without_hil.json
- Added judged NER terms for Claude 3.7 Sonnet on the 'Latent-circuit' paper without human-in-the-loop.
evaluation/ner/old/evaluation/Latent-circuit/without_hil/ner_config_deepseek_s41593-025-01869-7_without_hil.json
- Added judged NER terms for DeepSeek V3 0324 on the 'Latent-circuit' paper without human-in-the-loop.
evaluation/ner/old/evaluation/Latent-circuit/without_hil/ner_config_gpt_s41593-025-01869-7_without_hil.json
- Added judged NER terms for GPT-4o-mini on the 'Latent-circuit' paper without human-in-the-loop.
evaluation/ner/old/evaluation/Latent-circuit/without_hil/ner_token_usage_without_hil_latent.csv
- Added token usage data for NER evaluation on the 'Latent-circuit' paper without human-in-the-loop.
evaluation/ner/old/ner_config_claudesonet.yaml
- Added YAML configuration for the Claude 3.7 Sonnet NER agent.
evaluation/ner/old/ner_config_deepseek.yaml
- Added YAML configuration for the DeepSeek V3 0324 NER agent.
evaluation/ner/old/ner_config_gpt.yaml
- Added YAML configuration for the GPT-4o-mini NER agent.
evaluation/notebook/README.md
- Documented the token cost and speed analysis script.
evaluation/notebook/ner_comprehensive_summary.py
- Added a Python script for generating comprehensive NER evaluation summaries and visualizations.
evaluation/notebook/ner_data_loader.py
- Added a Python script for loading and preprocessing NER evaluation data.
evaluation/notebook/ner_entity_pool_analysis.py
- Added a Python script for analyzing entity detection performance and false negatives.
evaluation/notebook/ner_judge_score_analysis.py
- Added a Python script for analyzing judge scores and quality assessments in NER.
evaluation/notebook/ner_label_distribution.py
- Added a Python script for analyzing the distribution of entity labels.
evaluation/notebook/ner_location_analysis.py
- Added a Python script for analyzing entity detection patterns across paper locations.
evaluation/notebook/old/integrating-brainstem_w-hil/integrating-brainstem_w-hil_cost_violin.svg
- Added an SVG plot visualizing cost distribution for the 'Integrating brainstem' task with human-in-the-loop.
evaluation/notebook/old/integrating-brainstem_w-hil/integrating-brainstem_w-hil_speed_violin.svg
- Added an SVG plot visualizing speed distribution for the 'Integrating brainstem' task with human-in-the-loop.
evaluation/notebook/old/integrating-brainstem_w-hil/integrating-brainstem_w-hil_speed_vs_cost.svg
- Added an SVG plot visualizing speed versus cost for the 'Integrating brainstem' task with human-in-the-loop.

Activity

The pull request was created by puja-trivedi.
New benchmarking data and scripts for NER evaluation were added.
Comprehensive evaluation results for various LLMs were included.
Analysis and visualization scripts for NER performance metrics were introduced.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new benchmarking dataset for StructSense, including a Python script (bio_txt.py) to convert BIO-tagged data into continuous text and JSONL formats. It also adds numerous evaluation results for Named Entity Recognition (NER) tasks, specifically for 'Integrating brainstem' and 'Latent-circuit' papers, comparing various LLMs (Claude 3.7 Sonnet, DeepSeek V3 0324, GPT-4o-mini) both with and without Human-in-the-Loop (HIL). These results are presented in new CSV and JSON files, detailing token usage, costs, speed, entity detection, ontology mapping, and judge scores. Additionally, new Python scripts (ner_comprehensive_summary.py, ner_data_loader.py, ner_entity_pool_analysis.py, ner_judge_score_analysis.py, ner_label_distribution.py, ner_location_analysis.py, ner_ontology_analysis.py) are added to analyze and visualize these NER evaluation metrics, along with a README.md for token cost analysis. Review comments highlight several issues: redundant nested judge_ner_terms keys in JSON output files, an inconsistency in schema definitions where paper_title is specified as a string but examples show it as an array, hardcoded file paths in analysis scripts, and extraneous \ No newline at end of file artifacts in CSV files.

_{Note: Security Review is unavailable for this PR.}

gemini-code-assist · 2026-03-02T18:58:34Z

...old/evaluation/Latent-circuit/without_hil/ner_config_gpt_s41593-025-01869-7_without_hil.json

@@ -0,0 +1,369 @@
+{
+  "judge_ner_terms": {
+    "judge_ner_terms": {


The JSON structure contains a redundant, nested judge_ner_terms key. This is likely a bug in the file generation process and makes the data structure unnecessarily complex and error-prone to parse. The structure should be flattened to have only one judge_ner_terms level.

gemini-code-assist · 2026-03-02T18:58:34Z

evaluation/ner/old/ner_config_deepseek.yaml

+                        - paper_title: string  # ⚠️ Not an array
+                        - doi: string
+
+                      notes:
+                        - The following fields must be arrays of the same length:
+                            - sentence
+                            - start
+                            - end 
+                            - remarks
+                            - paper_location
+                        - If ontology match is missing, set:
+                            - ontology_id: null
+                            - ontology_label: null
+                        - paper_title must be a string, even if repeated in multiple entries


There is a contradiction in the schema definition for paper_title. Line 133 requires it to be a string, and line 151 disallows arrays. However, the notes on line 141 and the example output (line 174) show it as an array. This inconsistency should be resolved to ensure the schema is clear and correctly implemented by the agents.

gemini-code-assist · 2026-03-02T18:58:34Z

...on/Integrating-brainstem/without_hil/ner_config_deepseek_s41593-024-01787-0_without_hil.json

@@ -0,0 +1,336 @@
+{
+  "judge_ner_terms": {
+    "judge_ner_terms": {


The JSON structure contains a redundant, nested judge_ner_terms key. This is likely a bug in the file generation process and makes the data structure unnecessarily complex and error-prone to parse. The structure should be flattened to have only one judge_ner_terms level.

gemini-code-assist · 2026-03-02T18:58:34Z

...luation/Integrating-brainstem/without_hil/ner_config_gpt_s41593-024-01787-0_without_hil.json

@@ -0,0 +1,1184 @@
+{
+  "judge_ner_terms": {
+    "judge_ner_terms": {


The JSON structure contains a redundant, nested judge_ner_terms key. This is likely a bug in the file generation process and makes the data structure unnecessarily complex and error-prone to parse. The structure should be flattened to have only one judge_ner_terms level.

gemini-code-assist · 2026-03-02T18:58:35Z

.../old/evaluation/Latent-circuit/with_hil/ner_config_deepseek_s41593-025-01869-7_with_hil.json

@@ -0,0 +1,531 @@
+{
+  "judged_structured_information": {
+    "judge_ner_terms": {


The JSON structure contains a redundant, nested judge_ner_terms key. This is likely a bug in the file generation process and makes the data structure unnecessarily complex and error-prone to parse. The structure should be flattened to have only one judge_ner_terms level.

gemini-code-assist · 2026-03-02T18:58:35Z

evaluation/notebook/ner_entity_pool_analysis.py

+    structsense_root = Path(__file__).parent.parent.parent
+    output_dir = structsense_root / "evaluation/ner/evaluation/Latent-circuit/results"


The output directory path is hardcoded. This reduces the script's flexibility. It would be better to pass the output directory as a command-line argument, for example using Python's argparse module. This would allow the script to be run for different evaluation sets without code modification.

gemini-code-assist · 2026-03-02T18:58:35Z

evaluation/notebook/ner_judge_score_analysis.py

+    structsense_root = Path(__file__).parent.parent.parent
+    output_dir = structsense_root / "evaluation/ner/evaluation/Latent-circuit/results"


The output directory path is hardcoded. This reduces the script's flexibility. It would be better to pass the output directory as a command-line argument, for example using Python's argparse module. This would allow the script to be run for different evaluation sets without code modification.

gemini-code-assist · 2026-03-02T18:58:35Z

evaluation/notebook/ner_label_distribution.py

+    structsense_root = Path(__file__).parent.parent.parent
+    output_dir = structsense_root / "evaluation/ner/evaluation/Latent-circuit/results"


The output directory path is hardcoded. This reduces the script's flexibility. It would be better to pass the output directory as a command-line argument, for example using Python's argparse module. This would allow the script to be run for different evaluation sets without code modification.

gemini-code-assist · 2026-03-02T18:58:35Z

evaluation/notebook/ner_location_analysis.py

+    structsense_root = Path(__file__).parent.parent.parent
+    output_dir = structsense_root / "evaluation/ner/evaluation/Latent-circuit/results"


The output directory path is hardcoded. This reduces the script's flexibility. It would be better to pass the output directory as a command-line argument, for example using Python's argparse module. This would allow the script to be run for different evaluation sets without code modification.

gemini-code-assist · 2026-03-02T18:58:35Z

evaluation/notebook/old/combined_analysis/data/combined_all_csv.csv

+"Jun 27, 12:53 PM",Favicon for OpenAI,GPT-4o-mini,liteLLM,1236,495,0.000482,47.0 tps,stop,"Multi-animal pose estimation, identification and tracking with DeepLabCut",Resource extraction,NO
+"Jun 27, 12:53 PM",Favicon for OpenAI,GPT-4o-mini,liteLLM,300,262,0.000202,69.8 tps,stop,"Multi-animal pose estimation, identification and tracking with DeepLabCut",Resource extraction,NO
+"Jun 27, 12:53 PM",Favicon for OpenAI,GPT-4o-mini,liteLLM,59016,634,0.00923,67.6 tps,stop,"Multi-animal pose estimation, identification and tracking with DeepLabCut",Resource extraction,NO
+"Jun 27, 12:53 PM",Favicon for OpenAI,GPT-4o-mini,liteLLM,59239,262,0.00904,80.6 tps,stop,"Multi-animal pose estimation, identification and tracking with DeepLabCut",Resource extraction,NO


The file ends with the text \ No newline at end of file. This appears to be an artifact from a version control or diff tool and should be removed. This extra text can cause parsing errors in tools that expect a clean CSV format.

"Jun 27, 12:53 PM",Favicon for OpenAI,GPT-4o-mini,liteLLM,59239,262,0.00904,80.6,tps,stop,"Multi-animal pose estimation, identification and tracking with DeepLabCut",Resource extraction,NO

tekrajchhetri added 19 commits February 19, 2026 12:51

old evaluation ner

c39d883

moved old evaluation data to old directory

e04ab05

moved old evaluation data to old directory

5b96ef2

moved old evaluation data to old directory

c88bcb6

moved old pdf2reproschema evaluation data to old directory

d8e4a51

script to convert bio tagging to text + entities

05812b6

JNLPBA_gene_protein_benchmarking dataset for structsense

784fc44

NCBI disease benchmarking dataset for structsense

c056235

moved code to script directory

dbb4bee

readme added

4b3562f

updated readme

aa24930

added prev agents json keys

05b61dc

add dict root level keys check

70ab4d4

json-repair added

da35117

trustcall added

e198e5d

json repair tool using trustcall + json_repair library

3332e19

json repair tool added as default for all agents

0ede87b

normalization of resources to prevent error

97ee75a

prevent no output case

4c51e98

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

puja-trivedi assigned puja-trivedi and tekrajchhetri and unassigned puja-trivedi Mar 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix json op issue#85

Fix json op issue#85
puja-trivedi wants to merge 19 commits intoimprovementfrom
fix_json_op_issue

puja-trivedi commented Mar 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		structsense_root = Path(__file__).parent.parent.parent
		output_dir = structsense_root / "evaluation/ner/evaluation/Latent-circuit/results"

Conversation

puja-trivedi commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

puja-trivedi commented Mar 2, 2026 •

edited

Loading