End-to-end, rule-based pipeline for rebuilding an analytic corpus of Web of Science articles about artificial intelligence in oncology. The repository supports the manuscript "A reproducible bibliographic landscape of AI in oncology".
OncoTagger is a deterministic, abstract-level evidence-surveillance pipeline. It starts from user-supplied Web of Science Core Collection (WoSCC) export batches in data/raw/; raw WoSCC exports are not redistributed because they are licensed Clarivate content. The core outputs are a filtered workbook, a binary annotation workbook, an aggregate analysis workbook, manual validation artifacts, and manuscript-facing derived release artifacts such as population-normalized country outputs and candidate translational-signal subsets. Reproducibility depends on rerunning the documented script order from a fixed WoSCC snapshot and fixed dictionaries under sources/.
Run scripts from the repository root in this order:
-
WoS export merge
src/combine_wos_exports.pyreadsdata/raw/savedrecs*.{csv,xls,xlsx}and writesdata/raw/combined_dataset.xlsx. This means files indata/raw/with names beginning withsavedrecsand one of the supported extensions. -
Deduplication and year exclusion
src/to_delete_duplicates_by_DOI.pyreads the combined workbook, normalizes non-emptyDOIvalues, applies a conservative title/year fallback, excludes publication year2026, reports each removal count, and writesdata/processed/processed_dataset.xlsx. -
Eligibility filter
src/filter_dataset.pyscores oncology and AI relevance, keeps a publication-year exclusion safeguard, applies severe-negative gates, and writes:data/filtered/filtered_dataset.xlsx- automatically included records.data/filtered/filtered_dataset_manual_review.xlsx- borderline records requiring manual review.data/filtered/filtered_dataset_excluded.xlsx- automatically excluded records.data/filtered/filtered_dataset_audit_all_decisions.xlsx- full audit table with scores, hits, flags, final decision, anddecision_reason.
-
Cancer typing
src/main_binary.pyscans title, abstract, and author keywords with hard and soft cancer vocabularies, then creates one-hot cancer columns plus detection metadata. -
AI model annotation
src/main_binary.pydetects AI model families from curated keyword columns and records where the AI signal was found. -
Task detection
src/main_binary.pydetects study task labels. The script writes a single task label to theprimary_taskoutput column for downstream metric interpretation and preserves all matched task labels in theall_tasksoutput column. -
Metric extraction
src/main_binary.pyextracts performance metrics from abstracts, normalizes numeric values, bins them with metric-specific thresholds, and adds trace columns for context and raw values. -
Aggregation, counters, and output tables
src/counter.pyreads the binary annotation workbook and writes an analysis workbook with frequencies, year trends, cross-tabs, country summaries, source-title summaries, no-metrics summaries, and AI class mapping tables. -
Article-facing helper outputs
src/build_article_to_population_ratio.py- Input:
data/results/filtered_dataset_binary_classification_analysis.xlsx - Input:
sources/total-population-by-country-2025.csv - Output:
data/results/article to population ratio.xlsx - Purpose: derives population-normalized corresponding-author country outputs.
- Input:
src/build_translational_subset.py- Input:
data/results/filtered_dataset_binary_classification.xlsx - Output: manuscript-facing translational subset workbook and summary files.
- Purpose: derives a cautious candidate translational-signal subset using external-validation context and implementation-oriented title signals.
- Input:
data/
manual validation/
manual reference sets, audit CSVs, and validation summaries
raw/
combined_dataset.xlsx
processed/
processed_dataset.xlsx
filtered/
filtered_dataset.xlsx
filtered_dataset_manual_review.xlsx
filtered_dataset_excluded.xlsx
filtered_dataset_audit_all_decisions.xlsx
results/
filtered_dataset_binary_classification.xlsx
filtered_dataset_binary_classification_analysis.xlsx
article to population ratio.xlsx
supplementary material/
synchronized supplementary workbooks and release summaries
docs/
samples/
sample_savedrecs.xlsx
sample_combined_dataset.xlsx
sample_filtered_dataset_binary_classification.xlsx
sources/
controlled vocabularies, thresholds, task priorities, mappings, population source tables
src/
build_article_to_population_ratio.py
build_translational_subset.py
combine_wos_exports.py
to_delete_duplicates_by_DOI.py
filter_dataset.py
main_binary.py
counter.py
Several files under data/ are generated outputs from the pipeline rather than manually edited source files. To reproduce them, place your own WoSCC export batches in data/raw/ and run the scripts in the documented Run order. The raw WoSCC exports are not redistributed because they are licensed Clarivate content.
Input and output roles:
- User-supplied input: WoSCC export batches under
data/raw/. - Controlled source files: dictionaries, thresholds, priorities, and mappings under
sources/. - Generated outputs:
data/raw/combined_dataset.xlsx,data/processed/processed_dataset.xlsx,data/filtered/*.xlsx, anddata/results/*.xlsx. - Manuscript-facing derived release artifacts: supplementary packages and submission-stage derived files, when present, under
data/supplementary material/or the local submission staging area.
Place WoS export batches in data/raw/. The merge script looks for WoS export files whose filenames start with savedrecs and end in .csv, .xls, or .xlsx, for example savedrecs1.xlsx, savedrecs(2).csv, or savedrecs_batch_03.xlsx. In shell-style notation this is data/raw/savedrecs*.{csv,xls,xlsx}. The asterisk is a wildcard pattern, not a footnote marker.
Expected WoS columns retained by the merge step:
AuthorsArticle TitleSource TitleAuthor KeywordsKeywords PlusAbstractPublication YearReprint AddressesDOIDOI LinkBook DOIWoS Categories
If a retained column is missing in one export, the merge script creates it as empty. Later scripts require at least Article Title, Author Keywords, Abstract, and Publication Year; DOI deduplication requires DOI.
The pipeline is driven by files in sources/:
cancer_keywords.csv- hard cancer type keywords used for cancer one-hot columns.cancer_keywords_soft.csv- fallback organ/site proxies used only when no hard cancer match is found.onco_terms_filter_strong.csv,onco_terms_filter_moderate.csv,onco_terms_filter_weak.csv,onco_terms_filter_remove.csv- oncology eligibility terms grouped by evidence strength.onco_terms_filter.csv- legacy single-file oncology term source used as fallback.ai_terms_filter_strong.csv,ai_terms_filter_moderate.csv,ai_terms_filter_weak.csv,ai_terms_filter_remove.csv- AI eligibility terms grouped by evidence strength.raw_ai_terms_filter.csv- legacy single-file AI term source used as fallback.ai_keywords.csv- AI model and model-family keywords for binary annotation.ai_family_map.csv- maps AI subfamily columns to broader AI classes.task_keywords.csv- task vocabularies for classification, segmentation, prognosis, synthesis, genomic, integration, NLP, and auxiliary tasks.task_priority.csv- priority order used to selectprimary_task.task_metric_priority.csv- task-specific metric ladders used for composite and weighted performance categories.metric_synonyms.csv- metric names and textual synonyms used by the abstract parser.thresholds.csv- metric-specific cutoffs forVery High,High,Medium,Low, andVery Low.category_scores.csv- numeric scores for performance categories used by weighted aggregation.country_synonyms.csv- country aliases used when parsingReprint Addresses.total-population-by-country-2025.csv- population denominators for the article-to-population helper output.wos_exclusion_categories.tsv- WoS category trace layer for non-oncology or ambiguous records.
Input: data/raw/savedrecs*.{csv,xls,xlsx}. This means files in data/raw/ with names beginning with savedrecs and one of the supported extensions.
Output: data/raw/combined_dataset.xlsx
Merges WoS batches, keeps the core WoS columns listed above, and fills missing retained columns with empty values.
Input: data/raw/combined_dataset.xlsx
Output: data/processed/processed_dataset.xlsx
Removes duplicate rows by normalized non-empty DOI, keeping the first occurrence. Rows without DOI are not collapsed together. A conservative Article Title + Publication Year fallback removes likely no-DOI duplicates or DOI/no-DOI copies while preserving rows with conflicting DOI values. Publication year 2026 is excluded at this preprocessing stage and the terminal summary reports how many records were removed for that year.
Input: data/processed/processed_dataset.xlsx
Outputs:
data/filtered/filtered_dataset.xlsxdata/filtered/filtered_dataset_manual_review.xlsxdata/filtered/filtered_dataset_excluded.xlsxdata/filtered/filtered_dataset_audit_all_decisions.xlsx
Adds oncology and AI scores, hit traces, flags, WoS exclusion hits, final decision, and decision_reason. The default run keeps a publication-year exclusion safeguard for 2026, which should normally remove zero records after the preprocessing step above.
Input: data/filtered/filtered_dataset.xlsx
Output: data/results/filtered_dataset_binary_classification.xlsx
src/main_binary.py adds:
- binary cancer-site columns;
- binary AI model/family columns;
- binary task columns;
primary_task;all_tasks;- metric category columns;
- metric trace columns;
composite_metric;composite_source;weighted_score;weighted_category.
Input: data/results/filtered_dataset_binary_classification.xlsx
Output: data/results/filtered_dataset_binary_classification_analysis.xlsx
Builds analysis sheets grouped as:
- frequency tables:
- cancer-site frequencies;
- AI model frequencies;
- AI class frequencies;
- task frequencies;
- temporal tables:
- task-by-year;
- cancer-by-year;
- AI model-by-year;
- AI class-by-year;
- metric-by-year;
- cross-tabulations:
- task x cancer;
- task x model;
- task x AI class;
- metric-category cross-tabs against cancer, AI model, and AI class outputs;
- reporting-quality summaries:
- no-metrics summaries;
- source-title summaries;
- geography outputs:
- corresponding-author country summaries;
- trend summaries:
- top-10 temporal trends.
Inputs:
data/results/filtered_dataset_binary_classification_analysis.xlsxsources/total-population-by-country-2025.csv
Output:
data/results/article to population ratio.xlsx
Purpose: derives population-normalized corresponding-author country outputs.
Input:
data/results/filtered_dataset_binary_classification.xlsx
Output:
- manuscript-facing translational subset workbook and summary files.
Purpose: derives a cautious candidate translational-signal subset using external-validation context and implementation-oriented title signals.
Cancer detection scans Article Title, Abstract, and Author Keywords in that order. Hard cancer keywords are preferred. The first field with a hard match wins and stops the scan. If no hard match is found anywhere, the first soft-only match is used as a fallback. The script keeps one-hot cancer columns plus:
cancer_detected_incancer_match_levelcancer_hard_detected_incancer_soft_detected_in
The binary classifier can mark several cancer columns for one article. counter.py counts selected cancer columns and writes:
number_of_cancer_typeshow_many_cancer_studied
If more than one cancer type is detected, how_many_cancer_studied becomes various cancers. If exactly one is detected, it records just one cancer - <cancer type>. If none is detected, it records not specified.
main_binary.py creates one-hot AI model or subfamily columns from ai_keywords.csv. counter.py then loads ai_family_map.csv and creates broader AI class columns by taking the maximum value across mapped subfamilies. The analysis workbook includes both an AI Class Map sheet and an AI Class Breakdown sheet.
Task detection uses task_keywords.csv and task_priority.csv. The current priority order is:
segmentationclassificationprognosissynthesisgenomicintegrationnlpauxiliary
The first matched task by priority becomes primary_task. All matched tasks are preserved in all_tasks.
For each article, main_binary.py selects the first usable metric from the primary_task ladder in task_metric_priority.csv. That category is written to composite_metric, and the source metric name is written to composite_source.
Metric extraction uses sentence-level candidates, ignores relative-change language such as improvement-by or reduction-by phrasing, and ranks context as:
external_validation > test > validation > holdout > cross_validation_summary > train > unknown
For the same task-specific metric ladder, all usable detected metrics are converted to numeric scores via category_scores.csv. Metrics earlier in the task ladder receive higher weights. The weighted mean is written to weighted_score, and the nearest category label is written to weighted_category.
Typical generated files:
data/raw/combined_dataset.xlsx- merged WoS exports.data/processed/processed_dataset.xlsx- deduplicated records after publication year2026exclusion.data/filtered/filtered_dataset.xlsx- included records after eligibility filtering.data/filtered/filtered_dataset_manual_review.xlsx- borderline records selected for manual review.data/filtered/filtered_dataset_excluded.xlsx- excluded records with scores and reasons.data/filtered/filtered_dataset_audit_all_decisions.xlsx- full filter audit table.data/results/filtered_dataset_binary_classification.xlsx- article-level annotation and metric workbook.data/results/filtered_dataset_binary_classification_analysis.xlsx- aggregated counters, trends, and cross-tabs.data/results/article to population ratio.xlsx- population-normalized corresponding-author country output.data/manual validation/- manual audit files, ordinal validation tables, and detection-audit summaries.data/supplementary material/- synchronized article-supporting release artifacts and summary manifests.
The repository also contains sample workbooks in docs/samples/ for orientation.
The workflow includes several audit layers:
- Filter-level
manual_reviewoutput for ambiguous but potentially eligible records. - Full audit workbook with include, manual-review, and exclude decisions.
- Hit-trace columns for oncology and AI evidence by field and bucket.
decision_reasonexplaining the final eligibility decision.- Cancer hard/soft source metadata.
- Task source metadata through
task_source_field. - Metric trace columns:
- context;
- raw value;
- sentence;
- source type;
- suspicious extraction flag.
no_metrics_reportedcolumn and no-metrics analysis sheets.- Dictionary enrichment via editable files in
sources/. - Repository-facing manual validation artifacts in
data/manual validation/. - Repository-facing supplementary release artifacts in
data/supplementary material/.
Manual validation and supplementary release folders are not required for a basic pipeline run, but they are useful for article support, reproducibility, and auditability.
Recommended environment:
- Python 3.11 or newer
- Python packages from
requirements.txt, includingpandas,numpy,spacy,tqdm, andpycountry - Excel engine dependency:
openpyxlfor.xlsxfiles - Optional Excel engine dependency for legacy
.xlsexports:xlrd>=2.0.1 - spaCy English language model:
en_core_web_sm
Setup:
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
python -m spacy download en_core_web_smRun order:
python src/combine_wos_exports.py
python src/to_delete_duplicates_by_DOI.py
python src/filter_dataset.py
python src/main_binary.py
python src/counter.py
python src/build_article_to_population_ratio.py
# optional article-facing helper
python src/build_translational_subset.pyGiven a fixed WoS export snapshot and fixed sources/ dictionaries, the pipeline is deterministic. WoS itself can change over time, so record the export date/time and repository commit when publishing derived results.
- Site-unspecified cancers may remain difficult to assign to a precise organ class.
- Metastatic-site language can be ambiguous when the primary tumor site and metastatic site are both mentioned.
- Country parsing depends on
Reprint Addresses, address formatting, andcountry_synonyms.csv; it is useful for summaries but not a full affiliation parser. - Metric extraction is abstract-only and may miss values reported only in full text, tables, supplements, or figures.
- Deduplication depends on DOI and title/year metadata quality; rows with conflicting DOI values but the same normalized title/year are preserved for manual review rather than automatically collapsed.
- Rule-based keywords are transparent and auditable, but they require periodic enrichment when new terminology appears.
No files matching savedrecs*.{csv,xls,xlsx} found: put WoS batches indata/raw/and keep thesavedrecsfilename prefix.- Excel engine errors: install
openpyxlfor.xlsxfiles; installxlrd>=2.0.1only for legacy.xlsfiles. - Missing spaCy English language model: if
en_core_web_smis missing, runpython -m spacy download en_core_web_sm. - Unexpected filtering decisions: inspect
data/filtered/filtered_dataset_audit_all_decisions.xlsx, especially columns containingscore,hit, orflag, plus the exactdecision_reasoncolumn. These fields explain why a record was included, excluded, or routed to manual review. - Country aliases missing: add them to
sources/country_synonyms.csv. - README/release mismatch: treat
data/supplementary material/as a synchronized release layer, not as the sole source of truth for rerunning the core pipeline.
Code is released under the MIT License. See LICENSE.txt.