PUBTOOL (LaTeX publication helper scripts)

This directory contains small shell and Python utilities that support a LaTeX manuscript workflow: merging PDFs, extracting figures/tables as EPS, listing or freezing LaTeX references, counting words, converting TeX to DOCX, uploading drafts, and augmenting BibTeX entries with DOIs.

All commands below assume you are running from the directory that contains these scripts.

0) Setup

Make scripts executable:

chmod +x *.sh

Common dependencies (install names vary by OS):

TeX Live utilities: pdfjam, pdfjoin (often provided by TeX Live texlive-extra-utils or similar)
Ghostscript: gs
Pandoc: pandoc
TeXcount: texcount
Python 3 for DOI scripts, plus packages (see below)

Example install (Ubuntu-like):

sudo apt-get update
sudo apt-get install -y texlive-extra-utils ghostscript pandoc texcount python3-pip

Python packages for DOI scripts:

python3 -m pip install --user bibtexparser requests unidecode tqdm habanero

Quick dependency sanity checks:

command -v pdfjam
command -v pdfjoin
command -v gs
command -v pandoc
command -v texcount
python3 -c "import bibtexparser, requests, unidecode, tqdm; import habanero; print('python deps ok')"

1) Typical workflows

A) Build an author-facing combined PDF (main + SI)

Option 1: getauthorpdf.sh (pdfjam)

./getauthorpdf.sh main.pdf SI.pdf
# or, if your SI file is literally named SI.pdf
./getauthorpdf.sh main.pdf

Expected output:

authorpdf.pdf

Option 2: merge.sh (pdfjoin)

./merge.sh main.pdf SI.pdf

Expected output:

authorpdf.pdf

Notes:

Both scripts produce authorpdf.pdf. The practical difference is pdfjam vs pdfjoin and local TeX Live behavior.

B) Convert a figure PDF to EPS

./geteps.sh figure.pdf

Expected output:

figure.eps

C) Extract pages from a multi-page PDF and convert to EPS

Example: your extended data PDF has multiple pages and you want pages 1, 3, and 5:

./getextfigaseps.sh extended_data.pdf 1 3 5

Expected outputs:

ED_Table_1.eps (from page 1)
ED_Table_2.eps (from page 3)
ED_Table_3.eps (from page 5)

D) Inspect label numbering (figures, tables, sections) from an AUX file

./getfigtaborder.sh main.aux
# or default:
./getfigtaborder.sh

Output:

Prints lines mapping label keys to resolved numbers (and possibly a page field depending on your AUX format).

E) List reference keys in the order they appear in a TeX file

./getrefseq.sh main.tex
# or default:
./getrefseq.sh

Output:

One reference key per line, unique, in first-seen order.

F) Freeze references into a “static” TeX file (for copyediting)

Prereq: compile your TeX file to generate an up-to-date .aux file (often compile twice):

pdflatex main.tex
pdflatex main.tex

Then create the static file:

./getstatictex.sh main.tex

Expected output:

main_static.tex

G) Word count (Nature Medicine style slice)

./wordcount.sh

Expected output:

nature med word count: intro+result+discussion: <N>

H) Convert LaTeX to Word (DOCX) via Pandoc

Prereq: ensure these files exist in the working directory:

manuscript.tex
vancouver-superscript.csl
custom.docx

Then:

./todocx.sh

Expected output:

manuscript.docx

I) Upload a draft PDF to a fixed remote location via SFTP

Default behavior uploads main.pdf as IPF.pdf:

./upload.sh

Upload a specific file:

./upload.sh paper.pdf

Expected:

Local file IPF.pdf is created/overwritten
Remote upload occurs to the configured destination

Prereq:

Working sftp key-based auth is strongly recommended because the script runs in batch mode.

J) Add missing DOIs to a BibTeX file

Option 1: API-based (Crossref REST via habanero)

python3 getdoi.py refs.bib

Expected output file naming quirk:

The script uses .replace('.bib','new.bib')
Example: refs.bib becomes refsnew.bib

Option 2: More aggressive, HTML scraping of Crossref guest query

python3 pydoi.py refs.bib

Expected output:

refs.bib_doi.bib (the script appends "_doi.bib" to the full input filename)

2) Script reference (detailed)

`getauthorpdf.sh`

Goal: Combine main PDF and SI PDF into authorpdf.pdf using pdfjam (letter paper formatting).

Run:

./getauthorpdf.sh main.pdf
# uses SI.pdf as the SI default

Or:

./getauthorpdf.sh main.pdf Supplement.pdf

Output:

authorpdf.pdf

Failure modes:

pdfjam: command not found
Install TeX Live utilities (example above) and ensure TeX binaries are on PATH.
Missing SI file
Provide the second argument or name the SI file SI.pdf.

`merge.sh`

Goal: Merge two PDFs into authorpdf.pdf using pdfjoin.

Run:

./merge.sh main.pdf SI.pdf

Output:

authorpdf.pdf

Failure modes:

pdfjoin: command not found
Install TeX Live extra utils.

`geteps.sh`

Goal: Convert one PDF into an EPS using Ghostscript (eps2write).

Run:

./geteps.sh figure.pdf

Output:

figure.eps

Notes and caveats:

Output name is computed by Bash substitution ${1/pdf/eps}. Prefer filenames that end with .pdf.
If the file is named my.figure.PDF (uppercase), it will not match the substitution. Rename to lowercase .pdf.

Troubleshooting:

gs: command not found
Install Ghostscript.

`getextfigaseps.sh`

Goal: Extract specific pages from a multi-page PDF and convert each extracted page to EPS.

Run (example extracting pages 2 and 4):

./getextfigaseps.sh extended_data.pdf 2 4

What happens:

Creates page2.pdf, converts to page2.eps, then renames to ED_Table_1.eps
Creates page4.pdf, converts to page4.eps, then renames to ED_Table_2.eps

Outputs:

ED_Table_1.eps, ED_Table_2.eps, ... (numbered by argument order, not by page number)

Common gotchas:

Page numbers are interpreted the way pdfjam expects them. In most TeX Live setups, page numbers start at 1.
If you want output names that preserve the page number, you can adjust the script, but as written it uses sequential k.

`getfigtaborder.sh`

Goal: Inspect label numbering recorded in a LaTeX AUX file.

Default run:

./getfigtaborder.sh
# reads main.aux

Specify AUX explicitly:

./getfigtaborder.sh paper.aux

Output:

Lines derived from \newlabel{...}{...} entries
Intended to help you confirm the order and numbers of figure/table labels

Notes:

AUX formats differ depending on packages and engines, so the extracted fields can vary.

`getrefseq.sh`

Goal: Print unique \ref{...} keys in the order they appear in a TeX file.

Default run:

./getrefseq.sh
# reads main.tex

Specify TeX:

./getrefseq.sh paper.tex

Output:

One label key per line, duplicates removed, order preserved.

Uses:

Sanity check: confirm expected reference labels exist
Quick inventory of referenced labels (figures, tables, sections)

`getstatictex.sh`

Goal: Produce a static TeX file by replacing \ref{label} with its resolved number from the AUX file.

Run:

./getstatictex.sh main.tex

Prereq:

main.aux must exist and be up to date. Compile the LaTeX file first:

pdflatex main.tex
pdflatex main.tex

Output:

main_static.tex

Important caveats:

Replacement is done with plain sed and can over-replace if your label name appears as plain text elsewhere.
The AUX filename is computed by ${FILE/tex/aux} which can misbehave if tex appears earlier in the path. If you hit issues, hardcode AUX derivation or adjust the substitution.
Always diff the result:

diff -u main.tex main_static.tex | head

`wordcount.sh`

Goal: Print a word count intended to approximate Nature Medicine counting for intro + results + discussion.

Run:

./wordcount.sh

Assumptions:

The script runs texcount main_new.tex
It then slices a fixed region of the texcount output and sums a field

If you rename your TeX file, update the script or create a symlink:

ln -sf yourfile.tex main_new.tex
./wordcount.sh

`todocx.sh`

Goal: Convert a LaTeX manuscript to a Word document using Pandoc and a specific CSL citation style.

Inputs expected:

manuscript.tex
vancouver-superscript.csl
custom.docx

Run:

./todocx.sh

Output:

manuscript.docx

Troubleshooting:

If citations do not render, confirm --citeproc support in your Pandoc version.
If math renders poorly, check the target journal requirements and consider --mathml alternatives.

`upload.sh`

Goal: Upload a PDF to a fixed server path using sftp batch mode.

Default run:

./upload.sh
# uploads main.pdf as IPF.pdf

Specify input:

./upload.sh authorpdf.pdf

Behavior:

Copies the selected PDF to IPF.pdf locally
Uploads IPF.pdf via sftp to a hardcoded destination

Prereq:

Key-based auth typically required for non-interactive sftp workflows

3) DOI tools (Python)

Both scripts read BibTeX, fill missing DOI fields, then write a new BibTeX file.

`getdoi.py` (Crossref API via habanero)

Run:

python3 getdoi.py refs.bib

Behavior:

For each entry missing doi, queries Crossref using title and takes the top hit (limit 1)
Writes a new bib file

Output file naming:

refs.bib -> refsnew.bib (quirk of .replace('.bib','new.bib'))

Tip:

If you want refs_new.bib, edit that output filename line.

Potential failure modes:

If Crossref returns no items, the script can error (depending on current code). If it crashes on some entries, use pydoi.py or harden the empty-result handling.

`pydoi.py` (Crossref guest query via HTML scraping)

Run:

python3 pydoi.py refs.bib

Behavior:

Normalizes the title (ASCII, strips LaTeX-ish constructs)
Tries multiple author last names to increase hit probability
Extracts DOI by regex matching of doi.org/<...> in the HTML response

Output:

refs.bib_doi.bib

Tradeoffs:

More aggressive on messy entries
More brittle because it depends on Crossref HTML layout

4) File inventory

Shell scripts:

getauthorpdf.sh
merge.sh
geteps.sh
getextfigaseps.sh
getfigtaborder.sh
getrefseq.sh
getstatictex.sh
wordcount.sh
todocx.sh
upload.sh

Python scripts:

getdoi.py
pydoi.py

5) Notes for maintainers

If you want to make these tools more robust, the highest-value changes are:

geteps.sh: output naming via ${1%.pdf}.eps instead of ${1/pdf/eps}
getstatictex.sh: replace only \ref{label} tokens (not raw label substrings) and use safer parsing of AUX entries
getdoi.py: handle Crossref empty results without indexing [0]
Harmonize DOI script output naming to *_doi.bib and *_new.bib

6) Merging multiple `.bib` files with deduplication (`merge_bibs.sh`)

This utility merges all BibTeX files in a directory into a single consolidated .bib file while removing duplicates using bibtool.

Purpose

Combine multiple .bib sources (Zotero exports, PubMed, manual entries, etc.)
Normalize entries
Remove duplicates based on:
- BibTeX key (first pass)
- Content similarity (title + author + year) (second pass)

Prerequisites

Install bibtool:

sudo apt-get install bibtool

Check installation:

bibtool --version

Script

File: merge_bibs.sh

Make executable:

chmod +x merge_bibs.sh

Usage

Merge all `.bib` files in current directory

./merge_bibs.sh

Output:

merged.bib

Specify input directory and output file

./merge_bibs.sh ./refs final.bib

What the script does

Finds all .bib files in the specified directory
Concatenates them into a temporary file
Runs bibtool in two passes:

Pass 1: Key-based deduplication

--duplicates=key

Removes entries with identical BibTeX keys.

Pass 2: Content-based deduplication

--duplicates=field \
--duplicate.field=title \
--duplicate.field=author \
--duplicate.field=year

Removes entries that are the same paper but have different keys.

Example workflow (recommended)

# Step 1: Merge raw bibliographies
./merge_bibs.sh ./bib_sources merged_raw.bib

# Step 2: Add DOIs
python3 getdoi.py merged_raw.bib
# OR
python3 pydoi.py merged_raw.bib

# Step 3: Re-deduplicate using DOI
./merge_bibs.sh . merged_clean.bib

Optional enhancements

Prefer DOI-based uniqueness

--duplicate.field=doi

Aggressive normalization

--normalize.fields=title

Prefer entries with DOI

--prefer.field=doi

Notes and caveats

bibtool deduplication is heuristic for content-based matches
Title normalization is not perfect across LaTeX vs Unicode sources
Always inspect final output before submission:

less merged.bib

Integration points

Before todocx.sh
After DOI enrichment (getdoi.py, pydoi.py)
Before journal submission formatting

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
clean.sh		clean.sh
custom.docx		custom.docx
getauthorpdf.sh		getauthorpdf.sh
getdoi.py		getdoi.py
geteps.sh		geteps.sh
getextfigaseps.sh		getextfigaseps.sh
getfigtaborder.sh		getfigtaborder.sh
getrefseq.sh		getrefseq.sh
getstatictex.sh		getstatictex.sh
merge.sh		merge.sh
mergebib.sh		mergebib.sh
polllatex-latex.sh		polllatex-latex.sh
pydoi.py		pydoi.py
todocx.sh		todocx.sh
upload.sh		upload.sh
vancouver-superscript.csl		vancouver-superscript.csl
wordcount.sh		wordcount.sh

License

zeroknowledgediscovery/pubtools

Folders and files

Latest commit

History

Repository files navigation

PUBTOOL (LaTeX publication helper scripts)

0) Setup

1) Typical workflows

A) Build an author-facing combined PDF (main + SI)

B) Convert a figure PDF to EPS

C) Extract pages from a multi-page PDF and convert to EPS

D) Inspect label numbering (figures, tables, sections) from an AUX file

E) List reference keys in the order they appear in a TeX file

F) Freeze references into a “static” TeX file (for copyediting)

G) Word count (Nature Medicine style slice)

H) Convert LaTeX to Word (DOCX) via Pandoc

I) Upload a draft PDF to a fixed remote location via SFTP

J) Add missing DOIs to a BibTeX file

2) Script reference (detailed)

getauthorpdf.sh

merge.sh

geteps.sh

getextfigaseps.sh

getfigtaborder.sh

getrefseq.sh

getstatictex.sh

wordcount.sh

todocx.sh

upload.sh

3) DOI tools (Python)

getdoi.py (Crossref API via habanero)

pydoi.py (Crossref guest query via HTML scraping)

4) File inventory

5) Notes for maintainers

6) Merging multiple .bib files with deduplication (merge_bibs.sh)

Purpose

Prerequisites

Script

Usage

Merge all .bib files in current directory

Specify input directory and output file

What the script does

Pass 1: Key-based deduplication

Pass 2: Content-based deduplication

Example workflow (recommended)

Optional enhancements

Prefer DOI-based uniqueness

Aggressive normalization

Prefer entries with DOI

Notes and caveats

Integration points

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`getauthorpdf.sh`

`merge.sh`

`geteps.sh`

`getextfigaseps.sh`

`getfigtaborder.sh`

`getrefseq.sh`

`getstatictex.sh`

`wordcount.sh`

`todocx.sh`

`upload.sh`

`getdoi.py` (Crossref API via habanero)

`pydoi.py` (Crossref guest query via HTML scraping)

6) Merging multiple `.bib` files with deduplication (`merge_bibs.sh`)

Merge all `.bib` files in current directory

Packages