Skip to content

Document formatting tools often required for manuscript preparation

License

Notifications You must be signed in to change notification settings

zeroknowledgediscovery/pubtools

Repository files navigation

PUBTOOL (LaTeX publication helper scripts)

This directory contains small shell and Python utilities that support a LaTeX manuscript workflow: merging PDFs, extracting figures/tables as EPS, listing or freezing LaTeX references, counting words, converting TeX to DOCX, uploading drafts, and augmenting BibTeX entries with DOIs.

All commands below assume you are running from the directory that contains these scripts.


0) Setup

Make scripts executable:

chmod +x *.sh

Common dependencies (install names vary by OS):

  • TeX Live utilities: pdfjam, pdfjoin (often provided by TeX Live texlive-extra-utils or similar)
  • Ghostscript: gs
  • Pandoc: pandoc
  • TeXcount: texcount
  • Python 3 for DOI scripts, plus packages (see below)

Example install (Ubuntu-like):

sudo apt-get update
sudo apt-get install -y texlive-extra-utils ghostscript pandoc texcount python3-pip

Python packages for DOI scripts:

python3 -m pip install --user bibtexparser requests unidecode tqdm habanero

Quick dependency sanity checks:

command -v pdfjam
command -v pdfjoin
command -v gs
command -v pandoc
command -v texcount
python3 -c "import bibtexparser, requests, unidecode, tqdm; import habanero; print('python deps ok')"

1) Typical workflows

A) Build an author-facing combined PDF (main + SI)

Option 1: getauthorpdf.sh (pdfjam)

./getauthorpdf.sh main.pdf SI.pdf
# or, if your SI file is literally named SI.pdf
./getauthorpdf.sh main.pdf

Expected output:

  • authorpdf.pdf

Option 2: merge.sh (pdfjoin)

./merge.sh main.pdf SI.pdf

Expected output:

  • authorpdf.pdf

Notes:

  • Both scripts produce authorpdf.pdf. The practical difference is pdfjam vs pdfjoin and local TeX Live behavior.

B) Convert a figure PDF to EPS

./geteps.sh figure.pdf

Expected output:

  • figure.eps

C) Extract pages from a multi-page PDF and convert to EPS

Example: your extended data PDF has multiple pages and you want pages 1, 3, and 5:

./getextfigaseps.sh extended_data.pdf 1 3 5

Expected outputs:

  • ED_Table_1.eps (from page 1)
  • ED_Table_2.eps (from page 3)
  • ED_Table_3.eps (from page 5)

D) Inspect label numbering (figures, tables, sections) from an AUX file

./getfigtaborder.sh main.aux
# or default:
./getfigtaborder.sh

Output:

  • Prints lines mapping label keys to resolved numbers (and possibly a page field depending on your AUX format).

E) List reference keys in the order they appear in a TeX file

./getrefseq.sh main.tex
# or default:
./getrefseq.sh

Output:

  • One reference key per line, unique, in first-seen order.

F) Freeze references into a “static” TeX file (for copyediting)

Prereq: compile your TeX file to generate an up-to-date .aux file (often compile twice):

pdflatex main.tex
pdflatex main.tex

Then create the static file:

./getstatictex.sh main.tex

Expected output:

  • main_static.tex

G) Word count (Nature Medicine style slice)

./wordcount.sh

Expected output:

  • nature med word count: intro+result+discussion: <N>

H) Convert LaTeX to Word (DOCX) via Pandoc

Prereq: ensure these files exist in the working directory:

  • manuscript.tex
  • vancouver-superscript.csl
  • custom.docx

Then:

./todocx.sh

Expected output:

  • manuscript.docx

I) Upload a draft PDF to a fixed remote location via SFTP

Default behavior uploads main.pdf as IPF.pdf:

./upload.sh

Upload a specific file:

./upload.sh paper.pdf

Expected:

  • Local file IPF.pdf is created/overwritten
  • Remote upload occurs to the configured destination

Prereq:

  • Working sftp key-based auth is strongly recommended because the script runs in batch mode.

J) Add missing DOIs to a BibTeX file

Option 1: API-based (Crossref REST via habanero)

python3 getdoi.py refs.bib

Expected output file naming quirk:

  • The script uses .replace('.bib','new.bib')
  • Example: refs.bib becomes refsnew.bib

Option 2: More aggressive, HTML scraping of Crossref guest query

python3 pydoi.py refs.bib

Expected output:

  • refs.bib_doi.bib (the script appends "_doi.bib" to the full input filename)

2) Script reference (detailed)

getauthorpdf.sh

Goal: Combine main PDF and SI PDF into authorpdf.pdf using pdfjam (letter paper formatting).

Run:

./getauthorpdf.sh main.pdf
# uses SI.pdf as the SI default

Or:

./getauthorpdf.sh main.pdf Supplement.pdf

Output:

  • authorpdf.pdf

Failure modes:

  • pdfjam: command not found
    Install TeX Live utilities (example above) and ensure TeX binaries are on PATH.
  • Missing SI file
    Provide the second argument or name the SI file SI.pdf.

merge.sh

Goal: Merge two PDFs into authorpdf.pdf using pdfjoin.

Run:

./merge.sh main.pdf SI.pdf

Output:

  • authorpdf.pdf

Failure modes:

  • pdfjoin: command not found
    Install TeX Live extra utils.

geteps.sh

Goal: Convert one PDF into an EPS using Ghostscript (eps2write).

Run:

./geteps.sh figure.pdf

Output:

  • figure.eps

Notes and caveats:

  • Output name is computed by Bash substitution ${1/pdf/eps}. Prefer filenames that end with .pdf.
  • If the file is named my.figure.PDF (uppercase), it will not match the substitution. Rename to lowercase .pdf.

Troubleshooting:

  • gs: command not found
    Install Ghostscript.

getextfigaseps.sh

Goal: Extract specific pages from a multi-page PDF and convert each extracted page to EPS.

Run (example extracting pages 2 and 4):

./getextfigaseps.sh extended_data.pdf 2 4

What happens:

  • Creates page2.pdf, converts to page2.eps, then renames to ED_Table_1.eps
  • Creates page4.pdf, converts to page4.eps, then renames to ED_Table_2.eps

Outputs:

  • ED_Table_1.eps, ED_Table_2.eps, ... (numbered by argument order, not by page number)

Common gotchas:

  • Page numbers are interpreted the way pdfjam expects them. In most TeX Live setups, page numbers start at 1.
  • If you want output names that preserve the page number, you can adjust the script, but as written it uses sequential k.

getfigtaborder.sh

Goal: Inspect label numbering recorded in a LaTeX AUX file.

Default run:

./getfigtaborder.sh
# reads main.aux

Specify AUX explicitly:

./getfigtaborder.sh paper.aux

Output:

  • Lines derived from \newlabel{...}{...} entries
  • Intended to help you confirm the order and numbers of figure/table labels

Notes:

  • AUX formats differ depending on packages and engines, so the extracted fields can vary.

getrefseq.sh

Goal: Print unique \ref{...} keys in the order they appear in a TeX file.

Default run:

./getrefseq.sh
# reads main.tex

Specify TeX:

./getrefseq.sh paper.tex

Output:

  • One label key per line, duplicates removed, order preserved.

Uses:

  • Sanity check: confirm expected reference labels exist
  • Quick inventory of referenced labels (figures, tables, sections)

getstatictex.sh

Goal: Produce a static TeX file by replacing \ref{label} with its resolved number from the AUX file.

Run:

./getstatictex.sh main.tex

Prereq:

  • main.aux must exist and be up to date. Compile the LaTeX file first:
pdflatex main.tex
pdflatex main.tex

Output:

  • main_static.tex

Important caveats:

  • Replacement is done with plain sed and can over-replace if your label name appears as plain text elsewhere.
  • The AUX filename is computed by ${FILE/tex/aux} which can misbehave if tex appears earlier in the path. If you hit issues, hardcode AUX derivation or adjust the substitution.
  • Always diff the result:
diff -u main.tex main_static.tex | head

wordcount.sh

Goal: Print a word count intended to approximate Nature Medicine counting for intro + results + discussion.

Run:

./wordcount.sh

Assumptions:

  • The script runs texcount main_new.tex
  • It then slices a fixed region of the texcount output and sums a field

If you rename your TeX file, update the script or create a symlink:

ln -sf yourfile.tex main_new.tex
./wordcount.sh

todocx.sh

Goal: Convert a LaTeX manuscript to a Word document using Pandoc and a specific CSL citation style.

Inputs expected:

  • manuscript.tex
  • vancouver-superscript.csl
  • custom.docx

Run:

./todocx.sh

Output:

  • manuscript.docx

Troubleshooting:

  • If citations do not render, confirm --citeproc support in your Pandoc version.
  • If math renders poorly, check the target journal requirements and consider --mathml alternatives.

upload.sh

Goal: Upload a PDF to a fixed server path using sftp batch mode.

Default run:

./upload.sh
# uploads main.pdf as IPF.pdf

Specify input:

./upload.sh authorpdf.pdf

Behavior:

  • Copies the selected PDF to IPF.pdf locally
  • Uploads IPF.pdf via sftp to a hardcoded destination

Prereq:

  • Key-based auth typically required for non-interactive sftp workflows

3) DOI tools (Python)

Both scripts read BibTeX, fill missing DOI fields, then write a new BibTeX file.

getdoi.py (Crossref API via habanero)

Run:

python3 getdoi.py refs.bib

Behavior:

  • For each entry missing doi, queries Crossref using title and takes the top hit (limit 1)
  • Writes a new bib file

Output file naming:

  • refs.bib -> refsnew.bib (quirk of .replace('.bib','new.bib'))

Tip:

  • If you want refs_new.bib, edit that output filename line.

Potential failure modes:

  • If Crossref returns no items, the script can error (depending on current code). If it crashes on some entries, use pydoi.py or harden the empty-result handling.

pydoi.py (Crossref guest query via HTML scraping)

Run:

python3 pydoi.py refs.bib

Behavior:

  • Normalizes the title (ASCII, strips LaTeX-ish constructs)
  • Tries multiple author last names to increase hit probability
  • Extracts DOI by regex matching of doi.org/<...> in the HTML response

Output:

  • refs.bib_doi.bib

Tradeoffs:

  • More aggressive on messy entries
  • More brittle because it depends on Crossref HTML layout

4) File inventory

Shell scripts:

  • getauthorpdf.sh
  • merge.sh
  • geteps.sh
  • getextfigaseps.sh
  • getfigtaborder.sh
  • getrefseq.sh
  • getstatictex.sh
  • wordcount.sh
  • todocx.sh
  • upload.sh

Python scripts:

  • getdoi.py
  • pydoi.py

5) Notes for maintainers

If you want to make these tools more robust, the highest-value changes are:

  • geteps.sh: output naming via ${1%.pdf}.eps instead of ${1/pdf/eps}
  • getstatictex.sh: replace only \ref{label} tokens (not raw label substrings) and use safer parsing of AUX entries
  • getdoi.py: handle Crossref empty results without indexing [0]
  • Harmonize DOI script output naming to *_doi.bib and *_new.bib

6) Merging multiple .bib files with deduplication (merge_bibs.sh)

This utility merges all BibTeX files in a directory into a single consolidated .bib file while removing duplicates using bibtool.

Purpose

  • Combine multiple .bib sources (Zotero exports, PubMed, manual entries, etc.)
  • Normalize entries
  • Remove duplicates based on:
    • BibTeX key (first pass)
    • Content similarity (title + author + year) (second pass)

Prerequisites

Install bibtool:

sudo apt-get install bibtool

Check installation:

bibtool --version

Script

File: merge_bibs.sh

Make executable:

chmod +x merge_bibs.sh

Usage

Merge all .bib files in current directory

./merge_bibs.sh

Output:

merged.bib

Specify input directory and output file

./merge_bibs.sh ./refs final.bib

What the script does

  1. Finds all .bib files in the specified directory
  2. Concatenates them into a temporary file
  3. Runs bibtool in two passes:

Pass 1: Key-based deduplication

--duplicates=key

Removes entries with identical BibTeX keys.


Pass 2: Content-based deduplication

--duplicates=field \
--duplicate.field=title \
--duplicate.field=author \
--duplicate.field=year

Removes entries that are the same paper but have different keys.


Example workflow (recommended)

# Step 1: Merge raw bibliographies
./merge_bibs.sh ./bib_sources merged_raw.bib

# Step 2: Add DOIs
python3 getdoi.py merged_raw.bib
# OR
python3 pydoi.py merged_raw.bib

# Step 3: Re-deduplicate using DOI
./merge_bibs.sh . merged_clean.bib

Optional enhancements

Prefer DOI-based uniqueness

--duplicate.field=doi

Aggressive normalization

--normalize.fields=title

Prefer entries with DOI

--prefer.field=doi

Notes and caveats

  • bibtool deduplication is heuristic for content-based matches
  • Title normalization is not perfect across LaTeX vs Unicode sources
  • Always inspect final output before submission:
less merged.bib

Integration points

  • Before todocx.sh
  • After DOI enrichment (getdoi.py, pydoi.py)
  • Before journal submission formatting

About

Document formatting tools often required for manuscript preparation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published