Originally developed for the Beilstein Diamond Open Access Journals.
Documentation · Quick Start · Install · Contributing · Releasing
BChemXtract is an open-source, pure-Java parser for ChemDraw files
(both binary .cdx and XML .cdxml) that extracts and validates
chemical structures and reactions, enriching every structure with
InChI / SMILES descriptors (and RInChI / reaction SMILES for
reactions).
If you have a corpus of ChemDraw documents — published manuscripts, internal SOPs, supporting information PDFs — BChemXtract turns them into machine-readable, FAIR-aligned chemistry you can index, search, and cite.
Maturity: structure extraction is mature and battle-tested in Beilstein's Diamond Open Access publishing pipeline. Reaction extraction is experimental and under active development.
- 🧩 Pure-Java — no native dependencies on the parsing path; runs anywhere a JVM does
- 📦 Both formats — binary
.cdxand XML.cdxml - ⚛️ Structures → atoms, bonds, stereo, charges, isotopes, rings
- 🔁 Reactions → reactants, products, agents, RInChI (experimental)
- 🧪 Descriptors out of the box — InChI, InChIKey, canonical SMILES, MDL V3000 mol
🅰️ Markush support — abbreviations and generic structures- 🛡️ Hard safety limits — refuses to process pathologically large inputs
- 🖼️ Depiction — renders extracted structures as PNG via CDK
- 🧰 Battle-tested CI — lint, multi-JDK tests, JaCoCo coverage, Gitleaks, Trivy, OWASP Dependency-Check, CodeQL
# 1. Grab the standalone jar from the latest release
curl -L -o bchemxtract.jar \
https://repo1.maven.org/maven2/org/beilstein/bchemxtract/1.1/bchemxtract-1.1-jar-with-dependencies.jar
# 2. Extract every structure in a CDX file as PNG into the current directory
java -jar bchemxtract.jar example.cdxOr use the API from your own code:
import org.beilstein.chemxtract.io.IOUtils;
import org.beilstein.chemxtract.cdx.CDXVisitor;
// Read a CDX or CDXML file and walk its object tree
var in = new FileInputStream("example.cdx");
var doc = CDXReader.readCDX("example.cdx");
// Extract structures, enriched with InChI / SMILES / mol coordinates
var xtractor = new SubstanceXtractor();
var structures = xtractor.xtract(doc, new BCXSubstanceInfo());
structures.forEach(s -> {
System.out.println(s.getInchi());
System.out.println(s.getSmiles());
});For end-to-end snippets see the HOWTO page.
<dependency>
<groupId>org.beilstein</groupId>
<artifactId>bchemxtract</artifactId>
<version>1.1</version>
</dependency>implementation("org.beilstein:bchemxtract:1.1")Download the *-jar-with-dependencies.jar from the
Releases page
or from
Maven Central.
| Doc | Purpose |
|---|---|
doc/CONCEPTS.md |
The architecture and parsing model behind BChemXtract |
doc/HOWTO.md |
Code recipes — read a CDX, extract structures, render PNGs |
CONTRIBUTING.md |
Local dev setup, Conventional Commits policy, CI overview |
RELEASING.md |
End-to-end release pipeline (release-please → Maven Central) |
CHANGELOG.md |
Release notes (auto-generated by release-please from v1.2.0+) |
The implementation follows the historical ChemDraw file specification
originally hosted at cambridgesoft.com. The site is gone, but the
spec lives on:
Prerequisites: JDK 17+ and Maven 3.9+.
git clone https://github.com/Beilstein-Institut/BChemXtract.git
cd BChemXtract
mvn -B clean packageTwo artifacts land in target/:
| Artifact | Use |
|---|---|
bchemxtract-X.Y.Z.jar |
Slim jar — bundle with your own application |
bchemxtract-X.Y.Z-jar-with-dependencies.jar |
Fat jar — run standalone |
For the full quality sweep (Checkstyle, PMD, SpotBugs, OWASP Dependency-Check):
mvn -B -Pquality verify| Workflow | Triggers | Purpose |
|---|---|---|
lint.yml |
PR + push to main |
Spotless · Checkstyle · PMD · SpotBugs · actionlint |
test.yml |
PR + push to main |
mvn verify on JDK 17 + 21, JaCoCo → Codecov |
security.yml |
PR + push + weekly | Gitleaks · Trivy · OWASP Dependency-Check |
codeql.yml |
PR + push + weekly | GitHub CodeQL Java SAST |
release-please.yml |
push to main |
Drafts release PRs from Conventional Commits |
publish.yml |
GitHub Release published | GPG-signs and deploys to Maven Central |
format.yml |
manual | Applies Spotless and commits the diff |
The release flow is fully automated:
Conventional Commits on main
↓
release-please opens "Release vX.Y.Z" PR
↓
maintainer reviews + merges
↓
GitHub Release tagged → publish.yml fires → signs → ships to Maven Central
See RELEASING.md for the complete pipeline, including
the active GPG signing key fingerprint.
We welcome contributions! Please read CONTRIBUTING.md
for local setup, coding standards, and the Conventional Commits
policy that drives our changelog.
In short:
- Fork the repo
- Create a feature branch off
main - Add tests; run
mvn -B verifylocally - Open a pull request with a Conventional Commit title
If you use BChemXtract in published work, please cite it using the metadata
in CITATION.cff, or the BibTeX entry below:
@software{bchemxtract,
author = {Bänsch, Felix and Rajan, Kohulan and Nietfeld, Markus},
title = {{BChemXtract: a pure-Java extractor of ChemDraw structures}},
organization = {Beilstein-Institut},
year = {2026},
url = {https://github.com/Beilstein-Institut/BChemXtract}
}Released under the MIT License. Use it freely in research, software, education, or commercial applications.
This project is grounded in the open-science values of the cheminformatics community and aligns with FAIR data principles.
BChemXtract makes extensive use of the Chemistry Development Kit (CDK) (LGPL-2.1). Please cite the CDK papers below if you use BChemXtract:
Willighagen et al. The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J. Cheminform. 2017, 9(3). doi:10.1186/s13321-017-0220-4
May & Steinbeck. Efficient ring perception for the Chemistry Development Kit. J. Cheminform. 2014. doi:10.1186/1758-2946-6-3
Steinbeck et al. Recent Developments of the Chemistry Development Kit (CDK) — An Open-Source Java Library for Chemo- and Bioinformatics. Curr. Pharm. Des. 2006, 12(17), 2111–2120. doi:10.2174/138161206777585274
Steinbeck et al. The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo- and Bioinformatics. J. Chem. Inf. Comput. Sci. 2003, 43(2), 493–500. doi:10.1021/ci025584y
Open a GitHub issue or reach out to the maintainers at open-source@beilstein-institut.de.
Made with ⚗️ at the Beilstein-Institut · GitHub · Diamond Open Access Journals