Skip to content

Beilstein-Institut/BChemXtract

Repository files navigation

BChemXtract

A pure-Java extractor of ChemDraw structures and reactions

Originally developed for the Beilstein Diamond Open Access Journals.


Maven Central Latest Release License: MIT Java 17+ CDK 2.12

Tests Lint Security CodeQL Coverage

Conventional Commits release-please Dependabot FAIR

Documentation · Quick Start · Install · Contributing · Releasing


✨ What is BChemXtract?

BChemXtract is an open-source, pure-Java parser for ChemDraw files (both binary .cdx and XML .cdxml) that extracts and validates chemical structures and reactions, enriching every structure with InChI / SMILES descriptors (and RInChI / reaction SMILES for reactions).

If you have a corpus of ChemDraw documents — published manuscripts, internal SOPs, supporting information PDFs — BChemXtract turns them into machine-readable, FAIR-aligned chemistry you can index, search, and cite.

Maturity: structure extraction is mature and battle-tested in Beilstein's Diamond Open Access publishing pipeline. Reaction extraction is experimental and under active development.

🔬 Features

  • 🧩 Pure-Java — no native dependencies on the parsing path; runs anywhere a JVM does
  • 📦 Both formats — binary .cdx and XML .cdxml
  • ⚛️ Structures → atoms, bonds, stereo, charges, isotopes, rings
  • 🔁 Reactions → reactants, products, agents, RInChI (experimental)
  • 🧪 Descriptors out of the box — InChI, InChIKey, canonical SMILES, MDL V3000 mol
  • 🅰️ Markush support — abbreviations and generic structures
  • 🛡️ Hard safety limits — refuses to process pathologically large inputs
  • 🖼️ Depiction — renders extracted structures as PNG via CDK
  • 🧰 Battle-tested CI — lint, multi-JDK tests, JaCoCo coverage, Gitleaks, Trivy, OWASP Dependency-Check, CodeQL

🚀 Quick Start

# 1. Grab the standalone jar from the latest release
curl -L -o bchemxtract.jar \
  https://repo1.maven.org/maven2/org/beilstein/bchemxtract/1.1/bchemxtract-1.1-jar-with-dependencies.jar

# 2. Extract every structure in a CDX file as PNG into the current directory
java -jar bchemxtract.jar example.cdx

Or use the API from your own code:

import org.beilstein.chemxtract.io.IOUtils;
import org.beilstein.chemxtract.cdx.CDXVisitor;

// Read a CDX or CDXML file and walk its object tree
var in = new FileInputStream("example.cdx");
var doc = CDXReader.readCDX("example.cdx");

// Extract structures, enriched with InChI / SMILES / mol coordinates
var xtractor = new SubstanceXtractor();
var structures = xtractor.xtract(doc, new BCXSubstanceInfo());

structures.forEach(s -> {
    System.out.println(s.getInchi());
    System.out.println(s.getSmiles());
});

For end-to-end snippets see the HOWTO page.

📦 Installation

Maven

<dependency>
  <groupId>org.beilstein</groupId>
  <artifactId>bchemxtract</artifactId>
  <version>1.1</version>
</dependency>

Gradle

implementation("org.beilstein:bchemxtract:1.1")

Standalone fat jar

Download the *-jar-with-dependencies.jar from the Releases page or from Maven Central.

📚 Documentation

Doc Purpose
doc/CONCEPTS.md The architecture and parsing model behind BChemXtract
doc/HOWTO.md Code recipes — read a CDX, extract structures, render PNGs
CONTRIBUTING.md Local dev setup, Conventional Commits policy, CI overview
RELEASING.md End-to-end release pipeline (release-please → Maven Central)
CHANGELOG.md Release notes (auto-generated by release-please from v1.2.0+)

Reference: ChemDraw file specification

The implementation follows the historical ChemDraw file specification originally hosted at cambridgesoft.com. The site is gone, but the spec lives on:

🛠️ Building from source

Prerequisites: JDK 17+ and Maven 3.9+.

git clone https://github.com/Beilstein-Institut/BChemXtract.git
cd BChemXtract
mvn -B clean package

Two artifacts land in target/:

Artifact Use
bchemxtract-X.Y.Z.jar Slim jar — bundle with your own application
bchemxtract-X.Y.Z-jar-with-dependencies.jar Fat jar — run standalone

For the full quality sweep (Checkstyle, PMD, SpotBugs, OWASP Dependency-Check):

mvn -B -Pquality verify

⚙️ Continuous integration & release pipeline

Workflow Triggers Purpose
lint.yml PR + push to main Spotless · Checkstyle · PMD · SpotBugs · actionlint
test.yml PR + push to main mvn verify on JDK 17 + 21, JaCoCo → Codecov
security.yml PR + push + weekly Gitleaks · Trivy · OWASP Dependency-Check
codeql.yml PR + push + weekly GitHub CodeQL Java SAST
release-please.yml push to main Drafts release PRs from Conventional Commits
publish.yml GitHub Release published GPG-signs and deploys to Maven Central
format.yml manual Applies Spotless and commits the diff

The release flow is fully automated:

Conventional Commits on main
        ↓
release-please opens "Release vX.Y.Z" PR
        ↓
maintainer reviews + merges
        ↓
GitHub Release tagged → publish.yml fires → signs → ships to Maven Central

See RELEASING.md for the complete pipeline, including the active GPG signing key fingerprint.

🤝 Contributing

We welcome contributions! Please read CONTRIBUTING.md for local setup, coding standards, and the Conventional Commits policy that drives our changelog.

In short:

  1. Fork the repo
  2. Create a feature branch off main
  3. Add tests; run mvn -B verify locally
  4. Open a pull request with a Conventional Commit title

📖 Citing BChemXtract

If you use BChemXtract in published work, please cite it using the metadata in CITATION.cff, or the BibTeX entry below:

@software{bchemxtract,
  author       = {Bänsch, Felix and Rajan, Kohulan and Nietfeld, Markus},
  title        = {{BChemXtract: a pure-Java extractor of ChemDraw structures}},
  organization = {Beilstein-Institut},
  year         = {2026},
  url          = {https://github.com/Beilstein-Institut/BChemXtract}
}

📜 License

Released under the MIT License. Use it freely in research, software, education, or commercial applications.

🙏 Acknowledgments

This project is grounded in the open-science values of the cheminformatics community and aligns with FAIR data principles.

BChemXtract makes extensive use of the Chemistry Development Kit (CDK) (LGPL-2.1). Please cite the CDK papers below if you use BChemXtract:

Willighagen et al. The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J. Cheminform. 2017, 9(3). doi:10.1186/s13321-017-0220-4

May & Steinbeck. Efficient ring perception for the Chemistry Development Kit. J. Cheminform. 2014. doi:10.1186/1758-2946-6-3

Steinbeck et al. Recent Developments of the Chemistry Development Kit (CDK) — An Open-Source Java Library for Chemo- and Bioinformatics. Curr. Pharm. Des. 2006, 12(17), 2111–2120. doi:10.2174/138161206777585274

Steinbeck et al. The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo- and Bioinformatics. J. Chem. Inf. Comput. Sci. 2003, 43(2), 493–500. doi:10.1021/ci025584y

💬 Feedback

Open a GitHub issue or reach out to the maintainers at open-source@beilstein-institut.de.

About

Pure-Java ChemDraw Parser and Extraction Software

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors