SMSD Pro

Substructure & MCS Search for Chemical Graphs

SMSD Pro is an open-source toolkit for exact substructure search and maximum common substructure (MCS) finding in chemical graphs. It runs on Java, C++ (header-only), and Python, with GPU acceleration (CUDA + Apple Metal). Built on established algorithms from the graph-isomorphism literature (VF2++, McSplit, McGregor, Horton, Vismara).

Install

Java (Maven)

<dependency>
  <groupId>com.bioinceptionlabs</groupId>
  <artifactId>smsd</artifactId>
  <version>6.7.0</version>
</dependency>

Java (Download JAR)

curl -LO https://github.com/asad/SMSD/releases/download/v6.7.0/smsd-6.7.0-jar-with-dependencies.jar

java -jar smsd-6.7.0-jar-with-dependencies.jar \
  --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -

Python (pip)

pip install smsd

import smsd

result = smsd.substructure_search("c1ccccc1", "c1ccc(O)cc1")
mcs    = smsd.mcs("c1ccccc1", "c1ccc2ccccc2c1")

# Tautomer-aware MCS
mcs    = smsd.mcs("CC(=O)C", "CC(O)=C", tautomer_aware=True)

# Prefer rare heteroatoms (S, P, Se) for reaction mapping
mcs    = smsd.mcs("C[S+](C)CCC(N)C(=O)O", "SCCC(N)C(=O)O",
                   prefer_rare_heteroatoms=True)

# Reaction-aware atom mapping
aam    = smsd.map_reaction_aware("CC(=O)O", "CCO")

# Similarity upper bound (fast pre-filter)
sim    = smsd.similarity("c1ccccc1", "c1ccc(O)cc1")

fp     = smsd.fingerprint("c1ccccc1", kind="mcs")

# Circular fingerprint (ECFP4 equivalent, tautomer-aware)
ecfp4 = smsd.circular_fingerprint("c1ccccc1", radius=2, fp_size=2048)

Java API

import com.bioinception.smsd.core.*;

SMSD smsd = new SMSD(mol1, mol2, new ChemOptions());
boolean isSub = smsd.isSubstructure();
var mcs = smsd.findMCS();

// Reaction-aware with bond-change scoring
SearchEngine.McsOptions opts = new SearchEngine.McsOptions();
opts.reactionAware = true;
opts.bondChangeAware = true;  // penalise implausible bond transformations
var rxnMcs = SearchEngine.reactionAwareMCS(g1, g2, new ChemOptions(), opts);

// CIP stereo assignment (Rules 1-5, including pseudoasymmetric r/s)
Map<Integer, Character> stereo = CipAssigner.assignRS(g);
Map<Long, Character> ez = CipAssigner.assignEZ(g);

// Batch MCS with non-overlap constraints
var mappings = SearchEngine.batchMcsConstrained(queries, targets, new ChemOptions(), 10_000);

Python — Advanced Features

import smsd

# --- Reaction-Aware MCS ---
# Prefer heteroatom-containing mappings for reaction center identification
mapping = smsd.map_reaction_aware(
    "C[S+](CCC(N)C(=O)O)CC1OC(n2cnc3c(N)ncnc32)C(O)C1O",  # SAM
    "SCCC(N)C(=O)OCC1OC(n2cnc3c(N)ncnc32)C(O)C1O"           # SAH
)

# --- Structured MCS Result ---
result = smsd.mcs_result("c1ccccc1", "c1ccc(O)cc1")
print(result.size)          # 6
print(result.tanimoto)      # 0.857
print(result.mcs_smiles)    # "c1ccccc1"
print(result.mapping)       # {0: 0, 1: 1, ...}

# --- Works with any input type ---
# SMILES strings
mcs = smsd.mcs("c1ccccc1", "c1ccc(O)cc1")

# MolGraph objects (pre-parsed, fastest for batch)
g1 = smsd.parse_smiles("c1ccccc1")
g2 = smsd.parse_smiles("c1ccc(O)cc1")
mcs = smsd.mcs(g1, g2)

# Native Mol objects (auto-detected, indices returned in native ordering)
# from rdkit import Chem
# mcs = smsd.mcs(Chem.MolFromSmiles("c1ccccc1"), Chem.MolFromSmiles("c1ccc(O)cc1"))

# --- Fingerprints ---
ecfp4  = smsd.circular_fingerprint("c1ccccc1", radius=2, fp_size=2048)
fcfp4  = smsd.circular_fingerprint("c1ccccc1", radius=2, fp_size=2048, mode="fcfp")
counts = smsd.ecfp_counts("c1ccccc1", radius=2, fp_size=2048)
torsion = smsd.topological_torsion("c1ccccc1", fp_size=2048)
tan    = smsd.tanimoto(ecfp4, ecfp4)

# --- 2D Layout ---
g = smsd.parse_smiles("c1ccc2c(c1)cc1ccccc1c2")  # phenanthrene
coords = smsd.force_directed_layout(g, max_iter=500, target_bond_length=1.5)
coords = smsd.stress_majorisation(g, max_iter=300)
crossings = smsd.reduce_crossings(g, coords, max_iter=2000)

Python — MCS Variants & Batch Operations

import smsd

# --- All MCS variants ---
mcs = smsd.mcs("c1ccccc1", "c1ccc(O)cc1")                     # Connected MCS (default)
mcs = smsd.mcs("c1ccccc1", "c1ccc(O)cc1", connected_only=False) # Disconnected MCS
mcs = smsd.mcs("c1ccccc1", "c1ccc(O)cc1", induced=True)         # Induced MCS
mcs = smsd.mcs("c1ccccc1", "c1ccc(O)cc1", maximize_bonds=True)  # Edge MCS (MCES)

# Find top-N distinct MCS solutions
all_mcs = smsd.find_all_mcs("c1ccccc1", "c1ccc(O)cc1", max_results=5)

# SMARTS-based MCS
mcs = smsd.find_mcs_smarts("[#6]~[#7]", "c1ccc(N)cc1")

# Scaffold MCS (Murcko framework)
scaffold = smsd.find_scaffold_mcs("CC(=O)Oc1ccccc1C(=O)O", "Oc1ccccc1C(=O)O")

# R-group decomposition
rgroups = smsd.decompose_rgroups("c1ccc(O)cc1", "c1ccc(N)cc1")

# --- Substructure Search ---
hit = smsd.substructure_search("c1ccccc1", "c1ccc(O)cc1")
all_matches = smsd.find_all_substructures("c1ccccc1", "c1ccc(O)cc1", max_matches=10)

# SMARTS pattern matching
matches = smsd.smarts_search("[OH]", "c1ccc(O)cc1")

# --- Similarity & Screening ---
sim = smsd.tanimoto(
    smsd.circular_fingerprint("CCO", radius=2),
    smsd.circular_fingerprint("CCCO", radius=2)
)
dice = smsd.dice_similarity(
    smsd.ecfp_counts("CCO", radius=2),
    smsd.ecfp_counts("CCCO", radius=2)
)

# --- Chemistry Options ---
# Tautomer-aware with solvent and pH
mcs = smsd.mcs("CC(=O)C", "CC(O)=C",
               tautomer_aware=True, solvent="DMSO", pH=5.0)

# Loose bond matching (FMCS-style)
mcs = smsd.mcs("c1ccccc1", "C1CCCCC1", bond_order_mode="loose")

# --- Canonical SMILES ---
smi = smsd.canonical_smiles("OC(=O)c1ccccc1")   # deterministic canonical form
mcs_smi = smsd.mcs_to_smiles(g1, mapping)        # extract MCS as SMILES

# --- CIP Stereo Assignment ---
g = smsd.parse_smiles("N[C@@H](C)C(=O)O")  # L-alanine
stereo = smsd.assign_rs(g)                   # {1: 'S'}
ez = smsd.assign_ez(smsd.parse_smiles("C/C=C/C"))  # E-2-butene

# --- MolGraph I/O ---
g = smsd.parse_smiles("c1ccccc1")
g = smsd.parse_smarts("[#6]~[#7]")
g = smsd.read_molfile("molecule.mol")
smsd.export_sdf([g1, g2], "output.sdf")

Export & Depiction

import smsd

# Depict MCS with highlighted atoms (works in Jupyter)
img = smsd.depict_mcs("c1ccccc1", "c1ccc(O)cc1")
img.save("mcs.png")

# Depict substructure match
img = smsd.depict_substructure("c1ccccc1", "c1ccc(O)cc1")

# Generate SVG
svg = smsd.to_svg("c1ccccc1")

# Export to SDF file
mols = [smsd.parse_smiles(s) for s in ["CCO", "c1ccccc1", "CC(=O)O"]]
smsd.export_sdf(mols, "output.sdf")

C++ (Header-Only)

git clone https://github.com/asad/SMSD.git
# Add SMSD/cpp/include to your include path — no other dependencies needed

#include "smsd/smsd.hpp"

auto mol1 = smsd::parseSMILES("c1ccccc1");
auto mol2 = smsd::parseSMILES("c1ccc(O)cc1");

bool isSub = smsd::isSubstructure(mol1, mol2, smsd::ChemOptions{});
auto mcs   = smsd::findMCS(mol1, mol2, smsd::ChemOptions{}, smsd::McsOptions{});

// Bond-change-aware MCS for reaction mapping
auto opts = smsd::McsOptions{};
opts.reactionAware = true;
opts.bondChangeAware = true;
auto rxnMcs = smsd::reactionAwareMCS(mol1, mol2, smsd::ChemOptions{}, opts);

// Batch MCS with non-overlap constraints (multi-fragment reactions)
auto mappings = smsd::batchMcsConstrained(queries, targets, smsd::ChemOptions{});

Build from Source

git clone https://github.com/asad/SMSD.git
cd SMSD

# Java
mvn -U clean package

# C++
mkdir cpp/build && cd cpp/build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

# Python
cd python && pip install -e .

Docker

docker build -t smsd .
docker run --rm smsd --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -

Benchmarks

MCS Performance (Python)

Same machine, same Python process, best of 5 runs. Full data: benchmarks/results_python.tsv

Pair	Category	SMSD (ms)	MCS Size
Cubane (self)	Cage	0.003	8
Coronene (self)	PAH	0.006	24
NAD / NADH	Cofactor	0.012	44
Caffeine / Theophylline	Drug pair	0.016	13
Morphine / Codeine	Alkaloid	0.049	20
Ibuprofen / Naproxen	NSAID	0.069	15
ATP / ADP	Nucleotide	0.085	27
PEG-12 / PEG-16	Polymer	1.6	40
Paclitaxel / Docetaxel	Taxane	2,405	56

Substructure Performance (Java)

28/28 pairs correct. Cached speedup: 2x-16x faster across all pairs.

Run python benchmarks/benchmark_python.py to reproduce.

Algorithms

MCS Pipeline (11-level funnel)

Level	Algorithm	Based on
L0	Label-frequency upper bound	Degree-aware coverage-driven termination
L0.25	Chain fast-path	O(n*m) DP for linear polymers (PEG, lipids)
L0.5	Tree fast-path	Kilpelainen-Mannila DP for branched polymers (dendrimers, glycogen)
L0.75	Greedy probe	O(N) fast path for near-identical molecules
L1	Substructure containment	VF2++ check if smaller molecule is subgraph
L1.25	Augmenting path extension	Forced-extension bond growth from substructure seed
L1.5	Seed-and-extend	Bond-growth from rare-label seeds
L2	McSplit + RRSplit	Partition refinement (McCreesh 2017) with maximality pruning
L3	Bron-Kerbosch	Product-graph clique with Tomita pivoting + k-core + orbit pruning
L4	McGregor extension	Forced-assignment bond-grow frontier (McGregor 1982)
L5	Extra seeds	Ring skeleton, heavy-atom core, label-degree anchor seeds

MCS Variants

Variant	Flag
MCIS (induced)	`induced=true`
MCCS (connected)	default
MCES (edge subgraph)	`maximizeBonds=true`
dMCS (disconnected)	`disconnectedMCS=true`
N-MCS (multi-molecule)	`findNMCS()`
Weighted MCS	`atomWeights`
Scaffold MCS	`findScaffoldMCS()`
Tautomer-aware MCS	`ChemOptions.tautomerProfile()`

Substructure Search (VF2++)

VF2++ (Juttner & Madarasi 2018) with FASTiso/VF3-Light matching order, 3-level NLF pruning, bit-parallel candidate domains, and GPU-accelerated domain initialization (CUDA + Metal).

Ring Perception

Horton's candidate generation + 2-phase GF(2) elimination (Vismara 1997) for relevant cycles, orbit-based grouping for Unique Ring Families (URFs).

Output	Description
SSSR / MCB	Smallest Set of Smallest Rings
RCB	Relevant Cycle Basis
URF	Unique Ring Families (automorphism orbit grouping)

Chemistry Options

Option	Values
Chirality	R/S tetrahedral, E/Z double bond
Isotope	`matchIsotope=true`
Tautomers	15 transforms with pKa-informed weights (Sitzmann 2010)
Solvent	AQUEOUS, DMSO, METHANOL, CHLOROFORM, ACETONITRILE, DIETHYL_ETHER
Ring fusion	IGNORE / PERMISSIVE / STRICT
Bond order	STRICT / LOOSE / ANY
Aromaticity	STRICT / FLEXIBLE
Lenient SMILES	`ParseOptions{.lenient=true}` (C++) / `ChemOptions.lenientSmiles` (Java)

Preset profiles: ChemOptions() (default), .tautomerProfile(), .fmcsProfile()

Solvent-aware tautomers (Tier 2 pKa): opts.withSolvent(Solvent.DMSO) adjusts tautomer equilibrium weights for non-aqueous environments.

Platform & GPU Support

Platform	CPU	GPU
macOS (Apple Silicon)	OpenMP	Metal (zero-copy unified memory)
Linux	OpenMP	CUDA
Windows	OpenMP	CUDA
Any (no GPU)	OpenMP	Automatic CPU fallback

GPU acceleration covers RASCAL batch screening and domain initialization. Recursive backtracking (VF2++, BK, McSplit) runs on CPU. Dispatch: CUDA -> Metal -> OpenMP -> sequential.

Performance Caching

SMSD employs multi-level caching to eliminate redundant computation in batch and reaction workloads:

Cache	Target	Benefit
MolGraph identity cache	CDK molecule conversion	Same molecule reused across 6-18 calls per reaction pair
Domain space cache	VF2++ atom compatibility matrix	Avoids O(Nq*Nt) rebuild on repeated queries
ECFP/FCFP fingerprint cache	Default-parameter fingerprints	337x speedup on repeated fingerprint calls
Pharmacophore features cache	FCFP atom invariants	Eliminates O(n*degree^2) per FCFP call
C++ GraphBuilder compat matrix	Seed-extend/McSplit/BK stages	Pre-computed once, shared across algorithms

Call SearchEngine.clearMolGraphCache() (Java) or reuse MolGraph instances (C++/Python) between batches.

Additional Tools

Tool	Description
CIP R/S/E/Z assignment	Full digraph-based stereo descriptors (IUPAC 2013 Rules 1-5) including like/unlike pairing and pseudoasymmetric r/s
Circular fingerprint (ECFP/FCFP)	Tautomer-aware Morgan/ECFP with configurable radius (-1 = whole molecule)
Count-based ECFP/FCFP	`ecfpCounts()` / `fcfpCounts()` — superior to binary for ML
Topological Torsion fingerprint	4-atom path with atom typing (SOTA on peptide benchmarks)
Path fingerprint	Graph-aware, tautomer-invariant path enumeration
MCS fingerprint	MCS-aware, auto-sized
Similarity metrics	Tanimoto, Dice, Cosine, Soergel (binary + count-vector)
Fingerprint formats	`toBitSet()`, `toHex()`, `toBinaryString()`, `fromBitSet()`, `fromHex()`
MCS SMILES extraction	`findMcsSmiles()` — extract MCS as canonical SMILES
findAllMCS	Top-N MCS enumeration with canonical SMILES dedup
SMARTS-based MCS	`findMcsSmarts()` — largest substructure matching a SMARTS pattern
R-group decomposition	`decomposeRGroups()`
MatchResult	Structured result: size, mapping, tanimoto, query/target atom counts
RASCAL screening	O(V+E) similarity upper bound
Canonical SMILES / SMARTS	deterministic, toolkit-independent (including `X` total connectivity)
Reaction atom mapping	`mapReaction()`
2D depiction	SVG rendering with atom highlighting
Lenient SMILES parser	Best-effort recovery from malformed SMILES
N-MCS	Multi-molecule MCS with provenance tracking
Tautomer validation	`validateTautomerConsistency()` — proton conservation check
30 tautomer transforms	pKa-informed weights, 6 solvents, pH-sensitive, ring-chain tautomerism
Force-directed layout	`forceDirectedLayout()` for bond-crossing minimisation
SMACOF stress majorisation	`stressMajorisation()` for optimal 2D embedding
Scaffold templates	`matchTemplate()` for 10 pre-computed common scaffolds
Reaction-aware MCS	`reactionAwareMCS()` post-filter for reaction mapping
Bond-change-aware MCS	`BondChangeScorer` re-ranks candidates by bond transformation plausibility (C-C breaks=3.0, heteroatom=0.5)
Batch constrained MCS	`batchMcsConstrained()` multi-pair MCS with non-overlap atom exclusion for multi-fragment reactions
Two-phase crossing reduction	`reduceCrossings()` Phase 1: system-level flipping, Phase 2: individual ring flipping with fusion-atom pivots
computeSSSR / layoutSSSR	Clean SSSR APIs: minimum cycle basis and layout-ordered ring perception

File Formats

Format	Read	Write
SMILES	Java, C++	Java, C++
SMARTS	Java, C++	C++
MOL V2000	Java, C++	C++
SDF	Java, C++	—
Mol2, PDB, CML	Java	—

Release Downloads

Every release includes all platforms:

Download	Description
`SMSD.Pro-6.7.0.dmg`	macOS installer (Apple Silicon) — drag to Applications
`SMSD.Pro-6.7.0.msi`	Windows installer — next, next, finish
`smsd-pro_6.7.0_amd64.deb`	Linux installer — `sudo dpkg -i`
`smsd-6.7.0.jar`	Pure library JAR (Maven/Gradle dependency)
`smsd-6.7.0-jar-with-dependencies.jar`	Standalone CLI (just `java -jar`)
`smsd-cpp-6.7.0-headers.tar.gz`	C++ header-only library (unpack, `#include "smsd/smsd.hpp"`)
`pip install smsd`	Python package (PyPI)

# Native installer — download .dmg / .msi / .deb, double-click, done

# CLI
java -jar smsd-6.7.0-jar-with-dependencies.jar --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -

# Docker CLI
docker build -t smsd .
docker run --rm smsd --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -

# Python
pip install smsd

Tests

1,181 Java tests (7 consolidated suites) — heterocycles, reactions, drug pairs, tautomers, stereochemistry, ring perception, URF families, hydrogen handling, adversarial edge cases, fast-path validation, solvent corrections
170 C++ tests (3 suites) — 63 core + 91 parser (including SMARTS X primitive) + 16 batch/GPU
1,003 diverse molecules — all parse correctly in C++ SMILES parser
AddressSanitizer — zero memory errors
Python tests — full API coverage including hydrogen handling and charged species

Documentation

Document	Description
WHITEPAPER	Algorithms & design (11-level MCS, VF2++, ring perception)
HOWTO-INSTALL	Build from source guide
NOTICE	Attribution, trademark, and novel algorithm terms

Citation

If you use SMSD Pro in your research, please cite:

Rahman SA, Bashton M, Holliday GL, Schrader R, Thornton JM. Small Molecule Subgraph Detector (SMSD) toolkit. Journal of Cheminformatics, 1:12, 2009. DOI: 10.1186/1758-2946-1-12

GitHub renders a "Cite this repository" button from CITATION.cff.

Author

Syed Asad Rahman — BioInception PVT LTD

License

Apache License 2.0 — see LICENSE and NOTICE

Name		Name	Last commit message	Last commit date
Latest commit History 806 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
cpp		cpp
docs		docs
icons		icons
python		python
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pom.xml		pom.xml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

SMSD Pro

Install

Java (Maven)

Java (Download JAR)

Python (pip)

Java API

Python — Advanced Features

Python — MCS Variants & Batch Operations

Export & Depiction

C++ (Header-Only)

Build from Source

Docker

Benchmarks

MCS Performance (Python)

Substructure Performance (Java)

Algorithms

MCS Pipeline (11-level funnel)

MCS Variants

Substructure Search (VF2++)

Ring Perception

Chemistry Options

Platform & GPU Support

Performance Caching

Additional Tools

File Formats

Release Downloads

Tests

Documentation

Citation

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 54

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages