Substructure & MCS Search for Chemical Graphs
SMSD Pro is an open-source toolkit for exact substructure search and maximum common substructure (MCS) finding in chemical graphs. It runs on Java, C++ (header-only), and Python, with GPU acceleration (CUDA + Apple Metal). Built on established algorithms from the graph-isomorphism literature (VF2++, McSplit, McGregor, Horton, Vismara).
Copyright (c) 2018-2026 Syed Asad Rahman — BioInception PVT LTD
<dependency>
<groupId>com.bioinceptionlabs</groupId>
<artifactId>smsd</artifactId>
<version>6.7.0</version>
</dependency>curl -LO https://github.com/asad/SMSD/releases/download/v6.7.0/smsd-6.7.0-jar-with-dependencies.jar
java -jar smsd-6.7.0-jar-with-dependencies.jar \
--Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -pip install smsdimport smsd
result = smsd.substructure_search("c1ccccc1", "c1ccc(O)cc1")
mcs = smsd.mcs("c1ccccc1", "c1ccc2ccccc2c1")
# Tautomer-aware MCS
mcs = smsd.mcs("CC(=O)C", "CC(O)=C", tautomer_aware=True)
# Prefer rare heteroatoms (S, P, Se) for reaction mapping
mcs = smsd.mcs("C[S+](C)CCC(N)C(=O)O", "SCCC(N)C(=O)O",
prefer_rare_heteroatoms=True)
# Reaction-aware atom mapping
aam = smsd.map_reaction_aware("CC(=O)O", "CCO")
# Similarity upper bound (fast pre-filter)
sim = smsd.similarity("c1ccccc1", "c1ccc(O)cc1")
fp = smsd.fingerprint("c1ccccc1", kind="mcs")
# Circular fingerprint (ECFP4 equivalent, tautomer-aware)
ecfp4 = smsd.circular_fingerprint("c1ccccc1", radius=2, fp_size=2048)import com.bioinception.smsd.core.*;
SMSD smsd = new SMSD(mol1, mol2, new ChemOptions());
boolean isSub = smsd.isSubstructure();
var mcs = smsd.findMCS();
// Reaction-aware with bond-change scoring
SearchEngine.McsOptions opts = new SearchEngine.McsOptions();
opts.reactionAware = true;
opts.bondChangeAware = true; // penalise implausible bond transformations
var rxnMcs = SearchEngine.reactionAwareMCS(g1, g2, new ChemOptions(), opts);
// CIP stereo assignment (Rules 1-5, including pseudoasymmetric r/s)
Map<Integer, Character> stereo = CipAssigner.assignRS(g);
Map<Long, Character> ez = CipAssigner.assignEZ(g);
// Batch MCS with non-overlap constraints
var mappings = SearchEngine.batchMcsConstrained(queries, targets, new ChemOptions(), 10_000);import smsd
# --- Reaction-Aware MCS ---
# Prefer heteroatom-containing mappings for reaction center identification
mapping = smsd.map_reaction_aware(
"C[S+](CCC(N)C(=O)O)CC1OC(n2cnc3c(N)ncnc32)C(O)C1O", # SAM
"SCCC(N)C(=O)OCC1OC(n2cnc3c(N)ncnc32)C(O)C1O" # SAH
)
# --- Structured MCS Result ---
result = smsd.mcs_result("c1ccccc1", "c1ccc(O)cc1")
print(result.size) # 6
print(result.tanimoto) # 0.857
print(result.mcs_smiles) # "c1ccccc1"
print(result.mapping) # {0: 0, 1: 1, ...}
# --- Works with any input type ---
# SMILES strings
mcs = smsd.mcs("c1ccccc1", "c1ccc(O)cc1")
# MolGraph objects (pre-parsed, fastest for batch)
g1 = smsd.parse_smiles("c1ccccc1")
g2 = smsd.parse_smiles("c1ccc(O)cc1")
mcs = smsd.mcs(g1, g2)
# Native Mol objects (auto-detected, indices returned in native ordering)
# from rdkit import Chem
# mcs = smsd.mcs(Chem.MolFromSmiles("c1ccccc1"), Chem.MolFromSmiles("c1ccc(O)cc1"))
# --- Fingerprints ---
ecfp4 = smsd.circular_fingerprint("c1ccccc1", radius=2, fp_size=2048)
fcfp4 = smsd.circular_fingerprint("c1ccccc1", radius=2, fp_size=2048, mode="fcfp")
counts = smsd.ecfp_counts("c1ccccc1", radius=2, fp_size=2048)
torsion = smsd.topological_torsion("c1ccccc1", fp_size=2048)
tan = smsd.tanimoto(ecfp4, ecfp4)
# --- 2D Layout ---
g = smsd.parse_smiles("c1ccc2c(c1)cc1ccccc1c2") # phenanthrene
coords = smsd.force_directed_layout(g, max_iter=500, target_bond_length=1.5)
coords = smsd.stress_majorisation(g, max_iter=300)
crossings = smsd.reduce_crossings(g, coords, max_iter=2000)import smsd
# --- All MCS variants ---
mcs = smsd.mcs("c1ccccc1", "c1ccc(O)cc1") # Connected MCS (default)
mcs = smsd.mcs("c1ccccc1", "c1ccc(O)cc1", connected_only=False) # Disconnected MCS
mcs = smsd.mcs("c1ccccc1", "c1ccc(O)cc1", induced=True) # Induced MCS
mcs = smsd.mcs("c1ccccc1", "c1ccc(O)cc1", maximize_bonds=True) # Edge MCS (MCES)
# Find top-N distinct MCS solutions
all_mcs = smsd.find_all_mcs("c1ccccc1", "c1ccc(O)cc1", max_results=5)
# SMARTS-based MCS
mcs = smsd.find_mcs_smarts("[#6]~[#7]", "c1ccc(N)cc1")
# Scaffold MCS (Murcko framework)
scaffold = smsd.find_scaffold_mcs("CC(=O)Oc1ccccc1C(=O)O", "Oc1ccccc1C(=O)O")
# R-group decomposition
rgroups = smsd.decompose_rgroups("c1ccc(O)cc1", "c1ccc(N)cc1")
# --- Substructure Search ---
hit = smsd.substructure_search("c1ccccc1", "c1ccc(O)cc1")
all_matches = smsd.find_all_substructures("c1ccccc1", "c1ccc(O)cc1", max_matches=10)
# SMARTS pattern matching
matches = smsd.smarts_search("[OH]", "c1ccc(O)cc1")
# --- Similarity & Screening ---
sim = smsd.tanimoto(
smsd.circular_fingerprint("CCO", radius=2),
smsd.circular_fingerprint("CCCO", radius=2)
)
dice = smsd.dice_similarity(
smsd.ecfp_counts("CCO", radius=2),
smsd.ecfp_counts("CCCO", radius=2)
)
# --- Chemistry Options ---
# Tautomer-aware with solvent and pH
mcs = smsd.mcs("CC(=O)C", "CC(O)=C",
tautomer_aware=True, solvent="DMSO", pH=5.0)
# Loose bond matching (FMCS-style)
mcs = smsd.mcs("c1ccccc1", "C1CCCCC1", bond_order_mode="loose")
# --- Canonical SMILES ---
smi = smsd.canonical_smiles("OC(=O)c1ccccc1") # deterministic canonical form
mcs_smi = smsd.mcs_to_smiles(g1, mapping) # extract MCS as SMILES
# --- CIP Stereo Assignment ---
g = smsd.parse_smiles("N[C@@H](C)C(=O)O") # L-alanine
stereo = smsd.assign_rs(g) # {1: 'S'}
ez = smsd.assign_ez(smsd.parse_smiles("C/C=C/C")) # E-2-butene
# --- MolGraph I/O ---
g = smsd.parse_smiles("c1ccccc1")
g = smsd.parse_smarts("[#6]~[#7]")
g = smsd.read_molfile("molecule.mol")
smsd.export_sdf([g1, g2], "output.sdf")import smsd
# Depict MCS with highlighted atoms (works in Jupyter)
img = smsd.depict_mcs("c1ccccc1", "c1ccc(O)cc1")
img.save("mcs.png")
# Depict substructure match
img = smsd.depict_substructure("c1ccccc1", "c1ccc(O)cc1")
# Generate SVG
svg = smsd.to_svg("c1ccccc1")
# Export to SDF file
mols = [smsd.parse_smiles(s) for s in ["CCO", "c1ccccc1", "CC(=O)O"]]
smsd.export_sdf(mols, "output.sdf")git clone https://github.com/asad/SMSD.git
# Add SMSD/cpp/include to your include path — no other dependencies needed#include "smsd/smsd.hpp"
auto mol1 = smsd::parseSMILES("c1ccccc1");
auto mol2 = smsd::parseSMILES("c1ccc(O)cc1");
bool isSub = smsd::isSubstructure(mol1, mol2, smsd::ChemOptions{});
auto mcs = smsd::findMCS(mol1, mol2, smsd::ChemOptions{}, smsd::McsOptions{});
// Bond-change-aware MCS for reaction mapping
auto opts = smsd::McsOptions{};
opts.reactionAware = true;
opts.bondChangeAware = true;
auto rxnMcs = smsd::reactionAwareMCS(mol1, mol2, smsd::ChemOptions{}, opts);
// Batch MCS with non-overlap constraints (multi-fragment reactions)
auto mappings = smsd::batchMcsConstrained(queries, targets, smsd::ChemOptions{});git clone https://github.com/asad/SMSD.git
cd SMSD
# Java
mvn -U clean package
# C++
mkdir cpp/build && cd cpp/build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
# Python
cd python && pip install -e .docker build -t smsd .
docker run --rm smsd --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -Same machine, same Python process, best of 5 runs.
Full data: benchmarks/results_python.tsv
| Pair | Category | SMSD (ms) | MCS Size |
|---|---|---|---|
| Cubane (self) | Cage | 0.003 | 8 |
| Coronene (self) | PAH | 0.006 | 24 |
| NAD / NADH | Cofactor | 0.012 | 44 |
| Caffeine / Theophylline | Drug pair | 0.016 | 13 |
| Morphine / Codeine | Alkaloid | 0.049 | 20 |
| Ibuprofen / Naproxen | NSAID | 0.069 | 15 |
| ATP / ADP | Nucleotide | 0.085 | 27 |
| PEG-12 / PEG-16 | Polymer | 1.6 | 40 |
| Paclitaxel / Docetaxel | Taxane | 2,405 | 56 |
28/28 pairs correct. Cached speedup: 2x-16x faster across all pairs.
Run python benchmarks/benchmark_python.py to reproduce.
| Level | Algorithm | Based on |
|---|---|---|
| L0 | Label-frequency upper bound | Degree-aware coverage-driven termination |
| L0.25 | Chain fast-path | O(n*m) DP for linear polymers (PEG, lipids) |
| L0.5 | Tree fast-path | Kilpelainen-Mannila DP for branched polymers (dendrimers, glycogen) |
| L0.75 | Greedy probe | O(N) fast path for near-identical molecules |
| L1 | Substructure containment | VF2++ check if smaller molecule is subgraph |
| L1.25 | Augmenting path extension | Forced-extension bond growth from substructure seed |
| L1.5 | Seed-and-extend | Bond-growth from rare-label seeds |
| L2 | McSplit + RRSplit | Partition refinement (McCreesh 2017) with maximality pruning |
| L3 | Bron-Kerbosch | Product-graph clique with Tomita pivoting + k-core + orbit pruning |
| L4 | McGregor extension | Forced-assignment bond-grow frontier (McGregor 1982) |
| L5 | Extra seeds | Ring skeleton, heavy-atom core, label-degree anchor seeds |
| Variant | Flag |
|---|---|
| MCIS (induced) | induced=true |
| MCCS (connected) | default |
| MCES (edge subgraph) | maximizeBonds=true |
| dMCS (disconnected) | disconnectedMCS=true |
| N-MCS (multi-molecule) | findNMCS() |
| Weighted MCS | atomWeights |
| Scaffold MCS | findScaffoldMCS() |
| Tautomer-aware MCS | ChemOptions.tautomerProfile() |
VF2++ (Juttner & Madarasi 2018) with FASTiso/VF3-Light matching order, 3-level NLF pruning, bit-parallel candidate domains, and GPU-accelerated domain initialization (CUDA + Metal).
Horton's candidate generation + 2-phase GF(2) elimination (Vismara 1997) for relevant cycles, orbit-based grouping for Unique Ring Families (URFs).
| Output | Description |
|---|---|
| SSSR / MCB | Smallest Set of Smallest Rings |
| RCB | Relevant Cycle Basis |
| URF | Unique Ring Families (automorphism orbit grouping) |
| Option | Values |
|---|---|
| Chirality | R/S tetrahedral, E/Z double bond |
| Isotope | matchIsotope=true |
| Tautomers | 15 transforms with pKa-informed weights (Sitzmann 2010) |
| Solvent | AQUEOUS, DMSO, METHANOL, CHLOROFORM, ACETONITRILE, DIETHYL_ETHER |
| Ring fusion | IGNORE / PERMISSIVE / STRICT |
| Bond order | STRICT / LOOSE / ANY |
| Aromaticity | STRICT / FLEXIBLE |
| Lenient SMILES | ParseOptions{.lenient=true} (C++) / ChemOptions.lenientSmiles (Java) |
Preset profiles: ChemOptions() (default), .tautomerProfile(), .fmcsProfile()
Solvent-aware tautomers (Tier 2 pKa): opts.withSolvent(Solvent.DMSO) adjusts tautomer equilibrium weights for non-aqueous environments.
| Platform | CPU | GPU |
|---|---|---|
| macOS (Apple Silicon) | OpenMP | Metal (zero-copy unified memory) |
| Linux | OpenMP | CUDA |
| Windows | OpenMP | CUDA |
| Any (no GPU) | OpenMP | Automatic CPU fallback |
GPU acceleration covers RASCAL batch screening and domain initialization. Recursive backtracking (VF2++, BK, McSplit) runs on CPU. Dispatch: CUDA -> Metal -> OpenMP -> sequential.
SMSD employs multi-level caching to eliminate redundant computation in batch and reaction workloads:
| Cache | Target | Benefit |
|---|---|---|
| MolGraph identity cache | CDK molecule conversion | Same molecule reused across 6-18 calls per reaction pair |
| Domain space cache | VF2++ atom compatibility matrix | Avoids O(Nq*Nt) rebuild on repeated queries |
| ECFP/FCFP fingerprint cache | Default-parameter fingerprints | 337x speedup on repeated fingerprint calls |
| Pharmacophore features cache | FCFP atom invariants | Eliminates O(n*degree^2) per FCFP call |
| C++ GraphBuilder compat matrix | Seed-extend/McSplit/BK stages | Pre-computed once, shared across algorithms |
Call SearchEngine.clearMolGraphCache() (Java) or reuse MolGraph instances (C++/Python) between batches.
| Tool | Description |
|---|---|
| CIP R/S/E/Z assignment | Full digraph-based stereo descriptors (IUPAC 2013 Rules 1-5) including like/unlike pairing and pseudoasymmetric r/s |
| Circular fingerprint (ECFP/FCFP) | Tautomer-aware Morgan/ECFP with configurable radius (-1 = whole molecule) |
| Count-based ECFP/FCFP | ecfpCounts() / fcfpCounts() — superior to binary for ML |
| Topological Torsion fingerprint | 4-atom path with atom typing (SOTA on peptide benchmarks) |
| Path fingerprint | Graph-aware, tautomer-invariant path enumeration |
| MCS fingerprint | MCS-aware, auto-sized |
| Similarity metrics | Tanimoto, Dice, Cosine, Soergel (binary + count-vector) |
| Fingerprint formats | toBitSet(), toHex(), toBinaryString(), fromBitSet(), fromHex() |
| MCS SMILES extraction | findMcsSmiles() — extract MCS as canonical SMILES |
| findAllMCS | Top-N MCS enumeration with canonical SMILES dedup |
| SMARTS-based MCS | findMcsSmarts() — largest substructure matching a SMARTS pattern |
| R-group decomposition | decomposeRGroups() |
| MatchResult | Structured result: size, mapping, tanimoto, query/target atom counts |
| RASCAL screening | O(V+E) similarity upper bound |
| Canonical SMILES / SMARTS | deterministic, toolkit-independent (including X total connectivity) |
| Reaction atom mapping | mapReaction() |
| 2D depiction | SVG rendering with atom highlighting |
| Lenient SMILES parser | Best-effort recovery from malformed SMILES |
| N-MCS | Multi-molecule MCS with provenance tracking |
| Tautomer validation | validateTautomerConsistency() — proton conservation check |
| 30 tautomer transforms | pKa-informed weights, 6 solvents, pH-sensitive, ring-chain tautomerism |
| Force-directed layout | forceDirectedLayout() for bond-crossing minimisation |
| SMACOF stress majorisation | stressMajorisation() for optimal 2D embedding |
| Scaffold templates | matchTemplate() for 10 pre-computed common scaffolds |
| Reaction-aware MCS | reactionAwareMCS() post-filter for reaction mapping |
| Bond-change-aware MCS | BondChangeScorer re-ranks candidates by bond transformation plausibility (C-C breaks=3.0, heteroatom=0.5) |
| Batch constrained MCS | batchMcsConstrained() multi-pair MCS with non-overlap atom exclusion for multi-fragment reactions |
| Two-phase crossing reduction | reduceCrossings() Phase 1: system-level flipping, Phase 2: individual ring flipping with fusion-atom pivots |
| computeSSSR / layoutSSSR | Clean SSSR APIs: minimum cycle basis and layout-ordered ring perception |
| Format | Read | Write |
|---|---|---|
| SMILES | Java, C++ | Java, C++ |
| SMARTS | Java, C++ | C++ |
| MOL V2000 | Java, C++ | C++ |
| SDF | Java, C++ | — |
| Mol2, PDB, CML | Java | — |
Every release includes all platforms:
| Download | Description |
|---|---|
SMSD.Pro-6.7.0.dmg |
macOS installer (Apple Silicon) — drag to Applications |
SMSD.Pro-6.7.0.msi |
Windows installer — next, next, finish |
smsd-pro_6.7.0_amd64.deb |
Linux installer — sudo dpkg -i |
smsd-6.7.0.jar |
Pure library JAR (Maven/Gradle dependency) |
smsd-6.7.0-jar-with-dependencies.jar |
Standalone CLI (just java -jar) |
smsd-cpp-6.7.0-headers.tar.gz |
C++ header-only library (unpack, #include "smsd/smsd.hpp") |
pip install smsd |
Python package (PyPI) |
# Native installer — download .dmg / .msi / .deb, double-click, done
# CLI
java -jar smsd-6.7.0-jar-with-dependencies.jar --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -
# Docker CLI
docker build -t smsd .
docker run --rm smsd --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -
# Python
pip install smsd- 1,181 Java tests (7 consolidated suites) — heterocycles, reactions, drug pairs, tautomers, stereochemistry, ring perception, URF families, hydrogen handling, adversarial edge cases, fast-path validation, solvent corrections
- 170 C++ tests (3 suites) — 63 core + 91 parser (including SMARTS X primitive) + 16 batch/GPU
- 1,003 diverse molecules — all parse correctly in C++ SMILES parser
- AddressSanitizer — zero memory errors
- Python tests — full API coverage including hydrogen handling and charged species
| Document | Description |
|---|---|
| WHITEPAPER | Algorithms & design (11-level MCS, VF2++, ring perception) |
| HOWTO-INSTALL | Build from source guide |
| NOTICE | Attribution, trademark, and novel algorithm terms |
If you use SMSD Pro in your research, please cite:
Rahman SA, Bashton M, Holliday GL, Schrader R, Thornton JM. Small Molecule Subgraph Detector (SMSD) toolkit. Journal of Cheminformatics, 1:12, 2009. DOI: 10.1186/1758-2946-1-12
GitHub renders a "Cite this repository" button from CITATION.cff.
Syed Asad Rahman — BioInception PVT LTD
Copyright (c) 2018-2026 BioInception PVT LTD. Algorithm Copyright (c) 2009-2026 Syed Asad Rahman.