Skip to content

asad/SMSD

Repository files navigation

SMSD Pro

SMSD Pro

Substructure & MCS Search for Chemical Graphs

Maven Central PyPI Downloads License Release


SMSD Pro is an open-source toolkit for exact substructure search and maximum common substructure (MCS) finding in chemical graphs. It runs on Java, C++ (header-only), and Python, with GPU acceleration (CUDA + Apple Metal). Built on established algorithms from the graph-isomorphism literature (VF2++, McSplit, McGregor, Horton, Vismara).

Copyright (c) 2018-2026 Syed Asad Rahman — BioInception PVT LTD


Install

Java (Maven)

<dependency>
  <groupId>com.bioinceptionlabs</groupId>
  <artifactId>smsd</artifactId>
  <version>6.7.0</version>
</dependency>

Java (Download JAR)

curl -LO https://github.com/asad/SMSD/releases/download/v6.7.0/smsd-6.7.0-jar-with-dependencies.jar

java -jar smsd-6.7.0-jar-with-dependencies.jar \
  --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -

Python (pip)

pip install smsd
import smsd

result = smsd.substructure_search("c1ccccc1", "c1ccc(O)cc1")
mcs    = smsd.mcs("c1ccccc1", "c1ccc2ccccc2c1")

# Tautomer-aware MCS
mcs    = smsd.mcs("CC(=O)C", "CC(O)=C", tautomer_aware=True)

# Prefer rare heteroatoms (S, P, Se) for reaction mapping
mcs    = smsd.mcs("C[S+](C)CCC(N)C(=O)O", "SCCC(N)C(=O)O",
                   prefer_rare_heteroatoms=True)

# Reaction-aware atom mapping
aam    = smsd.map_reaction_aware("CC(=O)O", "CCO")

# Similarity upper bound (fast pre-filter)
sim    = smsd.similarity("c1ccccc1", "c1ccc(O)cc1")

fp     = smsd.fingerprint("c1ccccc1", kind="mcs")

# Circular fingerprint (ECFP4 equivalent, tautomer-aware)
ecfp4 = smsd.circular_fingerprint("c1ccccc1", radius=2, fp_size=2048)

Java API

import com.bioinception.smsd.core.*;

SMSD smsd = new SMSD(mol1, mol2, new ChemOptions());
boolean isSub = smsd.isSubstructure();
var mcs = smsd.findMCS();

// Reaction-aware with bond-change scoring
SearchEngine.McsOptions opts = new SearchEngine.McsOptions();
opts.reactionAware = true;
opts.bondChangeAware = true;  // penalise implausible bond transformations
var rxnMcs = SearchEngine.reactionAwareMCS(g1, g2, new ChemOptions(), opts);

// CIP stereo assignment (Rules 1-5, including pseudoasymmetric r/s)
Map<Integer, Character> stereo = CipAssigner.assignRS(g);
Map<Long, Character> ez = CipAssigner.assignEZ(g);

// Batch MCS with non-overlap constraints
var mappings = SearchEngine.batchMcsConstrained(queries, targets, new ChemOptions(), 10_000);

Python — Advanced Features

import smsd

# --- Reaction-Aware MCS ---
# Prefer heteroatom-containing mappings for reaction center identification
mapping = smsd.map_reaction_aware(
    "C[S+](CCC(N)C(=O)O)CC1OC(n2cnc3c(N)ncnc32)C(O)C1O",  # SAM
    "SCCC(N)C(=O)OCC1OC(n2cnc3c(N)ncnc32)C(O)C1O"           # SAH
)

# --- Structured MCS Result ---
result = smsd.mcs_result("c1ccccc1", "c1ccc(O)cc1")
print(result.size)          # 6
print(result.tanimoto)      # 0.857
print(result.mcs_smiles)    # "c1ccccc1"
print(result.mapping)       # {0: 0, 1: 1, ...}

# --- Works with any input type ---
# SMILES strings
mcs = smsd.mcs("c1ccccc1", "c1ccc(O)cc1")

# MolGraph objects (pre-parsed, fastest for batch)
g1 = smsd.parse_smiles("c1ccccc1")
g2 = smsd.parse_smiles("c1ccc(O)cc1")
mcs = smsd.mcs(g1, g2)

# Native Mol objects (auto-detected, indices returned in native ordering)
# from rdkit import Chem
# mcs = smsd.mcs(Chem.MolFromSmiles("c1ccccc1"), Chem.MolFromSmiles("c1ccc(O)cc1"))

# --- Fingerprints ---
ecfp4  = smsd.circular_fingerprint("c1ccccc1", radius=2, fp_size=2048)
fcfp4  = smsd.circular_fingerprint("c1ccccc1", radius=2, fp_size=2048, mode="fcfp")
counts = smsd.ecfp_counts("c1ccccc1", radius=2, fp_size=2048)
torsion = smsd.topological_torsion("c1ccccc1", fp_size=2048)
tan    = smsd.tanimoto(ecfp4, ecfp4)

# --- 2D Layout ---
g = smsd.parse_smiles("c1ccc2c(c1)cc1ccccc1c2")  # phenanthrene
coords = smsd.force_directed_layout(g, max_iter=500, target_bond_length=1.5)
coords = smsd.stress_majorisation(g, max_iter=300)
crossings = smsd.reduce_crossings(g, coords, max_iter=2000)

Python — MCS Variants & Batch Operations

import smsd

# --- All MCS variants ---
mcs = smsd.mcs("c1ccccc1", "c1ccc(O)cc1")                     # Connected MCS (default)
mcs = smsd.mcs("c1ccccc1", "c1ccc(O)cc1", connected_only=False) # Disconnected MCS
mcs = smsd.mcs("c1ccccc1", "c1ccc(O)cc1", induced=True)         # Induced MCS
mcs = smsd.mcs("c1ccccc1", "c1ccc(O)cc1", maximize_bonds=True)  # Edge MCS (MCES)

# Find top-N distinct MCS solutions
all_mcs = smsd.find_all_mcs("c1ccccc1", "c1ccc(O)cc1", max_results=5)

# SMARTS-based MCS
mcs = smsd.find_mcs_smarts("[#6]~[#7]", "c1ccc(N)cc1")

# Scaffold MCS (Murcko framework)
scaffold = smsd.find_scaffold_mcs("CC(=O)Oc1ccccc1C(=O)O", "Oc1ccccc1C(=O)O")

# R-group decomposition
rgroups = smsd.decompose_rgroups("c1ccc(O)cc1", "c1ccc(N)cc1")

# --- Substructure Search ---
hit = smsd.substructure_search("c1ccccc1", "c1ccc(O)cc1")
all_matches = smsd.find_all_substructures("c1ccccc1", "c1ccc(O)cc1", max_matches=10)

# SMARTS pattern matching
matches = smsd.smarts_search("[OH]", "c1ccc(O)cc1")

# --- Similarity & Screening ---
sim = smsd.tanimoto(
    smsd.circular_fingerprint("CCO", radius=2),
    smsd.circular_fingerprint("CCCO", radius=2)
)
dice = smsd.dice_similarity(
    smsd.ecfp_counts("CCO", radius=2),
    smsd.ecfp_counts("CCCO", radius=2)
)

# --- Chemistry Options ---
# Tautomer-aware with solvent and pH
mcs = smsd.mcs("CC(=O)C", "CC(O)=C",
               tautomer_aware=True, solvent="DMSO", pH=5.0)

# Loose bond matching (FMCS-style)
mcs = smsd.mcs("c1ccccc1", "C1CCCCC1", bond_order_mode="loose")

# --- Canonical SMILES ---
smi = smsd.canonical_smiles("OC(=O)c1ccccc1")   # deterministic canonical form
mcs_smi = smsd.mcs_to_smiles(g1, mapping)        # extract MCS as SMILES

# --- CIP Stereo Assignment ---
g = smsd.parse_smiles("N[C@@H](C)C(=O)O")  # L-alanine
stereo = smsd.assign_rs(g)                   # {1: 'S'}
ez = smsd.assign_ez(smsd.parse_smiles("C/C=C/C"))  # E-2-butene

# --- MolGraph I/O ---
g = smsd.parse_smiles("c1ccccc1")
g = smsd.parse_smarts("[#6]~[#7]")
g = smsd.read_molfile("molecule.mol")
smsd.export_sdf([g1, g2], "output.sdf")

Export & Depiction

import smsd

# Depict MCS with highlighted atoms (works in Jupyter)
img = smsd.depict_mcs("c1ccccc1", "c1ccc(O)cc1")
img.save("mcs.png")

# Depict substructure match
img = smsd.depict_substructure("c1ccccc1", "c1ccc(O)cc1")

# Generate SVG
svg = smsd.to_svg("c1ccccc1")

# Export to SDF file
mols = [smsd.parse_smiles(s) for s in ["CCO", "c1ccccc1", "CC(=O)O"]]
smsd.export_sdf(mols, "output.sdf")

C++ (Header-Only)

git clone https://github.com/asad/SMSD.git
# Add SMSD/cpp/include to your include path — no other dependencies needed
#include "smsd/smsd.hpp"

auto mol1 = smsd::parseSMILES("c1ccccc1");
auto mol2 = smsd::parseSMILES("c1ccc(O)cc1");

bool isSub = smsd::isSubstructure(mol1, mol2, smsd::ChemOptions{});
auto mcs   = smsd::findMCS(mol1, mol2, smsd::ChemOptions{}, smsd::McsOptions{});

// Bond-change-aware MCS for reaction mapping
auto opts = smsd::McsOptions{};
opts.reactionAware = true;
opts.bondChangeAware = true;
auto rxnMcs = smsd::reactionAwareMCS(mol1, mol2, smsd::ChemOptions{}, opts);

// Batch MCS with non-overlap constraints (multi-fragment reactions)
auto mappings = smsd::batchMcsConstrained(queries, targets, smsd::ChemOptions{});

Build from Source

git clone https://github.com/asad/SMSD.git
cd SMSD

# Java
mvn -U clean package

# C++
mkdir cpp/build && cd cpp/build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

# Python
cd python && pip install -e .

Docker

docker build -t smsd .
docker run --rm smsd --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -

Benchmarks

MCS Performance (Python)

Same machine, same Python process, best of 5 runs. Full data: benchmarks/results_python.tsv

Pair Category SMSD (ms) MCS Size
Cubane (self) Cage 0.003 8
Coronene (self) PAH 0.006 24
NAD / NADH Cofactor 0.012 44
Caffeine / Theophylline Drug pair 0.016 13
Morphine / Codeine Alkaloid 0.049 20
Ibuprofen / Naproxen NSAID 0.069 15
ATP / ADP Nucleotide 0.085 27
PEG-12 / PEG-16 Polymer 1.6 40
Paclitaxel / Docetaxel Taxane 2,405 56

Substructure Performance (Java)

28/28 pairs correct. Cached speedup: 2x-16x faster across all pairs.

Run python benchmarks/benchmark_python.py to reproduce.


Algorithms

MCS Pipeline (11-level funnel)

Level Algorithm Based on
L0 Label-frequency upper bound Degree-aware coverage-driven termination
L0.25 Chain fast-path O(n*m) DP for linear polymers (PEG, lipids)
L0.5 Tree fast-path Kilpelainen-Mannila DP for branched polymers (dendrimers, glycogen)
L0.75 Greedy probe O(N) fast path for near-identical molecules
L1 Substructure containment VF2++ check if smaller molecule is subgraph
L1.25 Augmenting path extension Forced-extension bond growth from substructure seed
L1.5 Seed-and-extend Bond-growth from rare-label seeds
L2 McSplit + RRSplit Partition refinement (McCreesh 2017) with maximality pruning
L3 Bron-Kerbosch Product-graph clique with Tomita pivoting + k-core + orbit pruning
L4 McGregor extension Forced-assignment bond-grow frontier (McGregor 1982)
L5 Extra seeds Ring skeleton, heavy-atom core, label-degree anchor seeds

MCS Variants

Variant Flag
MCIS (induced) induced=true
MCCS (connected) default
MCES (edge subgraph) maximizeBonds=true
dMCS (disconnected) disconnectedMCS=true
N-MCS (multi-molecule) findNMCS()
Weighted MCS atomWeights
Scaffold MCS findScaffoldMCS()
Tautomer-aware MCS ChemOptions.tautomerProfile()

Substructure Search (VF2++)

VF2++ (Juttner & Madarasi 2018) with FASTiso/VF3-Light matching order, 3-level NLF pruning, bit-parallel candidate domains, and GPU-accelerated domain initialization (CUDA + Metal).

Ring Perception

Horton's candidate generation + 2-phase GF(2) elimination (Vismara 1997) for relevant cycles, orbit-based grouping for Unique Ring Families (URFs).

Output Description
SSSR / MCB Smallest Set of Smallest Rings
RCB Relevant Cycle Basis
URF Unique Ring Families (automorphism orbit grouping)

Chemistry Options

Option Values
Chirality R/S tetrahedral, E/Z double bond
Isotope matchIsotope=true
Tautomers 15 transforms with pKa-informed weights (Sitzmann 2010)
Solvent AQUEOUS, DMSO, METHANOL, CHLOROFORM, ACETONITRILE, DIETHYL_ETHER
Ring fusion IGNORE / PERMISSIVE / STRICT
Bond order STRICT / LOOSE / ANY
Aromaticity STRICT / FLEXIBLE
Lenient SMILES ParseOptions{.lenient=true} (C++) / ChemOptions.lenientSmiles (Java)

Preset profiles: ChemOptions() (default), .tautomerProfile(), .fmcsProfile()

Solvent-aware tautomers (Tier 2 pKa): opts.withSolvent(Solvent.DMSO) adjusts tautomer equilibrium weights for non-aqueous environments.


Platform & GPU Support

Platform CPU GPU
macOS (Apple Silicon) OpenMP Metal (zero-copy unified memory)
Linux OpenMP CUDA
Windows OpenMP CUDA
Any (no GPU) OpenMP Automatic CPU fallback

GPU acceleration covers RASCAL batch screening and domain initialization. Recursive backtracking (VF2++, BK, McSplit) runs on CPU. Dispatch: CUDA -> Metal -> OpenMP -> sequential.

Performance Caching

SMSD employs multi-level caching to eliminate redundant computation in batch and reaction workloads:

Cache Target Benefit
MolGraph identity cache CDK molecule conversion Same molecule reused across 6-18 calls per reaction pair
Domain space cache VF2++ atom compatibility matrix Avoids O(Nq*Nt) rebuild on repeated queries
ECFP/FCFP fingerprint cache Default-parameter fingerprints 337x speedup on repeated fingerprint calls
Pharmacophore features cache FCFP atom invariants Eliminates O(n*degree^2) per FCFP call
C++ GraphBuilder compat matrix Seed-extend/McSplit/BK stages Pre-computed once, shared across algorithms

Call SearchEngine.clearMolGraphCache() (Java) or reuse MolGraph instances (C++/Python) between batches.


Additional Tools

Tool Description
CIP R/S/E/Z assignment Full digraph-based stereo descriptors (IUPAC 2013 Rules 1-5) including like/unlike pairing and pseudoasymmetric r/s
Circular fingerprint (ECFP/FCFP) Tautomer-aware Morgan/ECFP with configurable radius (-1 = whole molecule)
Count-based ECFP/FCFP ecfpCounts() / fcfpCounts() — superior to binary for ML
Topological Torsion fingerprint 4-atom path with atom typing (SOTA on peptide benchmarks)
Path fingerprint Graph-aware, tautomer-invariant path enumeration
MCS fingerprint MCS-aware, auto-sized
Similarity metrics Tanimoto, Dice, Cosine, Soergel (binary + count-vector)
Fingerprint formats toBitSet(), toHex(), toBinaryString(), fromBitSet(), fromHex()
MCS SMILES extraction findMcsSmiles() — extract MCS as canonical SMILES
findAllMCS Top-N MCS enumeration with canonical SMILES dedup
SMARTS-based MCS findMcsSmarts() — largest substructure matching a SMARTS pattern
R-group decomposition decomposeRGroups()
MatchResult Structured result: size, mapping, tanimoto, query/target atom counts
RASCAL screening O(V+E) similarity upper bound
Canonical SMILES / SMARTS deterministic, toolkit-independent (including X total connectivity)
Reaction atom mapping mapReaction()
2D depiction SVG rendering with atom highlighting
Lenient SMILES parser Best-effort recovery from malformed SMILES
N-MCS Multi-molecule MCS with provenance tracking
Tautomer validation validateTautomerConsistency() — proton conservation check
30 tautomer transforms pKa-informed weights, 6 solvents, pH-sensitive, ring-chain tautomerism
Force-directed layout forceDirectedLayout() for bond-crossing minimisation
SMACOF stress majorisation stressMajorisation() for optimal 2D embedding
Scaffold templates matchTemplate() for 10 pre-computed common scaffolds
Reaction-aware MCS reactionAwareMCS() post-filter for reaction mapping
Bond-change-aware MCS BondChangeScorer re-ranks candidates by bond transformation plausibility (C-C breaks=3.0, heteroatom=0.5)
Batch constrained MCS batchMcsConstrained() multi-pair MCS with non-overlap atom exclusion for multi-fragment reactions
Two-phase crossing reduction reduceCrossings() Phase 1: system-level flipping, Phase 2: individual ring flipping with fusion-atom pivots
computeSSSR / layoutSSSR Clean SSSR APIs: minimum cycle basis and layout-ordered ring perception

File Formats

Format Read Write
SMILES Java, C++ Java, C++
SMARTS Java, C++ C++
MOL V2000 Java, C++ C++
SDF Java, C++
Mol2, PDB, CML Java

Release Downloads

Every release includes all platforms:

Download Description
SMSD.Pro-6.7.0.dmg macOS installer (Apple Silicon) — drag to Applications
SMSD.Pro-6.7.0.msi Windows installer — next, next, finish
smsd-pro_6.7.0_amd64.deb Linux installer — sudo dpkg -i
smsd-6.7.0.jar Pure library JAR (Maven/Gradle dependency)
smsd-6.7.0-jar-with-dependencies.jar Standalone CLI (just java -jar)
smsd-cpp-6.7.0-headers.tar.gz C++ header-only library (unpack, #include "smsd/smsd.hpp")
pip install smsd Python package (PyPI)
# Native installer — download .dmg / .msi / .deb, double-click, done

# CLI
java -jar smsd-6.7.0-jar-with-dependencies.jar --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -

# Docker CLI
docker build -t smsd .
docker run --rm smsd --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -

# Python
pip install smsd

Tests

  • 1,181 Java tests (7 consolidated suites) — heterocycles, reactions, drug pairs, tautomers, stereochemistry, ring perception, URF families, hydrogen handling, adversarial edge cases, fast-path validation, solvent corrections
  • 170 C++ tests (3 suites) — 63 core + 91 parser (including SMARTS X primitive) + 16 batch/GPU
  • 1,003 diverse molecules — all parse correctly in C++ SMILES parser
  • AddressSanitizer — zero memory errors
  • Python tests — full API coverage including hydrogen handling and charged species

Documentation

Document Description
WHITEPAPER Algorithms & design (11-level MCS, VF2++, ring perception)
HOWTO-INSTALL Build from source guide
NOTICE Attribution, trademark, and novel algorithm terms

Citation

If you use SMSD Pro in your research, please cite:

Rahman SA, Bashton M, Holliday GL, Schrader R, Thornton JM. Small Molecule Subgraph Detector (SMSD) toolkit. Journal of Cheminformatics, 1:12, 2009. DOI: 10.1186/1758-2946-1-12

GitHub renders a "Cite this repository" button from CITATION.cff.


Author

Syed Asad RahmanBioInception PVT LTD

Copyright (c) 2018-2026 BioInception PVT LTD. Algorithm Copyright (c) 2009-2026 Syed Asad Rahman.

License

Apache License 2.0 — see LICENSE and NOTICE