mofsyncondition is a Python module for automatically extracting synthesis conditions of metal–organic frameworks (MOFs) from scientific journal articles.
The module reads HTML files or PDF-derived text files, uses machine learning models to identify paragraphs describing synthetic protocols and then extracts relevant synthesis conditions. In its current state, the extraction of synthesis conditions is primarily performed using intelligent regular expressions. The resulting dataset is being used to fine-tune a large language model (LLM) for MOFs.
Extracting synthesis conditions from MOF literature is a key challenge in data-driven materials discovery.
mofsyncondition addresses this problem by:
- Reading journal articles in HTML, pdf or xml format
- Identifying synthesis-related paragraphs using ML-based classification
- Extracting structured synthesis conditions from unstructured text
- Generating datasets suitable for machine learning and LLM training
- Support for HTML and PDF-derived text inputs
- ML-based identification of synthesis protocols
- Regex-driven extraction of synthesis conditions
- Modular and extensible Python design
- Scalable for large literature datasets
The module aims to extract synthesis parameters such as:
- Metal precursors
- Organic linkers
- Solvents
- Additives / modulators
- Reaction temperature
- Reaction time
- pH (when available)
- Synthetic methods (e.g. solvothermal, hydrothermal)
- Pressure and humidity (when available)
- Name of MOF or formular is provided
In addition to intelligent regular expressions, mofsyncondition uses a trained spaCy Named Entity Recognizer (NER) to identify chemical reagents and synthesis-related entities directly from raw text and paragraph inputs.
The model, en_mof_chem_ner, is specialized for MOF literature and recognizes the following domain-specific entity types:
| Component | Labels |
|---|---|
ner |
ATMOSPHERE, METAL_SALT, MODULATOR, MOF, ORGANIC_LIGAND, SOLVENT, SYNTH_METHOD |
This NER layer enables reliable extraction of:
- Metal precursors and salts
- Organic ligands / linkers
- Solvents and modulators
- Synthetic methods (e.g., solvothermal, hydrothermal)
- Reaction atmosphere (e.g., air, nitrogen, argon)
- MOF names (when explicitly stated)
These structured entities are then combined with regex-based extraction to produce high-quality synthesis-condition datasets for machine learning and LLM fine-tuning.
Overall evaluation scores on held-out data:
| Metric | Score |
|---|---|
ENTS_F |
91.66 |
ENTS_P |
92.78 |
ENTS_R |
90.56 |
TOK2VEC_LOSS |
26365.16 |
NER_LOSS |
78555.25 |
| Entity Type | Precision (P) | Recall (R) | F1-score (F) |
|---|---|---|---|
| METAL_SALT | 0.9292 | 0.9082 | 0.9186 |
| ORGANIC_LIGAND | 0.7600 | 0.7157 | 0.7372 |
| SOLVENT | 0.9815 | 0.9900 | 0.9857 |
| MODULATOR | 0.9722 | 0.9560 | 0.9640 |
| ATMOSPHERE | 0.9715 | 0.9662 | 0.9689 |
| SYNTH_METHOD | 0.9970 | 0.9941 | 0.9955 |
| MOF | 0.6797 | 0.4973 | 0.5744 |
Clone the repository and install the package locally:
git clone https://github.com/bafgreat/mofsyncondition.git
cd mofsyncondition
pip install .The module can be install using PYPI
pip install mofsynconditionAssuming you have different files and wish to extract list of paragraphs describing synthesis simply run the following code.
from mofsyncondition.synthesis_conditions import extractor
# filepaths
pdf_file_path = '../filename.pdf'
html_file_path = '../filename.html'
xml_file_path = '../filename.xml'
# declare extractor class
text_extractor = extractor.MOFSynConditionExtractor()
# PDF extraction
list_of_paragraphs = text_extractor.read_file(pdf_file_path)
synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)
# html extraction
list_of_paragraphs = text_extractor.read_file(html_file_path)
synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)
# xml extraction
list_of_paragraphs = text_extractor.read_file(xml_file_path)
synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)By default the paragraph sentiment model uses NN_tfv. Below is a list of other models.
| Rank | Model | Avg Accuracy | Avg Precision | Notes |
|---|---|---|---|---|
| 1 | SVM_tfv | 0.9905 | 0.8163 | Default model |
| 2 | NN_tfv | 0.9903 | 0.8143 | |
| 3 | RF_tfv | 0.9904 | 0.7730 | High accuracy, lower precision |
| 4 | RF_CV | 0.9902 | 0.7692 | Stable but conservative |
| 5 | NN_CV | 0.9889 | 0.8240 | High precision |
| 6 | LR_tfv | 0.9895 | 0.7853 | Fast baseline |
| 7 | LR_CV | 0.9885 | 0.8040 | Balanced baseline |
| 8 | SVM_CV | 0.9885 | 0.8124 | Robust alternative |
| 9 | DT_CV | 0.9865 | 0.7795 | Interpretable |
| 10 | DT_tfv | 0.9851 | 0.7692 | Simple model |
| 11 | NB_CV | 0.9837 | 0.8337 | Highest precision |
| 12 | NB_tfv | 0.9657 | 0.0232 | Not recommended |
To use any model, simply add the name of the model to the function. e.g
list_of_paragraphs = text_extractor.read_file(xml_file_path)
synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs, model="NN_CV")Suppose you have an document (pdf, html, xml) and wish to extract all synthesis conditions. The below lines of code it the faster way to do so. This is faster than using transformer models and take large documents and parse thousand of files.
import spacy
from mofsyncondition.synthesis_conditions.mof_synthesis_conditions import MOFSynConditionExtractor
from mofsyncondition.io import filetyper
data_extractor = MOFSynConditionExtractor()
transformer_dataset = []
standard_dataset = []
file_path = "./data_test/Test2.pdf"
all_files = ["./data_test/Test2.pdf", "./data_test/ABAFUH.xml", "./data_test/Test3.html"]
for file_path in all_files:
syn_data = data_extractor.syn_data_from_document(file_path)
for paragraph, data_style_1, data_style_2 in syn_data:
transformer_dataset.append({'paragraph':paragraph, "condition":data_style_1})
standard_dataset.append({'paragraph':paragraph, "condition":data_style_2})
filetyper.list_2_json(transformer_dataset, 'transformer_dataset.jsonl')
filetyper.list_2_json(standard_dataset, 'standard_dataset.json')MIT license