mofsyncondition

mofsyncondition is a Python module for automatically extracting synthesis conditions of metal–organic frameworks (MOFs) from scientific journal articles.

The module reads HTML files or PDF-derived text files, uses machine learning models to identify paragraphs describing synthetic protocols and then extracts relevant synthesis conditions. In its current state, the extraction of synthesis conditions is primarily performed using intelligent regular expressions. The resulting dataset is being used to fine-tune a large language model (LLM) for MOFs.

Overview

Extracting synthesis conditions from MOF literature is a key challenge in data-driven materials discovery. mofsyncondition addresses this problem by:

Reading journal articles in HTML, pdf or xml format
Identifying synthesis-related paragraphs using ML-based classification
Extracting structured synthesis conditions from unstructured text
Generating datasets suitable for machine learning and LLM training

Key Features

Support for HTML and PDF-derived text inputs
ML-based identification of synthesis protocols
Regex-driven extraction of synthesis conditions
Modular and extensible Python design
Scalable for large literature datasets

Extracted Synthesis Information

The module aims to extract synthesis parameters such as:

Metal precursors
Organic linkers
Solvents
Additives / modulators
Reaction temperature
Reaction time
pH (when available)
Synthetic methods (e.g. solvothermal, hydrothermal)
Pressure and humidity (when available)
Name of MOF or formular is provided

Named Entity Recognition for Chemical Reagents

In addition to intelligent regular expressions, mofsyncondition uses a trained spaCy Named Entity Recognizer (NER) to identify chemical reagents and synthesis-related entities directly from raw text and paragraph inputs.

The model, en_mof_chem_ner, is specialized for MOF literature and recognizes the following domain-specific entity types:

Component	Labels
`ner`	`ATMOSPHERE`, `METAL_SALT`, `MODULATOR`, `MOF`, `ORGANIC_LIGAND`, `SOLVENT`, `SYNTH_METHOD`

This NER layer enables reliable extraction of:

Metal precursors and salts
Organic ligands / linkers
Solvents and modulators
Synthetic methods (e.g., solvothermal, hydrothermal)
Reaction atmosphere (e.g., air, nitrogen, argon)
MOF names (when explicitly stated)

These structured entities are then combined with regex-based extraction to produce high-quality synthesis-condition datasets for machine learning and LLM fine-tuning.

NER Model Performance

Overall evaluation scores on held-out data:

Metric	Score
`ENTS_F`	91.66
`ENTS_P`	92.78
`ENTS_R`	90.56
`TOK2VEC_LOSS`	26365.16
`NER_LOSS`	78555.25

Per-Entity Performance

Entity Type	Precision (P)	Recall (R)	F1-score (F)
METAL_SALT	0.9292	0.9082	0.9186
ORGANIC_LIGAND	0.7600	0.7157	0.7372
SOLVENT	0.9815	0.9900	0.9857
MODULATOR	0.9722	0.9560	0.9640
ATMOSPHERE	0.9715	0.9662	0.9689
SYNTH_METHOD	0.9970	0.9941	0.9955
MOF	0.6797	0.4973	0.5744

Installation

Clone the repository and install the package locally:

git clone https://github.com/bafgreat/mofsyncondition.git
cd mofsyncondition
pip install .

PYPI

The module can be install using PYPI

   pip install mofsyncondition

Usage

1. Extract synthetic paragraph from file

Assuming you have different files and wish to extract list of paragraphs describing synthesis simply run the following code.

    from mofsyncondition.synthesis_conditions import extractor

    # filepaths
    pdf_file_path = '../filename.pdf'
    html_file_path = '../filename.html'
    xml_file_path = '../filename.xml'

    # declare extractor class
    text_extractor = extractor.MOFSynConditionExtractor()

    # PDF extraction

    list_of_paragraphs = text_extractor.read_file(pdf_file_path)
    synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)


    # html extraction

    list_of_paragraphs = text_extractor.read_file(html_file_path)
    synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)


    # xml extraction

    list_of_paragraphs = text_extractor.read_file(xml_file_path)
    synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)

By default the paragraph sentiment model uses NN_tfv. Below is a list of other models.

ML Model Performance (5-Fold Cross-Validation Averages)

Rank	Model	Avg Accuracy	Avg Precision	Notes
1	SVM_tfv	0.9905	0.8163	Default model
2	NN_tfv	0.9903	0.8143
3	RF_tfv	0.9904	0.7730	High accuracy, lower precision
4	RF_CV	0.9902	0.7692	Stable but conservative
5	NN_CV	0.9889	0.8240	High precision
6	LR_tfv	0.9895	0.7853	Fast baseline
7	LR_CV	0.9885	0.8040	Balanced baseline
8	SVM_CV	0.9885	0.8124	Robust alternative
9	DT_CV	0.9865	0.7795	Interpretable
10	DT_tfv	0.9851	0.7692	Simple model
11	NB_CV	0.9837	0.8337	Highest precision
12	NB_tfv	0.9657	0.0232	Not recommended

To use any model, simply add the name of the model to the function. e.g

   list_of_paragraphs = text_extractor.read_file(xml_file_path)
   synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs, model="NN_CV")

2. Extract paragaraph level synthetic condition from file

Suppose you have an document (pdf, html, xml) and wish to extract all synthesis conditions. The below lines of code it the faster way to do so. This is faster than using transformer models and take large documents and parse thousand of files.

import spacy
from mofsyncondition.synthesis_conditions.mof_synthesis_conditions import MOFSynConditionExtractor
from mofsyncondition.io import filetyper

data_extractor = MOFSynConditionExtractor()

transformer_dataset = []
standard_dataset = []
file_path = "./data_test/Test2.pdf"

all_files = ["./data_test/Test2.pdf", "./data_test/ABAFUH.xml", "./data_test/Test3.html"]
for file_path in all_files:
    syn_data  = data_extractor.syn_data_from_document(file_path)
    for paragraph, data_style_1, data_style_2 in syn_data:
        transformer_dataset.append({'paragraph':paragraph, "condition":data_style_1})
        standard_dataset.append({'paragraph':paragraph, "condition":data_style_2})
filetyper.list_2_json(transformer_dataset, 'transformer_dataset.jsonl')
filetyper.list_2_json(standard_dataset, 'standard_dataset.json')

LICENSE

MIT license

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
mofsyncondition		mofsyncondition
tests		tests
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mofsyncondition

Overview

Key Features

Extracted Synthesis Information

Named Entity Recognition for Chemical Reagents

NER Model Performance

Per-Entity Performance

Installation

PYPI

Usage

1. Extract synthetic paragraph from file

ML Model Performance (5-Fold Cross-Validation Averages)

2. Extract paragaraph level synthetic condition from file

LICENSE

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mofsyncondition

Overview

Key Features

Extracted Synthesis Information

Named Entity Recognition for Chemical Reagents

NER Model Performance

Per-Entity Performance

Installation

PYPI

Usage

1. Extract synthetic paragraph from file

ML Model Performance (5-Fold Cross-Validation Averages)

2. Extract paragaraph level synthetic condition from file

LICENSE

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages