Skip to content

bafgreat/mofsyncondition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mofsyncondition

mofsyncondition is a Python module for automatically extracting synthesis conditions of metal–organic frameworks (MOFs) from scientific journal articles.

The module reads HTML files or PDF-derived text files, uses machine learning models to identify paragraphs describing synthetic protocols and then extracts relevant synthesis conditions. In its current state, the extraction of synthesis conditions is primarily performed using intelligent regular expressions. The resulting dataset is being used to fine-tune a large language model (LLM) for MOFs.


Overview

Extracting synthesis conditions from MOF literature is a key challenge in data-driven materials discovery. mofsyncondition addresses this problem by:

  • Reading journal articles in HTML, pdf or xml format
  • Identifying synthesis-related paragraphs using ML-based classification
  • Extracting structured synthesis conditions from unstructured text
  • Generating datasets suitable for machine learning and LLM training

Key Features

  • Support for HTML and PDF-derived text inputs
  • ML-based identification of synthesis protocols
  • Regex-driven extraction of synthesis conditions
  • Modular and extensible Python design
  • Scalable for large literature datasets

Extracted Synthesis Information

The module aims to extract synthesis parameters such as:

  • Metal precursors
  • Organic linkers
  • Solvents
  • Additives / modulators
  • Reaction temperature
  • Reaction time
  • pH (when available)
  • Synthetic methods (e.g. solvothermal, hydrothermal)
  • Pressure and humidity (when available)
  • Name of MOF or formular is provided

Named Entity Recognition for Chemical Reagents

In addition to intelligent regular expressions, mofsyncondition uses a trained spaCy Named Entity Recognizer (NER) to identify chemical reagents and synthesis-related entities directly from raw text and paragraph inputs.

The model, en_mof_chem_ner, is specialized for MOF literature and recognizes the following domain-specific entity types:

Component Labels
ner ATMOSPHERE, METAL_SALT, MODULATOR, MOF, ORGANIC_LIGAND, SOLVENT, SYNTH_METHOD

This NER layer enables reliable extraction of:

  • Metal precursors and salts
  • Organic ligands / linkers
  • Solvents and modulators
  • Synthetic methods (e.g., solvothermal, hydrothermal)
  • Reaction atmosphere (e.g., air, nitrogen, argon)
  • MOF names (when explicitly stated)

These structured entities are then combined with regex-based extraction to produce high-quality synthesis-condition datasets for machine learning and LLM fine-tuning.


NER Model Performance

Overall evaluation scores on held-out data:

Metric Score
ENTS_F 91.66
ENTS_P 92.78
ENTS_R 90.56
TOK2VEC_LOSS 26365.16
NER_LOSS 78555.25

Per-Entity Performance

Entity Type Precision (P) Recall (R) F1-score (F)
METAL_SALT 0.9292 0.9082 0.9186
ORGANIC_LIGAND 0.7600 0.7157 0.7372
SOLVENT 0.9815 0.9900 0.9857
MODULATOR 0.9722 0.9560 0.9640
ATMOSPHERE 0.9715 0.9662 0.9689
SYNTH_METHOD 0.9970 0.9941 0.9955
MOF 0.6797 0.4973 0.5744

Installation

Clone the repository and install the package locally:

git clone https://github.com/bafgreat/mofsyncondition.git
cd mofsyncondition
pip install .

PYPI

The module can be install using PYPI

   pip install mofsyncondition

Usage

1. Extract synthetic paragraph from file

Assuming you have different files and wish to extract list of paragraphs describing synthesis simply run the following code.

    from mofsyncondition.synthesis_conditions import extractor

    # filepaths
    pdf_file_path = '../filename.pdf'
    html_file_path = '../filename.html'
    xml_file_path = '../filename.xml'

    # declare extractor class
    text_extractor = extractor.MOFSynConditionExtractor()

    # PDF extraction

    list_of_paragraphs = text_extractor.read_file(pdf_file_path)
    synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)


    # html extraction

    list_of_paragraphs = text_extractor.read_file(html_file_path)
    synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)


    # xml extraction

    list_of_paragraphs = text_extractor.read_file(xml_file_path)
    synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)

By default the paragraph sentiment model uses NN_tfv. Below is a list of other models.

ML Model Performance (5-Fold Cross-Validation Averages)

Rank Model Avg Accuracy Avg Precision Notes
1 SVM_tfv 0.9905 0.8163 Default model
2 NN_tfv 0.9903 0.8143
3 RF_tfv 0.9904 0.7730 High accuracy, lower precision
4 RF_CV 0.9902 0.7692 Stable but conservative
5 NN_CV 0.9889 0.8240 High precision
6 LR_tfv 0.9895 0.7853 Fast baseline
7 LR_CV 0.9885 0.8040 Balanced baseline
8 SVM_CV 0.9885 0.8124 Robust alternative
9 DT_CV 0.9865 0.7795 Interpretable
10 DT_tfv 0.9851 0.7692 Simple model
11 NB_CV 0.9837 0.8337 Highest precision
12 NB_tfv 0.9657 0.0232 Not recommended

To use any model, simply add the name of the model to the function. e.g

   list_of_paragraphs = text_extractor.read_file(xml_file_path)
   synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs, model="NN_CV")

2. Extract paragaraph level synthetic condition from file

Suppose you have an document (pdf, html, xml) and wish to extract all synthesis conditions. The below lines of code it the faster way to do so. This is faster than using transformer models and take large documents and parse thousand of files.

import spacy
from mofsyncondition.synthesis_conditions.mof_synthesis_conditions import MOFSynConditionExtractor
from mofsyncondition.io import filetyper

data_extractor = MOFSynConditionExtractor()

transformer_dataset = []
standard_dataset = []
file_path = "./data_test/Test2.pdf"

all_files = ["./data_test/Test2.pdf", "./data_test/ABAFUH.xml", "./data_test/Test3.html"]
for file_path in all_files:
    syn_data  = data_extractor.syn_data_from_document(file_path)
    for paragraph, data_style_1, data_style_2 in syn_data:
        transformer_dataset.append({'paragraph':paragraph, "condition":data_style_1})
        standard_dataset.append({'paragraph':paragraph, "condition":data_style_2})
filetyper.list_2_json(transformer_dataset, 'transformer_dataset.jsonl')
filetyper.list_2_json(standard_dataset, 'standard_dataset.json')

LICENSE

MIT license

About

A python module for extracting synthesis conditions of metal-organic frameworks directly from journal articles

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors