Skip to content

Species Thermodynamics Dataset Generation#12

Open
craabreu wants to merge 2 commits intoQuantumPioneer:mainfrom
craabreu:thermo
Open

Species Thermodynamics Dataset Generation#12
craabreu wants to merge 2 commits intoQuantumPioneer:mainfrom
craabreu:thermo

Conversation

@craabreu
Copy link
Copy Markdown
Contributor

Species Thermodynamics Dataset Generation

Summary

This PR adds a script (scripts/thermo/species_thermodynamics.ipynb) that processes quantum mechanical calculation results to generate a comprehensive species thermodynamics dataset. The script extracts, validates, and consolidates thermodynamic properties from DFT-optimized geometries with DLPNO single-point energy calculations.

Dataset Output

File: data/thermo/quantumpioneer_species_thermo_dataset.csv

The generated dataset contains thermodynamic properties for unique chemical species, including:

  • Basic Properties: SMILES representation, H298, S298, Cp300
  • Wilhoit Parameters: Cp0, CpInf, a0, a1, a2, a3, H0, S0, B
  • Quantum Chemical Data: DLPNO single-point energies, scaled DFT zero-point energies (ZPE)

Processing Workflow

  1. Data Loading

    • Imports raw thermodynamic data from quantum_green_species_data_24august14_dft_opted_dlpno_sp_thermos.csv
    • Loads species reference data from quantum_green_species_data_24march12b.pkl
  2. Data Filtering

    • Filters out species with missing thermodynamic data (p_thermo field)
    • Canonicalizes SMILES representations using RDKit for consistent molecular identification
  3. Wilhoit Parameter Extraction

    • Implements custom parser to extract Wilhoit polynomial parameters from string representations
    • Regex pattern: (\w+)=\(?([-+]?\d+\.?\d*) handles both param=(value,'unit') and param=value formats
    • Extracts 9 Wilhoit parameters per species (Cp0, CpInf, a0-a3, H0, S0, B)
  4. Dataset Assembly

    • Combines molecular identifiers, thermodynamic properties, and Wilhoit parameters
    • Includes DLPNO single-point energies (Hartree)
    • Adds scaled DFT zero-point energies (scaling factor: 0.972387)
  5. Validation

    • Cross-validates against reaction dataset (quantum_green_ts_data_24july2c.pkl)
    • Verifies consistency of DLPNO energies and ZPE values between species and reaction datasets
    • Calculates maximum absolute differences to ensure data integrity

Technical Details

  • Canonical SMILES: All molecular structures are converted to canonical SMILES with atom map numbers removed for uniqueness
  • Data Deduplication: Ensures only unique species are included based on canonical SMILES
  • Progress Tracking: Uses swifter for parallelized pandas operations with progress bars

Key Functions

  • canonical_smiles(): Converts SMILES to canonical form with memoization
  • parse_wilhoit_string(): Extracts numerical parameters from Wilhoit polynomial string representations

Dependencies

  • pandas, swifter (data manipulation)
  • RDKit (molecular structure handling)
  • pathlib, re (file operations and parsing)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant