Skip to content

RFC: Refactor DPGEN2 with a new design #185

@link89

Description

@link89

Hi community,

This RFC is about a proposal to refactor DPGEN workflow with a new design based on DFlow

A typical DPGEN2 configuration is like the below:
https://github.com/deepmodeling/dpgen2/blob/master/examples/chno/input.json

IMHO there are some issues in the configuration:

  1. The context (executor, container, etc) configuration is mix with the configuration of algorithm
  2. It is hard to validate such configuration with tool like pydantic, which would be error prone
  3. Data files are not allowed to carry their own configuration, which makes it hard to training different systems at the same time.

A suggested pseudo configuration design is like the below, which borrow some ideas from ai2-kit project.
This configuration is supposed to be more formal and clean to maintain.

# executor configuration
executor:
  bohrium: ...

# dflow configuration for each software
dflow:
  python:
    container: ai2-kit/0.12.10
    python_cmd: python3
  deepmd:
    container: deepmd/2.7.1
    dp_cmd: dp
  lammps:
    container: deepmd/2.7.1
    lammps_cmd: lmp
  cp2k:
    container: cp2k/2023.1
    cp2k_cmd: mpirun cp2k.psmp

# declare file resources as datasets before use them
# so that we can assign extra attributes to them
datasets:
  dpdata-Ni13Pd12:
    url: /path/to/data
    format:  deepmd/npy

  sys-Ni13Pd12:
    url: /path/to/data
    includes: POSCAR*
    format: vasp
    attrs:
    # allow user to defined system-wise configuration
    # so that we can explore multiple types of systems in an iteration
      lammps:
        plumed_config: !load_text plumed.inp # use custom yaml tags to embed data from other file
      cp2k:
        input_template: !load_text cp2k.inp

workflow:
  general:
    type_map: [C, O, H]
    mass_map: [12, 16, 1]
    max_iters: 5

  train:
    deepmd:
      init_dataset: [dpdata-Ni13Pd12]
      input_template: !load_yaml deepmd.json  # use custom yaml tags to embed data from other file

  explore:
    # instead of using `type: lammps` to specific different software
    # specific a dedicated entry for different softwares of the same stage
    # so that we can use pydantic to validate the configuration item
    # and lead to a better code structure:
    # https://github.com/chenggroup/ai2-kit/blob/main/ai2_kit/workflow/cll_mlp.py#L163-L293
    lammps:
      nsteps: 10
      systems: [ sys-Ni13Pd12 ]  # reference dataset via key
      # support different way of variable combination strategies to avoid combination explosion
      # vars defined in `explore_vars` will combines with system_files with Cartesian product
      # vars defined in `broadcast_vars` will just broadcast to system_files
      # this design is useful if there are a lot of file
      explore_vars:
        TEMP: [330, 430, 530]
      broadcast_vars:
        LAMBDA_f: [0.0, 0.25, 0.5. 0.75. 1.0]
      template_vars:
        POST_INIT:  |
          neighbor bin 2.0
      plumed_config: !load_text plumed.inp

   # isolated select stage from explore so that we can implement more complex structure selection algorithm
  select:
    model_devi:
      decent_f: [0.12, 0.18]
    limit: 50

  label:
    cp2k:
      input_template: !load_text cp2k.inp

next:
  # specify configuration for next iteration
  # it will merge with the current configuration as a new configuration file for next round
  config: !load_yml iter-001.yml

The above configuration is easy to validate with pydantic, for example:
https://github.com/chenggroup/ai2-kit/blob/main/ai2_kit/workflow/cll_mlp.py#L32-L111

I believe a better design of configuration will lead to a better software design.
I post my thoughts for the community to review, and it would be appreciated to get some feedbacks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions