RFC: Refactor DPGEN2 with a new design

Hi community, 

This RFC is about a proposal to refactor DPGEN workflow with a new design based on DFlow

A typical DPGEN2 configuration is like the below:
https://github.com/deepmodeling/dpgen2/blob/master/examples/chno/input.json

IMHO there are some issues in the configuration:
1. The context (executor, container, etc) configuration is mix with the configuration of algorithm
2. It is hard to validate such configuration with tool like `pydantic`, which would be error prone
3. Data files are not allowed to carry their own configuration, which makes it hard to training different systems at the same time.

A suggested pseudo configuration design is like the below, which borrow some ideas from `ai2-kit` project. 
This configuration is supposed to be more formal and clean to maintain.
```yaml
# executor configuration
executor:
  bohrium: ...

# dflow configuration for each software
dflow:
  python:
    container: ai2-kit/0.12.10
    python_cmd: python3
  deepmd:
    container: deepmd/2.7.1
    dp_cmd: dp
  lammps:
    container: deepmd/2.7.1
    lammps_cmd: lmp
  cp2k:
    container: cp2k/2023.1
    cp2k_cmd: mpirun cp2k.psmp

# declare file resources as datasets before use them
# so that we can assign extra attributes to them
datasets:
  dpdata-Ni13Pd12:
    url: /path/to/data
    format:  deepmd/npy

  sys-Ni13Pd12:
    url: /path/to/data
    includes: POSCAR*
    format: vasp
    attrs:
    # allow user to defined system-wise configuration
    # so that we can explore multiple types of systems in an iteration
      lammps:
        plumed_config: !load_text plumed.inp # use custom yaml tags to embed data from other file
      cp2k:
        input_template: !load_text cp2k.inp

workflow:
  general:
    type_map: [C, O, H]
    mass_map: [12, 16, 1]
    max_iters: 5

  train:
    deepmd:
      init_dataset: [dpdata-Ni13Pd12]
      input_template: !load_yaml deepmd.json  # use custom yaml tags to embed data from other file

  explore:
    # instead of using `type: lammps` to specific different software
    # specific a dedicated entry for different softwares of the same stage
    # so that we can use pydantic to validate the configuration item
    # and lead to a better code structure:
    # https://github.com/chenggroup/ai2-kit/blob/main/ai2_kit/workflow/cll_mlp.py#L163-L293
    lammps:
      nsteps: 10
      systems: [ sys-Ni13Pd12 ]  # reference dataset via key
      # support different way of variable combination strategies to avoid combination explosion
      # vars defined in `explore_vars` will combines with system_files with Cartesian product
      # vars defined in `broadcast_vars` will just broadcast to system_files
      # this design is useful if there are a lot of file
      explore_vars:
        TEMP: [330, 430, 530]
      broadcast_vars:
        LAMBDA_f: [0.0, 0.25, 0.5. 0.75. 1.0]
      template_vars:
        POST_INIT:  |
          neighbor bin 2.0
      plumed_config: !load_text plumed.inp

   # isolated select stage from explore so that we can implement more complex structure selection algorithm
  select:
    model_devi:
      decent_f: [0.12, 0.18]
    limit: 50

  label:
    cp2k:
      input_template: !load_text cp2k.inp

next:
  # specify configuration for next iteration
  # it will merge with the current configuration as a new configuration file for next round
  config: !load_yml iter-001.yml
```

The above configuration is easy to validate with `pydantic`, for example: 
https://github.com/chenggroup/ai2-kit/blob/main/ai2_kit/workflow/cll_mlp.py#L32-L111

I believe a better design of configuration will lead to a better software design.
I post my thoughts for the community to review, and it would be appreciated to get some feedbacks.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Refactor DPGEN2 with a new design #185

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Refactor DPGEN2 with a new design #185

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions