Skip to content

Help for deepmd for MD simulation #1684

@Luckaswww

Description

@Luckaswww

Summary

I'm doing a dynamic simulation with a frozen model. The content of simulation is neb calculation and energy curve drawing. In the process of running the simulation. Let's say my initial state is A, my intermediate state is B, and my final state is C. I performed neb calculations of A-->B, A-->C on the M cluster. I moved the content to cluster N because the cluster permissions expired. In order to test the error, neb calculations of A-->B, A- >C are also performed in N cluster. The results show a slight difference in the energy curve, with an energy difference of 0.2 eV for the transition state. I want to ask is this caused by the cluster? In M cluster, deepmd is v2.2.6-1-g174f204a。In N cluster, deepmd is v2.2.6-1-g174f204a.

In addition, there may be an MPI error when I tried to do a B ->C calculation on the N cluster. I consulted the founder of the cluster and his testing suggestion may be a problem with deepmd itself. I really don't know what's causing it or how to fix it. Therefore, I am sending this issue to ask you for help. The following is an error output, and attached is an example.

DP-GEN Version

dpgen ==0.12.1

Platform, Python Version, etc

python=3.12.4
dpdata== 0.2.19
dpdispatcher==0.6.5
deepmd-kit v=2.2.6

Details

Error(1)

ARNING: There was an error initializing an OpenFabrics device.

Local host: cn9
Local device: hfi1_0


MPI_ABORT was invoked on rank 12 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

[cn9:47344] 15 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[cn9:47344] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[cn9:47344] 3 more processes have sent help message help-mpi-api.txt / mpi-abort

Error(2)
2024-12-05 18:51:06.726189: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2024-12-05 18:51:06.726267: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2024-12-05 18:51:06.725967: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 6
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5

Error (3):

2024-12-05 15:11:25.911162: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2024-12-05 15:11:25.912789: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
Fatal error in PMPI_Allgather: Other MPI error, error stack:
PMPI_Allgather(421).........................: MPI_Allgather(sbuf=0x7ffd3910cfd0, scount=4, MPI_DOUBLE, rbuf=0x55dc221066c0, rcount=4, datatype=MPI_DOUBLE, comm=comm=0x84000006) failed
MPIR_Allgather_impl(239)....................:
MPIR_Allgather_intra_auto(145)..............: Failure during collective
MPIR_Allgather_intra_auto(141)..............:
MPIR_Allgather_intra_recursive_doubling(216): Failure during collective
Fatal error in PMPI_Allgather: Message truncated, error stack:
PMPI_Allgather(421).........................: MPI_Allgather(sbuf=0x7ffcb0654ed0, scount=1, MPI_DOUBLE, rbuf=0x56528cc45a80, rcount=1, datatype=MPI_DOUBLE, comm=comm=0x84000006) failed
MPIR_Allgather_impl(239)....................:
MPIR_Allgather_intra_auto(145)..............: Failure during collective
MPIR_Allgather_intra_auto(141)..............:
MPIR_Allgather_intra_recursive_doubling(108):
MPIC_Sendrecv(340)..........................:
MPIDI_CH3U_Request_unpack_uebuf(516)........: Message truncated; 128 bytes received but buffer size is 32
MPIR_Allgather_intra_recursive_doubling(108):
MPIDI_CH3U_Receive_data_found(131)..........: Message from rank 2 and tag 7 truncated; 256 bytes received but buffer size is 128

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions