Help for deepmd for MD simulation

### Summary

  I'm doing a dynamic simulation with a frozen model. The content of simulation is neb calculation and energy curve drawing. In the process of running the simulation. Let's say my initial state is A, my intermediate state is B, and my final state is C. I performed neb calculations of A-->B, A-->C on the M cluster.    I moved the content to cluster N because the cluster permissions expired. In order to test the error, neb calculations of A-->B, A- >C are also performed in N cluster. The results show a slight difference in the energy curve, with an energy difference of 0.2 eV for the transition state. I want to ask is this caused by the cluster? In M cluster, deepmd is v2.2.6-1-g174f204a。In N cluster, deepmd is v2.2.6-1-g174f204a.    

  In addition, there may be an MPI error when I tried to do a  B ->C calculation on the N cluster.    I consulted the founder of the cluster and his testing suggestion may be a problem with deepmd itself.    I really don't know what's causing it or how to fix it.    Therefore, I am sending this issue to ask you for help.    The following is an error output, and attached is an example.

### DP-GEN Version

dpgen ==0.12.1

### Platform, Python Version, etc

python=3.12.4
dpdata==  0.2.19 
dpdispatcher==0.6.5 
deepmd-kit v=2.2.6

### Details

**Error(1)** ：

ARNING: There was an error initializing an OpenFabrics device.

  Local host:   cn9
  Local device: hfi1_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 12 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[cn9:47344] 15 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[cn9:47344] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[cn9:47344] 3 more processes have sent help message help-mpi-api.txt / mpi-abort


**Error(2)**：
2024-12-05 18:51:06.726189: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2024-12-05 18:51:06.726267: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2024-12-05 18:51:06.725967: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 6
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5

**Error (3)**:

2024-12-05 15:11:25.911162: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2024-12-05 15:11:25.912789: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
Fatal error in PMPI_Allgather: Other MPI error, error stack:
PMPI_Allgather(421).........................: MPI_Allgather(sbuf=0x7ffd3910cfd0, scount=4, MPI_DOUBLE, rbuf=0x55dc221066c0, rcount=4, datatype=MPI_DOUBLE, comm=comm=0x84000006) failed
MPIR_Allgather_impl(239)....................: 
MPIR_Allgather_intra_auto(145)..............: Failure during collective
MPIR_Allgather_intra_auto(141)..............: 
MPIR_Allgather_intra_recursive_doubling(216): Failure during collective
Fatal error in PMPI_Allgather: Message truncated, error stack:
PMPI_Allgather(421).........................: MPI_Allgather(sbuf=0x7ffcb0654ed0, scount=1, MPI_DOUBLE, rbuf=0x56528cc45a80, rcount=1, datatype=MPI_DOUBLE, comm=comm=0x84000006) failed
MPIR_Allgather_impl(239)....................: 
MPIR_Allgather_intra_auto(145)..............: Failure during collective
MPIR_Allgather_intra_auto(141)..............: 
MPIR_Allgather_intra_recursive_doubling(108): 
MPIC_Sendrecv(340)..........................: 
MPIDI_CH3U_Request_unpack_uebuf(516)........: Message truncated; 128 bytes received but buffer size is 32
MPIR_Allgather_intra_recursive_doubling(108): 
MPIDI_CH3U_Receive_data_found(131)..........: Message from rank 2 and tag 7 truncated; 256 bytes received but buffer size is 128

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Help for deepmd for MD simulation #1684

Summary

DP-GEN Version

Platform, Python Version, etc

Details

Local host: cn9
Local device: hfi1_0

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Help for deepmd for MD simulation #1684

Description

Summary

DP-GEN Version

Platform, Python Version, etc

Details

Local host: cn9 Local device: hfi1_0

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Local host: cn9
Local device: hfi1_0

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.