-
Notifications
You must be signed in to change notification settings - Fork 65
Open
Description
ValueError: processor in session meta is not valid: <ErSessionMeta(id=202410250415447511850_nn_0_0_guest_10000, name=, status=KILLED, tag=, processors=[***, len=4], options=[{'eggroll.rollpair.inmemory_output': 'True', 'python.path': '/data/projects/fate/fate/python:/data/projects/fate/fate/python:/data/projects/fate/fateflow/python:/data/projects/fate/eggroll/python', 'eggroll.session.deploy.mode': 'cluster', 'eggroll.session.processors.per.node': '4', 'python.venv': '/data/projects/fate/common/python/venv'}]) at 0x7f14a43997c0>
FATE1.11.3,自定义模型报错,大概率出现该报错。
使用flow test toy -gid 10000 -hid 10000 极小概率出现该报错。
有时可以成功训练。
clustermanager.jvm.err.log报错:
[ERROR][2124508][2024-10-25 04:10:46,885][grpc-server-4670-24,pid:3120,tid:113][c.w.e.c.e.h.DefaultLoggingErrorHandler:144] -
java.lang.reflect.InvocationTargetException: null
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) ~[?:?]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_345]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_345]
at com.webank.eggroll.core.command.CommandRouter$$anonfun$register$3.apply(CommandRouter.scala:130) ~[eggroll-core-2.5.2.jar:?]
at com.webank.eggroll.core.command.CommandRouter$$anonfun$register$3.apply(CommandRouter.scala:124) ~[eggroll-core-2.5.2.jar:?]
at com.webank.eggroll.core.command.CommandRouter$.dispatch(CommandRouter.scala:139) ~[eggroll-core-2.5.2.jar:?]
at com.webank.eggroll.core.command.CommandService.com$webank$eggroll$core$command$CommandService$$run$body$1(CommandService.scala:47) ~[eggroll-core-2.5.2.jar:?]
at com.webank.eggroll.core.command.CommandService$$anonfun$1.run(CommandService.scala:41) ~[eggroll-core-2.5.2.jar:?]
at com.webank.eggroll.core.grpc.server.GrpcServerWrapper.wrapGrpcServerRunnable(GrpcServerWrapper.java:43) [eggroll-core-2.5.2.jar:?]
at com.webank.eggroll.core.command.CommandService.call(CommandService.scala:41) [eggroll-core-2.5.2.jar:?]
at com.webank.eggroll.core.command.CommandServiceGrpc$MethodHandlers.invoke(CommandServiceGrpc.java:257) [eggroll-core-2.5.2.jar:?]
at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182) [grpc-stub-1.55.1.jar:1.55.1]
at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:346) [grpc-core-1.55.1.jar:1.55.1]
at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:860) [grpc-core-1.55.1.jar:1.55.1]
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) [grpc-core-1.55.1.jar:1.55.1]
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) [grpc-core-1.55.1.jar:1.55.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_345]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_345]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_345]
Caused by: com.webank.eggroll.core.error.ErSessionException: unable to start all processors for session id: '202410250359237753070_eval_0_0_host_10000'. Please check corresponding bootstrap logs at '/data/logs/fate/eggroll/202410250359237753070_eval_0_0_host_10000' to check the reasons. Details:
=================
total processors: 4,
started count: 0,
not started count: 4,
current active processors per node: Map(192.168.71.121 -> 0),
not started processors and their nodes: Map(218 -> 192.168.71.121, 220 -> 192.168.71.121, 217 -> 192.168.71.121, 219 -> 192.168.71.121)
at com.webank.eggroll.core.resourcemanager.SessionManagerService.getOrCreateSessionOld(SessionManager.scala:493) ~[eggroll-core-2.5.2.jar:?]
at com.webank.eggroll.core.resourcemanager.SessionManagerService.getOrCreateSession(SessionManager.scala:342) ~[eggroll-core-2.5.2.jar:?]
... 19 more
请问是资源问题还是网络问题?
Metadata
Metadata
Assignees
Labels
No labels