Skip to content

processor in session meta is not valid #731

@sealofyou

Description

@sealofyou
ValueError: processor in session meta is not valid: <ErSessionMeta(id=202410250415447511850_nn_0_0_guest_10000, name=, status=KILLED, tag=, processors=[***, len=4], options=[{'eggroll.rollpair.inmemory_output': 'True', 'python.path': '/data/projects/fate/fate/python:/data/projects/fate/fate/python:/data/projects/fate/fateflow/python:/data/projects/fate/eggroll/python', 'eggroll.session.deploy.mode': 'cluster', 'eggroll.session.processors.per.node': '4', 'python.venv': '/data/projects/fate/common/python/venv'}]) at 0x7f14a43997c0>

FATE1.11.3,自定义模型报错,大概率出现该报错。
使用flow test toy -gid 10000 -hid 10000 极小概率出现该报错。
有时可以成功训练。
clustermanager.jvm.err.log报错:

[ERROR][2124508][2024-10-25 04:10:46,885][grpc-server-4670-24,pid:3120,tid:113][c.w.e.c.e.h.DefaultLoggingErrorHandler:144] -
java.lang.reflect.InvocationTargetException: null
        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) ~[?:?]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_345]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_345]
        at com.webank.eggroll.core.command.CommandRouter$$anonfun$register$3.apply(CommandRouter.scala:130) ~[eggroll-core-2.5.2.jar:?]
        at com.webank.eggroll.core.command.CommandRouter$$anonfun$register$3.apply(CommandRouter.scala:124) ~[eggroll-core-2.5.2.jar:?]
        at com.webank.eggroll.core.command.CommandRouter$.dispatch(CommandRouter.scala:139) ~[eggroll-core-2.5.2.jar:?]
        at com.webank.eggroll.core.command.CommandService.com$webank$eggroll$core$command$CommandService$$run$body$1(CommandService.scala:47) ~[eggroll-core-2.5.2.jar:?]
        at com.webank.eggroll.core.command.CommandService$$anonfun$1.run(CommandService.scala:41) ~[eggroll-core-2.5.2.jar:?]
        at com.webank.eggroll.core.grpc.server.GrpcServerWrapper.wrapGrpcServerRunnable(GrpcServerWrapper.java:43) [eggroll-core-2.5.2.jar:?]
        at com.webank.eggroll.core.command.CommandService.call(CommandService.scala:41) [eggroll-core-2.5.2.jar:?]
        at com.webank.eggroll.core.command.CommandServiceGrpc$MethodHandlers.invoke(CommandServiceGrpc.java:257) [eggroll-core-2.5.2.jar:?]
        at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182) [grpc-stub-1.55.1.jar:1.55.1]
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:346) [grpc-core-1.55.1.jar:1.55.1]
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:860) [grpc-core-1.55.1.jar:1.55.1]
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) [grpc-core-1.55.1.jar:1.55.1]
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) [grpc-core-1.55.1.jar:1.55.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_345]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_345]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_345]
Caused by: com.webank.eggroll.core.error.ErSessionException: unable to start all processors for session id: '202410250359237753070_eval_0_0_host_10000'. Please check corresponding bootstrap logs at '/data/logs/fate/eggroll/202410250359237753070_eval_0_0_host_10000' to check the reasons. Details:
=================
total processors: 4,
started count: 0,
not started count: 4,
current active processors per node: Map(192.168.71.121 -> 0),
not started processors and their nodes: Map(218 -> 192.168.71.121, 220 -> 192.168.71.121, 217 -> 192.168.71.121, 219 -> 192.168.71.121)
        at com.webank.eggroll.core.resourcemanager.SessionManagerService.getOrCreateSessionOld(SessionManager.scala:493) ~[eggroll-core-2.5.2.jar:?]
        at com.webank.eggroll.core.resourcemanager.SessionManagerService.getOrCreateSession(SessionManager.scala:342) ~[eggroll-core-2.5.2.jar:?]
        ... 19 more

请问是资源问题还是网络问题?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions