Skip to content

multi-node support #12

@astooke

Description

@astooke

Starting a new issue in reference to question: (#11 (comment))

I have not experimented with running Synkhronos multi-node. Currently it's only built for single-node. To run multi-node would require another layer to coordinate and communicate among nodes. Certainly sounds possible, with a separate instance of the current Synkhronos running on each node. I haven't put a lot of thought into this yet, because my current research is well-suited to running single-node.

Apparently the new version of NCCL, version 2, supports inter-node communication. I have not tried it yet (Synkhronos is currently built on version 1). Synkhronos uses NCCL through libgpuarray and pygpu...I'm not sure what the compatibility status is through that chain.

Note that a key to scaling well to 256 GPUs in the large minibatch ResNet paper is to start communicating on gradients as they are computed layer-by-layer, simultaneously with performing the rest of the backpropagation.

I'd be curious to hear if you try anything!

Have you tried any other packages / libraries for running multi-GPU? e.g. TensorFlow, PyTorch, Chainer? And how does using them compare to Synkhronos?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions