Multi-GPU multi-node GNN training modes

I have a question on the distributed modes of DGL for training. I want to run my DGL code on a cluster of multiple nodes, performing strong scaling experiments.

DistGraph has standalone and distributed modes, and documentation associates standalone with a single node (machine) and distributed with multiple nodes. Standalone further clarifies that an entire graph (i.e., single partition) is stored within a server process, where trainers (or clients) can access the graph from the server process. In distributed mode, multiple partitions of the graph are distributed across the designated server processes (one process/node and #partitions == #server-processes?).

Questions:

  1. Is my understanding correct?

  2. In standalone mode, the clients or trainers fetch data from the (default) DGL server process which holds the entire graph (single process holding the graph). Can the clients span beyond a single node?

  3. In distributed mode, there are partitions of the graph - but I wanted to clarify how would one launch a distributed training job on a cluster? The instructions here (Distributed Node Classification — DGL 0.8.2post1 documentation) assumes a distributed system, but I was looking for specific instructions to run on a cluster.

Hi, for questions regarding to distributed training, please check out GraphStorm GitHub - awslabs/graphstorm: Enterprise graph machine learning framework for billion-scale graphs for ML scientists and data scientists. , which simplifies a lot of errands of DistDGL. You could post your questions there.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.