I have a question on the distributed modes of DGL for training. I want to run my DGL code on a cluster of multiple nodes, performing strong scaling experiments.
DistGraph has standalone and distributed modes, and documentation associates standalone with a single node (machine) and distributed with multiple nodes. Standalone further clarifies that an entire graph (i.e., single partition) is stored within a server process, where trainers (or clients) can access the graph from the server process. In distributed mode, multiple partitions of the graph are distributed across the designated server processes (one process/node and #partitions == #server-processes?).
Questions:
-
Is my understanding correct?
-
In standalone mode, the clients or trainers fetch data from the (default) DGL server process which holds the entire graph (single process holding the graph). Can the clients span beyond a single node?
-
In distributed mode, there are partitions of the graph - but I wanted to clarify how would one launch a distributed training job on a cluster? The instructions here (Distributed Node Classification — DGL 0.8.2post1 documentation) assumes a distributed system, but I was looking for specific instructions to run on a cluster.