How to get a computation graph for distributed GNNs training script in DGL?

tariqaf · June 4, 2023, 11:23am

Hello everyone!

I have a question regarding the computation graph in a distributed script. Is it possible to obtain the computation graph for the entire process, starting from the data loader and mini-batch generation, all the way to gradient aggregation? I’m particularly interested in understanding the flow of operations and dependencies throughout the entire distributed training process, not just the computation (forward/backward) part in Graph Neural Networks (GNNs).

My goal is to accelerate the GNN training time by implementing task placements and online scheduling?

Thank you!

czkkkkkk · June 6, 2023, 2:32am

Hi @tariqaf , Which script do you use to run distributed training?

tariqaf · June 6, 2023, 11:02am

Sorry for late response, i was sleeping.

I use the following script

github.com

dmlc/dgl/blob/master/examples/pytorch/graphsage/dist/train_dist.py

import argparse
import socket
import time
from contextlib import contextmanager

import numpy as np
import torch as th
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import tqdm
import dgl
import dgl.nn.pytorch as dglnn

def load_subtensor(g, seeds, input_nodes, device, load_feat=True):
    """
    Copys features and labels of a set of nodes onto GPU.
    """
    batch_inputs = (
        g.ndata["features"][input_nodes].to(device) if load_feat else None

This file has been truncated. show original

Rhett-Ying · June 8, 2023, 3:06am

In short, distributed train is applied on the graph(though partitioned into several parts) with torch.nn.parallel.DistributedDataParallel. What DGL additionally added is split graph and related feature data into multiple machines and supports for accessing at the same time. So getting computation graph for dist train is supposed to be quite similar to getting for train with DDP. And do you have any experience on it?

tariqaf · June 9, 2023, 1:05am

Thank you so much your valuable answer. I dont have any experience with train with DDP but now i can search if it is the same. Thanks

system · July 9, 2023, 1:05am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.