How to get a computation graph for distributed GNNs training script in DGL?

Hello everyone!

I have a question regarding the computation graph in a distributed script. Is it possible to obtain the computation graph for the entire process, starting from the data loader and mini-batch generation, all the way to gradient aggregation? I’m particularly interested in understanding the flow of operations and dependencies throughout the entire distributed training process, not just the computation (forward/backward) part in Graph Neural Networks (GNNs).

My goal is to accelerate the GNN training time by implementing task placements and online scheduling?

Thank you!

Hi @tariqaf , Which script do you use to run distributed training?

Sorry for late response, i was sleeping.

I use the following script

In short, distributed train is applied on the graph(though partitioned into several parts) with torch.nn.parallel.DistributedDataParallel. What DGL additionally added is split graph and related feature data into multiple machines and supports for accessing at the same time. So getting computation graph for dist train is supposed to be quite similar to getting for train with DDP. And do you have any experience on it?

1 Like

Thank you so much your valuable answer. I dont have any experience with train with DDP but now i can search if it is the same. Thanks

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.