Hello everyone!
I have a question regarding the computation graph in a distributed script. Is it possible to obtain the computation graph for the entire process, starting from the data loader and mini-batch generation, all the way to gradient aggregation? I’m particularly interested in understanding the flow of operations and dependencies throughout the entire distributed training process, not just the computation (forward/backward) part in Graph Neural Networks (GNNs).
My goal is to accelerate the GNN training time by implementing task placements and online scheduling?
Thank you!