DGL classic GCN Computation

GuillaumeP · December 8, 2021, 3:26pm

Hi,
I tried to train the GCNexample in the github (“dgl\examples\pytorch\gat”) and it works fine. Now i tried to launch several times the training (like 2 or 3 times) and, in opposite to Torch experiments, the execution time of the 3 experiments are 3 times longer than expected ?

Single training on a Quadro P1000 :

[00:03<00:00, 76.79it/s]

Launch in // 2 times :

250/250 [00:05<00:00, 43.31it/s]
250/250 [00:05<00:00, 43.95it/s]

The computation time is 2 times the computation time of the single training, but in // on a single GPU, it is supposed to be the same ?

Thanks for your help,

mufeili · December 9, 2021, 6:37am

dgl\examples\pytorch\gat is an example for Graph Attention Network (GAT).
What do you mean by “Torch experiments”?
By “2 or 3 times”, did you mean multi-GPU training?
What’s the source of the expected execution time?
How did you get the time?

GuillaumeP · December 9, 2021, 9:46am

Hi Mufeili, thanks for your answer.

Yes, i mean dgl\examples\pytorch\gcn (i tried also on gat and same observation).

By “torch experiments”, i mean an toy gcn model extracted from torch example on github that i launched one time on one GPU (gpu:0) and 2 time on (gpu:0) in //.
What’s the source of the expected execution time?
I compare the execution time between :
- One single training executed on GPU (gpu:0) : 250/250 [00:03<00:00, 76.79it/s]
- Two training executed on GPU (gpu:0) : 250/250 [00:05<00:00, 43.31it/s]; 250/250 [00:05<00:00, 43.95it/s]
By “2 or 3 times”, did you mean multi-GPU training? :
When i said 2 or 3 times, i meant that i did “python train.py --n-epoch 250” 3 time on the same GPU (gpu:0), so the model are independent and are executed in // on the same GPU (gpu:0). I measure the time with tqdm in the epoch loop.

So here my problem is : Why when i am executing 2 GCN training using DGL library on the same GPU, the execution time is twice as long as when I execute a single one while, with the same model (GCN) on pytorch, the execution time is the same if i launch 1 training or 2 training in // on the same GPU ?

Maybe it is something with cuda stream properties ? I read that is was not handle now by DGL ?

Thanks a lot

mufeili · December 13, 2021, 7:53am

One possibility is that DGL empowers better GPU utilization and as a result the two experiments compete with each other when using the same GPU. Do you have the numbers for the pure PyTorch experiments?

GuillaumeP · December 13, 2021, 11:09am

Hi,

Yes it’s approximately the same ~3s, ~80it/s. And in PyTorch, it is the same (~3s, ~80it/s) when i do 3 parallelized training on the same GPU.
Is it a way to split the GPU utilization with DGL to not have each training competes with each other using the same GPU ?

Thanks,

GuillaumeP · October 31, 2022, 5:05pm

Re-updating the thread: this bug which still occure in the last version.
Seems to come from CUDA integration in the back-end, where competition occured as said previously ? I didn’t find another solution since last december than not executing two DGL training in the same computer.

Thanks,