Will MAG240M's baseline code use multi-gpus training bring efficiency improvements?

In my environment(the machine has 6 Tesla T4, and the main memory is about 300GB), it takes about 30 minutes for an Epoch to run the baseline code on a single card.
And I noticed this issue in github, which said the biggest bottleneck is numpy’s memmap data loading.
So I want to know, in this case, will multi-gpus training bring performance improvement?

Another question:
When I run multi-gpus training demo like this, the memory usage will increase linearly according to the number of cards used. The reason should be that the process corresponding to each GPU reads a copy of data(the full graph) (I am not sure about it now… Please correct me if I’m wrong). Is there a way to use only one full graph for all processes?

In my environment, the answer to question 1 should be yes, because when using Tesla T4, the main time-consuming should be in the calculation(Here is the time test done by BarclayII :Running time difference in "DGL Baseline Code for MAG240M" · Issue #2823 · dmlc/dgl · GitHub).
I used 4 GPUs for training, and the single Epoch time dropped from about 27 minutes to about 11 minutes.

For the second question:
When I use multi-gpus training, the memory usage is about 130G, and it does not vary greatly according to the number of GPUs used. In other words, it seems that not every process copies a copy of the data.

That’s fantastic. Would you mind making a pull request with your multi-GPU training code?

@BarclayII, of course not. Here is the PR:[example] multi gpus training for mag240m by maqy1995 · Pull Request #2835 · dmlc/dgl · GitHub

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.