In my environment(the machine has 6 Tesla T4, and the main memory is about 300GB), it takes about 30 minutes for an Epoch to run the baseline code on a single card.
And I noticed this issue in github, which said the biggest bottleneck is numpy’s memmap data loading.
So I want to know, in this case, will multi-gpus training bring performance improvement?
Another question:
When I run multi-gpus training demo like this, the memory usage will increase linearly according to the number of cards used. The reason should be that the process corresponding to each GPU reads a copy of data(the full graph) (I am not sure about it now… Please correct me if I’m wrong). Is there a way to use only one full graph for all processes?