Pytorch DataParallel with GAT

AI678 · October 18, 2020, 6:03am

I want to use DataParallel API for multi-gpu training. Because I just use gat model for knowledge augmentation , so there will be many other computations.
To Parallel GAT , I just write this in Pytorch : gat = DataParallel(gat)

And it seems to have the first dimension problem like this
dgl._ffi.base.DGLError: Expect number of features to match number of nodes (len(u)). Got 400 and 800 instead.

My batch size is 4 ,node num per graph is 200

I think it is because dgl.batch puts some small graphs in the same batch into a big graph. And Pytorch DataParallel just think this big graph’s batch size equals to 1. And this big graph has 800 nodes. It won’t be sent to different GPUs for parallel .However, the corresponding features were sent into 2 GPUs .Each has batch size 2 ,and the number of features is 400.
How can I fix this problem ? Is there another way to batch graphs ?

AI678 · October 19, 2020, 3:07am

I used 2 GPUs for training

BarclayII · October 19, 2020, 7:12am

Hi,

DGL graphs cannot work with DataParallel. DGL works with DistributedDataParallel instead (see https://pytorch.org/tutorials/intermediate/dist_tuto.html). Essentially you need to partition the dataset yourself and let each GPU spawn a process working on its own partition.