Hi,
I’m trying to do mixed precision training of GNN with a large number of small graphs in distributed setting using torch.distributed
on two tesla V100 GPUs in single node using NVIDIA’s Apex (https://github.com/NVIDIA/apex) but I’m running into error on dgl backend. The code successfully runs in distributed setting when using horovod
(https://github.com/horovod/horovod) but fails in torch.distributed
and apex
. The full stack trace is below. Sorry, if this is not the appropriate place to ask this.
Traceback (most recent call last):
File "train_apex.py", line 475, in <module>
main()
File "train_apex.py", line 256, in main
train_loader, model, criterion, optimizer, epoch, evaluation)
File "train_apex.py", line 336, in train
output = model(g)
File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/apex/parallel/distributed.py", line 476, in forward
result = self.module(*inputs, **kwargs)
File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/apex/amp/_initialize.py", line 203, in new_fwd
output = old_fwd(*applier(args, input_caster),
File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/apex/amp/_initialize.py", line 48, in applier
return type(value)(applier(v, fn) for v in value)
File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/apex/amp/_initialize.py", line 48, in <genexpr>
return type(value)(applier(v, fn) for v in value)
File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/apex/amp/_initialize.py", line 44, in applier
return fn(value)
File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/apex/amp/_initialize.py", line 32, in to_type
return t.to(dtype)
File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/dgl/graph.py", line 3346, in to
self.ndata[k] = F.copy_to(self.ndata[k], ctx)
File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/dgl/backend/pytorch/tensor.py", line 81, in copy_to
if ctx.device.type == 'cpu':
AttributeError: 'torch.dtype' object has no attribute 'device'
Thanks