Mixed precision training of GNNs using apex

skrsna · June 27, 2019, 7:03pm

Hi,
I’m trying to do mixed precision training of GNN with a large number of small graphs in distributed setting using torch.distributed on two tesla V100 GPUs in single node using NVIDIA’s Apex (https://github.com/NVIDIA/apex) but I’m running into error on dgl backend. The code successfully runs in distributed setting when using horovod (https://github.com/horovod/horovod) but fails in torch.distributed and apex. The full stack trace is below. Sorry, if this is not the appropriate place to ask this.

Traceback (most recent call last):
  File "train_apex.py", line 475, in <module>
    main()
  File "train_apex.py", line 256, in main
    train_loader, model, criterion, optimizer, epoch, evaluation)
  File "train_apex.py", line 336, in train
    output = model(g)
  File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/apex/parallel/distributed.py", line 476, in forward
    result = self.module(*inputs, **kwargs)
  File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/apex/amp/_initialize.py", line 203, in new_fwd
    output = old_fwd(*applier(args, input_caster),
  File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/apex/amp/_initialize.py", line 48, in applier
    return type(value)(applier(v, fn) for v in value)
  File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/apex/amp/_initialize.py", line 48, in <genexpr>
    return type(value)(applier(v, fn) for v in value)
  File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/apex/amp/_initialize.py", line 44, in applier
    return fn(value)
  File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/apex/amp/_initialize.py", line 32, in to_type
    return t.to(dtype)
  File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/dgl/graph.py", line 3346, in to
    self.ndata[k] = F.copy_to(self.ndata[k], ctx)
  File "/home/sirumalla.s/anaconda3/envs/ddgl/lib/python3.6/site-packages/dgl/backend/pytorch/tensor.py", line 81, in copy_to
    if ctx.device.type == 'cpu':
AttributeError: 'torch.dtype' object has no attribute 'device'

Thanks

VoVAllen · June 28, 2019, 7:41am

It seems apex will convert all variable passed into forward function to certain mixed precisio. But it expect all variable are pytorch tensors, and seems you passed a DGLGraph into the model. And here apex tried to call DGLGraph.to(_some_mixed_precision_type), but we only support DGLGraph.to(device).
I’m not familiar to apex. You can extend our .to operation to convert all the ndata and edata to certain type. However, there might be further problems at somewhere else, since this is not a scenario expected by apex.

minjie · June 28, 2019, 11:05pm

One side note: our message passing kernels currently do not support half precision. It should be fine if your mixed precision training does not involve that.