Hi DGL Developers!
I am new user of the Deep Graph Library. Thanks for the great Python package! I had a problem which I wanted to clarify/raise -
I am using a DGL Heterogeneous Graph with a PyTorch back-end. I want to train a model which has a heterogeneous graph with custom message passing functions which are essentially neural networks. Since the networks are heavy, I have to train on multiple GPUs with PyTorch’s multiprocessing wrapper. I would like to have a global graph network and several local graph networks (one local network in each process) which update the global network (aka Hogwild) where all the graph networks are defined on GPU. I have tried the following two variants -
- For a simpler setup (exactly as I described above), I used a DGL Homogeneous Graph and it is working great. I have 2 local networks each training on 3 GPUs and the global network also on a GPU.
- If I train DGL Heterogeneous Graphs (completely) on multiple cpu cores, it works. I use the default ‘fork’ start method for multiprocessing.
But I identified that DGL Heterogeneous Graphs are failing to build when defined inside a torch.multiprocessing.Process
(graph = dgl.heterograph(graph_struct_info_dict).to('cuda:0')
) when I’m trying to use GPUs with the error - ValueError: bad value(s) in fds_to_keep
.
I read in this release note on heterogeneous graphs that Knowledge Graph Models currently only support multiprocessing on CPUs. Is it also true for DGL Heterogeneous Graphs? Moreover for training on GPUs with PyTorch’s multiprocessing wrapper, the ‘spawn’ start method is required.
If this is currently not supported, can the support for training on GPUs with multiprocessing be kindly added to DGL Heterogeneous Graphs quickly? For reference, here are my current version details -
- Python Version - 3.7.4
- DGL - 0.4.3post2
- PyTorch - 1.4.0
- CUDA - 10.1
- Nvidia Driver Version - 430.64
Thanks a lot in advance!