Training DGL Heterogeneous Graphs on GPUs with Multiprocessing

Hi DGL Developers!

I am new user of the Deep Graph Library. Thanks for the great Python package! I had a problem which I wanted to clarify/raise -

I am using a DGL Heterogeneous Graph with a PyTorch back-end. I want to train a model which has a heterogeneous graph with custom message passing functions which are essentially neural networks. Since the networks are heavy, I have to train on multiple GPUs with PyTorch’s multiprocessing wrapper. I would like to have a global graph network and several local graph networks (one local network in each process) which update the global network (aka Hogwild) where all the graph networks are defined on GPU. I have tried the following two variants -

  1. For a simpler setup (exactly as I described above), I used a DGL Homogeneous Graph and it is working great. I have 2 local networks each training on 3 GPUs and the global network also on a GPU.
  2. If I train DGL Heterogeneous Graphs (completely) on multiple cpu cores, it works. I use the default ‘fork’ start method for multiprocessing.

But I identified that DGL Heterogeneous Graphs are failing to build when defined inside a torch.multiprocessing.Process (graph = dgl.heterograph(graph_struct_info_dict).to('cuda:0')) when I’m trying to use GPUs with the error - ValueError: bad value(s) in fds_to_keep.

I read in this release note on heterogeneous graphs that Knowledge Graph Models currently only support multiprocessing on CPUs. Is it also true for DGL Heterogeneous Graphs? Moreover for training on GPUs with PyTorch’s multiprocessing wrapper, the ‘spawn’ start method is required.

If this is currently not supported, can the support for training on GPUs with multiprocessing be kindly added to DGL Heterogeneous Graphs quickly? For reference, here are my current version details -

  1. Python Version - 3.7.4
  2. DGL - 0.4.3post2
  3. PyTorch - 1.4.0
  4. CUDA - 10.1
  5. Nvidia Driver Version - 430.64

Thanks a lot in advance!


May I ask why did you ask the question through personal message instead of a public question? Is there any sensitive information in this post?

So in your case, do you have multiple graphs? Are they dynamically created? Can you try send graph on cpu and call to("cuda") in the subprocess? We have examples using multiprocessing but with sampling, you can find at

Hi @VoVAllen,

Thanks for your reply! There is no particular reason why I asked the question privately, I guess since I’m new to these discussion forums, I didn’t really notice I was posting privately.

I tried what you suggested, but it is still throwing the exact same error.

Also it works perfectly fine when I train on a CPU with the default fork start method. It throws an error though when I use the the spawn start method regardless of CPU or GPU. As you would know, spawn start method is not necessary for training on CPUs, but is required for training on GPUs and so doing mp.set_start_method('spawn') cannot be avoided in the master/main process when using GPUs.

Is this error because of the ‘spawn’ start process? If not what else could be causing it?


I’ve filed an issue at We may need some time to investigate this problem.

And I just made this post public, so other people with similar problem could find this.

1 Like


Passing cuda tensor between process is tricky. We have one viable example at

Current workaround to make it work:

  • Use fork instead of spawn
  • Do not pass cuda tensor between process
  • Only initialize one gpu with one process, which you can use th.cuda.set_device(dev_id) link. This avoids reinitilization cuda context error.

For more details please refer to the example above. We will keep investigating how to make this work with spawn. Feel free to ask if there’s still any question

1 Like

Hi @VoVAllen,

Thank you for formally raising an issue on GitHub regarding this! Thanks for providing the workarounds! I had some question/comments about them if you could answer -

  1. PyTorch does not allow fork start method when using cuda with torch.multiprocessing. Only spawn or forkserver start methods are supported and spawn is recommended. Hence I won’t be able to use fork, is there s workaround (if possible) with spawn currently?

  2. Can I pass cuda tensors between the main process and a subprocess? I will, however, try not to pass cuda tensors between two subprocesses.

  3. I will look into the links you have provided. Just a question though, with my current setup with DGL Homogeneous graphs (as I had described in my very first post), to train faster, I am initialising 3 processes per GPU. In the case of a single GPU, I initialize the global network and two local networks in the GPU. If I get more GPUs, I increase the number of subprocesses and load each extra GPU with 3 other local networks. This setup is working great though?