Simultaneous multi-process training on CPU and GPU

I need to inference and explain online.My approach is to perform this process every time I train a certain amount of batches. To prevent GPU contention, I will copy the model to the cpu for this. The problem is that when I try to use multiple processes, the program is often not able to run sequentially.
My program mimics one of dgl’s official examples of distributed node classification. It runs through python’s main function. main function will call a function named main after processing the arguments. This function is responsible for some initialization, such as the initialization of the dgl distribution, the initialization of the graph, etc. The run function will then be called to perform the training.

At first, I call the following function during the training process:

if step % args.sampling_update_period == 0 and step != 0:
     mp.spawn(...)

report an error

AttributeError: module ‘dgl.multiprocessing’ has no attribute ‘spawn’

so I change to:

mp.Process(...).start()

RuntimeError: Unable to handle autograd’s threading in combination with fork-based multiprocessing. See Autograd and Fork · pytorch/pytorch Wiki · GitHub

and when I use the method in the link:

ctx = mp.get_context("spawn")
    with ctx.Pool(1) as p:
        p.map(...)

AssertionError: Distributed module is not initialized. Please call dgl.distributed.initialize.

I also tried to do the above for the training function named run, in an attempt to make both processes become child processes by both. But the situation is similar to the above.
Only the practice of opening a process for a function named Main in Python main can run.
Only the practice of opening a process in Python main for a function named Main can run. Unfortunately, I need the DistGaph in the main function to do some initial building in another process. Is there a better way to be able to do multi-process operations?Or, when we create a new DistGraph class, if this graph is already load in memory, it doesn’t cause another load?

Your description is a bit messy. Could you provide more context and ideally a minimal script to understand your setting?

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.