Greetings,
I am trying to do single-machine multi-GPU minibatch training on a large graph (with 1+ billion edges, like OGB MAG240M). I followed the tutorial and example code here, and below is the snippet of my code which starts the multigpu training:
print("Preparing training data...")
# Create a large DGL graph object
train_graph = prepare_train_data(create_formats=True)
print("Starting MultiGPU processes...")
mp.start_processes(multigpu_start, args=(args, train_graph), nprocs=len(args.gpu), start_method="spawn")
However, I noticed that this actually creates multiple copies of the train_graph
object (one for each started process) that will cause OOM even for a small nprocs
(like nprocs=3
); pickling and unpickling the graph object to subprocesses also takes quite amount of time.
My question is that are there any approaches where we can put the graph object into shared memory that can be accessed by all the subprocesses without creating additional copies? A read-only access would be sufficient, but it would be even better if a CoW (copy-on-write) access is possible. I have tried to change the start method to “fork” which allows CoW access to the graph, but unfortunately that does not work out with torch.cuda
. I also noticed that there is a share_memory
member function for the DGL graph object, but it does not seem to be documented and I am not sure how that is supposed to work.
I would appreciate it if someone could provide some helps and suggestions!