Using Shared Memory for Graph Access in multi-GPU training

Greetings,

I am trying to do single-machine multi-GPU minibatch training on a large graph (with 1+ billion edges, like OGB MAG240M). I followed the tutorial and example code here, and below is the snippet of my code which starts the multigpu training:

print("Preparing training data...")
# Create a large DGL graph object
train_graph = prepare_train_data(create_formats=True) 
print("Starting MultiGPU processes...")
mp.start_processes(multigpu_start, args=(args, train_graph), nprocs=len(args.gpu), start_method="spawn")

However, I noticed that this actually creates multiple copies of the train_graph object (one for each started process) that will cause OOM even for a small nprocs (like nprocs=3); pickling and unpickling the graph object to subprocesses also takes quite amount of time.

My question is that are there any approaches where we can put the graph object into shared memory that can be accessed by all the subprocesses without creating additional copies? A read-only access would be sufficient, but it would be even better if a CoW (copy-on-write) access is possible. I have tried to change the start method to “fork” which allows CoW access to the graph, but unfortunately that does not work out with torch.cuda. I also noticed that there is a share_memory member function for the DGL graph object, but it does not seem to be documented and I am not sure how that is supposed to work.

I would appreciate it if someone could provide some helps and suggestions!

I think you could try with DistDGL: dgl/examples/pytorch/graphsage/dist at master · dmlc/dgl · GitHub. For your case, just partition the graph into 1 partition as you have only 1 machine. In DistDGL, graph and graph data are mapped to shared memory. If OOM is hit yet, you need more machines and partition original graph into more partitions.

As for the example you’re referring to, graph.shared_memory() cannot be used directly, ndata/edata are also required to mapped to shared memory. I think it’s doable though multiple modifications are required. For now, graph.shared_memory() is used in DistDGL only.

Have you tried with start_method=forkserver?

How do you confirm graph is copied explicitly in each process which will run out of RAM? nprocs=1 works well while npocs=3 failed due to OOM?

Did you call graph.create_formats_() before spawning the processes? If not, the subprocesses may construct the CSR/CSC representations of the graph by their own, leading to multiple copies.

Thank you all for the suggestions! I checked out the docs for DistDGL; I feel it could solve the problem (through an indirect way), but it seems to be an overkill to me, and would require quite amount of modification on my current pipeline to make things work.

I took the approach of leveraging graph.shared_memory() and it seems that this works out the shared memory access mechanism. What I am doing now in the main function is like the following:

# if __name__ == '__main__':

print("Preparing training data...")
# Create a large DGL graph object
train_graph = prepare_train_data(create_formats=True)

# Move DGL graph object to shared memory with name as "train_graph"
shared_graph = train_graph.shared_memory("train_graph") 
  
print("Starting MultiGPU processes...")
# Do NOT pass the shared DGL graph object in the args here
mp.start_processes(multigpu_start, args=(args), nprocs=len(args.gpu), start_method="spawn")

and in the multigpu_start function, call train_graph = dgl.hetero_from_shared_memory("train_graph") to access the DGL graph object using the name set in the main function. In this case, all processes will use the graph object in the shared memory space instead of creating a copy of their own. The shared graph object is written as files to /dev/shm, which is essentially a tmpfs in RAM; if the process is interrupted abnormally, manually cleaning files in /dev/shm would be needed to free up the RAM space again. When using DGL node or edge dataloaders with ddp=True, CollateWrapper would also need to be modified accordingly to load the DGL graph from the shared memory.

I hope these helps people looking into similar issues!

1 Like

In the previous approach I did call graph.create_formats_() before spawning the subprocesses (with start method spawn or forkserver), but I think the graph (along with all created formats) are still pickled and unpickled to each of the subprocesses when not using the shared_memory function.

I have also tried with start_method=forkserver but it behaves the same as spawn.

I have in the first line of the multigpu_start function a print statement; in the previous approach, when the mp.start_processes function is called, that print statement in the multigpu_start function would not be executed, but the memory usage will grow rapidly, indicating that the DGL graph object is going through the pickling process. Also, I observe that nprocs=1 could work while npocs=3 will fail due to OOM. These can confirm that the graph is copied explicitly in each process if the previous approach is used.

In the tutorial you were referring to, graph is created outside of __name__ == '__main__' which will be run in any sub-processes if spawn is used. Could you confirm whether such graph creation operations are guarded by main in your code? Here’s a good example for reference: dgl/multi_gpu_node_classification.py at 9a16a5e03d9f653702fa98001e1549692c4b8394 · dmlc/dgl · GitHub