Single Machine with many gpus

onepiecewiley · July 16, 2024, 8:28pm

I am a novice in the field of distributed graph training. I am not sure if I can use DGL to do this. I have a computer with 4 GPUs. I want to perform graph partitioning on the graph dataset using DGL, for example, with Metis. After obtaining the partitions, I want each GPU to load one partition, with each GPU retaining a complete model and training simultaneously. During this process, there might be a need to fetch neighbor node features across GPUs. After each epoch, gradients should be synchronized to achieve the goal of single-machine multi-GPU training.

onepiecewiley · July 16, 2024, 8:29pm

I’d be grateful if someone could provide the source code for the above scenario

mfbalin · July 16, 2024, 8:34pm

Have you tried this example? It is not quite what you described it can train on multiple GPUs without replicating neither features or the graph.

onepiecewiley · July 16, 2024, 8:39pm

In fact, this example is dividing the graph dataset, not applying a graph partitioning algorithm, which I had to use since my current research area is optimising the efficiency of distributed graph training with graph partitioning algorithms

onepiecewiley · July 16, 2024, 8:40pm

Because I checked and found out that the distributed sampler here is going to divide the dataset based on the number of GPUs, but that’s not graph partitioning

onepiecewiley · July 16, 2024, 8:41pm

I’m not sure if I’m misunderstanding, because the official documentation doesn’t specify this scenario

mfbalin · July 16, 2024, 8:45pm

The distributed sampler partitions the training set across the GPUs, but there is no graph partitioning or feature partitioning performed. All GPUs access the same graph and features in memory when you use the “pinned-cuda” mode. For “cuda-cuda” mode, the graph and features are replicated across GPUs.

onepiecewiley · July 16, 2024, 8:49pm

So can you please tell me if I want to complete the usage scenario in my question, is it possible to do it through DGL, because I don’t quite understand, the example given in the official docs is multi-machine distributed, with one partition loaded per machine, but I want to do it on a single machine with multiple GPUs, with one GPU loaded with a single graph partitioned data,thank u

mfbalin · July 16, 2024, 8:53pm

DistDGL has metis partitioning capabilities. You might want to look into the distributed examples. I am sure you can run them even on a single machine.

mfbalin · July 16, 2024, 10:02pm

But I think the GraphBolt example above will be much more efficient than the distributed examples.

system · August 15, 2024, 10:03pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.