Multi-GPU problem

Japheth0002 · December 15, 2023, 6:07am

Problem description:
I have several RTX4090 now, when I use dgl.nn.GATv2Conv or GATConv, it seems that every time I choose to use model.to(‘cuda:1’), it will raise some GPU-related error, but model.to(‘cuda:0’) works well.

I Step by step debugging, find that every time I use model.to(‘cuda:1’), after the forward process of layer dgl.nn.GATv2Conv, cuda:0 will also have a fixed occupancy, about 300M. Also model.to(‘cuda:2’) has the same problem. It seems that the forward process will leads the first GPU has a fixed occupancy.

Also dgl.nn.TAGConv has the same problem, the difference is it can still train without raising error.

DGL Version: 1.1.3 + cu12

Japheth0002 · December 19, 2023, 5:38am

Problem solved, not caused by DGL.

system · January 18, 2024, 5:39am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.