Multi-GPU problem

Problem description:
I have several RTX4090 now, when I use dgl.nn.GATv2Conv or GATConv, it seems that every time I choose to use model.to(‘cuda:1’), it will raise some GPU-related error, but model.to(‘cuda:0’) works well.

I Step by step debugging, find that every time I use model.to(‘cuda:1’), after the forward process of layer dgl.nn.GATv2Conv, cuda:0 will also have a fixed occupancy, about 300M. Also model.to(‘cuda:2’) has the same problem. It seems that the forward process will leads the first GPU has a fixed occupancy.

Also dgl.nn.TAGConv has the same problem, the difference is it can still train without raising error.

DGL Version: 1.1.3 + cu12

Problem solved, not caused by DGL.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.