Problem description:
I have several RTX4090 now, when I use dgl.nn.GATv2Conv or GATConv, it seems that every time I choose to use model.to(‘cuda:1’), it will raise some GPU-related error, but model.to(‘cuda:0’) works well.
I Step by step debugging, find that every time I use model.to(‘cuda:1’), after the forward process of layer dgl.nn.GATv2Conv, cuda:0 will also have a fixed occupancy, about 300M. Also model.to(‘cuda:2’) has the same problem. It seems that the forward process will leads the first GPU has a fixed occupancy.
Also dgl.nn.TAGConv has the same problem, the difference is it can still train without raising error.
DGL Version: 1.1.3 + cu12