CUDA out of memory issue

Hi, I’m having some memory errors when training on a GAT model on a GPU. Here’s the detail:

The batched graph contain 3 types of edges(“parent”, “child”, “sibling”), about 2.1 million edges and 16800 nodes. Each node has a 300-dimension feature “h”. Here’s the code:

class GNN_HETER(nn.Module):

    def __init__(self, n_in, dropout, device, n_out = 1):
        super(GNN_HETER, self).__init__()
        self.A = nn.Parameter(th.Tensor(n_in+1, n_in+1)) # +1 for bias

    # the reduce function, 3 edge types included
    def reduce_func(self, nodes, edge_type):
        alpha = F.softmax(nodes.mailbox["attention_" + edge_type], dim = 1)
        reduce_result = th.sum(nodes.mailbox["m_" + edge_type] * alpha, dim = 1)
        return {("sum_" + edge_type): reduce_result}

    def call_parent_reduce(self, nodes):
        return self.reduce_func(nodes, "parent")

    def call_child_reduce(self, nodes):
        return self.reduce_func(nodes, "child")

    def call_sibling_reduce(self, nodes):
        return self.reduce_func(nodes, "sibling")

    def parent_message(self, edges):
        return {"m_parent": edges.src['h'], "attention_parent": edges.data["attention"]}
    def child_message(self, edges):
        return {"m_child": edges.src['h'], "attention_child": edges.data["attention"]}
    def sibling_message(self, edges):
        return {"m_sibling": edges.src['h'], "attention_sibling": edges.data["attention"]}

    # simulate one gnn layer
    def gnn_proceed_one_layer(self, g):
        g.apply_edges(func = self.edge_attention, etype = "parent")
        g["parent"].update_all(self.parent_message, self.call_parent_reduce)
        g.apply_edges(func = self.edge_attention, etype = "child")
        g["child"].update_all(self.child_message, self.call_child_reduce)
        g.apply_edges(func = self.edge_attention, etype = "sibling")
        g["sibling"].update_all(self.sibling_message, self.call_sibling_reduce)
        return g

    def edge_attention(self, edges): # hi^T * A * hj + hi^T * b1 + hj^T * b2
        hi = edges.src['h']
        hj = edges.dst['h']
        hi = th.cat((hi, th.ones_like(hi[..., :1])), -1)
        hj = th.cat((hj, th.ones_like(hj[..., :1])), -1)
        attn = th.einsum('ab,bc,ac->a', hi, self.A, hj).unsqueeze(1)
        return {("attention"): attn}
        
    def forward(self, graphs, num_layers = 1):
        graphs = dgl.batch_hetero(graphs, node_attrs = {"node": ['h', '_ID']}, edge_attrs = None)
        for i in range(num_layers):
            self.gnn_proceed_one_layer(graphs)

Here’s the error:

  File "/root/try/bishe/parser/modules/gnn_heter.py", line 198, in forward
    self.gnn_proceed_one_layer(graphs)
  File "/root/try/bishe/parser/modules/gnn_heter.py", line 147, in gnn_proceed_one_layer
    g["sibling"].update_all(self.sibling_message, self.call_sibling_reduce)
  File "/root/Anacondas/anaconda3/lib/python3.7/site-packages/dgl/heterograph.py", line 3196, in update_all
    Runtime.run(prog)
  File "/root/Anacondas/anaconda3/lib/python3.7/site-packages/dgl/runtime/runtime.py", line 11, in run
    exe.run()
  File "/root/Anacondas/anaconda3/lib/python3.7/site-packages/dgl/runtime/ir/executor.py", line 132, in run
    udf_ret = fn_data(node_data, mail_data)
  File "/root/Anacondas/anaconda3/lib/python3.7/site-packages/dgl/runtime/degree_bucketing.py", line 153, in _rfunc_wrapper
    return reduce_udf(nbatch)
  File "/root/try/bishe/parser/modules/gnn_heter.py", line 122, in call_sibling_reduce
    return self.reduce_func(nodes, "sibling")
  File "/root/try/bishe/parser/modules/gnn_heter.py", line 112, in reduce_func
    reduce_result = th.sum(nodes.mailbox["m_" + edge_type] * alpha, dim = 1)
  File "/root/Anacondas/anaconda3/lib/python3.7/site-packages/dgl/utils.py", line 285, in __getitem__
    return self._fn(key)
  File "/root/Anacondas/anaconda3/lib/python3.7/site-packages/dgl/runtime/degree_bucketing.py", line 148, in _reshaped_getter
    msg = mail_data[key]
  File "/root/Anacondas/anaconda3/lib/python3.7/site-packages/dgl/utils.py", line 285, in __getitem__
    return self._fn(key)
  File "/root/Anacondas/anaconda3/lib/python3.7/site-packages/dgl/frame.py", line 655, in <lambda>
    return utils.LazyDict(lambda key: self._frame[key][rows], keys=self.keys())
  File "/root/Anacondas/anaconda3/lib/python3.7/site-packages/dgl/frame.py", line 97, in __getitem__
    return F.gather_row(self.data, user_idx)
  File "/root/Anacondas/anaconda3/lib/python3.7/site-packages/dgl/backend/pytorch/tensor.py", line 152, in gather_row
    return th.index_select(data, 0, row_index)
RuntimeError: CUDA out of memory. Tried to allocate 446.00 MiB (GPU 0; 11.17 GiB total capacity; 10.31 GiB already allocated; 112.75 MiB free; 10.75 GiB reserved in total by PyTorch)

I noticed that the GPU memory occupation increased drastically as gnn_proceed_one_layer calling apply_edges and update_all.

Can someone help me with this? Any suggestions or tips would be appreciated.

Hi @Daryl, could you please try using the combination of builtin functions to replace the message functions/reduce functions/apply-edge-functions you defined? We have optimized the speed/GPU memory usage of builtin functions, in your case, the g.apply_edges(func = self.edge_attention, etype = "parent") would copy node features to edges, considering your graph is not small (2 million nodes), the operation is not memory efficient. But if you use something like

import dgl.function as fn
g.dstdata['ah'] = torch.matmul(A, g.dstdata['h'])
g.apply_edges(func=fn.u_dot_v('h', 'ah', 'hah'), etype = "parent")
# and other codes to handle hi^T * b1 + hj^T * b2

Then DGL would not copy node data to edges thus reducing the memory footprint.

Thank you for replying. I tried your method and it really worked! BTW, I have created another reduce function, here’s the code:

    def reduce_func(self, nodes, edge_type):
        message = nodes.mailbox["m_" + edge_type]
        node_h = nodes.data['h']
        # got "alpha" using "message " and "node_h"
        alpha = blablabla...
        return {("sum_" + edge_type): th.sum(message * alpha, dim = 1)}

I noticed, after executing these two lines of code, the GPU will allocate some memory to hold data in message and node_h. However, after the reduce_func returned, these allocated memories were not freed(I didn’t directly return these two variables). In my view, message and node_h are two temporary variables and should be freed once the function returns. May I know what is happening?

Hi, however the allocated memories would be freed after you calling reduce function, for example, if you call dgl.send for one time and dgl.recv for two times, you will receive an error in the second dgl.recv call because the messages are freed.

As for your observation, how do you measure the GPU memory allocated? If you use torch.cuda.max_memory_cached, you can call torch.cuda.empty_cache() to clear the cache explicitly. By calling torch.cuda.memory_stats you will get a more detalied mem profiling result.