I have a network which works with only about 6GB graphics memory.

And then I add a RGCN layer with n_rel=92, n_base=3, n_note=6760, n_dim=125(both in and out), average degree of each node is around 4 (the graph structure is just an extension of a tree).

The code of RGCN is almost the same as the demo. (only different in initializations and some trivial places that will not interfere tensor computing).

It appears a “CUDA out of memory” on 1080ti which have 11GB graphics memory in total.

And the Error infor is : “RuntimeError: CUDA out of memory. Tried to allocate 2.35 GiB (GPU 0; 10.92 GiB total capacity; 5.75 GiB already allocated; 2.19 GiB free; 368.29 MiB cached)”

I think this may not be in our original networks but in the RGCN layer. And I think this may due to the model SPARSITY. Did I use DGL in a wrong way?

# Is there any ERROR in the model SPARSITY?

**Erutan-pku**#1

**Erutan-pku**#3

Codes are shown as below~

Compared with RGCN demo, I just change some places so that no error occurs when using bias=True.

```
self.rgcnLayer = RGCNLayer(125, 125, 92,
activation=F.relu, bias=True, num_bases=5)
class RGCNLayer(nn.Module):
def __init__(self, in_feat, out_feat, num_rels, num_bases=-1, bias=None,
activation=None, is_input_layer=False):
super(RGCNLayer, self).__init__()
self.in_feat = in_feat
self.out_feat = out_feat
self.num_rels = num_rels
self.num_bases = num_bases
self.bias = bias
self.activation = activation
self.is_input_layer = is_input_layer
# sanity check
if self.num_bases <= 0 or self.num_bases > self.num_rels:
self.num_bases = self.num_rels
# weight bases in equation (3)
self.weight = nn.Parameter(torch.Tensor(self.num_bases, self.in_feat,
self.out_feat))
if self.num_bases < self.num_rels:
# linear combination coefficients in equation (3)
self.w_comp = nn.Parameter(torch.Tensor(self.num_rels, self.num_bases))
# add bias
if not self.bias is None:
self.bias = nn.Parameter(torch.Tensor(out_feat))
# init trainable parameters
nn.init.xavier_uniform_(self.weight,
gain=nn.init.calculate_gain('relu'))
if self.num_bases < self.num_rels:
nn.init.xavier_uniform_(self.w_comp,
gain=nn.init.calculate_gain('relu'))
if not self.bias is None:
nn.init.constant_(self.bias, val=0)
def forward(self, g):
if self.num_bases < self.num_rels:
# generate all weights from bases (equation (3))
weight = self.weight.view(self.in_feat, self.num_bases, self.out_feat)
weight = torch.matmul(self.w_comp, weight).view(self.num_rels,
self.in_feat, self.out_feat)
else:
weight = self.weight
if self.is_input_layer:
def message_func(edges):
# for input layer, matrix multiply can be converted to be
# an embedding lookup using source node id
embed = weight.view(-1, self.out_feat)
index = edges.data['rel_type'] * self.in_feat + edges.src['id']
return {'msg': embed[index] * edges.data['norm']}
else:
def message_func(edges):
w = weight[edges.data['rel_type']]
msg = torch.bmm(edges.src['h'].unsqueeze(1), w).squeeze()
msg = msg * edges.data['norm']
return {'msg': msg}
def apply_func(nodes):
h = nodes.data['h']
if not self.bias is None:
h = h + self.bias
if self.activation:
h = self.activation(h)
return {'h': h}
g.update_all(message_func, fn.sum(msg='msg', out='h'), apply_func)
```

**Erutan-pku**#4

I also found that the n_rel does not matter. I change n_rel and n_base to 1 and the “CUDA out of memory” occurs with a larger graph with 6864 nodes …

**lingfan**#6

You are right! The memory consumption is not related to n_rel or n_base. These two parameters control how large the model size (i.e. number of parameters) of RGCNLayer. But that’s not the memory bottleneck.

For RGCNLayer, the memory consumption should be proportional to number of edges in the graph. Because, in the message function (which operates on edge space), we have:

```
def message_func(edges):
w = weight[edges.data['rel_type']]
msg = torch.bmm(edges.src['h'].unsqueeze(1), w).squeeze()
return {'msg': msg}
```

And the first line duplicates weight matrix onto edges based on relation type of the edge.

Let’s assume, your graph has 10x more edges than nodes, then you have about 6760 x 10 ~ 70k edges. After `w = weight[edges.data['rel_type']]`

, each edge gets a weight matrix of size 125 x 125. So this single line would cost 70k x 125 x 125 x 4bytes/float ~ 4GB memory.

So how should we solve this problem? I can think of several things you could give a try:

- reduce the weight matrix size by
- reduce n_dim, if you check out author’s implementation, for basis decomposition, n_dim 16 is used
- use block decomposition to reduce matrix size to 1/num_bases

- do not duplicate weight matrix onto edges
- you can try creating subgraphs of the original graph, where each subgraph only contains one edge type, and then perform message-passing on each subgraph separately and merge the results back. By doing so, on each subgraph, the weight matrix can be shared and do not need to be duplicated onto edges.
- the bummer is: if you graph has hundreds or more relations, it might not be efficient.

**minjie**#7

@Erutan-pku Could you share how many nodes, edges and relation types are there in your graph?

**Erutan-pku**#8

It’s a batched task, so the graph size is different in different batches. And when the OOM Error occurred, there are about 6k nodes. relation types are 92 (well, most of them are useless, I am going to merge some of them). About 3 edges per node (have considered bi-directed edges) since the graph is a extension of tree.

**Erutan-pku**#9

This sounds reasonable and interesting. I will try your suggests as well as pruning the graph size according to the task.

Thank you very much~

**Erutan-pku**#10

I find a better implement of RGCN, which is more friendly to graphics memory.

Here, we do not copy the weight matrix in RGCN for each edge (which size is |E|*d_in*d_out，and the following calculation is [|E|*d_in*d_out]*[|E|*d_in]=[|E| d_out]), but copy the weight vector (size=|E|n_base) and calculate the base transform matrix for each front node ([n_based_ind_out]*[|E|*d_in]=[|E|

*n_base*d_out]) and finally get the output ([|E|

*n_base]*[|E|

*n_base*d_out]=[|E|*d_out]).

Since n_base is (at least) several times smaller than d_in and d_out, this will save graphics memory for several times.