Why is graph.send much slower when not using all edges?

Hello,
I am working with a 3d grid graph of intermediate dimension (100 x 215 x 40). I would like to change the GCN model so that it propagates only along one axis at a time. In order to to this, I just changed the ‘send’ method of GCN and replaced edges=ALL by my list of edges.

Unfortunately, my model runs around 50 times slower than the normal GCN. Theoretically, the same number of messages is passed, so I am looking for a way to bridge this gap.

I first believed that the problem was that the method utils.toindex(edges) is called every time, and a new index is built at each step. I tried to change the library to store this index, so that it is only computed once. It improved the speed, but almost insignificantly.

In fact, what seems to take the longest time is the graph.recv function, although it is the same in both cases.

Does someone know the precise reason why is much slower not to propagates to all edges? What do you suggest to mitigate this problem?

Thanks for your help!
Clément

Hi,

This is not expected. The update_all function also calls send and recv in the underlying executions. One possible answer would be the scale of the graph. In your case, the graph has about 800,000 nodes, so there is extra cost on scatter data out from the whole node data at each iteration. However this doesn’t seems the case here.

So could you provide more information about the following issues:

  1. The exact time of send and update_all (Because 1s vs 50s might be a different story comparing to 1ms vs 50ms)
  2. How heavy is the computation in your send and recv functions?

Another suggestion is to uncomment this line to see what actually happened and you can profile each operation’s time cost. DGL uses an IR system which translates the operation into s set of registry and execute it with executor. This line would print out the registry and execute the registry in the next line. We can see whether it’s having more IR or each IR was executed with longer time.

Hope this help!

It’s hard to see the problem without codes. Some guess:

  • How the edge list is generated? It could be quite slow if native python routing or for loop is involved. Try to use methods from the tensor framework (such as torch.arange).
  • Do you have edge data? When edges=ALL, the edge data can be fetched in one-shot. But when edge list is provided, the edge data needs to be looked up first by tensor operator. However, I doubt that will lead to 50x slower.
  • Did you observe a significant time increase in the reduce function? Could you tell us what reduce function is used? To me, 50x slower looks like accidentally switching from optimized execution (e.g. SPMV) to normal execution.

Hello, thank you for your interest! I will try to give you all the elements that you need to understand the situation.
My timing experiments give the following results:

  • A GCNlayer.forward takes on average 0.7s, against 47s for the AnisotropicGCN
  • The functions gcn_msg, gcn_reduce, NodeApplyModule.forward and pop (to retrieve the results) are all very fast
  • On the 47s of the AnisotropicGCN.forward, 45s are used by the recv function.

Other observations:

  • I tried two different methods to provide the list of edges: give a python list of pairs of integers, and convert this list to a 1d Torch.LongTensor using the graph.edgeid method. I did not see a big difference in the runtime.
  • I don’t have edge data.
  • Although my model is on the GPU, it seems that the load of the GPU is very light, almost everything happens on the CPU.

Here is my file to define the layers: the anisotropic method is a very light modification to the standard GCN that I copied:

From https://github.com/dmlc/dgl/blob/master/examples/pytorch/gcn/gcn.py
import torch
import torch.nn as nn
import math
import dgl
import time

def anisotropic_gcn_msg(edge):
    msg = edge.src['temp'] * edge.src['norm']
    return {'m': msg}


def gcn_msg(edge):
    msg = edge.src['h'] * edge.src['norm']
    return {'m': msg}


def gcn_reduce(node):
    accum = torch.sum(node.mailbox['m'], 1) * node.data['norm']
    return {'h': accum}


class NodeApplyModule(nn.Module):
    def __init__(self, out_feats, activation=None, bias=True):
        super(NodeApplyModule, self).__init__()
        if bias:
            self.bias = nn.Parameter(torch.Tensor(out_feats))
        else:
            self.bias = None
        self.activation = activation
        self.reset_parameters()

    def reset_parameters(self):
        if self.bias is not None:
            stdv = 1. / math.sqrt(self.bias.size(0))
            self.bias.data.uniform_(-stdv, stdv)

    def forward(self, nodes):
        h = nodes.data['h']
        if self.bias is not None:
            h = h + self.bias
        if self.activation:
            h = self.activation(h)
        return {'h': h}


class GCNLayer(nn.Module):
    def __init__(self, g, in_feats, out_feats, activation, bias=True):
        super().__init__()
        self.g = g
        self.weight = nn.Parameter(torch.Tensor(in_feats, out_feats))
        self.node_update = NodeApplyModule(out_feats, activation, bias)
        self.reset_parameters()

    def reset_parameters(self):
        stdv = 1. / math.sqrt(self.weight.size(1))
        self.weight.data.uniform_(-stdv, stdv)

    def forward(self, h):
        self.g.ndata['h'] = torch.mm(h, self.weight)
        self.g.update_all(gcn_msg, gcn_reduce, self.node_update)
        h = self.g.ndata.pop('h')
        return h


class AnisotropicGCNLayer(nn.Module):
    def __init__(self, g, edge_lists: list, dimensions: int, in_feats: int, out_feats: int,
                 activation, bias=True):
        """ g (DGL directed graph)
            in_feats: number of input channels
            out_feats: number of output units
            activation: usually F.relu or None
            dimensions: number of dimensions in the cartesian product
            bias: if True, add bias to the linear layer"""
        super().__init__()
        self.g = g
        self.dimensions = dimensions
        self.list_edges = edge_lists
        self.weight = nn.Parameter(torch.Tensor(in_feats, out_feats, dimensions))
        self.node_update = NodeApplyModule(out_feats, activation, bias)
        self.reset_parameters()

    def reset_parameters(self):
        stdv = 1. / math.sqrt(self.weight.size(1))
        self.weight.data.uniform_(-stdv, stdv)

    def forward(self, h):
        for dim in range(self.dimensions):
            self.g.ndata['temp'] = torch.mm(h, self.weight[:, :, dim])
            self.g.send(self.list_edges[dim], anisotropic_gcn_msg)
        self.g.recv(reduce_func=gcn_reduce, apply_node_func=self.node_update)
        self.g.ndata.pop('temp')
        h = self.g.ndata.pop('h')
        return h

I tried to uncomment the line in runtime.py, but I don’t know how to interpret the result. Here is what I obtain on one call to the AnisotropicGCN.forward():
FeatDict _z6 = READ_ROW(_z0, _z3)
FeatDict _z7 = READ_ROW(_z0, _z4)
FeatDict _z8 = READ_ROW(_z1, _z5)
FeatDict _z10 = EDGE_UDF(_z9, _z6, _z8, z7)
APPEND_ROW
(_z2, _z10)
Send took 0.485 s
FeatDict _z6 = READ_ROW(_z0, _z3)
FeatDict _z7 = READ_ROW(_z0, _z4)
FeatDict _z8 = READ_ROW(_z1, _z5)
FeatDict _z10 = EDGE_UDF(_z9, _z6, _z8, z7)
APPEND_ROW
(_z2, _z10)
Send took 0.332 s
FeatDict _z6 = READ_ROW(_z0, _z3)
FeatDict _z7 = READ_ROW(_z0, _z4)
FeatDict _z8 = READ_ROW(_z1, _z5)
FeatDict _z10 = EDGE_UDF(_z9, _z6, _z8, z7)
APPEND_ROW
(_z2, _z10)
Send took 0.361 s
FeatDict _z4 = READ_ROW(nf, _z1)
FeatDict _z5 = READ_ROW(msg, _z2)
FeatDict _z4 = NODE_UDF(_z3, _z4, _z5)
FeatDict _z9 = READ_ROW(nf, _z6)
FeatDict _z10 = READ_ROW(msg, _z7)
FeatDict _z9 = NODE_UDF(_z8, _z9, _z10)
FeatDict _z14 = READ_ROW(nf, _z11)
FeatDict _z15 = READ_ROW(msg, _z12)
FeatDict _z14 = NODE_UDF(_z13, _z14, _z15)
FeatDict _z19 = READ_ROW(nf, _z16)
FeatDict _z20 = READ_ROW(msg, _z17)
FeatDict _z19 = NODE_UDF(_z18, _z19, _z20)
FeatDict _z22 = MERGE_ROW(_z21, _z4, _z9, _z14, z19)
WRITE_DICT
(_z0, _z22)
FeatDict _z23 = READ_ROW(nf, recv_nodes)
FeatDict _z24 = UPDATE_DICT(_z23, _z0)
FeatDict _z26 = NODE_UDF(_z25, _z24)
FeatDict _z27 = UPDATE_DICT(_z0, z26)
WRITE_ROW
(nf, recv_nodes, _z27)
Recv took 45.900 s
Pop took 0.000 s
AnisotropicGCN took 47.081 s

Do these elements help you?

Hi,

It seems most time are spent on the READ_ROW operation, which is slow due to scattering certain rows out of a big tensor.

To verify this, could you also print out the time of each ir? Which can be done by modifying this, uncomment this line and time the exe.run() for each operation, changing it to import time; time_start=time.time(); exe.run(); print(time.time() - time_start)?

Thanks a lot! This would help us locate the problems.

Hello VoVAllen,
these are the results I get on one training epoch:

I cannot find anything that takes a really long time …

Thanks. It seems the time consuming part is at ir generation part. I’ll take a closer look this week.