Timing of builtin functions

Hi,

I’m trying to convert to DGL some existing GNN I’ve implemented manually, and I can’t seem to achieve the same performance when using built-in functions. I’m passing messages in a heterogenous bi-partite graph with two types of nodes, one for each partition, and I’ve managed to partially reduce the issue to the following short piece of code which is supposed to just sum neighbors embeddings:

G['l2c'].update_all(fn.copy_src('emb', 'm'), fn.sum('m', 'h'))
result = G.nodes['clause'].data['h']

Which I would have expected, since I’m using builtin functions, to be translated to something like:

result = torch.mm(G.adjacency_matrix(etype='l2c'),G.nodes['literal'].data['emb'])

But the update_all version seems to take about 5-10 times longer compared to manually doing the sparse-dense multiplication, though they return the same result. Am I missing something about how to do that correctly?

Thanks

Hi, could you please provide more details:

  1. What version of DGL you are using?
  2. The code is running on CPU or GPU?
  3. Your OS (linux/mac/windows): by default, we disable OPENMP on mac, which makes dgl built-in function on CPU slower.

I’m using DGL 0.4 on Ubuntu, and running on CPU.

Got it, dgl CPU side is not carefully optimized, we have verified there is a lot of room for further speedup, and you would expect some improvement at the end of this month in the master branch.

In your case, by running

result = torch.mm(G.adjacency_matrix(etype='l2c'),G.nodes['literal'].data['emb'])

what you actually did is a dense(not sparse)-dense matrix multiplication, by default pytorch would call gemm(General Matrix Multiply) implementation in intel-mkl or openblas or something similiar, which is highly optimized and it’s very likely DGL implementation is not as fast as them on dense graphs.

We encourage users to use dense operators when graph density is high (https://docs.dgl.ai/api/python/nn.pytorch.html#dense-conv-layers), and we will keep tuning our sparse operators, thanks for your feedback.

Hi, thanks for the reply!

I don’t fully understand it though…my graphs are not particularly dense, about 1-3% . I was under the impression torch.mm knows to do sparse-dense multiplication when given a sparse matrix, but just in case, I replaced torch.mm with torch.spmm and I get the same results - the DGL code consistently takes about 10x time when compared to using torch.spmm and torch.mm, with either the sparse adjacency matrix from the DGLGraph or its dense representation (through to_dense).

Am I missing something here? Is this 10x factor going to disappear by the end of the month? As it is, it makes training with the DGL version in my settings impractically slow…

That’s interesting, I’ll do a benchmark and see why that is the case.