Timing of builtin functions

Hi,

I’m trying to convert to DGL some existing GNN I’ve implemented manually, and I can’t seem to achieve the same performance when using built-in functions. I’m passing messages in a heterogenous bi-partite graph with two types of nodes, one for each partition, and I’ve managed to partially reduce the issue to the following short piece of code which is supposed to just sum neighbors embeddings:

G['l2c'].update_all(fn.copy_src('emb', 'm'), fn.sum('m', 'h'))
result = G.nodes['clause'].data['h']

Which I would have expected, since I’m using builtin functions, to be translated to something like:

result = torch.mm(G.adjacency_matrix(etype='l2c'),G.nodes['literal'].data['emb'])

But the update_all version seems to take about 5-10 times longer compared to manually doing the sparse-dense multiplication, though they return the same result. Am I missing something about how to do that correctly?

Thanks

Hi, could you please provide more details:

  1. What version of DGL you are using?
  2. The code is running on CPU or GPU?
  3. Your OS (linux/mac/windows): by default, we disable OPENMP on mac, which makes dgl built-in function on CPU slower.

I’m using DGL 0.4 on Ubuntu, and running on CPU.

Got it, dgl CPU side is not carefully optimized, we have verified there is a lot of room for further speedup, and you would expect some improvement at the end of this month in the master branch.

In your case, by running

result = torch.mm(G.adjacency_matrix(etype='l2c'),G.nodes['literal'].data['emb'])

what you actually did is a dense(not sparse)-dense matrix multiplication, by default pytorch would call gemm(General Matrix Multiply) implementation in intel-mkl or openblas or something similiar, which is highly optimized and it’s very likely DGL implementation is not as fast as them on dense graphs.

We encourage users to use dense operators when graph density is high (https://docs.dgl.ai/api/python/nn.pytorch.html#dense-conv-layers), and we will keep tuning our sparse operators, thanks for your feedback.

Hi, thanks for the reply!

I don’t fully understand it though…my graphs are not particularly dense, about 1-3% . I was under the impression torch.mm knows to do sparse-dense multiplication when given a sparse matrix, but just in case, I replaced torch.mm with torch.spmm and I get the same results - the DGL code consistently takes about 10x time when compared to using torch.spmm and torch.mm, with either the sparse adjacency matrix from the DGLGraph or its dense representation (through to_dense).

Am I missing something here? Is this 10x factor going to disappear by the end of the month? As it is, it makes training with the DGL version in my settings impractically slow…

That’s interesting, I’ll do a benchmark and see why that is the case.

Hi,

Any updates about that? Were you able to reproduce?

I see another issue which may be related (running 0.4.1 now), with memory consumption. For context, here’s my forward method:

def forward(self, G, feat_dict):
   G.nodes['literal'].data['Wh_l2c'] = feat_dict['literal']
   G['l2c'].update_all(fn.copy_src('Wh_l2c', 'm'), fn.sum('m', 'h'))
   cembs = self.activation(self.weight['l2c'](G.nodes['clause'].data['h']))            # cembs now holds the half-round embedding
   G.nodes['clause'].data['Wh_c2l'] = torch.cat([cembs,feat_dict['clause']], dim=1)
   G['c2l'].update_all(fn.copy_src('Wh_c2l', 'm'), fn.sum('m', 'h'))
   return self.activation(self.weight['c2l'](G.nodes['literal'].data['h']))                    

Other than the update_all part taking a long time, when running at scale it starts accumulating memory. If I replace fn.sum with fn.mean, it doesn’t (but its still as slow).

Thanks

For “accumulating memory” problem. You can try forced python gc to see if memory usage goes down.

Also earlier we have fixed a memory leakage problem with PyTorch backend in 0.4.1. Could you try the latest nightly build and see if that issue persists?

pip install --pre dgl

Hi, @lederg I’ve written a benchmark script to test the speed of basic sparse operator(spmm): https://github.com/dglai/dgl-benchmark/tree/master/kernel/spmm_prof.py, and found out there is no such performance gap on CPU as you mentioned.

I think you only measure the speed of dgl for the first iteration. However, dgl has some several extra work to do in the first pass (e.g. build csr representations, convert the csr from 64bit to 32bit, etc.) however, these information would be cached and reused in later iterations. The correct way to do benchmark is to warmup for a few steps to make things done and measure the average speed per iteration after the warmup period.

The following is the average time per iteration of torch mm, spmm and dgl kernel fusion on a random graph with 10000 nodes and 0.01 sparsity:

torch mm average speed 0.17029092046949598 ms
torch spmm average speed 0.08437883589002822 ms
dgl kernel fusion average speed 0.022988221380445692 ms