Random walk sampling on a giant graph

nlj0011 · December 6, 2019, 3:28am

Hi,
Awesome library, thank you for the good work!

I have a question about working on giant graph. There are about ten billion nodes in my heterogeneous graph, and I want using the metapath based method to embed them. However, before calling the metapath_random_walk, it shoud use the hetero_from_relations methods to construct the relationship, which consume large memory. Could you give me some suggestions to solve this problem? Thank you very mach!

Best Regards!

mufeili · December 6, 2019, 6:49am

Currently we might not have very good built-in support for your scenario. Here is a possible solution:

Sample subgraphs from each relation graph.
Perform hetero_from_relations on the sampled relation subgraphs.

nlj0011 · December 6, 2019, 10:34am

Is there an example to sample on subgraphs? or the library support the sampling on subgraphs by multi-worker? Thank you very much~~~

mufeili · December 6, 2019, 11:00pm

We have a built-in support for sampling subgraphs and construct data structures called NodeFlow. The sampling and construction of them is quite efficient. Unfortunately it has a different API from DGLGraph, making things a bit more complicated. You may find this tutorial helpful.

mctt90 · December 23, 2019, 2:53am

Hi, I think that ten billion nodes needs distributed solution on DGL, and we are pushing it. Please notice our release news !

BarclayII · December 30, 2019, 2:29am

Out of curiosity, what is your hardware configuration (memory, # CPUs, disk space, etc.)? Do you expect a single-large-machine solution, or a distributed solution, or a single-machine on-disk solution, etc.?