Low-level graph representations (C++ stuffs)

qzshadow · March 8, 2019, 9:47am

Any of authors could briefly introduce the low-level graph representations in DGL, something like the storage mechanism of edges (adjacency list or adjacency matrix), data structure it used, and how to effectively fetch nodes / edges from low-level storage, is multithread or multiprocessing techniques used?

zihao · March 10, 2019, 6:09pm

Hi,

graph.h defines the interface of data structure in standard DGLGraph, immutable_graph.h defines the interface of data structure in read-only DGLGraph(for read-only graphs, we could maintain csr/csc representations and other things that are extremely useful for some applications, especially on large graphs).
graph.cc and immutable_graph.cc implements the functions defined in the headers mentioned above.
nodeflow.cc defines a paradigm that is useful for training on subgraphs, there would be a blog post explaining this recently.
sampler.cc this implements frequently-used graph samplers, which is useful for training gcn on large graphs.

multi-threading and multi-processing are not used for most cases; However, for distributed training, These are required because it seems sampling graph is a bottleneck, and we create multiply threading/process to sample subgraphs in this scenario.

aksnzhy is working on this, if you are interested in our progress on this or you would like to contribute, feel free to contact him.

hbsun2113 · March 11, 2019, 3:05am

I also want to learn the C++ source code, but i find it hard to understand the variable meaning, could you add more annotations? For example:

zihao · March 11, 2019, 3:22am

indptr, indices and edge_ids defines a sparse matrix(this format is also known as csr representation), our variable names are the same as scipy, please see wiki: sparse matrix, csr_matrix in scipy and this blog for more details.

hbsun2113 · March 11, 2019, 3:41am

Thx! I think it is better to reveal these materials in the source code as the form of annotations.