Distributed heterogeneous graphs stored as homogeneous graphs

erickim555 · February 11, 2022, 8:50pm

In the documentation (https://docs.dgl.ai/en/latest/guide/distributed-preprocessing.html#construct-node-edge-features-for-a-heterogeneous-graph), I see this paragraph:

dgl.DGLGraph output by convert_partition.py stores a heterogeneous graph partition as a homogeneous graph. Its node data contains a field called orig_id to store the node IDs of a specific node type in the original heterogeneous graph and a field of NTYPE to store the node type.

Notably, each heterogeneous graph partition is stored as a homogeneous DGL graph, but with special node data fields that tell you its node type (and nodetype-specific ids).

Are there any implications/limitations I should be aware of with storing heterogeneous graphs as a homogeneous graph? For instance, I can imagine that current DGL metapath-based random walks wouldn’t work on this homogeneous graph.

Related: is it possible to take this homogeneous graph, and convert it to its equivalent heterogeneous graph, so that we can work with it in its “natural” way? Or are there limitations/considerations that I’m overlooking?

Rhett-Ying · February 14, 2022, 3:18am

Are the stored homogeneous graph is for distributed training?
If yes, I think following the graph structure in compact_g2(dgl/convert_partition.py at 3b34a5a7ec2f996e1e287abcac8697c4658ab318 · dmlc/dgl · GitHub) should be ok. The fields such as inner_node, NID and so on is required when instantiating DistGraphServer.
If not, it mainly depends on how will the homo graphs be used, I think.

I think it’s possible to convert the homo graph to hetero, bu the limitation/consideration depends on how will the hetero be used? as the homo graphs are partitions from an original heterograph, are there any connections between converted hetero partitions?

erickim555 · February 14, 2022, 4:41am

Are the stored homogeneous graph is for distributed training?
I think it’s possible to convert the homo graph to hetero, bu the limitation/consideration depends on how will the hetero be used?

I’m looking to support training heterogeneous (eg metapath-based) graph DNN models (like GraphSage/PinSage) on large-scale heterogeneous graphs (eg 10B+ nodes, edges). Since graphs beyond 10B+ nodes/edges are too large to fit on a reasonable machine, I’ll need to shard the graph (eg graph partitioning).

As methods like GraphSage/PinSage need to perform (metapath-based) random walks on this distributed heterogeneous graph, I was a little surprised to see that DistDGL represents each graph partition as a homogeneous graph, rather than fully-fledged heterogeneous graphs. To me, the most natural thing to do would be to have DistDGL represent each graph partition as a heterogeneous graph, to enable the full heterogeneous graph API.

Rhett-Ying · February 14, 2022, 6:31am

Unfortunately, current impl of DistDGL is highly coupled with existing homo-partitions. The most natural thing you expect is interesting and please let me discuss it within my team.

zhengda1936 · February 14, 2022, 9:52pm

We can store a heterogeneous graph with one CSR (homogeneous graph format) or multiple CSRs (heterogeneous graph format). We can potentially implement metapath-based random walk on the one-CSR format.

system · March 16, 2022, 9:52pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.