Distributed heterogeneous graphs stored as homogeneous graphs

In the documentation (https://docs.dgl.ai/en/latest/guide/distributed-preprocessing.html#construct-node-edge-features-for-a-heterogeneous-graph), I see this paragraph:

dgl.DGLGraph output by convert_partition.py stores a heterogeneous graph partition as a homogeneous graph. Its node data contains a field called orig_id to store the node IDs of a specific node type in the original heterogeneous graph and a field of NTYPE to store the node type.

Notably, each heterogeneous graph partition is stored as a homogeneous DGL graph, but with special node data fields that tell you its node type (and nodetype-specific ids).

Are there any implications/limitations I should be aware of with storing heterogeneous graphs as a homogeneous graph? For instance, I can imagine that current DGL metapath-based random walks wouldn’t work on this homogeneous graph.

Related: is it possible to take this homogeneous graph, and convert it to its equivalent heterogeneous graph, so that we can work with it in its “natural” way? Or are there limitations/considerations that I’m overlooking?

Are the stored homogeneous graph is for distributed training?
If yes, I think following the graph structure in compact_g2(dgl/convert_partition.py at 3b34a5a7ec2f996e1e287abcac8697c4658ab318 · dmlc/dgl · GitHub) should be ok. The fields such as inner_node, NID and so on is required when instantiating DistGraphServer.
If not, it mainly depends on how will the homo graphs be used, I think.

I think it’s possible to convert the homo graph to hetero, bu the limitation/consideration depends on how will the hetero be used? as the homo graphs are partitions from an original heterograph, are there any connections between converted hetero partitions?

Are the stored homogeneous graph is for distributed training?
I think it’s possible to convert the homo graph to hetero, bu the limitation/consideration depends on how will the hetero be used?

I’m looking to support training heterogeneous (eg metapath-based) graph DNN models (like GraphSage/PinSage) on large-scale heterogeneous graphs (eg 10B+ nodes, edges). Since graphs beyond 10B+ nodes/edges are too large to fit on a reasonable machine, I’ll need to shard the graph (eg graph partitioning).

As methods like GraphSage/PinSage need to perform (metapath-based) random walks on this distributed heterogeneous graph, I was a little surprised to see that DistDGL represents each graph partition as a homogeneous graph, rather than fully-fledged heterogeneous graphs. To me, the most natural thing to do would be to have DistDGL represent each graph partition as a heterogeneous graph, to enable the full heterogeneous graph API.

Unfortunately, current impl of DistDGL is highly coupled with existing homo-partitions. The most natural thing you expect is interesting and please let me discuss it within my team.

1 Like

We can store a heterogeneous graph with one CSR (homogeneous graph format) or multiple CSRs (heterogeneous graph format). We can potentially implement metapath-based random walk on the one-CSR format.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.