Reducing Runtime in Heterograph Construction for Large-Scale Applications

Hi everyone,
I’m currently working with a large dataset (~160,000 graphs), where each graph has a heterograph structure consisting of 3 types of edges and 5 types of nodes. Constructing these graphs one by one on a single CPU core is highly time-consuming.

To optimize this, I wrote a custom function to replace the heterograph() function provided in dgl/convert.py. The main difference lies in how the rel_graphs are constructed: I directly utilize their _graph objects instead of creating full graph instances. Here’s my code:

def toDGLGraph(edge, nums_dict):
    (
        metagraph,
        ntypes,
        etypes,
        relations,
    ) = heterograph_index.create_metagraph_index(
        nums_dict.keys(), edge.keys()
    )
    num_nodes_per_type = utils.toindex(
        [nums_dict[ntype] for ntype in ntypes], "int64"
    )
    rel_graphs = [None] * len(relations)

    for i, (srctype, etype, dsttype) in enumerate(relations):
        num_ntypes = int(srctype != dsttype) + 1
        (u, v) = edge[(srctype, etype, dsttype)]
        rel_graphs[i] = heterograph_index._CAPI_DGLHeteroCreateUnitGraphFromCOO(
                int(num_ntypes),
                int(nums_dict[srctype]),
                int(nums_dict[dsttype]),
                F.to_dgl_nd(u),
                F.to_dgl_nd(v),
                ["coo", "csr", "csc"],
                False,
                False,
            )
    hgidx = heterograph_index.create_heterograph_from_relations(
        metagraph, rel_graphs, num_nodes_per_type
    )     
    graph = DGLGraph(hgidx, ntypes, etypes)

    return graph

This approach has halved the runtime of the graph construction process. However, I have two questions:

  1. Are there additional strategies to further improve the efficiency of constructing such a large number of graphs?
  2. Will there be any differences in functionality or behavior between my custom function and the heterograph() function, especially concerning correctness or compatibility?

Thank you for your insights!