Hi everyone,
I’m currently working with a large dataset (~160,000 graphs), where each graph has a heterograph structure consisting of 3 types of edges and 5 types of nodes. Constructing these graphs one by one on a single CPU core is highly time-consuming.
To optimize this, I wrote a custom function to replace the heterograph() function provided in dgl/convert.py. The main difference lies in how the rel_graphs are constructed: I directly utilize their _graph objects instead of creating full graph instances. Here’s my code:
def toDGLGraph(edge, nums_dict):
(
metagraph,
ntypes,
etypes,
relations,
) = heterograph_index.create_metagraph_index(
nums_dict.keys(), edge.keys()
)
num_nodes_per_type = utils.toindex(
[nums_dict[ntype] for ntype in ntypes], "int64"
)
rel_graphs = [None] * len(relations)
for i, (srctype, etype, dsttype) in enumerate(relations):
num_ntypes = int(srctype != dsttype) + 1
(u, v) = edge[(srctype, etype, dsttype)]
rel_graphs[i] = heterograph_index._CAPI_DGLHeteroCreateUnitGraphFromCOO(
int(num_ntypes),
int(nums_dict[srctype]),
int(nums_dict[dsttype]),
F.to_dgl_nd(u),
F.to_dgl_nd(v),
["coo", "csr", "csc"],
False,
False,
)
hgidx = heterograph_index.create_heterograph_from_relations(
metagraph, rel_graphs, num_nodes_per_type
)
graph = DGLGraph(hgidx, ntypes, etypes)
return graph
This approach has halved the runtime of the graph construction process. However, I have two questions:
- Are there additional strategies to further improve the efficiency of constructing such a large number of graphs?
- Will there be any differences in functionality or behavior between my custom function and the heterograph() function, especially concerning correctness or compatibility?
Thank you for your insights!