Double RAM problem of DGLGraph.add_nodes() and add_edges()

I need to load a huge graph in pandas DF or numpy Array into existed empty DGLGraph instance (or add nodes and edges in batches), then I find that due to the Frame append operations in DGLGraph.add_nodes() like dgl/heterograph.py at 7b766393f8923f4a171fc1262aa5455d48996ace · dmlc/dgl · GitHub, there will be double copy of the nodes’ data/attribute (one is my loaded pandas DF, one is dgl node Frame), and this will waste GBs of system RAM.

My code like:

graph = dgl.DGLGraph()
graph.add_nodes(num, data = {“something”: torch.tensor(pd.DataFrame)})

I want to ask is there any method to avoid this double RAM problem to add new nodes and edges, like some other interface of dgl.DGLGraph or customized function or classes.

DGL’s add_nodes is out-of-place, meaning that new graph objects will be created. This will indeed double up the memory consumption. Instead, could you try creating the graph with all the nodes first, including the new nodes you are going to add later?

That would be very depressing because we need to support 10 billions nodes in our server and stream loading raw data from Hadoop storage is required, so the double ram problem would be very critical for our server, customized or optimized DGLGraph without double RAM issue is desired.

I guess the problem is how to support large-scale streaming graphs with ever-growing nodes and edges. Since usually it is not possible to load the entire graph into memory anyway, one way I could think of is to do training and inference with subgraph sampling. Basically, you build a data pipeline that produces subgraphs of streaming new nodes and edges outside DGL (e.g. using a graph database, and the result could be stored on HDFS as well). The downstream DGL model only consumes those subgraphs during training and testing. I think this is what we actually deploy in one of our real-world solutions.

we have similar issue, how hard it would be to make DGLgraph into a such a way to load a new set of nodes and edges, without creating a new data frame inside DGLGraph?

I would be happy to do the change if the effort required is reasonable, but not sure where to start.

@HuangLED @OrdinaryCrazy Can we follow up offline to understand more about your use cases? See your inbox if you missed the message.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.