I have one question about an efficient implementation of graph partition, can you give me some advice or DGL can directly help me? My scene is below:
I want to split a graph into many small graphs using an attribute of edges. For example, split a graph with time information into many subgraph of single day and I have many graph to split parallel. For example, I have 100 day’s item browsing history of 100000 users, which store in 100000 big graphs. I want to split each graph into 100 small graph parallel and efficiently. The time cost is very import for me. Dose DGL can help to do this ??
I am confused about this for a few days. Thanks very much for any help!
Currently we have no support for graph partition with
DGLGraph. How frequently do you partition graphs? E.g. partition only once at the data-preprocessing stage or dynamically partition during model computation?
One possibility is to use an existing library metis to partition graphs based on edge weights using NetworkX graphs. The contributed implementation of ClusterGCN partitions NetworkX graphs with metis in the stage of data preprocessing.
For more complex stuff you may need to manually filter edges based on edge feature and then create
I don’t quite understand your scenario. Let’s say you have 100 days, and X items with 100000 users. Do you need to transform into 100 graphs, one graph for each day? What’s the current storage structure?
And what do you mean by time costing? Is it a streaming system or the data processing procedure could be offline?
@VoVAllen @mufeili Dear Thanks very much for your timely reply!!! Sorry for my poor english. I think it is not graph partition and do not need graph cluster because i just want to split the graph into small graphs according to the attribution of edge. @VoVAllen You are right. I just want to split the whole graph to small graph of each day. My current storage struct is a list of quadruples (user, relation, item, time_list). where time_list is the list of times that (user, relation, item) happens. I generate many DGL graphs regarding the time_list and relation as two attributions of the edges, where each DGL_graph corresponding to one user. Take User1 for example, graph = [(user1, click, item1, [t1, t2, t3]), (user1, buy, item1, [t2]), (user1, buy, item2, [t1, t3])]. I want to generate three graphs: G_t1:[(user1, click, item1), (user1, buy, item2)], G_t2:[(user1, click, item1 ), (use1, buy, item2)], G_t3:[(user1, click , item1), (user1, buy, item2)]
I have stored many graphs of different users, and want to parallel do the same thing to each users in a batch. It is a streaming system and can not be preprocessed.
I don’t think dgl has any function can help this. I suggest constructing graph from quadruples with multiprocessing library instead of transform other graphs. How much time does it cost now?
Finally, I only have to preprocess the data with multiprocess and save all subgraphs as file. But it seems DGL.subgraph do not support pickle (NotImplementedError: SubgraphIndex pickling is not supported yet.)…How to transform a
DGL.subgraph object to a new
dgl.DGLGraph objects?? Thanks very much for your help!
Sorry for the inconvenience. I just added it to our 0.5 plan at https://github.com/dmlc/dgl/issues/930#issuecomment-550985840.
I’m not familiar with the code at this part. Let me ask our team member to answer this. @minjie