Keep original pointers to tensors in node data of `BatchDGLGraph`?

maxvo · December 8, 2019, 6:12pm

Since I am constructing a covariance matrix between all graphs in my dataset, I have to re-combine nodes in all graph-graph pairs, keep the node data and construct new in-between edges. For that I am merging two (or more graphs) with this function:

def merge_graphs(graphs, keep_edges=False, create_new_inbetween_edges=True):
    """
    Function for merging two or more graphs

    Arguments
    ---------
    graphs : list of dgl.DGLGraph objects
        The graphs, which will be merged into one graph. 

    Returns
    -------
    (dgl.DGLGraph)
        A merged DGLGraph with same node data as original graphs 

    
    Author
    ------
    Maximillian F. Vording

    Inspiration
    ----------- 
    njchoma
    url: https://discuss.dgl.ai/t/best-way-to-send-batched-graphs-to-gpu/171/6
    """
    g_merged = dgl.DGLGraph(graph_data=dgl.batch(graphs))
    

    # nodes
    labels = graphs[0].node_attr_schemes()
    for l in labels.keys():
        g_merged.ndata[l] = torch.cat([g.ndata[l] for g in graphs], 0)
   
    # edges
    if keep_edges:
        labels = graphs[0].edge_attr_schemes()
        for l in labels.keys():
            g_merged.edata[l] = torch.cat([g.edata[l] for g in graphs], 0)
    else:
        g_merged.remove_edges(list(range(g_merged.number_of_edges())))

    if create_new_inbetween_edges:
        graph_ids = tuple(range(len(graphs)))
        combs = list(itertools.combinations(graph_ids, 2))
        num_nodes = [g.number_of_nodes() for g in graphs]
        new_node_inds = [
            list(range(sum(num_nodes[:i]), sum(num_nodes[:i])+num_nodes[i]))
            for i in range(len(num_nodes))
        ]

        for i in range(len(combs)):
            g_merged.add_edges(
                *tuple(
                    zip(*itertools.product(
                        new_node_inds[combs[i][0]],
                        new_node_inds[combs[i][1]],
                        repeat=1)
                    )
                )
            )

    return g_merged

I run into problems with dgl.batch() not preserving the reference to the original nodes and their tensors, so I have to construct the merged graphs every time the tensors on the original graphs are updated in each epoch. I also wanna make sure, that updates are consistent and shared for both my BatchedDGLGraph and the original graphs in my dataset object, without having to set them explicitly as you suggest under BatchedDGLGraph/Update attributes, since this is removing the common reference to tensors, that the merged graphs have.

How can I make sure that the node data is referring back to tensors in the original graphs without having to explicitly set them for each update?

I considered using dgl.DGLSubGraph instead, but since it does not support sharing of node/edge features for now, I’m not sure how to make that work either. When will sharing be supported?

I hope my question makes sense and if not I can elaborate with more code and explanations.

Thanks in advance

mufeili · December 9, 2019, 6:25pm

If you take a look at our implementation of DGLGraph, we store the node and edge features as _node_frame and _edge_frame. Therefore, you might need to hack the _node_frame and _edge_frame of BatchedDGLGraph. The source code of frame can be found here.

maxvo · December 10, 2019, 11:27am

Thanks for a quick pointer towards a possible hack.

It seems as if I need a hack in PyTorch and make torch.cat preserve the original pointers. The new batched DGLGraph object is constructed with new Frame(cols) objects with cols being a dict of concatenated tensors. Depending on the backend used, at least torch.cat instantiates a new torch.Tensor, which does not preserve the pointers to the previously used memory. And I can’t see how to avoid concatenating the tensors in the BatchedDGLGraph without loosing functionality. Given this, I am not quite sure how share memory among nodes appearing in more than one graph.

Since slicing of a torch.Tensor keeps the original pointers, I can better see how DGLSubGraph can be supporting this feature.

Correct me if I am wrong in my judgement on how hackable the BatchedDGLGraph is for sharing edge and node features. Maybe you can see another way around concatenating the tensors or a way to do it with preserved pointers.

Thanks

mufeili · December 10, 2019, 8:33pm

With Frame(cols) you are creating a new frame and the memory does not get shared. I’d suggest you directly use g1._node_frame._frame, which is an instance of Frame class. You may read the source code there and see if you can directly hack the frame and schemes.

Alternatively, how about directly calling dgl.unbatch(bg)? It will be slower, but probably not too bad and can avoid the hacking efforts.