How does g_sampling in EdgeDataLoader work?

s4chin · March 12, 2021, 9:59am

My use-case - Unsupervised node representation learning on an unlabelled homogenous graph using GraphSAGE

Graph Description - The nodes are split into train, val, and test. The graph is such that the train nodes subgraph is a k-NN graph, while the val and test nodes are connected to only the nearest train nodes.

I’m using this example to build off of - dgl/train_sampling_unsupervised.py at master · dmlc/dgl · GitHub

My train_dataloader will use the subgraph g.subgraph(train_nid).
How do I calculate validation loss? As g.subgraph(val_nid) won’t have any edges, so I cannot use that as my val_dataloader. I need to sample edges connected from val_nid and the neighbors will be from my train subgraph.
Q1. How do I use the g_sampling parameter from EdgeDataLoader for this to work? Or is there another way?

Q2. Does using one subgraph in EdgeDataLoader and another subgraph in g_sampling cause problems? Since, creating a subgraph will change all the node and edge ids.

After training on the train subgraph, I plan to use the model to run inference on the entire graph, as in this function and use the node embeddings for my downstream task.
Q3. Does this seem correct?

Thanks for the help!

BarclayII · March 15, 2021, 4:59am

g_sampling is intended for the case where the graph for neighbor sampling is different from the graph for edge iteration. In your case, I think you can just use val_g = g.subgraph(torch.cat([train_nid, val_nid])) for validation, so that both the training nodes and the validation nodes are included. You don’t have to use g_sampling.

You will need to ensure that g_sampling and g have the same set of nodes. The edges can be different though.

Makes sense to me.

s4chin · March 16, 2021, 6:57am

Thanks a lot for the answers!

I thought of doing this, but since my number of training edges is so much more than validation edges(edges from val nodes to train nodes), my “validation” part would mostly run on the training edges itself.

What I ended up doing was using val_seeds = val_g.out_edges(val_nid, form='eid') as the eids parameter in EdgeDataLoader on g = val_g(where val_g is as given by you above) which means only my val edges will be used. This looks like a better option than using all edges.

The problem here is that even in this case, the MultiLayerNeighborSampler will still sample nodes from my validation set as neighbors, which is not something I want. How do I fix this? Essentially, I want to iterate on val edges(which I’m doing using the eids parameter), and also sample neighbors only from train set.

BarclayII · March 16, 2021, 7:12am

There are two things in EdgeDataLoader: one is which edges you iterate in minibatches (i.e. to compute scores), and the other is which edges you would like to sample neighbors from. So you can use EdgeDataLoader like:

# Include both the training nodes and validation nodes in the graph...
val_g = g.subgraph(torch.cat([train_nid, val_nid]))
dl = EdgeDataLoader(
    val_g,
    # ...but only iterate on the edges connecting validation nodes
    val_g.out_edges(
        torch.arange(len(train_nid), len(train_nid) + len(val_nid)), form='eid'),
    sampler,
    # all the other arguments...
    )

EDIT: changed the argument of out_edges since subgraph would relabel the nodes.

system · April 15, 2021, 7:12am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.