EdgeDataLoader Information Leakage

I have a heterogeneous graph containing 2 node types and 4 types of edges where link inference tasks are performed. The aim is to avoid any possible information leakage, so I am using EdgeDataLoader, the exclude option and the reverse_eids.
Is EdgeDataLoader the dataloader that fits the most for this purpose? If so, do I need to include every type of relation in each reverse_eids list to avoid information leakage? If not, how should this be done?

Is your graph bidirectional (i.e. each edge type has a reverse edge type)? If so, then you could just use reverse_types exclude option and give the mapping between the edge types and its reverse instead.

It is bidirectional, and by using reverse_types the reverse of the edge is already being excluded.
My concern is that some possible data leakage could be happening through of other edge types.
After much thought I am not even sure this would be possible, just to make sure:

Is there any chance that after using EdgeDataLoader, while training, a splitted type edge outside the train split could be reached through a different edge type than the one being trained on?
Does it produce an inductive setting?

Thank you for your help.

The exclude option ensures that the edges that sampled in the minibatch (as well as the reverse edges, or you can define your custom ones in 0.8) will never appear as neighborhood.

Great, thank you.

I have got one more doubt. Is it possible to make EdgeDataLoader split along a certain edge type?
I would like to split just an edge type and keep the rest, I am considering them data related to the graph structure.

Right now I split the dataset this way:

        for etype in g.etypes: 
            num = g[etype].num_edges()
            mask = np.random.choice(3,num,p=[0.8,0.1,0.1])
            mask = torch.tensor(mask, device=dev)
            g.edges[etype].data['mask'] = mask

I get the edge ids like this:

train_eid_dict = {etype: (g.edges[etype].data['mask'] == 0).nonzero(as_tuple=True)[0] for etype in g.etypes}

And call EdgeDataLoader and the batches this way:

sampler = dgl.dataloading.MultiLayerNeighborSampler([2, 2]) 
neg_sampler = dgl.dataloading.negative_sampler.Uniform(3)

dataloader = EdgeDataLoader(
        g, train_eid_dict, sampler, exclude='reverse_types',
        reverse_etypes={'rel1': 'rel2','rel2':'rel1',
                        'rel3': 'rel4', 'rel4':'rel3'},
        negative_sampler=neg_sampler,batch_size=g.number_of_edges(), shuffle=True, drop_last=False, device = device)

for input_nodes, pos_pair_graph, neg_pair_graph, blocks in (dataloader ):
           ...

But this outputs rather small graphs and, ideally, I would like to train with bigger graphs, how can I fix this?

How large is your graph and train_eid_dict? The graphs being small is perhaps because it only needs graphs of that size to compute your two-layer GNN.

My graph is:
rel1: 52000
rel2: 52000
rel3: 314000
rel4: 314000

And train_eid_dict is:
rel1: 42000
rel2: 42000
rel3: 251000
rel4: 251000

Now I can see it is not that small, taking into account the original one, but since a few changes were made to the way the masks are assigned the model performance dropped.
Thank you for your time.

Just to make sure: did you confirm that the output graphs from the dataloader are not that small?

Yes, I did make sure the output graphs are not that small compared to the original one, thank you.

Lately I have been trying to make some improvements to the model and I would like to do the following:

  • Apply the negative sampler just to a specific type of edges. So it just generates negative edges of this selected type.

Is it possible? Or it generates negative samples for all the edge types in the graph?

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.