Exclude eids in EdgeDataLoader

jedbl · November 25, 2020, 8:54pm

Hi,
I am building a recommender system, using link prediction on a HeteroGraph with 8 types of relations and 3 types of nodes. To do so, I am using EdgeDataLoader.

I would like to understand better how edges are removed or not from the computation graph. In the docs (link), it is written that “the sampled edges as well as their reverse edges are removed from computation dependencies of the incident nodes. This is a common trick to avoid information leakage.”

Does this mean that with basic parameters, sampled edges are removed from the computation graph? If I were to create this EdgeDataLoader, would it remove the sampled edges?

edgeloader_train = dgl.dataloading.EdgeDataLoader(
            train_graph,
            train_eids_dict,
            sampler,
            negative_sampler=sampler_n,
            batch_size=edge_batch_size,
            shuffle=True,
            drop_last=False, 
            pin_memory=True,  
            num_workers=num_workers,
        )

Or do I need to specify the “exclude” argument? And if I specify “exclude” as “reverse_etypes”, do I also need to provide the edge ids?

edgeloader_train = dgl.dataloading.EdgeDataLoader(
            train_graph,
            train_eids_dict,
            sampler,
            exclude='reverse_types',   # If I use this, do I need to specific the edge ids?
            reverse_etypes={'buys': 'bought-by', 'bought-by': 'buys',
                            'clicks': 'clicked-by', 'clicked-by': 'clicks'},
            negative_sampler=sampler_n,
            batch_size=train_params.edge_batch_size,
            shuffle=True,
            drop_last=False,  # Drop last batch if non-full
            pin_memory=True,  # Helps the transfer to GPU
            num_workers=num_workers,
        )

Also, If I understood correctly, this means that the edges are removed from the computation (in order to prevent that e.g. the model learns to predict high ratings to pair of nodes that are connected in the graph just because they are connected), but the edges are still in the positive_graph generated by the dataloader.

I would like to make sure that the sampled edges are not in the computation graph; that way, I assume that the model will be better to generalize to unseen data.

Thanks a lot in advance!

BarclayII · November 26, 2020, 6:56am

No it would not. You will need to either specify exclude='reverse_ids' or exclude='reverse_types'.

You don’t have to. However, you need to make sure that edges with the same ID of the reverse types (e.g. buys and bought-by) are reverse edges of each other.

The sampled edges are removed from the sampled blocks, meaning that they would not involve in neighbor aggregation. However, they exist in the positive graph because you still need them to compute the prediction on the edges.

Please feel free to follow up. Thanks!

jedbl · November 26, 2020, 10:06pm

Thanks @BarclayII for your clear answer.

Just to make sure I understand properly: if I nothing is specified in the “exclude” argument, both the ‘original edge’ AND the ‘reverse edge’ would not be removed from the computation graph. However, if I specify “reverse_etypes” and provide the reverse mapping, then both the original edge and the reverse edge would be removed from the computation graph. Is that correct? Thanks!

BarclayII · November 30, 2020, 7:12am

Correct.

(Filler for 20 characters)

system · December 30, 2020, 7:12am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.