Exclude_eids argument in as_edge_prediction_sampler

Sriharsha · September 28, 2022, 4:44pm

Hi Team, Any thoughts on adding this PR: [Feature] Add an optional argument to always exclude given edge IDs in as_edge_prediction_sampler by BarclayII · Pull Request #4114 · dmlc/dgl · GitHub to the future DGL releases?

I have a predefined set of positives and negatives using which I created the graph. But, I would like to exclude the negative edges while neighbourhood sampling.

When I was exploring the options, I find that exclude_eids is not part of the as_edge_prediction_sampler but it is part of the dgl.sampling.sample_neighbors — DGL 0.9 documentation which is being used inside dgl.dataloading.NeighborSampler — DGL 0.9 documentation

I deal with a graph of 30 Million edges (including negatives), Can you suggest an efficient way to fix this issue?

BarclayII · September 29, 2022, 8:35am

Could you construct a graph without the positives and negatives (possibly by subgraphing) and then ru n the dataloader from there?

As per the PR. The idea is OK but we bumped into some regression, so I tabled that. I’ll bring up this PR in our next development cycle.

Sriharsha · September 29, 2022, 3:13pm

Could you construct a graph without the positives and negatives (possibly by subgraphing) and then run the dataloader from there?

Yes, But, In this case, although I make my sampling graph containing only the sub-graph of positive edges, the indices (edge id’s) that I want to iterate on contain both positives and negatives right? The DataLoader throws an error if any of these edge IDs are not part of the sampling graph.

As per the PR. The idea is OK but we bumped into some regression, so I tabled that. I’ll bring up this PR in our next development cycle.

Thanks!

Sriharsha · September 30, 2022, 12:00am

@BarclayII For now, The below is a hack working for me. What do you think about this?

This union might be an additional overhead in every iteration
I might need to move the tensors back to the CPU to make this union

The ideal way to deal with this is within the DataLoader. But, do you think we can optimize in any other way?

import numpy as np
import torch
import dgl
from dgl.dataloading import BlockSampler


class CustomNeighborSampler(BlockSampler):
    def __init__(self, fanouts, exclude_neg_eids=None, edge_dir='in', prob=None, replace=False,
                 prefetch_node_feats=None, prefetch_labels=None, prefetch_edge_feats=None,
                 output_device=None):
        super().__init__(prefetch_node_feats=prefetch_node_feats,
                         prefetch_labels=prefetch_labels,
                         prefetch_edge_feats=prefetch_edge_feats,
                         output_device=output_device)
        self.fanouts = fanouts
        self.exclude_neg_eids = exclude_neg_eids
        self.edge_dir = edge_dir
        self.prob = prob
        self.replace = replace


    def sample_blocks(self, g, seed_nodes, exclude_eids=None):
        exclude_eids = {
            edge_type: torch.from_numpy(
                np.union1d(self.exclude_neg_eids[edge_type], exclude_eids[edge_type].cpu()) 
            ).to(torch.int32).to(torch.device('cuda'))
            for edge_type in self.exclude_neg_eids.keys()
        }
        output_nodes = seed_nodes
        blocks = []
        for fanout in reversed(self.fanouts):
            frontier = g.sample_neighbors(
                seed_nodes, fanout, edge_dir=self.edge_dir, prob=self.prob,
                replace=self.replace, output_device=self.output_device,
                exclude_edges=exclude_eids)
            eid = frontier.edata[dgl.EID]
            block = dgl.to_block(frontier, seed_nodes)
            block.edata[dgl.EID] = eid
            seed_nodes = block.srcdata[dgl.NID]
            blocks.insert(0, block)

        return seed_nodes, output_nodes, blocks

BarclayII · September 30, 2022, 3:48am

Talking about the union, since exclude argument can also take in a callable, you can just provide a function that takes in the seed edge IDs and return the edge IDs you wish to exclude (i.e. the union of the seed edges and your negative edges).

Sriharsha · September 30, 2022, 5:57am

Thanks, I will try the callable to the exclude argument.

Sriharsha · October 1, 2022, 12:53am

I tried this. It is working. Also, instead of taking the union using the NumPy function. I tried concatenating the torch tensors and took unique. It is executing much faster.

torch.cat((exclude_eids, exclude_neg_eids)).unique()

system · October 31, 2022, 12:54am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.