Dataloading.as_edge_prediction_sampler do not exclude edges

logan · March 31, 2023, 3:33pm

Hi! I’m trying to use exclude='reverse_types' in dataloading.as_edge_prediction_sampler. My understanding is that message flow graphs will not contain edges from the positive graph and their reverse edges. But after executing:

import dgl
import torch

torch.manual_seed(0)
user = torch.randint(10, (30,))
item = torch.randint(10, (30,))

g = dgl.heterograph({
    ('user', 'click', 'item'): (user, item),
    ('item', 'clicked-by', 'user'): (item, user)})


neg_sampler = dgl.dataloading.negative_sampler.Uniform(5)
sampler = dgl.dataloading.as_edge_prediction_sampler(
    dgl.dataloading.NeighborSampler([2,2,2]),
    exclude='reverse_types',
    reverse_etypes={'click': 'clicked-by', 'clicked-by': 'click'},
    negative_sampler=neg_sampler)

dataloader = dgl.dataloading.DataLoader(
    g, 
    {
                type: torch.arange(g.number_of_edges(type))
                for type in g._etypes
            }, 
            sampler,
    batch_size=2, shuffle=True, drop_last=False, num_workers=1)
input_nodes, pos_graph, neg_graph, mfgs = next(iter(dataloader))



print(pos_graph["click"].edges())
print(pos_graph["clicked-by"].edges())

print(mfgs[2]["click"].edges())
print(mfgs[2]["clicked-by"].edges())

I get

(tensor([0, 1]), tensor([0, 1]))
(tensor([], dtype=torch.int64), tensor([], dtype=torch.int64))
(tensor([0, 0, 2, 3, 4, 1, 5, 4, 6, 6, 7, 6, 5]), tensor([0, 0, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6]))
(tensor([0, 0, 3]), tensor([0, 0, 1]))

This means my positive edge (0,0) of type (‘user’, ‘click’, ‘item’) is not excluded from the message flow graph and their reverse edge (0,0) of type (‘item’, ‘click-by’, ‘user’) is also not excluded.

Can someone explain where’s the mistake?

peizhou001 · April 4, 2023, 5:04am

Hi @logan, the result is possible when your graph contains multi edges. And it is likely happen because your randint has a small range.

logan · April 4, 2023, 11:38am

@peizhou001 I modified the code to filter out multi edges and observe the same behaviour.
Here is how I filtered out duplicates:

def remove_duplicates(u, v):
    u, v = list(zip(*list(set(zip(u.tolist(), item.tolist())))))
    return torch.tensor(u), torch.tensor(v)

torch.manual_seed(0)
user = torch.randint(10, (30,))
item = torch.randint(10, (30,))
user, item = remove_duplicates(user, item)

peizhou001 · April 6, 2023, 1:06am

The root cause should be ID remapping in the sampling process, so same edge IDs in pair_graph and MFG doesn’t represent the same edge.

logan · April 9, 2023, 9:20pm

Is it a bug? I think sampling is pretty common and this behaviour is unexpeced. Do you have some suggestions on how to get the expected behaviour?

peizhou001 · April 10, 2023, 5:20am

After some test, we confirmed it is a bug, thanks for your promoting! We are fixing it in this PR.

system · May 10, 2023, 5:21am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.