Link Prediction Connected Edges end up More Dissimilar

Hello,

I have successfully trained a model using link prediction on a Heterograph. However, when I compute the embeddings and compare the cosine similarity between them, I find something odd which is that nodes that have nodes in the graph that are connected have lower cosine similarity than nodes that are not connected by edges! This is exactly the opposite of what is supposed to happen, so I’m wondering if I made a mistake.

My graph is defined like so:

data_dict = {('source', 'has_follower', 'user'): (torch.tensor([0, 0]), torch.tensor([0, 0])), ('user', 'follows', 'source'): (torch.tensor([0, 0]), torch.tensor([0, 0])) }
dgl_graph = dgl.heterograph(data_dict)

Here is how I train the model:

class TestRGCN(nn.Module):
    def __init__(self, in_feats, hid_feats, out_feats, canonical_etypes):
        super(TestRGCN, self).__init__()

        self.conv1 = dglnn.HeteroGraphConv({
                etype : dglnn.GraphConv(in_feats[utype], hid_feats, norm='right')
                for utype, etype, vtype in canonical_etypes
                })
        self.conv2 = dglnn.HeteroGraphConv({
                etype : dglnn.GraphConv(hid_feats, out_feats, norm='right')
                for _, etype, _ in canonical_etypes
                })

    def forward(self, blocks, inputs):
        x = self.conv1(blocks[0], inputs)
        x = self.conv2(blocks[1], x)

        return x

    def inference(self, curr_g, x, batch_size):
        
        nodes = torch.arange(curr_g.number_of_nodes())
        curr_g = curr_g.to('cpu')

        for l, layer in enumerate([self.conv1, self.conv2]):
            y = {k: torch.zeros(curr_g.number_of_nodes(k), self.hid_feats if l != self.n_layers - 1 else self.out_feats) for k in curr_g.ntypes}

            sampler = dgl.dataloading.MultiLayerFullNeighborSampler(1)
            dataloader = dgl.dataloading.NodeDataLoader(
                curr_g, {k: torch.arange(curr_g.number_of_nodes(k)) for k in curr_g.ntypes}, sampler, batch_size=batch_size, shuffle=True, drop_last=False, num_workers=self.num_workers)

            for input_nodes, output_nodes, blocks in tqdm(dataloader):
                block = blocks[0].to(torch.device('cuda'))

                h = {k: x[k][input_nodes[k].type(torch.LongTensor)].to(torch.device('cuda')) for k in input_nodes.keys()}
                
                h = layer(block, h)

                for k in h.keys():
                    y[k][output_nodes[k].type(torch.LongTensor)] = h[k].cpu()

            x = y
            
        return y

class HeteroScorePredictor(nn.Module):
    def forward(self, edge_subgraph, x):
        with edge_subgraph.local_scope():
            edge_subgraph.ndata['h'] = x
            for etype in edge_subgraph.canonical_etypes:
                edge_subgraph.apply_edges(dgl.function.u_dot_v('h', 'h', 'score'), etype=etype)
                # edge_subgraph.apply_edges(self.apply_edges, etype=etype)
            return edge_subgraph.edata['score']

class TestModel(nn.Module):
    # here we have a model that first computes the representation and then predicts the scores for the edges
    def __init__(self, in_features, hidden_features, out_features, canonical_etypes):
        super().__init__()
        self.sage = TestRGCN(in_features, hidden_features, out_features, canonical_etypes)
        self.pred = HeteroScorePredictor()
    def forward(self, g, neg_g, blocks, x):
        x = self.sage(blocks, x)
        pos_score = self.pred(g, x)
        neg_score = self.pred(neg_g, x)
        return pos_score, neg_score 

def compute_loss(pos_score, neg_score, canonical_etypes):
    # Margin loss
    all_losses = []
    for given_type in canonical_etypes:
        n_edges = pos_score[given_type].shape[0]
        if n_edges == 0:
            continue
        all_losses.append((1 - neg_score[given_type].view(n_edges, -1) + pos_score[given_type].unsqueeze(1)).clamp(min=0).mean())
    return torch.stack(all_losses, dim=0).mean()

model = TestModel(in_features={'source':700, 'user':800}, hidden_features=512, out_features=256, canonical_etypes=g.canonical_etypes)

train_eid_dict = {('source', 'has_follower', 'user'): torch.arange(g.num_edges('has_follower')), ('user', 'follows', 'source'): torch.arange(g.num_edges('has_follower'))}
dataloader = dgl.dataloading.EdgeDataLoader(g, train_eid_dict, sampler, negative_sampler=dgl.dataloading.negative_sampler.Uniform(5), batch_size=args.batch_size, shuffle=True, drop_last=False, pin_memory=True, num_workers=args.num_workers)
...
for epoch in range(args.n_epochs):
        for input_nodes, positive_graph, negative_graph, blocks in dataloader:
            model.train()
            blocks = [b.to(torch.device('cuda')) for b in blocks]

            positive_graph = positive_graph.to(torch.device('cuda'))
            negative_graph = negative_graph.to(torch.device('cuda'))

            node_features = {'source': blocks[0].srcdata['source_embedding']['source'], 'user': blocks[0].srcdata['user_embedding']['user']}
            pos_score, neg_score = model(positive_graph, negative_graph, blocks, node_features)
            loss = compute_loss(pos_score, neg_score, g.canonical_etypes)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

def compute_loss(pos_score, neg_score, canonical_etypes):
    # Margin loss
    all_losses = []
    for given_type in canonical_etypes:
        if given_type not in pos_score:
            continue
        n_edges = pos_score[given_type].shape[0]
        if n_edges == 0:
            continue
        all_losses.append((1 - neg_score[given_type].view(n_edges, -1) + pos_score[given_type].unsqueeze(1)).clamp(min=0).mean())
    return torch.stack(all_losses, dim=0).mean()

And then, I compute the inference, embeddings, and cosine similarity between them like so:

source_feats = g.nodes['source'].data['source_embedding'].to(torch.device('cuda'))
user_feats = g.nodes['user'].data['user_embedding'].to(torch.device('cuda'))
node_features_for_inference = {'source': source_feats, 'user': user_feats}
model.eval()
with torch.no_grad():
    # single gpu
    # if isinstance(model, FakeNewsModel):
    pred = model.inference(g, node_features_for_inference, args.batch_size)

embedding_1 = pred['source'][given_source_id-1]
embedding_2 = pred['user'][given_user_id-1]
embedding_3 = pred['user'][given_user_id_2-1]
print(F.cosine_similarity(embedding_1, embedding_2, dim=0).item())
# the nodes of embedding1 and embedding_3 are not connected but this number below is higher!
print(F.cosine_similarity(embedding_1, embedding_3, dim=0).item())

For some reason, the cosine similarity between nodes that are connected ends up being negative (like -0.4) and the one between nodes that aren’t connected ends up being positive (like 0.15)

all_losses.append((1 - neg_score[given_type].view(n_edges, -1) + pos_score[given_type].unsqueeze(1)).clamp(min=0).mean())

Loss is something you want to minimize while score is something you want to maximize. You need to take the negation of the loss above.

1 Like

@mufeili

So you mean I should do this? Are the docs here (https://docs.dgl.ai/guide/training-link.html) wrong then?

all_losses.append(1 - (1 - neg_score[given_type].view(n_edges, -1) + pos_score[given_type].unsqueeze(1)).clamp(min=0).mean())

I think that’s not correct.

@mufeili

Is the code I posted correct? Can you please help me with what it should be?

I tried it and the loss became negative and continued to decrease, seems odd.

Epoch 00000 | Time(s) nan | Loss -1803300736.0000 | ETputs(KTEPS) nan
Epoch 00000 | Time(s) nan | Loss -135698120704.0000 | ETputs(KTEPS) nan
Epoch 00000 | Time(s) nan | Loss -867152297984.0000 | ETputs(KTEPS) nan
Epoch 00000 | Time(s) nan | Loss -3556228202496.0000 | ETputs(KTEPS) nan
Epoch 00000 | Time(s) nan | Loss -12037722734592.0000 | ETputs(KTEPS) nan
Epoch 00000 | Time(s) nan | Loss -31378796183552.0000 | ETputs(KTEPS) nan
Epoch 00000 | Time(s) nan | Loss -70680867504128.0000 | ETputs(KTEPS) nan
Epoch 00000 | Time(s) nan | Loss -139235818471424.0000 | ETputs(KTEPS) nan
Epoch 00000 | Time(s) nan | Loss -308490421665792.0000 | ETputs(KTEPS) nan
Epoch 00000 | Time(s) nan | Loss -416320575242240.0000 | ETputs(KTEPS) nan
Epoch 00000 | Time(s) nan | Loss -875629411041280.0000 | ETputs(KTEPS) nan
all_losses.append(1 - (1 - neg_score[given_type].view(n_edges, -1) + pos_score[given_type].unsqueeze(1)).clamp(min=0).mean())

You need to replace

(1 - neg_score[given_type].view(n_edges, -1) + pos_score[given_type].unsqueeze(1)).clamp(min=0).mean()

by

(1 + neg_score[given_type].view(n_edges, -1) - pos_score[given_type].unsqueeze(1)).clamp(min=0).mean()

For a reference, see 4.8, Graph Representation Learning Book.

@mufeili

I changed my loss to this:

def compute_loss(pos_score, neg_score, canonical_etypes):
    all_losses = []
    for given_type in canonical_etypes:
        if given_type not in pos_score:
            continue
        n_edges = pos_score[given_type].shape[0]
        if n_edges == 0:
            continue
        all_losses.append((1 + neg_score[given_type].view(n_edges, -1) - pos_score[given_type].unsqueeze(1)).clamp(min=0).mean())
    return torch.stack(all_losses, dim=0).mean()

I now find that my initial problem of edges ending up more dissimilar is solved. However, when training my model, I find that my loss doesn’t really drop, and I am getting loss values like so. Do you have any idea what I could look into to help with this?

Epoch 00009 | Time(s) 0.5093 | Loss 11385.8516 | ETputs(KTEPS) 1800.68
Epoch 00009 | Time(s) 0.5093 | Loss 17931.8340 | ETputs(KTEPS) 1800.65
Epoch 00009 | Time(s) 0.5093 | Loss 19779.6719 | ETputs(KTEPS) 1800.79
Epoch 00009 | Time(s) 0.5093 | Loss 16351.8076 | ETputs(KTEPS) 1800.87
Epoch 00009 | Time(s) 0.5093 | Loss 17668.3652 | ETputs(KTEPS) 1800.86
Epoch 00009 | Time(s) 0.5093 | Loss 12159.8066 | ETputs(KTEPS) 1800.75
Epoch 00009 | Time(s) 0.5093 | Loss 18360.0137 | ETputs(KTEPS) 1800.67
Epoch 00009 | Time(s) 0.5094 | Loss 22985.2695 | ETputs(KTEPS) 1800.55
Epoch 00009 | Time(s) 0.5094 | Loss 23384.4844 | ETputs(KTEPS) 1800.54
Epoch 00009 | Time(s) 0.5094 | Loss 8138.6538 | ETputs(KTEPS) 1800.59
Epoch 00009 | Time(s) 0.5093 | Loss 1536908.0000 | ETputs(KTEPS) 1800.71
Epoch 00009 | Time(s) 0.5094 | Loss 559241.1250 | ETputs(KTEPS) 1800.27

Also, just to be sure, the negative_sampler in DGL (https://docs.dgl.ai/_modules/dgl/dataloading/negative_sampler.html#Uniform), works specific to edge types right. As in if Node A is connected to B with type c, but not connected to Node D, and Node A is connected to Node E with type F, then a positive pair could be (A, B) and a negative for that would be (A, D) - since they aren’t connected with the same type. The documentation of that confuses me a bit because it says The resulting edges will also have type (srctype, etype, dsttype)., but I thought in link prediction the negative samples are between nodes that are not connected (at least by the same type)

I now find that my initial problem of edges ending up more dissimilar is solved. However, when training my model, I find that my loss doesn’t really drop, and I am getting loss values like so. Do you have any idea what I could look into to help with this?

How many edge types do you have? If you have too many edge types, then maybe you need to also average the loss over edge types.

Also, just to be sure, the negative_sampler in DGL (https://docs.dgl.ai/_modules/dgl/dataloading/negative_sampler.html#Uniform), works specific to edge types right. As in if Node A is connected to B with type c, but not connected to Node D, and Node A is connected to Node E with type F, then a positive pair could be (A, B) and a negative for that would be (A, D) - since they aren’t connected with the same type. The documentation of that confuses me a bit because it says The resulting edges will also have type (srctype, etype, dsttype). , but I thought in link prediction the negative samples are between nodes that are not connected (at least by the same type)

The documentation there might be a bit confusing. Basically it just says that the sampled negative edges are type-specific.

@mufeili

I have quite a few edge types, like greater than 10 easily. I return torch.stack(all_losses, dim=0).mean(), is this not averaging over edge types, or do I need to do something else?

@mufeili

I also had a question on how edges are chosen by sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2) and EdgeDataLoader as I have used them above. Are edges/nodes just chosen randomly, or does the system make sure that each edge/node relation is used as a positive example at least once?

Then you should have already averaged the loss over edge types.

sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2) basically samples all 2-hop neighbors across all edge types.

@mufeili

But does it make sure that ever pair of nodes that are connected with a given edge type are chosen as a positive pair in every epoch or is it just random?

EdgeDataLoader should guarantee that.

@mufeili

Thanks for all the help so far. I’m also trying to train node classification, and I was just wondering if I was doing it correctly because the training happens really fast. Specifically, I’m wondering if I’m setting ‘train_nid’ correctly.

When I train for link prediction, I use a train_eid with EdgeDataLoader like this:

train_eid_dict = {('source', 'has_follower', 'user'): torch.arange(g.num_edges('has_follower')), ('user', 'follows', 'source'): torch.arange(g.num_edges('has_follower')), ('source2', 'links', 'source'): torch.arange(g.num_edges('links')), ('source', 'has_2', 'source2'): torch.arange(g.num_edges('has_good'))}

At the bottom is how I am now training for Node Classification. As the train_nid dict, I use train_idx, which is just an array with the indices of the sources that I am training on for node classification. I don’t include any of the other edges/node types in my train_idx dict, like I did with EdgeDataLoader. However, the blocks generated by the NodeDataLoader do have those nodes. Even still, I’m confused if they are actually being used, because the model is tor run ~60K user nodes with each have embedding of size ~800 in almost 0 seconds on a standard 12GB GPU. So I’m wondering if I made a mistake somewhere and all the features of my graph aren’t being used, or it being that fast is expected.

train_mask = np.zeros(num_sources)
for given_source, source_id in all_sources_list.items():
    if given_source in train_set:
        train_mask[given_source_id-1] = 1

train_idx = torch.nonzero(train_mask_tensor).squeeze()

sampler = dgl.dataloading.MultiLayerFullNeighborSampler(3)
dataloader_new = dgl.dataloading.NodeDataLoader(curr_g, {'source': train_idx}, sampler, batch_size=args.batch_size, shuffle=True, drop_last=False, num_workers=args.num_workers)

loss_fcn = nn.CrossEntropyLoss()
loss_fcn = loss_fcn.to(torch.device('cuda'))

for epoch in range(args.n_epochs):
    model.train()
    optimizer.zero_grad()
    for iteration, (input_nodes, output_nodes, blocks) in enumerate(dataloader_new):

        blocks = [b.to(torch.device('cuda')) for b in blocks]
        input_features = blocks[0].srcdata     # returns a dict
        output_labels = blocks[-1].dstdata['source_label']['source']    # returns a dict
        output_labels = (output_labels - 1).long()
        output_labels = torch.squeeze(output_labels)

        node_features = {'source': blocks[0].srcdata['source_embedding']['source'].to(torch.device('cuda')), 'user': blocks[0].srcdata['user_embedding']['user'].to(torch.device('cuda')), 'good_source': blocks[0].srcdata['source_2_embedding']['source_2'].to(torch.device('cuda'))

        output_predictions = model(blocks, node_features)['source']

        loss = loss_fcn(output_predictions, output_labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # and then evaluate
train_eid_dict = {('source', 'has_follower', 'user'): torch.arange(g.num_edges('has_follower')), ('user', 'follows', 'source'): torch.arange(g.num_edges('has_follower')), ('source2', 'links', 'source'): torch.arange(g.num_edges('links')), ('source', 'has_2', 'source2'): torch.arange(g.num_edges('has_good'))}

By doing so, your model is asked to perform link prediction on all 4 types of edges. Is this what you want?

At the bottom is how I am now training for Node Classification. As the train_nid dict, I use train_idx , which is just an array with the indices of the sources that I am training on for node classification. I don’t include any of the other edges/node types in my train_idx dict, like I did with EdgeDataLoader. However, the blocks generated by the NodeDataLoader do have those nodes. Even still, I’m confused if they are actually being used, because the model is tor run ~60K user nodes with each have embedding of size ~800 in almost 0 seconds on a standard 12GB GPU. So I’m wondering if I made a mistake somewhere and all the features of my graph aren’t being used, or it being that fast is expected.

If you only want to classify training nodes that are all source nodes, then the way you use dgl.dataloading.NodeDataLoader seems to be correct. nids simply specifies the nodes for model output, but their neighbors can still involve nodes of other types.

How does your learning curve look like?

@mufeili

Yea, I want to capture the similarity in the graph between all 4 edge types.

But how does my model train in less than 1 second with 40K user nodes that each have embedding of size 800? I only have 400 source nodes, but they are connected to the user ones (I can see that 40K user nodes with their embeddings are being given to the model as features). It just seems too quick to run 3 CONV operations. What optimizations does DGL make because it is only batching the source nodes, but not the user ones? As far as the training accuracy, it doesn’t seem to overfit the model but the loss does seem to decrease.

Also, when I am training it this for node classification way, will the representation of the user nodes change as well?

As far as the learning curve, you can see the performance of the model here. I use a batch size > 400 and I only have 400 source nodes, so it’s able to run the entire train set in one batch.

Epoch 00000 | Loss 2102.2668 | Acc 0.2873
Epoch 0 Test Accuracy classification: 0.64 Loss: 16628.7109375

Epoch 00001 | Loss 17348.5469 | Acc 0.5745
Epoch 1 Test Accuracy classification: 0.6533333333333333 Loss: 16696.720703125

Epoch 00002 | Loss 17018.3555 | Acc 0.5854
Epoch 2 Test Accuracy classification: 0.64 Loss: 16345.33203125

Epoch 00003 | Loss 16688.9004 | Acc 0.5745
Epoch 3 Test Accuracy classification: 0.6533333333333333 Loss: 14424.1630859375

Epoch 00004 | Loss 14788.6865 | Acc 0.5772
Epoch 4 Test Accuracy classification: 0.68 Loss: 9910.41015625

Epoch 00005 | Loss 10168.4570 | Acc 0.6179
Epoch 5 Test Accuracy classification: 0.5866666666666667 Loss: 5502.78271484375

Epoch 00006 | Loss 6254.3848 | Acc 0.4986
Epoch 6 Test Accuracy classification: 0.6533333333333333 Loss: 826.9302368164062

Epoch 00007 | Loss 1024.8438 | Acc 0.6260
Epoch 7 Test Accuracy classification: 0.25333333333333335 Loss: 7492.93408203125

Epoch 00008 | Loss 8407.9414 | Acc 0.2900
Epoch 8 Test Accuracy classification: 0.25333333333333335 Loss: 8983.81640625

Epoch 00009 | Loss 10173.7852 | Acc 0.2900
Epoch 9 Test Accuracy classification: 0.41333333333333333 Loss: 5174.62060546875

Epoch 00010 | Loss 6559.0908 | Acc 0.4363
Epoch 10 Test Accuracy classification: 0.72 Loss: 4223.3046875

Epoch 00011 | Loss 5400.2095 | Acc 0.6098
Epoch 11 Test Accuracy classification: 0.64 Loss: 5967.22607421875

Epoch 00012 | Loss 7225.6133 | Acc 0.5827
Epoch 12 Test Accuracy classification: 0.64 Loss: 5831.576171875

Epoch 00013 | Loss 6878.9229 | Acc 0.5799
Epoch 13 Test Accuracy classification: 0.64 Loss: 3769.7841796875

Epoch 00014 | Loss 4497.2041 | Acc 0.5935
Epoch 14 Test Accuracy classification: 0.68 Loss: 1260.674560546875

Epoch 00015 | Loss 1738.2714 | Acc 0.6260
Epoch 15 Test Accuracy classification: 0.38666666666666666 Loss: 2636.67529296875

Epoch 00016 | Loss 3459.1741 | Acc 0.3550
Epoch 16 Test Accuracy classification: 0.24 Loss: 3289.32666015625

Epoch 00017 | Loss 4002.0024 | Acc 0.2575
Epoch 17 Test Accuracy classification: 0.41333333333333333 Loss: 2885.09814453125

Epoch 00018 | Loss 3884.2263 | Acc 0.4228
Epoch 18 Test Accuracy classification: 0.48 Loss: 2611.711669921875

Epoch 00019 | Loss 3794.7185 | Acc 0.4715
Epoch 19 Test Accuracy classification: 0.68 Loss: 1708.960205078125

Epoch 00020 | Loss 2711.4412 | Acc 0.5718
Epoch 20 Test Accuracy classification: 0.72 Loss: 1636.6671142578125

Epoch 00021 | Loss 1969.6420 | Acc 0.6477
Epoch 21 Test Accuracy classification: 0.68 Loss: 2416.555419921875

Epoch 00022 | Loss 2442.2800 | Acc 0.6341
Epoch 22 Test Accuracy classification: 0.68 Loss: 3196.93212890625

Epoch 00023 | Loss 3193.8635 | Acc 0.6341
Epoch 23 Test Accuracy classification: 0.68 Loss: 3288.953857421875

Epoch 00024 | Loss 3298.9946 | Acc 0.6287
Epoch 24 Test Accuracy classification: 0.6666666666666666 Loss: 2801.912109375

Epoch 00025 | Loss 2888.8699 | Acc 0.6341
Epoch 25 Test Accuracy classification: 0.68 Loss: 1970.5631103515625

Epoch 00026 | Loss 2202.9199 | Acc 0.6450
Epoch 26 Test Accuracy classification: 0.6933333333333334 Loss: 1015.4242553710938

Epoch 00027 | Loss 1406.2284 | Acc 0.6477
Epoch 27 Test Accuracy classification: 0.6933333333333334 Loss: 1024.30615234375

Epoch 00028 | Loss 1750.1619 | Acc 0.5881
Epoch 28 Test Accuracy classification: 0.56 Loss: 1522.7830810546875

Epoch 00029 | Loss 2367.7964 | Acc 0.4932
Epoch 29 Test Accuracy classification: 0.6 Loss: 1256.1552734375

Epoch 00030 | Loss 2004.7062 | Acc 0.5122
Epoch 30 Test Accuracy classification: 0.6533333333333333 Loss: 898.8438720703125

Epoch 00031 | Loss 1278.9680 | Acc 0.6260
Epoch 31 Test Accuracy classification: 0.6933333333333334 Loss: 1112.3299560546875

Epoch 00032 | Loss 1376.6039 | Acc 0.6314
Epoch 32 Test Accuracy classification: 0.6933333333333334 Loss: 1320.6636962890625

Epoch 00033 | Loss 1618.3029 | Acc 0.6314
Epoch 33 Test Accuracy classification: 0.68 Loss: 1280.290283203125

Epoch 00034 | Loss 1613.1984 | Acc 0.6369
Epoch 34 Test Accuracy classification: 0.6933333333333334 Loss: 947.4419555664062

But how does my model train in less than 1 second with 40K user nodes that each have embedding of size 800? I only have 400 source nodes, but they are connected to the user ones (I can see that 40K user nodes with their embeddings are being given to the model as features). It just seems too quick to run 3 CONV operations.

If you use a batch size >400 then 400 source nodes takes a single iteration. How many edges are there in your graph from user nodes to the source nodes?

What optimizations does DGL make because it is only batching the source nodes, but not the user ones?

The sampler should sample the 3-hop neighborhood for the source nodes and the optimization is performed on the model output for the source nodes.

Also, when I am training it this for node classification way, will the representation of the user nodes change as well?

It sounds like you are learning embeddings from scratch for the user nodes. If you do not break the gradient for them, then they should change as well. You can also perform a sanity check to see if a subset of it actually changes across iterations.

Your loss seems to be too large and you may want to perform gradient clipping.