Link Prediction Connected Edges end up More Dissimilar

What do you mean by this? I currently create feature vectors of users such as their Twitter information (followers/likes/embedding of their bio/etc)

0.001, 0.0001, 0.00001

nn.CrossEntropyLoss()

What do you mean by this? I currently create feature vectors of users such as their Twitter information (followers/likes/embedding of their bio/etc)

Can you show me the code for doing this?

Another thing that I am confused is that a smaller loss does not necessarily suggest a higher accuracy in your case. E.g. for epoch 0.0004, the loss is 3000 while the accuracy is 0.6; for epoch 71, the loss is 835 while the accuracy is 0.29. Meanwhile, the test accuracy numbers are significantly different from the training accuracy.

  1. Are you using a correct loss function for the metric you are interested?
  2. What is the size of the dataset?
  3. How did you split the dataset into training, validation and test subsets?

The code is rather long but basically I create a 768 dimensional vector which consists of the BERT embedding of a user bio plus some numerical features that represent follower count, following count, and some binary features that represent things such as if a user is verified.

Yea, this is what I am confused about as well. Maybe I made a mistake somewhere, but here is what I’m doing.

When I build the graph, I store the label for each node as source_label.

train_mask = np.zeros(overall_graph._g[0].number_of_nodes(ntype='source'))
test_mask = np.zeros(overall_graph._g[0].number_of_nodes(ntype='source'))
train_nids = []
test_nids = []
# create a dictionary when making the graph which stores all my sources and source ID's 
# source ID's represent the Node ID in the graph for each source, but they are one higher than what they are in DGL
for given_source_identifier_combo, given_source_id in sources_mapping_dict.items():
    given_source = given_source_identifier_combo.replace(overall_graph.source_name_identifier, '')

    # if it's a training source, update the mask
    if given_source in curr_data_split['train']:
        test_mask[given_source_id-1] = 0
        train_mask[given_source_id-1] = 1
        train_nids.append(given_source_id-1)
    elif given_source in curr_data_split['test']:
        # if it's a test source update the mask
        test_mask[given_source_id-1] = 1
        train_mask[given_source_id-1] = 0
        test_nids.append(given_source_id-1)

train_mask_tensor = torch.from_numpy(train_mask)
train_idx = torch.nonzero(train_mask_tensor).squeeze()
test_mask_tensor = torch.from_numpy(test_mask)
test_idx = torch.nonzero(test_mask_tensor).squeeze()


sampler = dgl.dataloading.MultiLayerFullNeighborSampler(args.n_layers)
# select the train/test sources based on the masks
train_dataloader = dgl.dataloading.NodeDataLoader(curr_g, {'source': train_idx}, sampler, batch_size=args.batch_size, shuffle=True, drop_last=False, num_workers=args.num_workers)
test_sampler = dgl.dataloading.MultiLayerFullNeighborSampler(args.n_layers)
dataloader_test = dgl.dataloading.NodeDataLoader(curr_g, {'source': test_idx}, test_sampler, batch_size=args.batch_size, shuffle=True, drop_last=False, num_workers=args.num_workers)

loss_fcn = nn.CrossEntropyLoss()
loss_fcn = loss_fcn.to(torch.device('cuda'))

best_train_acc = 0.0

for epoch in range(args.n_epochs):
    model.train()
    optimizer.zero_grad()
    for iteration, (input_nodes, output_nodes, blocks) in enumerate(train_dataloader):

        blocks = [b.to(torch.device('cuda')) for b in blocks]
        output_labels = blocks[-1].dstdata['source_label']['source']    # returns a dict
        output_labels = (output_labels - 1).long()
        output_labels = torch.squeeze(output_labels)

        node_features = get_features_given_blocks(curr_g, blocks, graph_style, adding_advice=adding_advice)

        output_predictions = model(blocks, node_features)['source']
        loss = loss_fcn(output_predictions, output_labels)
        loss.backward()
        clip_grad_norm_(model.parameters(), 0.25)
        optimizer.step()
        optimizer.zero_grad()

        acc = (torch.sum(output_predictions.argmax(dim=1) == output_labels.long()).item()) / len(output_predictions)

        if acc > best_train_acc:
            best_train_acc = acc

        if iteration % 10 == 0:
            print("Epoch {:05d} | Loss {:.4f} | Acc {:.4f}".format(epoch, loss.item(), acc))
            print("Best train accuracy: " + str(best_train_acc))

    # compute the accuracy
    model.eval()
    test_acc, test_loss = do_evaluation(model, ...)
    print("Epoch " + str(epoch) + " Test Accuracy classification: " + str(test_acc) + " Loss: " + str(test_loss))
    if test_acc > best_acc:
        best_acc = test_acc
    print("Best accuracy classification: " + str(best_acc))
    print("")

The dataset has ~400 source nodes and 60K user nodes.

This was pre-done by the people who released the dataset.

Here is another learning curve:

Epoch 00002 | Loss 150.3613 | Acc 0.2981
Best train accuracy: 0.5718157181571816
Epoch 2 Test Accuracy classification: 0.32 Loss: 147.93052673339844
Best accuracy classification: 0.6

Epoch 00003 | Loss 134.7081 | Acc 0.3062
Best train accuracy: 0.5718157181571816
Epoch 3 Test Accuracy classification: 0.5733333333333334 Loss: 83.4209976196289
Best accuracy classification: 0.6

Epoch 00004 | Loss 60.8860 | Acc 0.4526
Best train accuracy: 0.5718157181571816
Epoch 4 Test Accuracy classification: 0.6 Loss: 132.33424377441406
Best accuracy classification: 0.6

Epoch 00005 | Loss 93.1154 | Acc 0.5908
Best train accuracy: 0.5907859078590786
Epoch 5 Test Accuracy classification: 0.48 Loss: 120.34642028808594
Best accuracy classification: 0.6

Epoch 00006 | Loss 80.5504 | Acc 0.4282
Best train accuracy: 0.5907859078590786
Epoch 6 Test Accuracy classification: 0.18666666666666668 Loss: 140.0550994873047
Best accuracy classification: 0.6

Epoch 00007 | Loss 102.2918 | Acc 0.1707
Best train accuracy: 0.5907859078590786
Epoch 7 Test Accuracy classification: 0.21333333333333335 Loss: 93.22665405273438
Best accuracy classification: 0.6

Epoch 00008 | Loss 67.4277 | Acc 0.2304
Best train accuracy: 0.5907859078590786
Epoch 8 Test Accuracy classification: 0.5866666666666667 Loss: 90.25814819335938
Best accuracy classification: 0.6

Epoch 00009 | Loss 73.3415 | Acc 0.5881
Best train accuracy: 0.5907859078590786
Epoch 9 Test Accuracy classification: 0.5866666666666667 Loss: 55.55716323852539
Best accuracy classification: 0.6

Epoch 00010 | Loss 50.5161 | Acc 0.5935
Best train accuracy: 0.5934959349593496
Epoch 10 Test Accuracy classification: 0.32 Loss: 94.17658233642578
Best accuracy classification: 0.6

Epoch 00011 | Loss 89.7341 | Acc 0.2927
Best train accuracy: 0.5934959349593496
Epoch 11 Test Accuracy classification: 0.29333333333333333 Loss: 138.47430419921875
Best accuracy classification: 0.6

Epoch 00081 | Loss 25.8510 | Acc 0.3767
Best train accuracy: 0.6395663956639567
Epoch 81 Test Accuracy classification: 0.6 Loss: 44.935489654541016
Best accuracy classification: 0.6266666666666667

Epoch 00082 | Loss 31.8864 | Acc 0.5935
Best train accuracy: 0.6395663956639567
Epoch 82 Test Accuracy classification: 0.6 Loss: 65.07081604003906
Best accuracy classification: 0.6266666666666667

Epoch 00083 | Loss 45.0529 | Acc 0.5908
Best train accuracy: 0.6395663956639567
Epoch 83 Test Accuracy classification: 0.24 Loss: 86.86858367919922
Best accuracy classification: 0.6266666666666667
...
Epoch 00084 | Loss 61.2752 | Acc 0.1978
Best train accuracy: 0.6395663956639567
Epoch 84 Test Accuracy classification: 0.21333333333333335 Loss: 89.41399383544922
Best accuracy classification: 0.6266666666666667

The code is rather long but basically I create a 768 dimensional vector which consists of the BERT embedding of a user bio plus some numerical features that represent follower count, following count, and some binary features that represent things such as if a user is verified.

Are you also updating the BERT?

The dataset has ~400 source nodes and 60K user nodes.

If you are performing node classification on source nodes, there might be too few data points for modeling.

Your latest learning curve appears more reasonable in terms of loss scale.

No, I just update the embeddings.

Yea, in fact if I add another node type to the graph - for example tweets - and there are about 100K of those, then I get learning curves worse than the first one I posted. Is there any tricks I can do to make this better? Even Link Prediction struggles on these larger graphs, can I maybe change the way I sample the negative examples there?

No, I just update the embeddings.

Have you tried disabling the update for the embeddings?

Yea, in fact if I add another node type to the graph - for example tweets - and there are about 100K of those, then I get learning curves worse than the first one I posted. Is there any tricks I can do to make this better? Even Link Prediction struggles on these larger graphs, can I maybe change the way I sample the negative examples there?

  1. I guess this will not work very well for node classification anyway. 400 nodes is just not enough for a large heterogeneous graph. One possibility is to find a tiny subset of the graph that yields the best performance.
  2. For link prediction, you may try different loss functions and different ways for predicting links from updated node representations.

@mufeili

Hmm, ok I will try some of those things. Can you also help me with getting the edge scores for all the nodes in the graph? EdgeDataLoader generates a positive and negative graph, so I’m not sure if that’s the right way to do it. I’m able to do inference using NodeDataLoader like below, but this gets me the embeddings and I can’t figure out how to get the edge scores between a pair of nodes from this:

for l, layer in enumerate(self.layers):
    y = {k: torch.zeros(curr_g.number_of_nodes(k), self.hid_feats if l != self.n_layers - 1 else self.out_feats) for k in curr_g.ntypes}

    sampler = dgl.dataloading.MultiLayerFullNeighborSampler(1)
    dataloader = dgl.dataloading.NodeDataLoader(
        curr_g, {k: torch.arange(curr_g.number_of_nodes(k)) for k in curr_g.ntypes}, sampler, batch_size=batch_size, shuffle=True, drop_last=False, num_workers=self.num_workers)

    for input_nodes, output_nodes, blocks in tqdm(dataloader):
        block = blocks[0].to(torch.device('cuda'))

        h = {k: x[k][input_nodes[k].type(torch.LongTensor)].to(torch.device('cuda')) for k in input_nodes.keys()}
        
        h = layer(block, h)

        for k in h.keys():
            y[k][output_nodes[k].type(torch.LongTensor)] = h[k].cpu()

Are there any loss types you can recommend? Also, since my output nodes are node balanced, are there any tricks you can recommend for link prediction for unbalanced classes?

Given a positive graph pos_g and a negative graph neg_g from an iteration of EdgeDataLoader, the positive edges are simply pos_g.edges() and the negative edges are simply neg_g.edges(). What do you mean by “the edge scores for all the nodes in the graph?”?

Are there any loss types you can recommend? Also, since my output nodes are node balanced, are there any tricks you can recommend for link prediction for unbalanced classes?

What do you mean by node-balanced? For link prediction, are you performing edge classification?

When doing link prediction, we compare the scores between nodes connected by an edge against the scores between an arbitrary pair of nodes. I just want to get the scores between all pairs of nodes in my graph that are connected at test time, as an inference. So if Node A connects to Node B, I want the score between A and B. I could technically do this with EdgeDataLoader, but that generates a pos_g and a neg_g as you said, and in this case I only want the positive graph or some other way to get the edge score between each pair of nodes.

No, I’m not doing edge classification. But even when I train link prediction, at the end of the day I make the prediction on the source nodes by embedding the entire graph and training a classifier. And the source nodes have multiple classes which are not balanced. So I was wondering if there were some adjustments I could make to handle this during the link prediction phase.

Why would I want to do this and how would I do it? You mean when doing node classification don’t change the feature representation for the nodes?

I just want to get the scores between all pairs of nodes in my graph that are connected at test time, as an inference. So if Node A connects to Node B, I want the score between A and B. I could technically do this with EdgeDataLoader, but that generates a pos_g and a neg_g as you said, and in this case I only want the positive graph or some other way to get the edge score between each pair of nodes.

Let g be the graph consisting of only training and validation edges. You need to:

  1. Compute the representations of all nodes using g.
  2. Get the source and destination nodes for all edges in the test set.
  3. Get the representations of the source and destination nodes for the test edges.
  4. You can then compute the scores for the test edges.

To compute the node representations, you can try following the example here.

No, I’m not doing edge classification. But even when I train link prediction, at the end of the day I make the prediction on the source nodes by embedding the entire graph and training a classifier. And the source nodes have multiple classes which are not balanced. So I was wondering if there were some adjustments I could make to handle this during the link prediction phase.

I’m not sure about that.

Why would I want to do this and how would I do it? You mean when doing node classification don’t change the feature representation for the nodes?

Never mind. I guess I misunderstood that.

I believe I have done all of this except for the last step. Can you explain how to compute the scores?

How did you compute loss for link prediction? It should be the same as that.

I pass in the subgraph with the positive edges and the features into the ScorePredictor to get the positive score. But, how should I get the positive graph? NodeDataLoader would only get me nodes right, should I use EdgeDataLoader again and just compute the scores for that? Would that give me the scores between each pair of edges just once in a batched way?

Here is the Score Predictor:

class HeteroScorePredictor(nn.Module):
    def forward(self, edge_subgraph, x):
        with edge_subgraph.local_scope():
            edge_subgraph.ndata['h'] = x
            for etype in edge_subgraph.canonical_etypes:
                if edge_subgraph.num_edges(etype) <= 0:
                    continue
                edge_subgraph.apply_edges(dgl.function.u_dot_v('h', 'h', 'score'), etype=etype)
            return edge_subgraph.edata['score']```

You can construct a giant graph consisting of all edges in the training and validation set, probably just the graph you passed to NodeDataLoader and EdgeDataLoader. This giant graph is for computing the representations of all nodes in the graph. Once you have that, you can then compute scores for the edges you want to predict using another mini-batching over edges. For that you don’t really need to call apply_edges. Say h is the output representations for all nodes in the graph, src and dst are source and destination nodes of the edges you want to predict. You can simply compute the dot of h[src] and h[dst].