Exporting embeddings for current PinSAGE implementation

Hi, I am referring to the following implementation of pinSAGE : https://github.com/dmlc/dgl/tree/master/examples/pytorch/pinsage

I had a few questions regarding it -

  1. Is it using the user features? If yes, where? I can see item features being used, but can’t understand where the user features are used.
  2. the readme in repo says that it learns embeddings for each node, instead of getting it as a function of node features. but i can’t understand how its being done, and how do we export the node embeddings.

Wouldn’t something simple like :

with torch.no_grad():
    item_batches = torch.arange(g.number_of_nodes(item_ntype)).split(args.batch_size)
    h_item_batches = []
    for blocks in dataloader_test:
        for i in range(len(blocks)):
            blocks[i] = blocks[i].to(device)
        h_item_batches.append(model.get_repr(blocks))
    h_item = torch.cat(h_item_batches, 0)
    torch.save(h_item,"embeddings.pth")

work to save the embeddings even if the graph g is different from what we trained on?

Please let me know if i couldn’t explain some parts of the questions, or if more details are needed

Also, are there any tips on improving the training speed? Because right now the dataloader is very slow, and GPU is mostly sitting idle. The time of dataloader for 1 step is greater than 100s while GPU completes 1 step in <1sec. I have increased number of workers to 15, but even than isn’t helpful in such large time for sampling

Is it using the user features? If yes, where? I can see item features being used, but can’t understand where the user features are used.

PinSAGE is not using user features. It only learns the representation of items, and the training loss is also comparing whether two items are relevant.

the readme in repo says that it learns embeddings for each node, instead of getting it as a function of node features. but i can’t understand how its being done, and how do we export the node embeddings.

Essentially each item has its own learnable vector as parameter. This is reflected by the fact that the item ID itself is an integral feature in https://github.com/dmlc/dgl/blob/master/examples/pytorch/pinsage/model.py#L48-L49 (or the item embedding matrix created in https://github.com/dmlc/dgl/blob/master/examples/pytorch/pinsage/model_sparse.py#L88 if you are looking for sparse embedding updates).

That being said, your demo code is correct for saving the embeddings, because the output from PinSAGE model is indeed the representation we use later for nearest neighbor lookup etc. However, the PinSAGE model provided here is a transductive model, meaning that it cannot be migrated to a completely different graph. This is because we have the item embedding matrix as a learnable parameter. The reason for having that is because assigning a learnable embedding for each item performs better than just using the item features on the public datasets we have tested (MovieLens & Nowplaying-RS).

Also, are there any tips on improving the training speed? Because right now the dataloader is very slow, and GPU is mostly sitting idle.

This is a good question. Actually, the computation of PinSAGE model is so light that the time spent in sampling takes over as one of the major bottleneck (and the sampling algorithm is indeed quite complex as well). This is observed in most GNN models in neighbor sampling.

However, 100 seconds for 1 training step sounds strange to me. What are your arguments of PinSAGE sampler? And what is the size (number of users/items/interactions) of your graph?

Thanks.

Hi, thanks for the detailed reply.
My graph has around 150m edges and 1m nodes.

I had run it for
layers=2,hidden_dims=256, num_neighbours=3, num_random_walks=10,random_walk_length=2, batch_size 500000.

But on using a smaller batch size like 5000, I observed a better overall estimated training time.
I am using num_workers = num_cores of my machine -1.
Is there any other way to reduce the time further, because currently it is quite large and prevents training for any significant number of epochs.

Also, on timing various parts of sampling, I found that within sample_blocks in sampler.py the part where we get edge ids
(i.e. eids = frontier.edge_ids(torch.cat([heads, heads]), torch.cat([tails, neg_tails]), return_uv=True)[2])
may be one of the reasons for the high sampling time. Am I right?
Can we do this part in an alternative, faster way? Or, can we just drop this edge removal part?
Do you have any other suggestions for improving the training time?

Also, about the inductive part, for now I am talking about model.py. If I just node id as a feature, then it should work for getting embeddings on other graphs and new nodes, right?

Also, on timing various parts of sampling, I found that within sample_blocks in sampler.py the part where we get edge ids
(i.e. eids = frontier.edge_ids(torch.cat([heads, heads]), torch.cat([tails, neg_tails]), return_uv=True)[2])
may be one of the reasons for the high sampling time. Am I right?

Correct. We are currently optimizing the implementation of edge_ids. For now, you could first try removing this operation. However, you may need to bump up the regularization a bit because not removing the edges may cause data leakage - because we are predicting if the co-interaction exists, we should not include it in message passing. That being said, this is only observed in GCMC training, and I never quantitatively measure the impact of removing the edges in PinSAGE.

Also, about the inductive part, for now I am talking about model.py. If I just node id as a feature, then it should work for getting embeddings on other graphs and new nodes, right?

If you were talking about training on one graph, saving the embeddings, and using those embeddings for another graph, then this idea does not necessarily work. For one, the two graphs may have different set of users/items/features, whose parameters are not learned at all.

For the inductive part, actually I meant training on one graph, save the model (so we are also saving its linear projector layers). Then for new graph, use the code fragment in the first post to get embeddings.

But if my set of features remain same, then it would, right? So, if for each item in both graphs, I have same number of features with same dimension, and for categorical features, I know their domain (the all possible categories that can appear) and do a one hot encoding accordingly, and not using any text features, then what may go wrong?

And about edge removal, yes we should ideally not include those edges in message passing, but I will try to remove it and check performance.

Edit: In layers.py line 23 and 24

    if column == dgl.NID:
        continue

are we skipping the ids? What are we doing here? Also consider line 95. Looks like we are skipping the node ids feature.

After using print statements, I found that the feature skipped was not “id” but _ID. So we are not skipping “id” feature.

If I don’t add the “id” feature (i.e. remove the lines 48-49 here https://github.com/dmlc/dgl/blob/master/examples/pytorch/pinsage/model.py#L48-L49 ) then it could be used as inductive model, right?

But if my set of features remain same, then it would, right?

Yes. If you are only testing on a graph with the same nature (like same feature, same generative model, etc) then it should work fine.

If I don’t add the “id” feature (i.e. remove the lines 48-49 here https://github.com/dmlc/dgl/blob/master/examples/pytorch/pinsage/model.py#L48-L49 ) then it could be used as inductive model, right?

Yes.