Concrete PinSage deviations from paper

Hello,
As previously mentioned in the README of the PinSage implementation, there are several deviations between the pseudocode of the original paper and the methodology used in the implementation. If possible, can you confirm whether the differences I will list below are correct?
For simplicity, I am posting the message passing implementation code here (so that I can refer to lines):

1 def forward(g, h, weights): 
2	h_src, h_dst = h
3   with g.local_scope():
4             g.srcdata['n'] = self.act(self.Q(self.dropout(h_src)))
5             g.edata['w'] = weights.float()
6             g.update_all(fn.u_mul_e('n', 'w', 'm'), fn.sum('m', 'n'))
7             g.update_all(fn.copy_e('w', 'm'), fn.sum('m', 'ws'))
8             n = g.dstdata['n']
9             ws = g.dstdata['ws'].unsqueeze(1).clamp(min=1)
10            z = self.act(self.W(self.dropout(torch.cat([n / ws, h_dst], 1))))
11            z_norm = z.norm(2, 1, keepdim=True)
12            z_norm = torch.where(z_norm == 0, torch.tensor(1.).to(z_norm), z_norm)
13            z = z / z_norm
  • Dropout: it seems that original paper does not list dropout while the implementation does.

  • Linear projection: it seems that the original paper uses pooling rather than a linear projection to get the hidden representations of neighborhood nodes

  • Normalization by edge sums: it seems that in the implementation, line 10 normalizes the neighbourhood representation by the sum of outgoing edges, this does not seem to be referenced in the original paper.

Please let me know if I have missed something here or if I am somehow incorrect in my understanding.

Thank you in advance!

Yes, this is our addition.

I assume by “pooling” you mean averaging the neighboring representations? This is done with update_all.

I would say this is specific for weighted graphs. If the graph is unweighted, the weights will be all 1 and the result will be identical to averaging.

Thank you for your response! I have a few follow ups :slight_smile:

I assume by “pooling” you mean averaging the neighboring representations? This is done with update_all .

Not exactly, in the pseudocode of the original paper (see image below), in Line 1, they apply \gamma which, as I understood them, the authors explain to be a pooling layer. Thus, it seems that once I aggregate the neighbours, I am expected to apply some sort of pooling layer to the output. This suggests to me that Line 4 should look something like
g.srcdata['n'] = self.pool(self.act(self.Q(self.dropout(h_src))))

Another question I had was related to the overall structure of storing the raw features and “lazily” projecting them onto node representations whenever the corresponding node is called. From my general understanding, if I wish to make this model work in an inductive setting, I would have to store the node representations on the nodes themselves (say, for example, in data[‘emb’]). Then, after each forward pass of the model I would have to replace the old node representations with new ones via blocks[-1].dstdata[‘emb’] = z. Is this correct? And, if so, I have a very naive follow up question - can I store tensors with gradients in the graph or do I need to somehow detach them before I do so?

In Section 3.2 they use “importance pooling” as their “pooling layer”, which is essentially weighted sum. That is done by update_all.

Yes.