Hi,

I am building a Recommender System using DGL, using a link prediction methodology.

To train the model, I use negative sampling. The model needs to predict that a positive pair of nodes has a higher cosine similarity than a negative pair of nodes.

To compute this cosine similarity, I implemented a custom function:

```
def udf_u_cos_v(edges):
cos = nn.CosineSimilarity(dim=1, eps=1e-6)
return {'cos': cos(edges.src['h'], edges.dst['h'])}
class CosinePrediction(nn.Module):
def __init__(self):
super().__init__()
def forward(self, graph, h):
"""
graph : graph with edges connecting pairs of nodes
h : hidden state of every node
"""
with graph.local_scope():
graph.ndata['h'] = h
graph.apply_edges(udf_u_cos_v)
ratings = graph.edata['cos']
return ratings
```

However, this seems to consume lots of memory.

In a DGL discussion post, it was explained that “[when] using a custom message function and a builtin-reduce function, […] DGL will use degree bucketing to parallize the computation, which is not most efficient in many cases.”

Would you have any recommendation as to how to most efficiently implement cosine similarity between two nodes in a graph?

Thanks in advance!