Counting or groupby readout


I have an integer-valued node feature, where the integer represents a category. I’d like to find, for each graph within a batched graph, the most popular category (aka the mode of the integer feature). Does anyone know how to do this?



You can change the integer label into a one-hot label. And use dgl.sum_nodes (docs) to sum the one-hot labels for each graph as counting. And do another max operation over each row of sum do get the desired result.



Thanks for your reply. This is a good idea but unfortunately will not work for me since, in my case, the number of labels is linearly proportional to the number of nodes (hence quadratic memory requirement).

For now I am just using pandas with groupby and hoping it will be quick enough. Would love to see dgl support for mode, though perhaps this is not a common use case.



I don’t know the exact answer, but I will think about this question in two steps:

  1. Given one int tensor for categories, how to do groupby/count in your case? (this looks like the trickiest part.)
  2. If 1 is solved, I could at least unbatch the graph, use a for loop and then batch the results again.