Hello!

Thank you for the great product.

When I print inside reduce function:

```
def gcn_reduce(nodes):
# The argument is a batch of nodes.
# This computes the new 'h' features by summing received 'msg' in each node's mailbox.
print(nodes)
print(nodes.batch_size())
print(nodes.nodes())
return {'h' : torch.sum(nodes.mailbox['msg'], dim=1)}
```

I have the following output:

```
<dgl.udf.NodeBatch object at 0x7fd348f4f450>
1
tensor([1])
<dgl.udf.NodeBatch object at 0x7fd343b16c90>
1
tensor([2])
<dgl.udf.NodeBatch object at 0x7fd348f4f650>
1
tensor([0])
<dgl.udf.NodeBatch object at 0x7fd352143390>
3
tensor([23, 8, 13])
<dgl.udf.NodeBatch object at 0x7fd349038410>
11
tensor([26, 9, 12, 14, 15, 16, 17, 18, 20, 21, 22])
<dgl.udf.NodeBatch object at 0x7fd349038a10>
6
tensor([28, 25, 24, 10, 4, 19])
<dgl.udf.NodeBatch object at 0x7fd3521435d0>
6
tensor([30, 29, 27, 7, 6, 5])
<dgl.udf.NodeBatch object at 0x7fd349038450>
1
tensor([11])
<dgl.udf.NodeBatch object at 0x7fd349038890>
1
tensor([32])
<dgl.udf.NodeBatch object at 0x7fd348c581d0>
2
tensor([31, 3])
<dgl.udf.NodeBatch object at 0x7fd349038750>
1
tensor([33])
```

*How the size of NodeBatch is calculated?*

According to the Internal Message Passing Implementation by @zihao

```
Note that not all nodes have the same in-degrees, the parallel of all nodes’ reduce functions is not trivial. Currently DGL uses degree bucketing policy, which means auto-batching the reduce function of nodes with the same in-degrees, this is a good solution for uniform degree graph (subgraph produced by some GCN sampling algorithms) or trees (e.g. TreeLSTM) where degree variance is small.
```

Therefore, a batch is formed using the same in-degrees.

For example, node 1 and 5 has 3 incoming edges. Therefore they would be formed as NodeBatch with nodes.batch_size() equal 2 by DGL

Is this statement correct?

Do you have an article that proves that such an approach is efficient?

Andrei