Reddit dataset versions


I have been experimenting with the Reddit dataset provided in However, looking at other papers such as Cluster-GCN, I have seen that they report the number of edges in the Reddit dataset as 11M, not 110M. Why is this difference? How can I use the 11M version to compare the accuracy numbers I get to other papers?

The reddit dataset we are using is the same as GraphSAGE paper and SGC paper, the number of edges is 114M.

11M looks more like a typo for me and I’ll take a look at their paper and code to verify that.

Most other papers report the edge count of reddit they use as either 11M or 23M. You can take a look at GraphSaint or VR-GCN too.