Reddit dataset versions


I have been experimenting with the Reddit dataset provided in However, looking at other papers such as Cluster-GCN, I have seen that they report the number of edges in the Reddit dataset as 11M, not 110M. Why is this difference? How can I use the 11M version to compare the accuracy numbers I get to other papers?

The reddit dataset we are using is the same as GraphSAGE paper and SGC paper, the number of edges is 114M.

11M looks more like a typo for me and I’ll take a look at their paper and code to verify that.

Most other papers report the edge count of reddit they use as either 11M or 23M. You can take a look at GraphSaint or VR-GCN too.

SGC also used the 11.6M version,are there any ways to get the 11.6M dataset?

Sorry about the confusion, actually there are two reddit datasets.

You can download their 11.6M dataset from the link provided in Training Accuracy much lower than Validation Accuracy · Issue #9 · matenure/FastGCN (