Reddit dataset versions

mfbalin · February 16, 2020, 11:17pm

Hello,

I have been experimenting with the Reddit dataset provided in dgl.data. However, looking at other papers such as Cluster-GCN, I have seen that they report the number of edges in the Reddit dataset as 11M, not 110M. Why is this difference? How can I use the 11M version to compare the accuracy numbers I get to other papers?

zihao · February 17, 2020, 8:07am

The reddit dataset we are using is the same as GraphSAGE paper and SGC paper, the number of edges is 114M.

11M looks more like a typo for me and I’ll take a look at their paper and code to verify that.

mfbalin · February 17, 2020, 2:00pm

Most other papers report the edge count of reddit they use as either 11M or 23M. You can take a look at GraphSaint or VR-GCN too.

xixi-baba · January 13, 2021, 4:47pm

SGC also used the 11.6M version，are there any ways to get the 11.6M dataset？

zihao · January 14, 2021, 4:06am

Sorry about the confusion, actually there are two reddit datasets.

You can download their 11.6M dataset from the link provided in Training Accuracy much lower than Validation Accuracy · Issue #9 · matenure/FastGCN (github.com).