Inconsistent Dataset Statistics

ycjing · June 9, 2020, 10:57am

Hi,

Thank you for developing such nice framework. When I run the demo code of GAT at https://github.com/dmlc/dgl/tree/master/examples/pytorch/gat, I found that the dataset statistics are different from what are reported in the literature. For example, in [1], the reported edge numbers for Citeseer is 4732. However, the output of DGL is:

Finished data loading and preprocessing.
  NumNodes: 3327
  NumEdges: 9228
  NumFeats: 3703
  NumClasses: 6
  NumTrainingSamples: 120
  NumValidationSamples: 500
  NumTestSamples: 1000

I would greatly appreciate for any help. Thank you.

[1] Yang et al. Revisiting Semi-Supervised Learning with Graph Embeddings.

mufeili · June 13, 2020, 1:31pm

If you check the official code of GCN, then you will find that there are 9228 nonzero entries in the adjacency matrix for citeseer. So I guess the dataset statistics is not correct and the latter work simply follow the practice.

ycjing · June 13, 2020, 1:44pm

Hi @mufeili

Thank you for your very kind and helpful response! I really appreciate it.

Best,
Yongcheng

mufeili · June 13, 2020, 1:55pm

No worries. You are welcome.