How does exactly DGL split the datasets while using train_mask, valid_mask and test_mask?

I am using Cora, Citeseer and PubMed datasets for my Knowledge Graph project using Graph Attention Network model.
The DGL library has specified the train, test and validation masks in the ndata dictionary of theses datasets.
I tried printing the size of train, vaild and test masks.
The size of train mask is smaller than the test and valid.
For eg. for cora original sizes were:
“train_samples”: 140,
“valid_samples”: 500,
“test_samples”: 1000,

after swapping training, test and valid mask:
“train_samples”: 1000,
“valid_samples”: 140,
“test_samples”: 500,

After swapping the accuracy of the model increased abruptly (almost 4-9% depending on the dataset we used)

In deep learning and data science, generally the size of training dataset should be larger than the size of test and validation datasets.

So how does DGL split the datasets? What is the purpose behind splitting it in unusual manner?

DGL does not split the dataset. The author of dataset split the dataset, I think. refer to Cora Dataset | Papers With Code

Yeah but that doesn’t make sense. The total of those do not add up to 2708 for cora. So many nodes are left unused.

DGL just follow this paper/repo: https://github.com/tkipf/gcn/tree/master/gcn/data. Please find more details in it. One possible reason could be the dataset aims to train a model that rely on a small portion of data with label.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.