How does exactly DGL split the datasets while using train_mask, valid_mask and test_mask?

Anon · February 20, 2024, 7:12am

I am using Cora, Citeseer and PubMed datasets for my Knowledge Graph project using Graph Attention Network model.
The DGL library has specified the train, test and validation masks in the ndata dictionary of theses datasets.
I tried printing the size of train, vaild and test masks.
The size of train mask is smaller than the test and valid.
For eg. for cora original sizes were:
“train_samples”: 140,
“valid_samples”: 500,
“test_samples”: 1000,

after swapping training, test and valid mask:
“train_samples”: 1000,
“valid_samples”: 140,
“test_samples”: 500,

After swapping the accuracy of the model increased abruptly (almost 4-9% depending on the dataset we used)

In deep learning and data science, generally the size of training dataset should be larger than the size of test and validation datasets.

So how does DGL split the datasets? What is the purpose behind splitting it in unusual manner?

Rhett-Ying · February 20, 2024, 7:30am

DGL does not split the dataset. The author of dataset split the dataset, I think. refer to Cora Dataset | Papers With Code

Anon · February 21, 2024, 5:12am

Yeah but that doesn’t make sense. The total of those do not add up to 2708 for cora. So many nodes are left unused.

Rhett-Ying · February 22, 2024, 2:52am

DGL just follow this paper/repo: https://github.com/tkipf/gcn/tree/master/gcn/data. Please find more details in it. One possible reason could be the dataset aims to train a model that rely on a small portion of data with label.

system · March 23, 2024, 2:52am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.