I see that many datasets that ship with DGL (e.g. Cora) contain a mask that is used for training networks to classify nodes. I assume that this is a pre-defined, static mask in the sense that the same nodes are always masked.
Yes, particularly when we have only one graph for semi-supervised node classification. The masks then indicate the membership of nodes for the training, validation and test sets, mostly from a prior work.
What if one would have a large number of smaller graphs? Would it help the model to generalise if masks would be created at training time, e.g. a certain percentage of nodes would always be masked?
I think there are some different issues here.
- A large number of smaller graphs.
From what I’ve seen, most of the time we are performing graph-level prediction (e.g. predicting the properties of molecular graphs are graph-level regression/classification problems). For these kind of problems, we can simply treat them as usual datasets.
- Create masks at training time.
I assume you mean randomly splitting the datasets for training/validation and test. The main reason for using a fixed split is that many published work use a same fixed split and we want to compare the training results against theirs. People have been aware of that this can lead to potential overfitting problems, see Pitfalls of Graph Neural Network Evaluation. One common things that we can do is to train the model with different splits and compute mean/std.