when dealing with edge classification or regression on a single graph, how should I split my dataset for training, validation, and testing?
should I split the edges and remove the edges of the test section of the graph for traing, or should I keep the entire graph and do not incorporate the edges of the test section for the calculation of the loss??? or what other way could I do it?
Both ways are reasonable depending on your setting:
- (Transductive) All edges are in the graph. Training will see validation and test edges. Note that each mini-batch need to exclude seed edges from the sampled subgraph. See Link Prediction — DGL 2.2.1 documentation
- (Inductive) All edges are in the graph. Use a mask array to distinguish edges among train, validation, test sets. In this case, besides excluding seed edges, masked edges also need to be filtered. To achieve that, store the mask as edge features and use the
prob_name
argument (NeighborSampler — DGL 2.2.1 documentation) in neighbor sampler to only sample from training edges.