Should model be defined inside/outside cross validation loop?

I have a general question on dgl model evaluation using cross validation (CV).

I noticed that the model performance were different when I define model inside or outside the CV loop.

My question is whether I should define my model inside/outside CV loop?

Here is a pseudo-example of how I used k_folds.

model = GCNClassifier(settings) #should I define it outside?

for train_idx, test_idx in k_folds(n_splits = num_folds):
    dataset_train = Dataset[train_idx]
    dataset_test = Dataset[test_idx]
    train_loader = DataLoader(dataset = dataset_train)
    test_loader = DataLoader(dataset = dataset_test)

    model = GCNClassifier(settings) # or should I define it inside?
    
    for epoch in range(1, n+1):
        train(model, optimizer, epoch, train_loader)
        test(model, test_loader)

If you want to brush up on cross validation, you may find this article by sklearn helpful. Generally, cross-validation is an approach for hyperparameter selection rather than parameter selection. If you simply want to find out the best hyperparameters, you should use random weights initialized for each fold. That says, inside the loop. Also note that you may want to train multiple models for each fold to address the randomness due to the random weight initialization.

Thanks for your explanation! That is very useful!
I would like to know whether I can set the “random_state” or use other settings to make the random weight initialization model results reproducible? Or the random weight initialization should be left completely random?

It depends on what you want to achieve. For pure cross validation, you should not introduce additional bias with fixing the random seed. Alternatively, you can run multiple experiments with weights initialized differently to address the statistical bias.

2 Likes

Thanks a lot for your explanation! It is very clear!