Reproducibility, DataLoader: shuffle=True, using seeds

aure_bnp · March 6, 2024, 3:48pm

Hi !
I want my result to be reproducible, but it seems I can’t make it happen…
Is there a way to use seed s and shuffle=True and keep Reproducibility in DataLoader?
I use dgl.dataloading.DataLoader.

I have already tried to fix numpy seed, torch seed and random seed but it does not seems to work.
I also tried something from this question Reproducibility, DataLoader: shuffle=True, using seeds - data - PyTorch Forums. But again, it does not seems to work with Dataloader of dgl… (but the code provided by ptrblck actually work for his example).

Is there any of you that could help me with this issue ?

Thanks a lot

In fact, it worked, I just needed to restart my kernel to get the same results…

aure_bnp · March 8, 2024, 7:23pm

Hi !
In fact, there are small changes in the last batch when I fix the seed, hence my results are not reproducible. I don’t understand why, I have fixed all of my seeds.

def fix_seed(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    dgl.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

And I use this function like this :

fix_seed(seed)
for batch in tqdm.tqdm(dataloader_train):
input_nodes,output_nodes, block = batch
...

If someone could help me please.
Thank you

aure_bnp · March 15, 2024, 8:50am

For those of you that have the same problem :
For me, the issue comes from the dataloader, it did not sample the same neighborhood each time. Putting down multithreading was the solution to get reproducible restults. I use the function fix_seed before creating each datalaoder and sampler.

def fix_seed(seed):
    '''
    Args : 
        seed : fix the seed
    Function which allows to fix all the seed and get reproducible results
    '''
    torch.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    dgl.seed(seed)
    dgl.random.seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.use_deterministic_algorithms(True)
    os.environ['OMP_NUM_THREADS'] = '1'
    os.environ['MKL_NUM_THREADS'] = '1'
    torch.set_num_threads(1)

 fix_seed(10)
sampler = dgl.dataloading.MultiLayerNeighborSampler([15,10])
dgl.dataloading.DataLoader(
    graph = graph, # the graph
    indices = train_mask, # The node IDs to iterate over in minibatches
    graph_sampler = sampler, # the neighbor sampler -> how we will sample train_mask neighborhood
    batch_size = batch_size, # size of the batch
    shuffle = True, # wether to shuffle or not at each batch
    drop_last = False, # wether to keep or to drop the last incomplete batch
)

peizhou001 · March 21, 2024, 1:41am

Hi @aure_bnp, thanks for this! We’ve confirmed that DGL can’t currently fix the seed under a multi-threaded environment, but we’re working on fixing this soon.

system · April 20, 2024, 1:42am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.