Dataloader reproducibility issue


I have an issue regarding the dataloader of dgl. First of all, I want to mentionned that I tried to fix all the seeds with :

def fix_seed(seed):
    Args : 
        seed : value of the seed
    Function which allows to fix all the seed and get reproducible results
    os.environ['PYTHONHASHSEED'] = str(seed)
    torch.backends.cudnn.benchmark = False
    os.environ['OMP_NUM_THREADS'] = '1'
    os.environ['MKL_NUM_THREADS'] = '1'

But it does not work.
My issue is when I am restarting my kernel, the dataloader that I create does not sample the same block. For the first time, I’ll have the following block :

And then, I restart my kernel and I have the following output:

You can see that the first block is different. This change does not appear when I do not kill my kernel before re-running it. It also seems that, sometimes when I restart my kernel, I fall back on a previous result. But it is not stable.

Could you please help me ?

PS :Here is my full code

seed = 10
sampled_neigh = [15, 5]
batch_size = 256 
sampler = dgl.dataloading.MultiLayerNeighborSampler(sampled_neigh)
dataloader = dgl.dataloading.DataLoader(
    graph = g, # the graph
    indices = train_mask, # The node IDs to iterate over in minibatches
    graph_sampler = sampler, # the neighbor sampler -> how we will sample train_mask neighborhood
    batch_size = batch_size, # size of the batch
    shuffle = True, # wether to shuffle or not at each batch
    drop_last = False, # wether to keep or to drop the last incomplete batch,
    num_workers = 0
for batch in dataloader:

Which version of DGL are you using?

Also, in the latest version, does using GraphBolt solve your problem?

Hello @BarclayII,

I am using dgl 2.1.0 version. Do you have any idea where comes from this lack of reproducibility ?

I tried really quickly with graphBolt, but I stopped because I had an issue that I did not investigate further. The issue was TypeError: hasattr(): attribute name must be string This exception is thrown by __iter__ of MapperIterDataPipe(datapipe=IterableWrapperIterDataPipe, fn=functools.partial(<function minibatcher_default at 0x7f54b8892ca0>, names=(None,)), input_col=None, output_col=None)
and my code was :

import dgl.graphbolt as gb
datapipe = gb.ItemSampler(train_mask, batch_size=1024, shuffle=True, drop_last = False)
datapipe = datapipe.sample_neighbor(dgl.graphbolt.from_dglgraph(g), [4, 4])
datapipe = datapipe.fetch_feature(g.ndata['features'], node_feature_keys=["features"])
train_dataloader = gb.DataLoader(datapipe, num_workers=0)

for minibatch in train_dataloader:

The issue appeared when browsing the dataloader.
But it really annoys me to see this difference. I don’t understand why using dgl.dataloading.DataLoader with all the seeds fixed does not work. It does not make any sense …

BTW, thanks for the help !

This line is wrong. In GraphBolt, we use FeatureStore. please refer to example for details. Or check in the documentation: FeatureFetcher — DGL 2.2.1 documentation.

Hello @Rhett-Ying,

Thanks for the indications !

I resolve my previous issue and I’ll look further into graphbolt !

My initial issue was the non reproducible results. The answer to that issue comes from :
Lack of reproducibility in PinSAGE sampler - Questions - Deep Graph Library ( which found the answer from python - Limit number of threads in numpy - Stack Overflow

The reason from this issue is :

import os
os.environ["MKL_NUM_THREADS"] = "1" 
os.environ["NUMEXPR_NUM_THREADS"] = "1" 
os.environ["OMP_NUM_THREADS"] = "1" 

should be put before we do import numpy.** Apparently numpy only checks for this at import.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.