MultiLayerNeighborSampler generating first batch several orders of magnitude slower than generating second batch

Hi experts,

I’m using the MultiLayerNeighborSampler to generate batches for the Reddit dataset. I’m finding that the time it takes to load the first batch is around 30,000 times slower than the time it takes to load the second batch. See the following code:

import time
import dgl
import numpy as np

dataset = dgl.data.RedditDataset()
graph = dataset[0]

adj_sparse = graph.adj(scipy_fmt='coo') 
train_ids = np.arange(adj_sparse.shape[0])[graph.ndata['train_mask']]
graph = dgl.graph((adj_sparse.row, adj_sparse.col))  
dataset = None

sampler = dgl.dataloading.MultiLayerNeighborSampler([5, 5])
dataloader = dgl.dataloading.DataLoader(
    graph, train_ids, sampler,
    batch_size=64,
    shuffle=False,
    drop_last=False,
    num_workers=0)
dataloader_iter = iter(dataloader)

first_sample_start = time.perf_counter()
input_nodes, output_nodes, mfgs = next(dataloader_iter) 
first_sample_end = time.perf_counter()

print(f"Finished first sample; time elapsed: {first_sample_end - first_sample_start}") # 174.90 on my machine

input_nodes, output_nodes, mfgs = next(dataloader_iter) 
second_sample_end = time.perf_counter() 
print(f"Finished second sample; time elapsed: {second_sample_end - first_sample_end}") # 0.005 on my machine

It seems that whatever is happening in generating the first batch is rather memory intensive, as my whole system starts lagging in generating this batch. Could I get some more details on what happens when generating the first batch? Is there some intermediate data structure being generated? If so, how much memory does it take to develop this structure?

Before sampling the first batch, DGL converts the format of the entire graph from COO to CSR so that later sampling becomes faster. I think that takes the majority of time consumption.

1 Like