Hi experts,
I’m using the MultiLayerNeighborSampler to generate batches for the Reddit dataset. I’m finding that the time it takes to load the first batch is around 30,000 times slower than the time it takes to load the second batch. See the following code:
import time
import dgl
import numpy as np
dataset = dgl.data.RedditDataset()
graph = dataset[0]
adj_sparse = graph.adj(scipy_fmt='coo')
train_ids = np.arange(adj_sparse.shape[0])[graph.ndata['train_mask']]
graph = dgl.graph((adj_sparse.row, adj_sparse.col))
dataset = None
sampler = dgl.dataloading.MultiLayerNeighborSampler([5, 5])
dataloader = dgl.dataloading.DataLoader(
graph, train_ids, sampler,
batch_size=64,
shuffle=False,
drop_last=False,
num_workers=0)
dataloader_iter = iter(dataloader)
first_sample_start = time.perf_counter()
input_nodes, output_nodes, mfgs = next(dataloader_iter)
first_sample_end = time.perf_counter()
print(f"Finished first sample; time elapsed: {first_sample_end - first_sample_start}") # 174.90 on my machine
input_nodes, output_nodes, mfgs = next(dataloader_iter)
second_sample_end = time.perf_counter()
print(f"Finished second sample; time elapsed: {second_sample_end - first_sample_end}") # 0.005 on my machine
It seems that whatever is happening in generating the first batch is rather memory intensive, as my whole system starts lagging in generating this batch. Could I get some more details on what happens when generating the first batch? Is there some intermediate data structure being generated? If so, how much memory does it take to develop this structure?