Overlap data loading and computation with DistDGL

I want to overlap data load and computation in DistDGL. I use ThreadPoolExecutor to get next batch when current batch training. My code likes below:

train_dataloader = dgl.dataloading.DistNodeDataLoader( g,
        train_nids,
        sampler,
        batch_size=args.batch_size,
        shuffle=True,
        drop_last=False,
    )
executor = ThreadPoolExecutor()
...
dataload_iter = train_dataloader.__iter__()
while True:
            if batch_idx == 0:
                get_first_minibatch(dataload_iter)
            else:
                next_batch_nodes.get()

            future = executor.submit(get_next_minibatch, dataload_iter)  #put batch in queue
            # model train
            future.result()

But codes above have identical execute time with normal version which likes below.

for batch_idx, (input_nodes, output_nodes,
                        blocks) in enumerate(train_dataloader):
            # model train

It really confuses me. I guess that the overlap can’t happen due to the existence of the GIL. Is there any chance to overlap these two stages? Will multiprocessing be useful? Any advice would help me a lot. Thanks in advance.