How to measure the CPU neighborhood sampling time?

I think the sampling step is within dataloader as the code shows below. How can I measure the sampling time decoupled from dataloader? Looking forward to your reply :slight_smile:

sampler = dgl.dataloading.NeighborSampler([4-4], prefetch_node_feats=['feat'], prefetch_labels=['label'],
        )
train_dataloader = dgl.dataloading.DataLoader(
    # The following arguments are specific to NodeDataLoader.
    graph,              # The graph is in CPU
    train_nids,         # The node IDs to iterate over in minibatches
    sampler,            # The neighbor sampler
    device=device,      # Put the sampled MFGs on CPU or GPU
    use_ddp=True,       # Make it work with distributed data parallel
    # The following arguments are inherited from PyTorch DataLoader.
    batch_size=args.batch_size,    # Per-device batch size.
                        # The effective batch size is this number times the number of GPUs.
    shuffle=True,       # Whether to shuffle the nodes for every epoch
    drop_last=False,    # Whether to drop the last incomplete batch
    num_workers=args.num_worker       # Number of sampler processes
)
for step, (input_nodes, output_nodes, blocks) in enumerate(dataloader):
    ...

By “decoupled”, you mean you don’t want to consider multiprocessing and overlapping with computation? If so, you probably could just benchmark sampler.sample method.