Dataloading efficiency bottleneck at to_block

I tried benchmarking NeighborSampler and surprisingly found that the bottleneck is at transforming frontier to block, which takes up to 80% of the sampling time. If I move frontier to gpu before performing to_block it gets a lot faster, but this consumes a lot of gpu memory. Is there any other workaround?

You could probably try to use UVA sampling by setting use_uva=True in the DataLoader?