How does graph-store example for large graph training split input graph?

The doc provides us with an example for graph store and multi-GPU training on mxnet for large graph. (dgl doc)
1.But many examples for multi-gpu training on a single machine on mxnet show that the input data should be splitted into many part using “gutils.split_and_load”. But the example code in the dgl doc seems not to split input data.
2.Also some examples for distributed training on mxnet shows that we need to split training data to different workers firstly. But from the code I don’t see that.I guess it may be done by NeighborSampler(But I’m little confused, it seems that every trainer will finish the sample on the whole graph)?
3. So I wonder what was parallelled using multi-GPU ? For example, I have a graph g with n nodes. I set the seed_nodes to be the whole n node. args.batch_size=b, the nf number in this for loop will be n/b. So when using multi-gpu traing, each gpu will calculate one nf(all gpus will finish the for loop together)? Or the input g to each gpu will be a subset of the origin g?

for nf in dgl.contrib.sampling.NeighborSampler(g, args.batch_size,
                                                       args.num_neighbors,
                                                       neighbor_type='in',
                                                       num_workers=32,
                                                       shuffle=True,
                                                       num_hops=n_layers,
                                                       seed_nodes=train_nid):
  1. Can this multi-GPU training be used in pytorch?
  1. Hi, the document link you put here is the tutorial for large-scale training on a CPU NUMA machine, not for multi-GPU.

  2. For using distributed sampler, you should have a sampler machine and trainer machine. Both of these two types of machines need the whole graph data.

  3. We will release the multi-gpu tutorial in the near feature. Thanks!