Hi, I am currently trying to use DGL’s distributed training. First, following the guidance at 7.1 Data Preprocessing — DGL 2.1.0 documentation, I obtained a distributed graph and it can now be read correctly.
# Load data and build graph
g = dgl.distributed.DistGraph(args.graph_name)
pb = g.get_partition_book()
# Model
model = GCL(fea_dict, model_dict)
model = torch.nn.parallel.DistributedDataParallel(model)
# Neighbor random sampling
neighbor_sampler = dgl.dataloading.NeighborSampler(train_dict.get('n_neighbors'), dgl.distributed.sample_neighbors)
# Data sampling
nids = {k: g.get_ntype_id(k) for k in g.ntypes}
dataloader = dgl.dataloading.DistNodeDataLoader(
g, nids, neighbor_sampler, batch_size=train_dict.get('batch_size'), shuffle=True, drop_last=False)
# train
for epoch in range(train_dict.get('n_epochs')):
loss = 0.
with model.join():
for it, (input_nodes, output_nodes, blocks) in enumerate(dataloader):
# forward pass
h = model(blocks)
However, I encountered some issues when using DistNodeDataLoader.
The returned blocks object has its srcdata
and dstdata
as an empty defaultdict. I have not checked edata
, but I suspect the situation is the same, as in a distributed graph, they are stored separately.
Is there any way to directly distribute the related ndata
and edata
into srcdata
, dstdata
, and edata
? Thank you!