Is DGL compatible with DDP (Distributed Data Parallel)?

Hi,

I am new to using GNNs. I already have a working code base with DDP and was hoping I could re-use it. I was wondering

  1. if DGL was compatible with pytroch’s DDP (Distributed Data Parallel).
  2. if it was better to use DGL’s native distributed API? (e.g. if there is something subtle I should know before trying to mix pytorch’s DDP and dgl but instead there is a good reason to use DGL’s distributed code)
  3. is there any example code for DGL + DDP?

Thanks for the feedback!

this code uses DDP: Chapter 7: Distributed Training — DGL 0.6.1 documentation

import dgl
import torch as th

dgl.distributed.initialize('ip_config.txt')
th.distributed.init_process_group(backend='gloo')
g = dgl.distributed.DistGraph('graph_name', 'part_config.json')
pb = g.get_partition_book()
train_nid = dgl.distributed.node_split(g.ndata['train_mask'], pb, force_even=True)


# Create sampler
sampler = NeighborSampler(g, [10,25],
                          dgl.distributed.sample_neighbors,
                          device)

dataloader = DistDataLoader(
    dataset=train_nid.numpy(),
    batch_size=batch_size,
    collate_fn=sampler.sample_blocks,
    shuffle=True,
    drop_last=False)

# Define model and optimizer
model = SAGE(in_feats, num_hidden, n_classes, num_layers, F.relu, dropout)
model = th.nn.parallel.DistributedDataParallel(model)
loss_fcn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=args.lr)

# training loop
for epoch in range(args.num_epochs):
    for step, blocks in enumerate(dataloader):
        batch_inputs, batch_labels = load_subtensor(g, blocks[0].srcdata[dgl.NID],
                                                    blocks[-1].dstdata[dgl.NID])
        batch_pred = model(blocks, batch_inputs)
        loss = loss_fcn(batch_pred, batch_labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

from link

As far as I know,
Q1’s answer is yes. Here is a single-node multi-gpus demo:dgl/train_sampling_multi_gpu.py at master · dmlc/dgl · GitHub, and It is easy to expand to multi-nodes and multi-gpus.

But there is a problem with the above example, that is, each node needs to read the full graph. DistDGL mainly solves this problem, which means that each node only needs to process a part of the full graph(by graph partition)(The details should be in the relevant documents and the paper).

Therefore, the answer to Q2 should be, if your data is very large and one node can not fit, you need to use DistDGL. If your data is not particularly large, training with data that reads the full graph for each node may be faster(Because there is no need to pull data from other nodes through the network).

2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.