Examples of PyTorch for "Run multi-processing training" in "Large-Scale Training of Graph Neural Networks" Tutorial

Ian09 · June 8, 2020, 1:40pm

It seems that the tutorial “Large-Scale Training of Graph Neural Networks” runs only on MxNet. Is there any code for “run_store_server.py” in PyTorch?

Moreover, I did not find the script of ``…/incubator-mxnet/tools/launch.py``` in the repo, which is used by the “Run multi-processing training” Session of https://github.com/dmlc/dgl/tree/master/examples/mxnet/sampling. Also, there is no pytorch version for the “Run multi-processing training” .

classicsong · June 13, 2020, 8:43am

There are several examples in the DGL repo with multigpu support:

github.com

dmlc/dgl/blob/master/examples/pytorch/graphsage/train_cv_multi_gpu.py

import dgl
import numpy as np
import torch as th
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.multiprocessing as mp
import dgl.function as fn
import dgl.nn.pytorch as dglnn
import time
import argparse
import tqdm
import traceback
from _thread import start_new_thread
from functools import wraps
from dgl.data import RedditDataset
from torch.utils.data import DataLoader
from torch.nn.parallel import DistributedDataParallel

class SAGEConvWithCV(nn.Module):

This file has been truncated. show original

github.com

dmlc/dgl/blob/master/examples/pytorch/graphsage/train_sampling_multi_gpu.py

import dgl
import numpy as np
import torch as th
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.multiprocessing as mp
from torch.utils.data import DataLoader
import dgl.function as fn
import dgl.nn.pytorch as dglnn
import time
import argparse
from dgl.data import RedditDataset
from torch.nn.parallel import DistributedDataParallel
import tqdm
import traceback

from utils import thread_wrapped_func

#### Neighbor sampler

This file has been truncated. show original

github.com

dmlc/dgl/blob/master/examples/pytorch/gcmc/train_sampling.py

"""Training GCMC model on the MovieLens data set by mini-batch sampling.

The script loads the full graph in CPU and samples subgraphs for computing
gradients on the training device. The script also supports multi-GPU for
further acceleration.
"""
import os, time
import argparse
import logging
import random
import string
import traceback
import numpy as np
import torch as th
import torch.nn as nn
import torch.multiprocessing as mp
from torch.utils.data import DataLoader
from torch.multiprocessing import Queue
from torch.nn.parallel import DistributedDataParallel
from _thread import start_new_thread

This file has been truncated. show original

ChenYuHo · July 8, 2020, 3:55am

Hello @classicsong,

Do these examples work for multi node settings?
Since they use DistributedDataParallel, minor changes can be made regarding the coordination, but I’m not sure about the data itself.
Do I need to do further data processing?
For example, our testbed has 8 nodes, each with one GPU.

classicsong · July 13, 2020, 7:19am

The distributed training in DGL is under development. We will release tools for distributed training in 0.5 release.

danish · July 19, 2020, 6:51am

@classicsong what is the expected timeline for 0.5 release?

minjie · July 20, 2020, 3:08pm

0.5 release will be in early August.