Reddit Dataset does not work with DataParallel

askliar · May 2, 2019, 12:49pm

Hi!
I have just started using DGL and the main motivation was that it simplifies working with graph data a lot.

However, I have encountered a following issue: whenever I try to use to test my model on Reddit dataset, it does not work.

If I use torch.nn.DataParallel, it throws a following error: dgl._ffi.base.DGLError: Expect number of features to match number of nodes (len(u)). Got 58242 and 232965 instead.
This is probably due to the fact that Reddit dataset is not exactly compatible with PyTorch dataloader from what I can see in the code.

I also tried using torch.nn.parallel.DistributedDataParallel (with one program - multiple GPUs approach), however, it freezes when I try to load model into torch.nn.parallel.DistributedDataParallel.

However, because the graph is quite large, it does not fit into any of the machines that I have available.

Is there anything I can do about that? Thanks.

VoVAllen · May 2, 2019, 4:59pm

Hi,

It seems your data is not split in the right way. Do you use multiprocessing workers or any collate function?

askliar · May 2, 2019, 5:44pm

Thank you for you reply.
Not really, I am not using anything. I just load data and save it in the variables. My code looks as following:

    import torch
from dgl.data import register_data_args, load_data


def get_dataset(args, device):
    data = load_data(args)
    features = torch.DoubleTensor(data.features).to(device)
    labels = torch.LongTensor(data.labels).to(device)
    train_mask = torch.ByteTensor(data.train_mask.astype(int)).to(device)
    val_mask = torch.ByteTensor(data.val_mask.astype(int)).to(device)
    test_mask = torch.ByteTensor(data.test_mask.astype(int)).to(device)
    in_feats = features.shape[1]
    n_classes = data.num_labels
    n_edges = data.graph.number_of_edges()

    return (
        data, 
        features,
        labels,
        train_mask,
        val_mask,
        test_mask,
        in_feats,
        n_classes,
        n_edges,
    )

After that, I use the data by passing it to the model directly, similarly to DGL tutorial.

VoVAllen · May 3, 2019, 7:48am

Hi,

Reddit Graph has only one graph, right? How would you parallelize it to multiple gpus?

mickeystroller · June 21, 2019, 1:52am

Hi @askliar,

I am wondering whether you have resolve this issue. I have encountered a similiar problem and would like to know how you resolve this problem.

Thanks in advance.

VoVAllen · June 21, 2019, 5:44am

Hi @mickeystroller ,

@askliar’s problem seems mainly because of the wrong dimension of batching slicing. You want to slice over number of graphs from dataset but not number of nodes from graphs. I’m not sure whether this is the same problem for you. You can open a new post for further detail and I’d be glad to help you.

mickeystroller · June 21, 2019, 4:15pm

Thanks for the reply. My current model works fine when using only one GPU. I try to use torch.nn.DataParallel(model) to enable multiprocessing train. I will further study this problem and if I cannot figure it up, I will open a new post for further discussion.

Thanks.