Dataloader with WeightedRandomSampler

Hello,

I am training an MLP using minibatching and I have an unbalanced dataset and was hoping to oversample the minority class and downsample the majority class to be able to train an MLP.

For the dataloader I have defined a graph sampler and I was hoping to pass a WeightedRandomSampler as a Key-word arguments to be passed to the parent PyTorch Dataloader.

From the error i am seeing i understand this isn’t possible because the DGL dataloader is generating an iterable dataset. Are are other ways to do oversampling of minority class for DGL?

ValueError: DataLoader with IterableDataset: expected unspecified sampler option, but got sampler=<torch.utils.data.sampler.WeightedRandomSampler object at 0x7fe4fe719d90>

#graph sampler
    graph_sampler = NeighborSampler([15, 10, 5], prefetch_node_feats=['h'])
    graph_sampler = as_edge_prediction_sampler(graph_sampler)

#sampler for parent pytorch dataloader that takes as input each sample probability (sample_weight)
    sampler = torch.utils.data.sampler.WeightedRandomSampler(sampler_weight.type('torch.DoubleTensor'), len(sampler_weight), replacement=True)
    use_uva = (args.mode == 'mixed')

#DGL dataloader with graph_sampler and sampler
    dataloader = DataLoader(
        g, train_eids, graph_sampler,
        device=device, batch_size=8, shuffle=True,
        drop_last=False, num_workers=0, use_uva=use_uva,sampler=sampler)

Thanks!

Here’s a ticket for WeightedRandomSampler: [DataLoader] Support weighted seed node/edge sampling in DGL DataLoader · Issue #3431 · dmlc/dgl · GitHub

One thing I’d like to confirm though: if the weighted sampler yields a minibatch with duplicated nodes (i.e. the nodes are not unique in the same minibatch), how would you like to do sampling? Would you like to sample different neighbors for the duplicated nodes?

@BarclayII The weighted sampler will result in duplicate nodes because otherwise cannot meet the probability of the minority class? if so yes… i would prefer to sample different nodes for the duplicated nodes

Thanks @Rhett-Ying I looked at [DataLoader] Support weighted seed node/edge sampling in DGL DataLoader · Issue #3431 · dmlc/dgl · GitHub but even with replacement=false still doesn’t work. Looks like its still an open feature request