DataLoader Worker Exited Unexpectedly When Customizing Neighborhood Sampler on Windows

I’m customizing a neighborhood sampler with DGL library. However, the sampler failed to work when I set the parameter `num_workers’ to a non-zero number when initializing the NodeDataLoader. For instance:

my_sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
my_dataloader = dgl.dataloading.NodeDataLoader(
                g=G,
                nids=[4],
                block_sampler=my_sampler,
                device='cpu',
                batch_size=1,
                shuffle=True,
                drop_last=False,
                num_workers=1
            )
for step, (input_nodes, seeds, blocks) in enumerate(my_dataloader):
    ...

The code above works well, where I use the MultiLayerFullNeighborSampler. Notice that it works whether the parameter ‘num_workers’ is set to non-zero or not.

class TestSampler(dgl.dataloading.MultiLayerNeighborSampler):
    def __init__(self, n_layers, return_eids=False):
        super().__init__([None] * n_layers, return_eids=return_eids)

my_sampler = TestSampler(2)
my_dataloader = dgl.dataloading.NodeDataLoader(
                g=G,
                nids=[4],
                block_sampler=my_sampler,
                device='cpu',
                batch_size=1,
                shuffle=True,
                drop_last=False,
                num_workers=0             # here
            )
for step, (input_nodes, seeds, blocks) in enumerate(my_dataloader):
    ...

The code above also works well. The implementation of the class TestSampler is completely copied from the implementation of MultiLayerFullNeighborSampler. However, when I set the parameter ‘num_workers’ to non-zero:

my_sampler = TestSampler(2)
#my_sampler = dgl.dataloading.MultiLayerNeighborSampler([None] * 2)
#my_sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
my_dataloader = dgl.dataloading.NodeDataLoader(
                g=G,
                nids=[4],
                block_sampler=my_sampler,
                device='cpu',
                batch_size=1,
                shuffle=True,
                drop_last=False,
                num_workers=1             # here
            )
for step, (input_nodes, seeds, blocks) in enumerate(my_dataloader):
   ...

The code above failed when enumerating. The error message is about multiprocessing:

---------------------------------------------------------------------------
Empty                                     Traceback (most recent call last)
~\anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _try_get_data(self, timeout)
    989         try:
--> 990             data = self._data_queue.get(timeout=timeout)
    991             return (True, data)

~\anaconda3\lib\multiprocessing\queues.py in get(self, block, timeout)
    107                     if not self._poll(timeout):
--> 108                         raise Empty
    109                 elif not self._poll():

Empty: 

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-15-6cf22217dc6e> in <module>
     13                 num_workers=1
     14             )
---> 15 for step, (input_nodes, seeds, blocks) in enumerate(my_dataloader):
     16     print('fuck')
     17     print(input_nodes)

~\anaconda3\lib\site-packages\dgl\dataloading\pytorch\dataloader.py in __next__(self)
    320     def __next__(self):
    321         # input_nodes, output_nodes, blocks
--> 322         result_ = next(self.iter_)
    323         _restore_blocks_storage(result_[-1], self.node_dataloader.collator.g)
    324 

~\anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __next__(self)
    519             if self._sampler_iter is None:
    520                 self._reset()
--> 521             data = self._next_data()
    522             self._num_yielded += 1
    523             if self._dataset_kind == _DatasetKind.Iterable and \

~\anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _next_data(self)
   1184 
   1185             assert not self._shutdown and self._tasks_outstanding > 0
-> 1186             idx, data = self._get_data()
   1187             self._tasks_outstanding -= 1
   1188             if self._dataset_kind == _DatasetKind.Iterable:

~\anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _get_data(self)
   1150         else:
   1151             while True:
-> 1152                 success, data = self._try_get_data()
   1153                 if success:
   1154                     return data

~\anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _try_get_data(self, timeout)
   1001             if len(failed_workers) > 0:
   1002                 pids_str = ', '.join(str(w.pid) for w in failed_workers)
-> 1003                 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
   1004             if isinstance(e, queue.Empty):
   1005                 return (False, None)

RuntimeError: DataLoader worker (pid(s) 3032) exited unexpectedly

After searching information for the problem, I think it might be related to the multiprocessing on the Windows system. However, it can not explain why the MultiLayerFullNeighborSampler class can run well. I wonder how to fix it correctly.

Software Versions:

Python: Python 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)] :: Ana
conda, Inc. on win32

Pytorch: 1.9.1 py3.8_cuda10.2_cudnn7_0

dgl-cuda10.2: 0.7.1

Did you have the same problem by using DGL’s sampler?

The first piece of code is using DGL’s sampler MultiLayerFullNeighborSampler and it works well

I see. Could you try change None to -1 in TestSampler to see whether this work?

It still doesn’t work, while
my_sampler = dgl.dataloading.MultiLayerNeighborSampler([-1] * 2)
and
my_sampler = dgl.dataloading.MultiLayerNeighborSampler([None] * 2)
both work.

Could you then try MultiLayerFullNeighborSampler? This should be very similar to your TestSampler where it is also subclassing from MultiLayerNeighborSampler. If MultiLayerFullNeighborSampler works, then I think you can copy that code and try again with your own algorithm.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.