I’m customizing a neighborhood sampler with DGL library. However, the sampler failed to work when I set the parameter `num_workers’ to a non-zero number when initializing the NodeDataLoader. For instance:
my_sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
my_dataloader = dgl.dataloading.NodeDataLoader(
g=G,
nids=[4],
block_sampler=my_sampler,
device='cpu',
batch_size=1,
shuffle=True,
drop_last=False,
num_workers=1
)
for step, (input_nodes, seeds, blocks) in enumerate(my_dataloader):
...
The code above works well, where I use the MultiLayerFullNeighborSampler. Notice that it works whether the parameter ‘num_workers’ is set to non-zero or not.
class TestSampler(dgl.dataloading.MultiLayerNeighborSampler):
def __init__(self, n_layers, return_eids=False):
super().__init__([None] * n_layers, return_eids=return_eids)
my_sampler = TestSampler(2)
my_dataloader = dgl.dataloading.NodeDataLoader(
g=G,
nids=[4],
block_sampler=my_sampler,
device='cpu',
batch_size=1,
shuffle=True,
drop_last=False,
num_workers=0 # here
)
for step, (input_nodes, seeds, blocks) in enumerate(my_dataloader):
...
The code above also works well. The implementation of the class TestSampler is completely copied from the implementation of MultiLayerFullNeighborSampler. However, when I set the parameter ‘num_workers’ to non-zero:
my_sampler = TestSampler(2)
#my_sampler = dgl.dataloading.MultiLayerNeighborSampler([None] * 2)
#my_sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
my_dataloader = dgl.dataloading.NodeDataLoader(
g=G,
nids=[4],
block_sampler=my_sampler,
device='cpu',
batch_size=1,
shuffle=True,
drop_last=False,
num_workers=1 # here
)
for step, (input_nodes, seeds, blocks) in enumerate(my_dataloader):
...
The code above failed when enumerating. The error message is about multiprocessing:
---------------------------------------------------------------------------
Empty Traceback (most recent call last)
~\anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _try_get_data(self, timeout)
989 try:
--> 990 data = self._data_queue.get(timeout=timeout)
991 return (True, data)
~\anaconda3\lib\multiprocessing\queues.py in get(self, block, timeout)
107 if not self._poll(timeout):
--> 108 raise Empty
109 elif not self._poll():
Empty:
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-15-6cf22217dc6e> in <module>
13 num_workers=1
14 )
---> 15 for step, (input_nodes, seeds, blocks) in enumerate(my_dataloader):
16 print('fuck')
17 print(input_nodes)
~\anaconda3\lib\site-packages\dgl\dataloading\pytorch\dataloader.py in __next__(self)
320 def __next__(self):
321 # input_nodes, output_nodes, blocks
--> 322 result_ = next(self.iter_)
323 _restore_blocks_storage(result_[-1], self.node_dataloader.collator.g)
324
~\anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __next__(self)
519 if self._sampler_iter is None:
520 self._reset()
--> 521 data = self._next_data()
522 self._num_yielded += 1
523 if self._dataset_kind == _DatasetKind.Iterable and \
~\anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _next_data(self)
1184
1185 assert not self._shutdown and self._tasks_outstanding > 0
-> 1186 idx, data = self._get_data()
1187 self._tasks_outstanding -= 1
1188 if self._dataset_kind == _DatasetKind.Iterable:
~\anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _get_data(self)
1150 else:
1151 while True:
-> 1152 success, data = self._try_get_data()
1153 if success:
1154 return data
~\anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _try_get_data(self, timeout)
1001 if len(failed_workers) > 0:
1002 pids_str = ', '.join(str(w.pid) for w in failed_workers)
-> 1003 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
1004 if isinstance(e, queue.Empty):
1005 return (False, None)
RuntimeError: DataLoader worker (pid(s) 3032) exited unexpectedly
After searching information for the problem, I think it might be related to the multiprocessing on the Windows system. However, it can not explain why the MultiLayerFullNeighborSampler class can run well. I wonder how to fix it correctly.
Software Versions:
Python: Python 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)] :: Ana
conda, Inc. on win32
Pytorch: 1.9.1 py3.8_cuda10.2_cudnn7_0
dgl-cuda10.2: 0.7.1