Error while using DataLoader and GPU

szauniema · January 29, 2023, 12:02pm

Hello,
For the last few days I have been struggling with an error that I cannot understand. The error only occurs when using the GPU. It does not appear on the CPU setting. I run the code using conda. The error is as follows:

Traceback (most recent call last):
  File "main_COLLAB_edge_classification.py", line 585, in <module>
    main()
  File "main_COLLAB_edge_classification.py", line 580, in main
    train_val_pipeline(MODEL_NAME, dataset, params, net_params, dirs)
  File "main_COLLAB_edge_classification.py", line 315, in train_val_pipeline
    epoch_train_loss, optimizer, train_loader, val_loader, test_loader = train_epoch(model, optimizer, device, graph, train_edges, params['batch_size'], epoch, dataset, 4, monet_pseudo)
  File "E:\link-prediction-V2\benchmarking\train\train_COLLAB_drnl_edge_classification.py", line 62, in train_epoch_sparse
    for subgs, _ in train_loader:
  File "F:\Aga\conda\envs\bench\lib\site-packages\dgl\dataloading\dataloader.py", line 512, in __next__
    self._next_non_threaded() if not self.use_thread else self._next_threaded()
  File "F:\Aga\conda\envs\bench\lib\site-packages\dgl\dataloading\dataloader.py", line 507, in _next_threaded
    exception.reraise()
  File "F:\Aga\conda\envs\bench\lib\site-packages\dgl\utils\exception.py", line 57, in reraise
    raise exception
dgl._ffi.base.DGLError: Caught DGLError in prefetcher.
Original Traceback (most recent call last):
  File "F:\Aga\conda\envs\bench\lib\site-packages\dgl\dataloading\dataloader.py", line 380, in _prefetcher_entry
    batch, feats, stream_event = _prefetch(batch, dataloader, stream)
  File "F:\Aga\conda\envs\bench\lib\site-packages\dgl\dataloading\dataloader.py", line 338, in _prefetch
    batch = recursive_apply(batch, _record_stream, current_stream)
  File "F:\Aga\conda\envs\bench\lib\site-packages\dgl\utils\internal.py", line 1038, in recursive_apply
    return [recursive_apply(v, fn, *args, **kwargs) for v in data]
  File "F:\Aga\conda\envs\bench\lib\site-packages\dgl\utils\internal.py", line 1038, in <listcomp>
    return [recursive_apply(v, fn, *args, **kwargs) for v in data]
  File "F:\Aga\conda\envs\bench\lib\site-packages\dgl\utils\internal.py", line 1040, in recursive_apply
    return fn(data, *args, **kwargs)
  File "F:\Aga\conda\envs\bench\lib\site-packages\dgl\dataloading\dataloader.py", line 307, in _record_stream
    x.record_stream(stream)
  File "F:\Aga\conda\envs\bench\lib\site-packages\dgl\heterograph.py", line 5605, in record_stream
    self._graph.record_stream(stream)
  File "F:\Aga\conda\envs\bench\lib\site-packages\dgl\heterograph_index.py", line 290, in record_stream
    return _CAPI_DGLHeteroRecordStream(self, to_dgl_stream_handle(stream))
  File "F:\Aga\conda\envs\bench\lib\site-packages\dgl\_ffi\_ctypes\function.py", line 188, in __call__
    check_call(_LIB.DGLFuncCall(
  File "F:\Aga\conda\envs\bench\lib\site-packages\dgl\_ffi\base.py", line 65, in check_call
    raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: [12:52:54] C:\Users\Administrator\dgl-0.5\src\runtime\ndarray.cc:284: Check failed: td->IsAvailable(): RecordStream only works when TensorAdaptor is available.

dyru · January 30, 2023, 1:57am

Could you provide a minimal code snippet to reproduce the error?

szauniema · January 30, 2023, 5:54pm

Sure. The code is quite long, so I am pasting a link to it. There is only one difference, I changed the default value of “num_workes” (line 421) from 8 to 1 or 2 (checked with both).
seal_ogbl
I should also mention that the error occurs for CUDA versions 11.3 and 11.6 (which I tested for). Thus I am using the following packages for CUDA:

cudatoolkit               11.3.1              h280eb24_10    conda-forge
cudnn                     8.1.0.77             h3e0f4f4_0    conda-forge
dgl-cuda11.3              0.9.1                    py38_0    dglteam
pytorch                   1.12.1          py3.8_cuda11.3_cudnn8_0    pytorch
torch                     1.13.1                   pypi_0    pypi

and also tensorflow in version 2.4.0, which is compatible with CUDA 11.3.

dyru · February 2, 2023, 1:47am

Hi @szauniema. Does the code fail at the very first of training or fail after a few steps? I ran the code for a few steps of training and changing “num_workers” to 1 didn’t affect it.

Besides, could you try reinstalling dgl? It seems TensorAdaptor is not working properly according to your error message.

BarclayII · February 2, 2023, 2:54am

It seems that you have both PyTorch 1.12.1 and PyTorch 1.13.1 installed. Could you try uninstalling either and determine which PyTorch version caused the problem?

system · March 4, 2023, 2:55am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.