Issue with Running DGL Using Disk-Based Feature GraphBolt

Yuhang · November 14, 2024, 9:03pm

I am trying to run the disk-based feature GraphBolt example in examples/graphbolt/disk_based_feature/node_classification.py. I executed the command $ python3 examples/graphbolt/disk_based_feature/node_classification.py --cpu-cache-size-in-gigabytes 1, but it fails with an AssertionError. Part of the log output is shown below:

  File "/home/yuhangs/miniconda3/envs/dgl-dev-gpu-124/lib/python3.12/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 180, in wrap_generator
    response = gen.send(None)
               ^^^^^^^^^^^^^^
  File "/home/yuhangs/miniconda3/envs/dgl-dev-gpu-124/lib/python3.12/site-packages/torch/utils/data/datapipes/iter/callable.py", line 126, in __iter__
    yield self._apply_fn(data)
          ^^^^^^^^^^^^^^^^^^^^
  File "/home/yuhangs/miniconda3/envs/dgl-dev-gpu-124/lib/python3.12/site-packages/torch/utils/data/datapipes/iter/callable.py", line 91, in _apply_fn
    return self.fn(data)
           ^^^^^^^^^^^^^
  File "/home/yuhangs/miniconda3/envs/dgl-dev-gpu-124/lib/python3.12/site-packages/dgl-2.5-py3.12-linux-x86_64.egg/dgl/graphbolt/minibatch_transformer.py", line 38, in _transformer
    minibatch = self.transformer(minibatch)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yuhangs/miniconda3/envs/dgl-dev-gpu-124/lib/python3.12/site-packages/dgl-2.5-py3.12-linux-x86_64.egg/dgl/graphbolt/feature_fetcher.py", line 147, in _execute_stage
    value = next(handle)
            ^^^^^^^^^^^^
  File "/home/yuhangs/miniconda3/envs/dgl-dev-gpu-124/lib/python3.12/site-packages/dgl-2.5-py3.12-linux-x86_64.egg/dgl/graphbolt/impl/cpu_cached_feature.py", line 149, in read_async
    missing_values_future = next(fallback_reader, None)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yuhangs/miniconda3/envs/dgl-dev-gpu-124/lib/python3.12/site-packages/dgl-2.5-py3.12-linux-x86_64.egg/dgl/graphbolt/impl/torch_based_feature_store.py", line 451, in read_async
    assert torch.ops.graphbolt.detect_io_uring()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 
This exception is thrown by __iter__ of MiniBatchTransformer(datapipe=Bufferer, transformer=functools.partial(<function FeatureFetcher._execute_stage at 0x7f78260a87c0>, 3))

I am using Ubuntu 20.04, with Python 3.12, PyTorch 2.4, CUDA 12.4, and the latest version of DGL installed. I also tried building and installing DGL from source, but the same problem persists.

Any idea what might be causing this issue? Could it be related to the system-wide liburing I installed? I updated system-wide liburing to match the version that DGL uses, but the problem is still there.

mfbalin · November 17, 2024, 4:39am

You need to have a fairly recent Linux kernel. I think around version 5.6 and above should work.

Yuhang · November 18, 2024, 2:05am

Thank you for you reply. I checked my linux kernel version it is 5.4.0-198-generic. However I can run other problems that requires liburing, so it probably not the kernel version problem.

mfbalin · November 18, 2024, 3:40am

It is possible that the detection mechanism is checking features available only on newer kernels, so I wouldn’t eliminate this as a cause so quickly.

(I wrote the code, may not have done it in the most general way.)

Update your kernel and try again? Or try disabling that particular assertion and see if it works.

Yuhang · November 20, 2024, 9:16pm

Thank you.

I figured it out why the assertion failed. In this line dgl/graphbolt/src/io_uring.cc at 88f109f17338d7905d6f5618f0b2b3afc689fd54 · dmlc/dgl · GitHub it will call io_uring_get_probe(), and in kernel below 5.6, this will return nullptr. (Using `io_uring_get_probe` give nullptr back on kernel 5.5? · Issue #526 · axboe/liburing · GitHub)

And also I tried to skip this assertion, but it still not work. I assume that you also use some features of liburing that only work on higher kernel version. So far I guess I can only update the kernel version.

system · December 20, 2024, 9:17pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.