DistDGL v2.4 built from source Training Error

I followed the official guide to build DGL from source for distributed training on GPUs. Previously, it worked fine for DGL v1.2. Now I tried with the latest DGL v2.4 source code on Ubuntu 22.04.

Numpy Version:   1.26.4
PyTorch Version: 2.3.1+cu118

I had to downgrade numpy to the latest 1.x version because DGL gives a lot of errors with numpy 2.x . The source code compiles successfully. When I launch distributed training, DGL initialization and graph loading is successful, but after the Number of Classes output, I get the following errors:

......
Number of classes: 47
[rank3]: Traceback (most recent call last):
[rank3]:   File "dgl/examples/distributed/graphsage/node_classification.py", line 467, in <module>
[rank3]:     main(args)
[rank3]:   File "dgl/examples/distributed/graphsage/node_classification.py", line 408, in main
[rank3]:     epoch_time, test_acc = run(args, device, data)
[rank3]:   File "dgl/examples/distributed/graphsage/node_classification.py", line 235, in run
[rank3]:     model = th.nn.parallel.DistributedDataParallel(
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 873, in __init__
[rank3]:     optimize_ddp = torch._dynamo.config._get_optimize_ddp_mode()
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/__init__.py", line 2003, in __getattr__
[rank3]:     return importlib.import_module(f".{name}", __name__)
[rank3]:   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank3]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank3]:   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank3]:   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
[rank3]:   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
[rank3]:   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
[rank3]:   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
[rank3]:   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 64, in <module>
[rank3]:     torch.manual_seed = disable(torch.manual_seed)
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/_dynamo/decorators.py", line 50, in disable
[rank3]:     return DisableContext()(fn)
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 410, in __call__
[rank3]:     (filename is None or trace_rules.check(fn))
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3378, in check
[rank3]: Traceback (most recent call last):
[rank3]:   File "/usr/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
[rank3]:     obj = _ForkingPickler.dumps(obj)
[rank3]:   File "/usr/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
[rank3]:     cls(buf, protocol).dump(obj)
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 295, in reduce_tensor
[rank3]:     from torch.nested._internal.nested_tensor import NestedTensor
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 416, in <module>
[rank3]:     _nt_view_dummy = NestedTensor(
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 112, in __init__
[rank3]:     torch._dynamo.mark_dynamic(self, self._ragged_idx)
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/__init__.py", line 2003, in __getattr__
[rank3]:     return importlib.import_module(f".{name}", __name__)
[rank3]:   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank3]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 64, in <module>
[rank3]:     torch.manual_seed = disable(torch.manual_seed)
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/_dynamo/decorators.py", line 50, in disable
[rank3]:     return DisableContext()(fn)
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 410, in __call__
[rank3]:     (filename is None or trace_rules.check(fn))
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3378, in check
[rank3]:     return check_verbose(obj, is_inlined_call).skipped
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3361, in check_verbose
[rank3]:     rule = torch._dynamo.trace_rules.lookup_inner(
[rank3]: AttributeError: partially initialized module 'torch._dynamo' has no attribute 'trace_rules' (most likely due to a circular import)
[rank3]:     return check_verbose(obj, is_inlined_call).skipped
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3361, in check_verbose
[rank3]:     rule = torch._dynamo.trace_rules.lookup_inner(
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3442, in lookup_inner
[rank3]:     rule = get_torch_obj_rule_map().get(obj, None)
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 2782, in get_torch_obj_rule_map
[rank3]:     obj = load_object(k)
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 2811, in load_object
[rank3]:     val = _load_obj_from_str(x[0])
[rank3]:   File "[PYTHON PATH]/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 2795, in _load_obj_from_str
[rank3]:     return getattr(importlib.import_module(module), obj_name)
[rank3]:   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank3]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank3]:   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank3]:   File "<frozen importlib._bootstrap>", line 1024, in _find_and_load
[rank3]:   File "<frozen importlib._bootstrap>", line 171, in __enter__
[rank3]:   File "<frozen importlib._bootstrap>", line 116, in acquire
[rank3]: _frozen_importlib._DeadlockError: deadlock detected by _ModuleLock('torch.nested._internal.nested_tensor') at 140178908033184

I replaced the long python environment path with [PYTHON PATH] for readability.

With PyTorch 2.3.1, I get the following error:

AttributeError: module 'torch.library' has no attribute 'register_fake'

But i am not using GraphBolt (yet) in distributed training and I applied a dirty fix of commenting out the lines in GraphBolt to get rid of that error. After that, training starts successfully but gives the above errors.

Distributed training still works fine for DGL v1.2 from source.

Also, I didn’t partition the graph using DGL v2.4, I used the partitioned graph which I partitioned using DGL v1.2. I don’t think partitioning is an issue, is it?

I get the same errors with using prebuilt latest DGL (not from source):

pip install  dgl -f https://data.dgl.ai/wheels/torch-2.3/cu118/repo.html

@Rhett-Ying and community, any help will be highly appreciated.

Could you also open an issue in the repository? The register_fake error is a bug that should be addressed for sure.

I will open a PR to fix the register fake issue, it looks like we made a mistake assuming register_fake was introduced in 2.3.1, while it was introduced in 2.4.0a0.

The fix is here: [GraphBolt] Fix issue on torch 2.3.1 by mfbalin · Pull Request #7521 · dmlc/dgl · GitHub

@mfbalin Thank you for addressing the register_fake issue. Could you also help me with the other errors I am getting in the distributed training? Whether I use the latest prebuilt DGL or compile the latest repo and build from source, I get the same errors in distributed training.

@Rhett-Ying maintains the DistDGL while I maintain GraphBolt. We will need his input on this issue. I don’t know what the solution might look like at all.

@mfbalin Thank you.

@Rhett-Ying your assistance will be highly appreciated.

1 Like

@pubu I’ve merged the fix @mfbalin has filed for register_fake error. Here’s my suggestion.

  1. build DGL from latest master branch code on your own. torch 2.3.1 + cu118 should be ok. but limit numpy < 2.0
  2. partition your graph with the newly built DGL as DGL 1.2 is too old and we’ve made several changes to the DistDGL.
  3. Are you just running dgl/examples/distributed/graphsage/node_classification.py without any change? If you still hit any issue, please share the error log and command as well. I will help look into it then.

Hi,

I am also seeing the same error for each rank - [rank4]: _frozen_importlib._DeadlockError: deadlock detected by _ModuleLock('torch.nested._internal.nested_tensor') at 139970463404416. I have repartitioned the graph with the latest dgl and am also running the latest node_classification.py. I am using torch 2.3.1+cu121 and built dgl from source. My system cuda is 12.2.

Here’s the full log:

Rank of nid005393: 1
Rank of nid005445: 4
Rank of nid005445: 5
Rank of nid005393: 0
Rank of nid005393: 3
Rank of nid005393: 2
Rank of nid005445: 7
Rank of nid005445: 6
part 6, train: 24577 (local: 24577), val: 4915 (local: 4915), test: 276636 (local: 276636)
part 7, train: 24576 (local: 24576), val: 4915 (local: 4915), test: 276636 (local: 276636)
part 5, train: 24577 (local: 24577), val: 4915 (local: 4915), test: 276636 (local: 276636)
part 4, train: 24577 (local: 24577), val: 4916 (local: 4916), test: 276637 (local: 276637)
part 3, train: 24577 (local: 23499), val: 4915 (local: 4495), test: 276636 (local: 271897)part 0, train: 24577 (local: 23479), val: 4916 (local: 4497), test: 276637 (local: 271999)

part 2, train: 24577 (local: 23527), val: 4915 (local: 4534), test: 276636 (local: 271866)
part 1, train: 24577 (local: 23508), val: 4916 (local: 4527), test: 276637 (local: 272027)
Number of classes: 47
Number of classes: 47
Number of classes: 47
Number of classes: 47
Number of classes: 47
Number of classes: 47
Number of classes: 47
Number of classes: 47
Client[0] in group[0] is exiting...
Client[7] in group[0] is exiting...
Client[14] in group[0] is exiting...
Client[9] in group[0] is exiting...
Client[21] in group[0] is exiting...
Client[31] in group[0] is exiting...
Client[25] in group[0] is exiting...
Client[34] in group[0] is exiting...
Arguments: Namespace(graph_name='ogbn-products', ip_config='/global/cfs/cdirs/m4626/Distributed_DGL/dgl_ex/experiments/logs/ogbn-products/dgl_cuda121/rpc_baseline/logs_perlmutter_cpu_gloo/sage/ip_config/ip_config_baseline_sage_ogbn-products_metis_n2_samp4_trainer4_28795115.txt', part_config=None, n_classes=0, backend='gloo', num_gpus=0, num_epochs=100, num_hidden=16, num_layers=2, fan_out='10,25', batch_size=2000, batch_size_eval=100000, log_every=20, eval_every=5, lr=0.003, dropout=0.5, local_rank=None, pad_data=False, use_graphbolt=False)
nid005393: Initializing DistDGL.
Initialize the distributed services with graphbolt: False
load ogbn-products
Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats: ['csc']
start graph service on server 0 for part 0
Server is waiting for connections on [10.249.9.42:30050]...
Server (0) shutdown.
Server is exiting...
Arguments: Namespace(graph_name='ogbn-products', ip_config='/global/cfs/cdirs/m4626/Distributed_DGL/dgl_ex/experiments/logs/ogbn-products/dgl_cuda121/rpc_baseline/logs_perlmutter_cpu_gloo/sage/ip_config/ip_config_baseline_sage_ogbn-products_metis_n2_samp4_trainer4_28795115.txt', part_config=None, n_classes=0, backend='gloo', num_gpus=0, num_epochs=100, num_hidden=16, num_layers=2, fan_out='10,25', batch_size=2000, batch_size_eval=100000, log_every=20, eval_every=5, lr=0.003, dropout=0.5, local_rank=None, pad_data=False, use_graphbolt=False)
nid005445: Initializing DistDGL.
Initialize the distributed services with graphbolt: False
load ogbn-products
Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats: ['csc']
start graph service on server 1 for part 1
Server is waiting for connections on [10.249.9.132:30050]...
Server (1) shutdown.
Server is exiting...
Client [319393] waits on 10.249.9.132:39013
Machine (1) group (0) client (4) connect to server successfuly!
Client[4] in group[0] is exiting...
  warnings.warn(
/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
[rank6]: Traceback (most recent call last):
[rank6]:   File "/global/u1/s/sark777/Distributed_DGL/src/prefetch/baseline/node_classification.py", line 485, in <module>
[rank6]:     main(args)
[rank6]:   File "/global/u1/s/sark777/Distributed_DGL/src/prefetch/baseline/node_classification.py", line 331, in main
[rank6]:     sample_time, eval_time, data_copy, absolute_total_time) = run(args, device, data)
[rank6]:   File "/global/u1/s/sark777/Distributed_DGL/src/prefetch/baseline/node_classification.py", line 92, in run
[rank6]:     model = th.nn.parallel.DistributedDataParallel(model)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 873, in __init__
[rank6]:     optimize_ddp = torch._dynamo.config._get_optimize_ddp_mode()
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/__init__.py", line 2003, in __getattr__
[rank6]:     return importlib.import_module(f".{name}", __name__)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank6]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank6]:   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank6]:   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
[rank6]:   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
[rank6]:   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
[rank6]:   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
[rank6]:   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 64, in <module>
[rank6]:     torch.manual_seed = disable(torch.manual_seed)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/decorators.py", line 50, in disable
[rank6]:     return DisableContext()(fn)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 410, in __call__
[rank6]:     (filename is None or trace_rules.check(fn))
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3378, in check
[rank6]:     return check_verbose(obj, is_inlined_call).skipped
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3361, in check_verbose
[rank6]:     rule = torch._dynamo.trace_rules.lookup_inner(
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3442, in lookup_inner
[rank6]: Traceback (most recent call last):
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
[rank6]:     obj = _ForkingPickler.dumps(obj)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
[rank6]:     cls(buf, protocol).dump(obj)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 295, in reduce_tensor
[rank6]:     from torch.nested._internal.nested_tensor import NestedTensor
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 416, in <module>
[rank6]:     _nt_view_dummy = NestedTensor(
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 112, in __init__
[rank6]:     torch._dynamo.mark_dynamic(self, self._ragged_idx)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/__init__.py", line 2003, in __getattr__
[rank6]:     return importlib.import_module(f".{name}", __name__)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank6]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 64, in <module>
[rank6]:     torch.manual_seed = disable(torch.manual_seed)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/decorators.py", line 50, in disable
[rank6]:     return DisableContext()(fn)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 410, in __call__
[rank6]:     (filename is None or trace_rules.check(fn))
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3378, in check
[rank6]:     return check_verbose(obj, is_inlined_call).skipped
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3361, in check_verbose
[rank6]:     rule = torch._dynamo.trace_rules.lookup_inner(
[rank6]: AttributeError: partially initialized module 'torch._dynamo' has no attribute 'trace_rules' (most likely due to a circular import)
[rank6]: Traceback (most recent call last):
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
[rank6]:     obj = _ForkingPickler.dumps(obj)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
[rank6]:     cls(buf, protocol).dump(obj)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 295, in reduce_tensor
[rank6]:     from torch.nested._internal.nested_tensor import NestedTensor
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 416, in <module>
[rank6]:     _nt_view_dummy = NestedTensor(
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 112, in __init__
[rank6]:     torch._dynamo.mark_dynamic(self, self._ragged_idx)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/__init__.py", line 2003, in __getattr__
[rank6]:     return importlib.import_module(f".{name}", __name__)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank6]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 64, in <module>
[rank6]:     torch.manual_seed = disable(torch.manual_seed)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/decorators.py", line 50, in disable
[rank6]:     return DisableContext()(fn)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 410, in __call__
[rank6]:     (filename is None or trace_rules.check(fn))
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3378, in check
[rank6]:     return check_verbose(obj, is_inlined_call).skipped
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3361, in check_verbose
[rank6]:     rule = torch._dynamo.trace_rules.lookup_inner(
[rank6]: AttributeError: partially initialized module 'torch._dynamo' has no attribute 'trace_rules' (most likely due to a circular import)
[rank6]: Traceback (most recent call last):
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
[rank6]:     obj = _ForkingPickler.dumps(obj)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
[rank6]:     cls(buf, protocol).dump(obj)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 295, in reduce_tensor
[rank6]:     from torch.nested._internal.nested_tensor import NestedTensor
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 416, in <module>
[rank6]:     _nt_view_dummy = NestedTensor(
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 112, in __init__
[rank6]:     torch._dynamo.mark_dynamic(self, self._ragged_idx)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/__init__.py", line 2003, in __getattr__
[rank6]:     return importlib.import_module(f".{name}", __name__)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank6]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 64, in <module>
[rank6]:     torch.manual_seed = disable(torch.manual_seed)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/decorators.py", line 50, in disable
[rank6]:     return DisableContext()(fn)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 410, in __call__
[rank6]:     (filename is None or trace_rules.check(fn))
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3378, in check
[rank6]:     return check_verbose(obj, is_inlined_call).skipped
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3361, in check_verbose
[rank6]:     rule = torch._dynamo.trace_rules.lookup_inner(
[rank6]: AttributeError: partially initialized module 'torch._dynamo' has no attribute 'trace_rules' (most likely due to a circular import)
[rank6]: Traceback (most recent call last):
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
[rank6]:     obj = _ForkingPickler.dumps(obj)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
[rank6]:     cls(buf, protocol).dump(obj)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 295, in reduce_tensor
[rank6]:     from torch.nested._internal.nested_tensor import NestedTensor
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 416, in <module>
[rank6]:     _nt_view_dummy = NestedTensor(
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 112, in __init__
[rank6]:     torch._dynamo.mark_dynamic(self, self._ragged_idx)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/__init__.py", line 2003, in __getattr__
[rank6]:     return importlib.import_module(f".{name}", __name__)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank6]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 64, in <module>
[rank6]:     torch.manual_seed = disable(torch.manual_seed)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/decorators.py", line 50, in disable
[rank6]:     return DisableContext()(fn)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 410, in __call__
[rank6]:     (filename is None or trace_rules.check(fn))
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3378, in check
[rank6]:     return check_verbose(obj, is_inlined_call).skipped
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3361, in check_verbose
[rank6]:     rule = torch._dynamo.trace_rules.lookup_inner(
[rank6]: AttributeError: partially initialized module 'torch._dynamo' has no attribute 'trace_rules' (most likely due to a circular import)
[rank6]:     rule = get_torch_obj_rule_map().get(obj, None)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 2782, in get_torch_obj_rule_map
[rank6]:     obj = load_object(k)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 2811, in load_object
[rank6]:     val = _load_obj_from_str(x[0])
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 2795, in _load_obj_from_str
[rank6]:     return getattr(importlib.import_module(module), obj_name)
[rank6]:   File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank6]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank6]:   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank6]:   File "<frozen importlib._bootstrap>", line 1024, in _find_and_load
[rank6]:   File "<frozen importlib._bootstrap>", line 171, in __enter__
[rank6]:   File "<frozen importlib._bootstrap>", line 116, in acquire
[rank6]: _frozen_importlib._DeadlockError: deadlock detected by _ModuleLock('torch.nested._internal.nested_tensor') at 140481820107792

Seeing exactly the same error in the GPU version just without the NVML warning. Would really appreciate some help. I am running with num_samplers = 4 and num_servers = 1.

Thanks!

  1. try to install cuda 12.1 to exactly match DGL and torch.
  2. try with num_samplers=0

I tried using CUDA 12.1, and the configuration with num_samplers=0 worked correctly. However, when I set num_samplers=2, it failed again with the same error. Is multiple sampler processes not working in the latest version?

num_samplers>0 is supposed to work well.

could you share the detailed DGL/pytorch/cuda version you’re using? and the train command as well. Let me try to reproduce on my side.

Thanks for you help! I’m using torch v2.3.1, cuda v12.1 with python v3.10.14. I am running this example on 2 machines. The output of conda list is also attached below. DGL v2.4 was built from commit.

$PYTHON_PATH $PROJ_PATH/launch.py \
    --workspace $PROJ_PATH \
    --num_trainers 4 \
    --num_samplers 2 \
    --num_servers 1 \
    --part_config $PARTITION_DIR \
    --ip_config  $IP_CONFIG_FILE \
    --num_omp_threads 16 \
    "$PYTHON_PATH node_classification.py --graph_name $DATASET_NAME \
    --ip_config $IP_CONFIG_FILE --num_epochs 100 --batch_size 2000"
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
aiohappyeyeballs          2.3.5                    pypi_0    pypi
aiohttp                   3.10.3                   pypi_0    pypi
aiosignal                 1.3.1                    pypi_0    pypi
alabaster                 0.7.16                   pypi_0    pypi
annotated-types           0.7.0                    pypi_0    pypi
anyio                     4.4.0                    pypi_0    pypi
argon2-cffi               23.1.0                   pypi_0    pypi
argon2-cffi-bindings      21.2.0                   pypi_0    pypi
arrow                     1.3.0                    pypi_0    pypi
astroid                   3.2.4                    pypi_0    pypi
asttokens                 2.4.1                    pypi_0    pypi
async-lru                 2.0.4                    pypi_0    pypi
async-timeout             4.0.3                    pypi_0    pypi
atk-1.0                   2.38.0               h04ea711_2    conda-forge
attrs                     24.2.0                   pypi_0    pypi
babel                     2.16.0                   pypi_0    pypi
beautifulsoup4            4.12.3                   pypi_0    pypi
black                     24.8.0                   pypi_0    pypi
bleach                    6.1.0                    pypi_0    pypi
boto3                     1.34.161                 pypi_0    pypi
botocore                  1.34.161                 pypi_0    pypi
bzip2                     1.0.8                h4bc722e_7    conda-forge
ca-certificates           2024.7.4             hbcca054_0    conda-forge
cairo                     1.18.0               hebfffa5_3    conda-forge
certifi                   2024.7.4                 pypi_0    pypi
cffi                      1.17.0                   pypi_0    pypi
charset-normalizer        3.3.2                    pypi_0    pypi
clang-format              18.1.8                   pypi_0    pypi
click                     8.1.7                    pypi_0    pypi
cmake                     3.30.2                   pypi_0    pypi
comm                      0.2.2                    pypi_0    pypi
contourpy                 1.2.1                    pypi_0    pypi
cycler                    0.12.1                   pypi_0    pypi
cython                    3.0.11                   pypi_0    pypi
debugpy                   1.8.5                    pypi_0    pypi
decorator                 5.1.1                    pypi_0    pypi
defusedxml                0.7.1                    pypi_0    pypi
dgl                       2.4                      pypi_0    pypi
dill                      0.3.8                    pypi_0    pypi
docutils                  0.20.1                   pypi_0    pypi
exceptiongroup            1.2.2                    pypi_0    pypi
executing                 2.0.1                    pypi_0    pypi
expat                     2.6.2                h59595ed_0    conda-forge
expecttest                0.2.1                    pypi_0    pypi
fastjsonschema            2.20.0                   pypi_0    pypi
filelock                  3.15.4                   pypi_0    pypi
font-ttf-dejavu-sans-mono 2.37                 hab24e00_0    conda-forge
font-ttf-inconsolata      3.000                h77eed37_0    conda-forge
font-ttf-source-code-pro  2.038                h77eed37_0    conda-forge
font-ttf-ubuntu           0.83                 h77eed37_2    conda-forge
fontconfig                2.14.2               h14ed4e7_0    conda-forge
fonts-conda-ecosystem     1                             0    conda-forge
fonts-conda-forge         1                             0    conda-forge
fonttools                 4.53.1                   pypi_0    pypi
fqdn                      1.5.1                    pypi_0    pypi
freetype                  2.12.1               h267a509_2    conda-forge
fribidi                   1.0.10               h36c2ea0_0    conda-forge
frozenlist                1.4.1                    pypi_0    pypi
fsspec                    2024.6.1                 pypi_0    pypi
gdk-pixbuf                2.42.12              hb9ae30d_0    conda-forge
graphite2                 1.3.13            h59595ed_1003    conda-forge
graphviz                  12.0.0               hba01fac_0    conda-forge
gtk2                      2.24.33              h6470451_5    conda-forge
gts                       0.7.6                h977cf35_4    conda-forge
h11                       0.14.0                   pypi_0    pypi
harfbuzz                  9.0.0                hda332d3_1    conda-forge
httpcore                  1.0.5                    pypi_0    pypi
httpx                     0.27.0                   pypi_0    pypi
icu                       75.1                 he02047a_0    conda-forge
idna                      3.7                      pypi_0    pypi
imagesize                 1.4.1                    pypi_0    pypi
iniconfig                 2.0.0                    pypi_0    pypi
ipykernel                 6.29.5                   pypi_0    pypi
ipython                   8.26.0                   pypi_0    pypi
ipywidgets                8.1.3                    pypi_0    pypi
isodate                   0.6.1                    pypi_0    pypi
isoduration               20.11.0                  pypi_0    pypi
isort                     5.13.2                   pypi_0    pypi
jedi                      0.19.1                   pypi_0    pypi
jinja2                    3.1.4                    pypi_0    pypi
jmespath                  1.0.1                    pypi_0    pypi
joblib                    1.4.2                    pypi_0    pypi
json5                     0.9.25                   pypi_0    pypi
jsonpointer               3.0.0                    pypi_0    pypi
jsonschema                4.23.0                   pypi_0    pypi
jsonschema-specifications 2023.12.1                pypi_0    pypi
jupyter-client            8.6.2                    pypi_0    pypi
jupyter-core              5.7.2                    pypi_0    pypi
jupyter-events            0.10.0                   pypi_0    pypi
jupyter-http-over-ws      0.0.8                    pypi_0    pypi
jupyter-lsp               2.2.5                    pypi_0    pypi
jupyter-server            2.14.2                   pypi_0    pypi
jupyter-server-terminals  0.5.3                    pypi_0    pypi
jupyterlab                4.2.4                    pypi_0    pypi
jupyterlab-pygments       0.3.0                    pypi_0    pypi
jupyterlab-server         2.27.3                   pypi_0    pypi
jupyterlab-widgets        3.0.11                   pypi_0    pypi
kiwisolver                1.4.5                    pypi_0    pypi
ld_impl_linux-64          2.40                 hf3520f5_7    conda-forge
lerc                      4.0.0                h27087fc_0    conda-forge
libcst                    1.4.0                    pypi_0    pypi
libdeflate                1.21                 h4bc722e_0    conda-forge
libexpat                  2.6.2                h59595ed_0    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 14.1.0               h77fa898_0    conda-forge
libgd                     2.3.3               hd3e95f3_10    conda-forge
libglib                   2.80.3               h315aac3_2    conda-forge
libgomp                   14.1.0               h77fa898_0    conda-forge
libiconv                  1.17                 hd590300_2    conda-forge
libjpeg-turbo             3.0.0                hd590300_1    conda-forge
libnsl                    2.0.1                hd590300_0    conda-forge
libpng                    1.6.43               h2797004_0    conda-forge
librsvg                   2.58.2               h9564881_1    conda-forge
libsqlite                 3.46.0               hde9e2c9_0    conda-forge
libstdcxx-ng              14.1.0               hc0a3c3a_0    conda-forge
libtiff                   4.6.0                h46a8edc_4    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libwebp-base              1.4.0                hd590300_0    conda-forge
libxcb                    1.16                 hd590300_0    conda-forge
libxcrypt                 4.4.36               hd590300_1    conda-forge
libxml2                   2.12.7               he7c6b58_4    conda-forge
libzlib                   1.3.1                h4ab18f5_1    conda-forge
lightning-utilities       0.11.6                   pypi_0    pypi
lintrunner                0.12.5                   pypi_0    pypi
littleutils               0.2.4                    pypi_0    pypi
markupsafe                2.1.5                    pypi_0    pypi
matplotlib                3.9.2                    pypi_0    pypi
matplotlib-inline         0.1.7                    pypi_0    pypi
mccabe                    0.7.0                    pypi_0    pypi
mistune                   3.0.2                    pypi_0    pypi
moreorless                0.4.0                    pypi_0    pypi
mpmath                    1.3.0                    pypi_0    pypi
multidict                 6.0.5                    pypi_0    pypi
mypy-extensions           1.0.0                    pypi_0    pypi
nbclient                  0.10.0                   pypi_0    pypi
nbconvert                 7.16.4                   pypi_0    pypi
nbformat                  5.10.4                   pypi_0    pypi
nbsphinx                  0.9.5                    pypi_0    pypi
nbsphinx-link             1.3.0                    pypi_0    pypi
ncurses                   6.5                  h59595ed_0    conda-forge
nest-asyncio              1.6.0                    pypi_0    pypi
networkx                  3.3                      pypi_0    pypi
nltk                      3.8.1                    pypi_0    pypi
nose                      1.3.7                    pypi_0    pypi
notebook                  7.2.1                    pypi_0    pypi
notebook-shim             0.2.4                    pypi_0    pypi
numpy                     2.0.1                    pypi_0    pypi
nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
nvidia-cudnn-cu12         8.9.2.26                 pypi_0    pypi
nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
nvidia-nvjitlink-cu12     12.6.20                  pypi_0    pypi
nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
ogb                       1.3.6                    pypi_0    pypi
openssl                   3.3.1                h4bc722e_2    conda-forge
outdated                  0.2.2                    pypi_0    pypi
overrides                 7.7.0                    pypi_0    pypi
packaging                 24.1                     pypi_0    pypi
pandas                    2.2.2                    pypi_0    pypi
pandoc                    3.3                  ha770c72_0    conda-forge
pandocfilters             1.5.1                    pypi_0    pypi
pango                     1.54.0               h4c5309f_1    conda-forge
parso                     0.8.4                    pypi_0    pypi
pathspec                  0.12.1                   pypi_0    pypi
pcre2                     10.44                hba22ea6_2    conda-forge
pexpect                   4.9.0                    pypi_0    pypi
pillow                    10.4.0                   pypi_0    pypi
pip                       24.2               pyhd8ed1ab_0    conda-forge
pixman                    0.43.2               h59595ed_0    conda-forge
platformdirs              4.2.2                    pypi_0    pypi
pluggy                    1.5.0                    pypi_0    pypi
prometheus-client         0.20.0                   pypi_0    pypi
prompt-toolkit            3.0.47                   pypi_0    pypi
psutil                    6.0.0                    pypi_0    pypi
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
ptyprocess                0.7.0                    pypi_0    pypi
pure-eval                 0.2.3                    pypi_0    pypi
pyarrow                   17.0.0                   pypi_0    pypi
pycparser                 2.22                     pypi_0    pypi
pydantic                  2.8.2                    pypi_0    pypi
pydantic-core             2.20.1                   pypi_0    pypi
pygments                  2.18.0                   pypi_0    pypi
pygraphviz                1.13            py310h0ca91bb_2    conda-forge
pylint                    3.2.6                    pypi_0    pypi
pyparsing                 3.1.2                    pypi_0    pypi
pytest                    8.3.2                    pypi_0    pypi
python                    3.10.14         hd12c33a_0_cpython    conda-forge
python-dateutil           2.9.0.post0              pypi_0    pypi
python-json-logger        2.0.7                    pypi_0    pypi
python_abi                3.10                    4_cp310    conda-forge
pytz                      2024.1                   pypi_0    pypi
pyyaml                    6.0.2                    pypi_0    pypi
pyzmq                     26.1.0                   pypi_0    pypi
rdflib                    7.0.0                    pypi_0    pypi
readline                  8.2                  h8228510_1    conda-forge
referencing               0.35.1                   pypi_0    pypi
regex                     2024.7.24                pypi_0    pypi
requests                  2.32.3                   pypi_0    pypi
rfc3339-validator         0.1.4                    pypi_0    pypi
rfc3986-validator         0.1.1                    pypi_0    pypi
rpds-py                   0.20.0                   pypi_0    pypi
s3transfer                0.10.2                   pypi_0    pypi
scikit-learn              1.5.1                    pypi_0    pypi
scipy                     1.14.0                   pypi_0    pypi
seaborn                   0.13.2                   pypi_0    pypi
send2trash                1.8.3                    pypi_0    pypi
setuptools                72.1.0             pyhd8ed1ab_0    conda-forge
six                       1.16.0                   pypi_0    pypi
sniffio                   1.3.1                    pypi_0    pypi
snowballstemmer           2.2.0                    pypi_0    pypi
soupsieve                 2.6                      pypi_0    pypi
sphinx                    7.4.7                    pypi_0    pypi
sphinx-copybutton         0.5.2                    pypi_0    pypi
sphinx-gallery            0.17.1                   pypi_0    pypi
sphinx-rtd-theme          2.0.0                    pypi_0    pypi
sphinxcontrib-applehelp   2.0.0                    pypi_0    pypi
sphinxcontrib-devhelp     2.0.0                    pypi_0    pypi
sphinxcontrib-htmlhelp    2.1.0                    pypi_0    pypi
sphinxcontrib-jquery      4.1                      pypi_0    pypi
sphinxcontrib-jsmath      1.0.1                    pypi_0    pypi
sphinxcontrib-qthelp      2.0.0                    pypi_0    pypi
sphinxcontrib-serializinghtml 2.0.0                    pypi_0    pypi
sphinxemoji               0.3.1                    pypi_0    pypi
stack-data                0.6.3                    pypi_0    pypi
stdlibs                   2024.5.15                pypi_0    pypi
sympy                     1.13.2                   pypi_0    pypi
terminado                 0.18.1                   pypi_0    pypi
threadpoolctl             3.5.0                    pypi_0    pypi
tinycss2                  1.3.0                    pypi_0    pypi
tk                        8.6.13          noxft_h4845f30_101    conda-forge
toml                      0.10.2                   pypi_0    pypi
tomli                     2.0.1                    pypi_0    pypi
tomlkit                   0.13.2                   pypi_0    pypi
torch                     2.3.1+cu121              pypi_0    pypi
torch-geometric           2.5.3                    pypi_0    pypi
torchdata                 0.8.0                    pypi_0    pypi
torcheval                 0.0.7                    pypi_0    pypi
torchmetrics              1.4.1                    pypi_0    pypi
tornado                   6.4.1                    pypi_0    pypi
tqdm                      4.66.5                   pypi_0    pypi
trailrunner               1.4.0                    pypi_0    pypi
traitlets                 5.14.3                   pypi_0    pypi
triton                    2.3.1                    pypi_0    pypi
types-python-dateutil     2.9.0.20240316           pypi_0    pypi
typing-extensions         4.12.2                   pypi_0    pypi
tzdata                    2024.1                   pypi_0    pypi
ufmt                      2.7.0                    pypi_0    pypi
uri-template              1.3.0                    pypi_0    pypi
urllib3                   2.2.2                    pypi_0    pypi
usort                     1.0.8.post1              pypi_0    pypi
wcwidth                   0.2.13                   pypi_0    pypi
webcolors                 24.8.0                   pypi_0    pypi
webencodings              0.5.1                    pypi_0    pypi
websocket-client          1.8.0                    pypi_0    pypi
wheel                     0.44.0             pyhd8ed1ab_0    conda-forge
widgetsnbextension        4.0.11                   pypi_0    pypi
xorg-kbproto              1.0.7             h7f98852_1002    conda-forge
xorg-libice               1.1.1                hd590300_0    conda-forge
xorg-libsm                1.2.4                h7391055_0    conda-forge
xorg-libx11               1.8.9                hb711507_1    conda-forge
xorg-libxau               1.0.11               hd590300_0    conda-forge
xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
xorg-libxext              1.3.4                h0b41bf4_2    conda-forge
xorg-libxrender           0.9.11               hd590300_0    conda-forge
xorg-renderproto          0.11.1            h7f98852_1002    conda-forge
xorg-xextproto            7.3.0             h0b41bf4_1003    conda-forge
xorg-xproto               7.0.31            h7f98852_1007    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
yarl                      1.9.4                    pypi_0    pypi
zlib                      1.3.1                h4ab18f5_1    conda-forge
zstd                      1.5.6                ha6fb4c9_0    conda-forge

Thanks for your sharing. We’ll try to reproduce with ogbn-products.

@sark777 Hi, we’ve reproduced the issue. here’s a the issue tracker: DistDGL v2.4 Training Error when num_samplers>0 · Issue #7753 · dmlc/dgl · GitHub

1 Like