Hi,
I am also seeing the same error for each rank - [rank4]: _frozen_importlib._DeadlockError: deadlock detected by _ModuleLock('torch.nested._internal.nested_tensor') at 139970463404416
. I have repartitioned the graph with the latest dgl and am also running the latest node_classification.py
. I am using torch 2.3.1+cu121 and built dgl from source. My system cuda is 12.2.
Here’s the full log:
Rank of nid005393: 1
Rank of nid005445: 4
Rank of nid005445: 5
Rank of nid005393: 0
Rank of nid005393: 3
Rank of nid005393: 2
Rank of nid005445: 7
Rank of nid005445: 6
part 6, train: 24577 (local: 24577), val: 4915 (local: 4915), test: 276636 (local: 276636)
part 7, train: 24576 (local: 24576), val: 4915 (local: 4915), test: 276636 (local: 276636)
part 5, train: 24577 (local: 24577), val: 4915 (local: 4915), test: 276636 (local: 276636)
part 4, train: 24577 (local: 24577), val: 4916 (local: 4916), test: 276637 (local: 276637)
part 3, train: 24577 (local: 23499), val: 4915 (local: 4495), test: 276636 (local: 271897)part 0, train: 24577 (local: 23479), val: 4916 (local: 4497), test: 276637 (local: 271999)
part 2, train: 24577 (local: 23527), val: 4915 (local: 4534), test: 276636 (local: 271866)
part 1, train: 24577 (local: 23508), val: 4916 (local: 4527), test: 276637 (local: 272027)
Number of classes: 47
Number of classes: 47
Number of classes: 47
Number of classes: 47
Number of classes: 47
Number of classes: 47
Number of classes: 47
Number of classes: 47
Client[0] in group[0] is exiting...
Client[7] in group[0] is exiting...
Client[14] in group[0] is exiting...
Client[9] in group[0] is exiting...
Client[21] in group[0] is exiting...
Client[31] in group[0] is exiting...
Client[25] in group[0] is exiting...
Client[34] in group[0] is exiting...
Arguments: Namespace(graph_name='ogbn-products', ip_config='/global/cfs/cdirs/m4626/Distributed_DGL/dgl_ex/experiments/logs/ogbn-products/dgl_cuda121/rpc_baseline/logs_perlmutter_cpu_gloo/sage/ip_config/ip_config_baseline_sage_ogbn-products_metis_n2_samp4_trainer4_28795115.txt', part_config=None, n_classes=0, backend='gloo', num_gpus=0, num_epochs=100, num_hidden=16, num_layers=2, fan_out='10,25', batch_size=2000, batch_size_eval=100000, log_every=20, eval_every=5, lr=0.003, dropout=0.5, local_rank=None, pad_data=False, use_graphbolt=False)
nid005393: Initializing DistDGL.
Initialize the distributed services with graphbolt: False
load ogbn-products
Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats: ['csc']
start graph service on server 0 for part 0
Server is waiting for connections on [10.249.9.42:30050]...
Server (0) shutdown.
Server is exiting...
Arguments: Namespace(graph_name='ogbn-products', ip_config='/global/cfs/cdirs/m4626/Distributed_DGL/dgl_ex/experiments/logs/ogbn-products/dgl_cuda121/rpc_baseline/logs_perlmutter_cpu_gloo/sage/ip_config/ip_config_baseline_sage_ogbn-products_metis_n2_samp4_trainer4_28795115.txt', part_config=None, n_classes=0, backend='gloo', num_gpus=0, num_epochs=100, num_hidden=16, num_layers=2, fan_out='10,25', batch_size=2000, batch_size_eval=100000, log_every=20, eval_every=5, lr=0.003, dropout=0.5, local_rank=None, pad_data=False, use_graphbolt=False)
nid005445: Initializing DistDGL.
Initialize the distributed services with graphbolt: False
load ogbn-products
Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats: ['csc']
start graph service on server 1 for part 1
Server is waiting for connections on [10.249.9.132:30050]...
Server (1) shutdown.
Server is exiting...
Client [319393] waits on 10.249.9.132:39013
Machine (1) group (0) client (4) connect to server successfuly!
Client[4] in group[0] is exiting...
warnings.warn(
/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
[rank6]: Traceback (most recent call last):
[rank6]: File "/global/u1/s/sark777/Distributed_DGL/src/prefetch/baseline/node_classification.py", line 485, in <module>
[rank6]: main(args)
[rank6]: File "/global/u1/s/sark777/Distributed_DGL/src/prefetch/baseline/node_classification.py", line 331, in main
[rank6]: sample_time, eval_time, data_copy, absolute_total_time) = run(args, device, data)
[rank6]: File "/global/u1/s/sark777/Distributed_DGL/src/prefetch/baseline/node_classification.py", line 92, in run
[rank6]: model = th.nn.parallel.DistributedDataParallel(model)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 873, in __init__
[rank6]: optimize_ddp = torch._dynamo.config._get_optimize_ddp_mode()
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/__init__.py", line 2003, in __getattr__
[rank6]: return importlib.import_module(f".{name}", __name__)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank6]: return _bootstrap._gcd_import(name[level:], package, level)
[rank6]: File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank6]: File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
[rank6]: File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
[rank6]: File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
[rank6]: File "<frozen importlib._bootstrap_external>", line 883, in exec_module
[rank6]: File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 64, in <module>
[rank6]: torch.manual_seed = disable(torch.manual_seed)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/decorators.py", line 50, in disable
[rank6]: return DisableContext()(fn)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 410, in __call__
[rank6]: (filename is None or trace_rules.check(fn))
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3378, in check
[rank6]: return check_verbose(obj, is_inlined_call).skipped
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3361, in check_verbose
[rank6]: rule = torch._dynamo.trace_rules.lookup_inner(
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3442, in lookup_inner
[rank6]: Traceback (most recent call last):
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
[rank6]: obj = _ForkingPickler.dumps(obj)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
[rank6]: cls(buf, protocol).dump(obj)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 295, in reduce_tensor
[rank6]: from torch.nested._internal.nested_tensor import NestedTensor
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 416, in <module>
[rank6]: _nt_view_dummy = NestedTensor(
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 112, in __init__
[rank6]: torch._dynamo.mark_dynamic(self, self._ragged_idx)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/__init__.py", line 2003, in __getattr__
[rank6]: return importlib.import_module(f".{name}", __name__)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank6]: return _bootstrap._gcd_import(name[level:], package, level)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 64, in <module>
[rank6]: torch.manual_seed = disable(torch.manual_seed)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/decorators.py", line 50, in disable
[rank6]: return DisableContext()(fn)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 410, in __call__
[rank6]: (filename is None or trace_rules.check(fn))
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3378, in check
[rank6]: return check_verbose(obj, is_inlined_call).skipped
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3361, in check_verbose
[rank6]: rule = torch._dynamo.trace_rules.lookup_inner(
[rank6]: AttributeError: partially initialized module 'torch._dynamo' has no attribute 'trace_rules' (most likely due to a circular import)
[rank6]: Traceback (most recent call last):
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
[rank6]: obj = _ForkingPickler.dumps(obj)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
[rank6]: cls(buf, protocol).dump(obj)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 295, in reduce_tensor
[rank6]: from torch.nested._internal.nested_tensor import NestedTensor
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 416, in <module>
[rank6]: _nt_view_dummy = NestedTensor(
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 112, in __init__
[rank6]: torch._dynamo.mark_dynamic(self, self._ragged_idx)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/__init__.py", line 2003, in __getattr__
[rank6]: return importlib.import_module(f".{name}", __name__)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank6]: return _bootstrap._gcd_import(name[level:], package, level)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 64, in <module>
[rank6]: torch.manual_seed = disable(torch.manual_seed)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/decorators.py", line 50, in disable
[rank6]: return DisableContext()(fn)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 410, in __call__
[rank6]: (filename is None or trace_rules.check(fn))
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3378, in check
[rank6]: return check_verbose(obj, is_inlined_call).skipped
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3361, in check_verbose
[rank6]: rule = torch._dynamo.trace_rules.lookup_inner(
[rank6]: AttributeError: partially initialized module 'torch._dynamo' has no attribute 'trace_rules' (most likely due to a circular import)
[rank6]: Traceback (most recent call last):
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
[rank6]: obj = _ForkingPickler.dumps(obj)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
[rank6]: cls(buf, protocol).dump(obj)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 295, in reduce_tensor
[rank6]: from torch.nested._internal.nested_tensor import NestedTensor
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 416, in <module>
[rank6]: _nt_view_dummy = NestedTensor(
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 112, in __init__
[rank6]: torch._dynamo.mark_dynamic(self, self._ragged_idx)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/__init__.py", line 2003, in __getattr__
[rank6]: return importlib.import_module(f".{name}", __name__)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank6]: return _bootstrap._gcd_import(name[level:], package, level)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 64, in <module>
[rank6]: torch.manual_seed = disable(torch.manual_seed)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/decorators.py", line 50, in disable
[rank6]: return DisableContext()(fn)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 410, in __call__
[rank6]: (filename is None or trace_rules.check(fn))
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3378, in check
[rank6]: return check_verbose(obj, is_inlined_call).skipped
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3361, in check_verbose
[rank6]: rule = torch._dynamo.trace_rules.lookup_inner(
[rank6]: AttributeError: partially initialized module 'torch._dynamo' has no attribute 'trace_rules' (most likely due to a circular import)
[rank6]: Traceback (most recent call last):
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
[rank6]: obj = _ForkingPickler.dumps(obj)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
[rank6]: cls(buf, protocol).dump(obj)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 295, in reduce_tensor
[rank6]: from torch.nested._internal.nested_tensor import NestedTensor
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 416, in <module>
[rank6]: _nt_view_dummy = NestedTensor(
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/nested/_internal/nested_tensor.py", line 112, in __init__
[rank6]: torch._dynamo.mark_dynamic(self, self._ragged_idx)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/__init__.py", line 2003, in __getattr__
[rank6]: return importlib.import_module(f".{name}", __name__)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank6]: return _bootstrap._gcd_import(name[level:], package, level)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/__init__.py", line 64, in <module>
[rank6]: torch.manual_seed = disable(torch.manual_seed)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/decorators.py", line 50, in disable
[rank6]: return DisableContext()(fn)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 410, in __call__
[rank6]: (filename is None or trace_rules.check(fn))
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3378, in check
[rank6]: return check_verbose(obj, is_inlined_call).skipped
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 3361, in check_verbose
[rank6]: rule = torch._dynamo.trace_rules.lookup_inner(
[rank6]: AttributeError: partially initialized module 'torch._dynamo' has no attribute 'trace_rules' (most likely due to a circular import)
[rank6]: rule = get_torch_obj_rule_map().get(obj, None)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 2782, in get_torch_obj_rule_map
[rank6]: obj = load_object(k)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 2811, in load_object
[rank6]: val = _load_obj_from_str(x[0])
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py", line 2795, in _load_obj_from_str
[rank6]: return getattr(importlib.import_module(module), obj_name)
[rank6]: File "/global/homes/s/sark777/.conda/envs/dgl-dev-gpu-121/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank6]: return _bootstrap._gcd_import(name[level:], package, level)
[rank6]: File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank6]: File "<frozen importlib._bootstrap>", line 1024, in _find_and_load
[rank6]: File "<frozen importlib._bootstrap>", line 171, in __enter__
[rank6]: File "<frozen importlib._bootstrap>", line 116, in acquire
[rank6]: _frozen_importlib._DeadlockError: deadlock detected by _ModuleLock('torch.nested._internal.nested_tensor') at 140481820107792
Seeing exactly the same error in the GPU version just without the NVML warning. Would really appreciate some help. I am running with num_samplers = 4
and num_servers = 1
.
Thanks!