Hi Rhett, thanks for suggesting the pipeline. I tried it and got timeoutError when dispatching data. The client machine gave RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 60000ms for recv operation to complete Exception in thread Thread-2:
and the server machines gave RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.138.0.4]:51850 Exception in thread Thread-1:
I guess it is the timeout error from the client machine causing the connection closed error on the server side. Do you mind having a look at it? Thank you
Here is the complete err msg:
/usr/bin/python3 /home/luyc/workspace/dist_part/tools/distgraphlaunch.py --num_proc_per_machine 1 --ip_config ip_config.txt --master_port 12345 "/usr/bin/python3 /home/luyc/workspace/dist_part/tools/distpartitioning/data_proc_pipeline.py --world-size 2 --partitions-dir /home/luyc/workspace/dataset/ogbn_papers100M/2parts --input-dir /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data --graph-name papers100M --schema metadata.json --num-parts 2 --output /home/luyc/workspace/dataset/ogbn_papers100M/partitioned "
(export DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 RANK=0 MASTER_ADDR=10.138.0.2 MASTER_PORT=12345; /usr/bin/python3 /home/luyc/workspace/dist_part/tools/distpartitioning/data_proc_pipeline.py --world-size 2 --partitions-dir /home/luyc/workspace/dataset/ogbn_papers100M/2parts --input-dir /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data --graph-name papers100M --schema metadata.json --num-parts 2 --output /home/luyc/workspace/dataset/ogbn_papers100M/partitioned )
(export DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 RANK=1 MASTER_ADDR=10.138.0.2 MASTER_PORT=12345; /usr/bin/python3 /home/luyc/workspace/dist_part/tools/distpartitioning/data_proc_pipeline.py --world-size 2 --partitions-dir /home/luyc/workspace/dataset/ogbn_papers100M/2parts --input-dir /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data --graph-name papers100M --schema metadata.json --num-parts 2 --output /home/luyc/workspace/dataset/ogbn_papers100M/partitioned )
cleanupu process runs
[ddgltest INFO 2022-08-26 19:00:49,954 PID:6958] Added key: store_based_barrier_key:1 to store for rank: 0
[ddistdgltest INFO 2022-08-26 19:00:49,954 PID:12907] Added key: store_based_barrier_key:1 to store for rank: 1
[ddgltest INFO 2022-08-26 19:00:49,954 PID:6958] Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[ddgltest INFO 2022-08-26 19:00:49,954 PID:6958] [Rank: 0] Done with process group initialization...
[ddistdgltest INFO 2022-08-26 19:00:49,954 PID:12907] Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[ddgltest INFO 2022-08-26 19:00:49,954 PID:6958] [Rank: 0] Starting distributed data processing pipeline...
[ddistdgltest INFO 2022-08-26 19:00:49,955 PID:12907] [Rank: 1] Done with process group initialization...
[ddistdgltest INFO 2022-08-26 19:00:49,955 PID:12907] [Rank: 1] Starting distributed data processing pipeline...
[ddistdgltest INFO 2022-08-26 19:00:51,020 PID:12907] [Rank: 1] Initialized metis partitions and node_types map...
[ddistdgltest INFO 2022-08-26 19:00:51,021 PID:12907] Loading numpy from /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data/node_data/_N/feat-1.npy
[ddgltest INFO 2022-08-26 19:00:51,060 PID:6958] [Rank: 0] Initialized metis partitions and node_types map...
[ddgltest INFO 2022-08-26 19:00:51,060 PID:6958] Loading numpy from /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data/node_data/_N/feat-0.npy
[ddistdgltest INFO 2022-08-26 19:01:02,821 PID:12907] Loading numpy from /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data/node_data/_N/year-1.npy
[ddistdgltest INFO 2022-08-26 19:01:03,011 PID:12907] Loading numpy from /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data/node_data/_N/label-1.npy
[ddistdgltest INFO 2022-08-26 19:01:03,105 PID:12907] [Rank: 1] node feature name: _N/feat, feature data shape: torch.Size([55529978, 128])
[ddistdgltest INFO 2022-08-26 19:01:03,106 PID:12907] [Rank: 1] node feature name: _N/year, feature data shape: torch.Size([55529978, 1])
[ddistdgltest INFO 2022-08-26 19:01:03,106 PID:12907] [Rank: 1] node feature name: _N/label, feature data shape: torch.Size([55529978, 1])
[ddistdgltest INFO 2022-08-26 19:01:03,106 PID:12907] Reading csv files from /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data/edge_index/_N:_E:_N1.txt
[ddistdgltest INFO 2022-08-26 19:01:41,842 PID:12907] [Rank: 1] Done reading edge_file: 4, (807842936,)
[ddistdgltest INFO 2022-08-26 19:01:41,887 PID:12907] [Rank: 1] Done reading dataset deom /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data
[ddgltest INFO 2022-08-26 19:01:44,331 PID:6958] Loading numpy from /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data/node_data/_N/year-0.npy
[ddgltest INFO 2022-08-26 19:01:44,730 PID:6958] Loading numpy from /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data/node_data/_N/label-0.npy
[ddgltest INFO 2022-08-26 19:01:44,843 PID:6958] [Rank: 0] node feature name: _N/feat, feature data shape: torch.Size([55529978, 128])
[ddgltest INFO 2022-08-26 19:01:44,843 PID:6958] [Rank: 0] node feature name: _N/year, feature data shape: torch.Size([55529978, 1])
[ddgltest INFO 2022-08-26 19:01:44,843 PID:6958] [Rank: 0] node feature name: _N/label, feature data shape: torch.Size([55529978, 1])
[ddgltest INFO 2022-08-26 19:01:44,843 PID:6958] Reading csv files from /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data/edge_index/_N:_E:_N0.txt
[ddgltest INFO 2022-08-26 19:02:32,985 PID:6958] [Rank: 0] Done reading edge_file: 4, (807842936,)
[ddgltest INFO 2022-08-26 19:02:33,032 PID:6958] [Rank: 0] Done reading dataset deom /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data
[Rank: 1 ] Reading file: /home/luyc/workspace/dataset/ogbn_papers100M/2parts/_N.txt
Traceback (most recent call last):
File "/home/luyc/workspace/dist_part/tools/distpartitioning/data_proc_pipeline.py", line 52, in <module>
multi_machine_run(params)
File "/home/luyc/workspace/dist_part/tools/distpartitioning/data_shuffle.py", line 690, in multi_machine_run
gen_dist_partitions(rank, params.world_size, params)
File "/home/luyc/workspace/dist_part/tools/distpartitioning/data_shuffle.py", line 544, in gen_dist_partitions
read_dataset(rank, world_size, id_lookup, params, schema_map)
File "/home/luyc/workspace/dist_part/tools/distpartitioning/data_shuffle.py", line 394, in read_dataset
edge_data = augment_edge_data(edge_data, id_lookup, edge_tids, rank, world_size)
File "/home/luyc/workspace/dist_part/tools/distpartitioning/utils.py", line 243, in augment_edge_data
edge_data[constants.OWNER_PROCESS] = lookup_service.get_partition_ids(edge_data[constants.GLOBAL_DST_ID])
File "/home/luyc/workspace/dist_part/tools/distpartitioning/dist_lookup.py", line 200, in get_partition_ids
owner_resp_list = alltoallv_cpu(self.rank, self.world_size, out_list)
File "/home/luyc/workspace/dist_part/tools/distpartitioning/gloo_wrapper.py", line 122, in alltoallv_cpu
__alltoall_cpu(rank, world_size, recv_counts, send_counts)
File "/home/luyc/workspace/dist_part/tools/distpartitioning/gloo_wrapper.py", line 60, in __alltoall_cpu
dist.scatter(output_tensor_list[i], input_tensor_list if i == rank else [], src=i)
File "/home/luyc/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2359, in scatter
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 60000ms for recv operation to complete
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/luyc/workspace/dist_part/tools/distgraphlaunch.py", line 111, in run
subprocess.check_call(ssh_cmd, shell=True)
File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 22 10.138.0.4 '(export DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 RANK=1 MASTER_ADDR=10.138.0.2 MASTER_PORT=12345; /usr/bin/python3 /home/luyc/workspace/dist_part/tools/distpartitioning/data_proc_pipeline.py --world-size 2 --partitions-dir /home/luyc/workspace/dataset/ogbn_papers100M/2parts --input-dir /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data --graph-name papers100M --schema metadata.json --num-parts 2 --output /home/luyc/workspace/dataset/ogbn_papers100M/partitioned )'' returned non-zero exit status 1.
[Rank: 0 ] Reading file: /home/luyc/workspace/dataset/ogbn_papers100M/2parts/_N.txt
Traceback (most recent call last):
File "/home/luyc/workspace/dist_part/tools/distpartitioning/data_proc_pipeline.py", line 52, in <module>
multi_machine_run(params)
File "/home/luyc/workspace/dist_part/tools/distpartitioning/data_shuffle.py", line 690, in multi_machine_run
gen_dist_partitions(rank, params.world_size, params)
File "/home/luyc/workspace/dist_part/tools/distpartitioning/data_shuffle.py", line 544, in gen_dist_partitions
read_dataset(rank, world_size, id_lookup, params, schema_map)
File "/home/luyc/workspace/dist_part/tools/distpartitioning/data_shuffle.py", line 394, in read_dataset
edge_data = augment_edge_data(edge_data, id_lookup, edge_tids, rank, world_size)
File "/home/luyc/workspace/dist_part/tools/distpartitioning/utils.py", line 243, in augment_edge_data
edge_data[constants.OWNER_PROCESS] = lookup_service.get_partition_ids(edge_data[constants.GLOBAL_DST_ID])
File "/home/luyc/workspace/dist_part/tools/distpartitioning/dist_lookup.py", line 200, in get_partition_ids
owner_resp_list = alltoallv_cpu(self.rank, self.world_size, out_list)
File "/home/luyc/workspace/dist_part/tools/distpartitioning/gloo_wrapper.py", line 122, in alltoallv_cpu
__alltoall_cpu(rank, world_size, recv_counts, send_counts)
File "/home/luyc/workspace/dist_part/tools/distpartitioning/gloo_wrapper.py", line 60, in __alltoall_cpu
dist.scatter(output_tensor_list[i], input_tensor_list if i == rank else [], src=i)
File "/home/luyc/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2359, in scatter
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.138.0.4]:51850
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/luyc/workspace/dist_part/tools/distgraphlaunch.py", line 111, in run
subprocess.check_call(ssh_cmd, shell=True)
File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 22 10.138.0.2 '(export DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 RANK=0 MASTER_ADDR=10.138.0.2 MASTER_PORT=12345; /usr/bin/python3 /home/luyc/workspace/dist_part/tools/distpartitioning/data_proc_pipeline.py --world-size 2 --partitions-dir /home/luyc/workspace/dataset/ogbn_papers100M/2parts --input-dir /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data --graph-name papers100M --schema metadata.json --num-parts 2 --output /home/luyc/workspace/dataset/ogbn_papers100M/partitioned )'' returned non-zero exit status 1.
btw for the previous step, random_partition.py
probably needs to be in the same directory as utils
to avoid cannot import name 'setdir' from 'utils'
and the args specified in the tutorial is not consistent with args set in the script.