Hello.
For some reason, I am unable to run ogb-paper100M with distributed GraphSAGE. I have tried on 2
different super computing clusters. Here is the error I get on one of them:
Traceback (most recent call last):
File "train_dist.py", line 331, in <module>
main(args)
File "train_dist.py", line 294, in main
run(args, device, data)
File "train_dist.py", line 259, in run
g.ndata['labels'], val_nid, test_nid, args.batch_size_eval, device)
File "train_dist.py", line 155, in evaluate
pred = model.inference(g, inputs, batch_size, device)
File "train_dist.py", line 117, in inference
for blocks in tqdm.tqdm(dataloader):
File "/jet/home/lhoang/.local/lib/python3.6/site-packages/tqdm/std.py", line 1178, in __iter__
for obj in iterable:
File "/jet/home/lhoang/.local/lib/python3.6/site-packages/dgl/distributed/dist_dataloader.py", line 163, in __next__
result = self.queue.get(timeout=1800)
File "<string>", line 2, in get
File "/usr/lib64/python3.6/multiprocessing/managers.py", line 757, in _callmethod
kind, result = conn.recv()
File "/usr/lib64/python3.6/multiprocessing/connection.py", line 254, in recv
buf = self._recv_bytes()
File "/usr/lib64/python3.6/multiprocessing/connection.py", line 411, in _recv_bytes
buf = self._recv(4)
File "/usr/lib64/python3.6/multiprocessing/connection.py", line 387, in _recv
raise EOFError
EOFError
The error I get on the other machine is different. In this case though, I copied the partitions I made on the cluster above to the other machine (the other machine does not have enough memory to run METIS partitioning on the graph; I’ve updated the data paths and such as necessary after the copy):
Traceback (most recent call last):
File "train_dist.py", line 337, in <module>
main(args)
File "train_dist.py", line 273, in main
dgl.distributed.initialize(args.ip_config, args.num_servers, num_workers=args.num_workers)
File "/home1/03372/lhoang/.local/lib/python3.6/site-packages/dgl/distributed/dist_context.py", line 100, in initialize
os.environ.get('DGL_CONF_PATH'))
File "/home1/03372/lhoang/.local/lib/python3.6/site-packages/dgl/distributed/dist_graph.py", line 263, in __init__
self.gpb, graph_name = load_partition_book(part_config, self.part_id)
File "/home1/03372/lhoang/.local/lib/python3.6/site-packages/dgl/distributed/partition.py", line 103, in load_partition_book
node_map = part_metadata['node_map'] if is_range_part else np.load(part_metadata['node_map'])
File "/home1/03372/lhoang/.local/lib/python3.6/site-packages/numpy/lib/npyio.py", line 416, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
TypeError: expected str, bytes or os.PathLike object, not dict
Traceback (most recent call last):
File "train_dist.py", line 337, in <module>
main(args)
File "train_dist.py", line 273, in main
dgl.distributed.initialize(args.ip_config, args.num_servers, num_workers=args.num_workers)
File "/home1/03372/lhoang/.local/lib/python3.6/site-packages/dgl/distributed/dist_context.py", line 100, in initialize
os.environ.get('DGL_CONF_PATH'))
File "/home1/03372/lhoang/.local/lib/python3.6/site-packages/dgl/distributed/dist_graph.py", line 263, in __init__
self.gpb, graph_name = load_partition_book(part_config, self.part_id)
File "/home1/03372/lhoang/.local/lib/python3.6/site-packages/dgl/distributed/partition.py", line 103, in load_partition_book
node_map = part_metadata['node_map'] if is_range_part else np.load(part_metadata['node_map'])
File "/home1/03372/lhoang/.local/lib/python3.6/site-packages/numpy/lib/npyio.py", line 416, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
TypeError: expected str, bytes or os.PathLike object, not dict
Both errors happen during reading of partitions from what I can tell, and this is a 4 host partition.
Both systems are running the same DistDGL python code. The only difference I can think of is that OGB on the machine where the dataset was downloaded and partitioned is a more up to date version, but this shouldn’t affect this as OGB isn’t called at this point in execution (this is during DistDGL file loading).
Any insight would be appreciated.
Thank you,
Loc Hoang