Training stuck with multi GPU MAG dataset

Hi everyone,
When I’m trying to train the MAG dataset with multi GPU following the example here: dgl/train_multi_gpus.py at master · dmlc/dgl · GitHub . I found my code is always stuck with mp.spawn in line 285 and it cannot go into the train function. I wonder if anyone has encountered similar problems before and how did you solve it? :blush:

Thanks.

could you share any call stacks when get stuck? train() cannot be called? could you share the command you’re using?

@Rhett-Ying When trying to repro @test 's issue, I got this error with 0.9.x branch.
Command#: python train_multi-gpus.py

# python train_multi_gpus.py
Loading graph
Traceback (most recent call last):
  File "train_multi_gpus.py", line 276, in <module>
    (g,), _ = dgl.load_graphs(args.graph_path)
  File "/opt/conda/lib/python3.8/site-packages/dgl-0.9.0-py3.8-linux-x86_64.egg/dgl/data/graph_serialize.py", line 174, in load_graphs
    check_local_file_exists(filename)
  File "/opt/conda/lib/python3.8/site-packages/dgl-0.9.0-py3.8-linux-x86_64.egg/dgl/data/graph_serialize.py", line 36, in check_local_file_exists
    raise DGLError("File {} does not exist.".format(filename))
dgl._ffi.base.DGLError: File ./graph.dgl does not exist.

Did you run the pre-processing before train? follow this: dgl/examples/pytorch/ogb_lsc/MAG240M at master · dmlc/dgl · GitHub

No, I did not. Will re-do with the preprocess.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.