I have trouble in multi-GPU training of graphsage (dgl-example), since I use a neural network to construct the graph before getting into the
mp.process calls. CUDA re-initialize error occurs in blew.
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the ‘spawn’ start method
After I set the start method to ‘spawn’, another error occurs in blew.
File “/home/xxx/examples/base_train_kmeans.py”, line 568, in attr_graph
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/process.py”, line 105, in start
self._popen = self._Popen(self)
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/context.py”, line 284, in _Popen
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/popen_spawn_posix.py”, line 32, in init
super(). init (process_obj)
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/popen_fork.py”, line 19, in init
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/popen_spawn_posix.py”, line 47, in _launch
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/reduction.py”, line 60, in dump
_pickle.PicklingError: Can’t pickle <function run at 0x7f7970a157b8>: it’s not the same object as main .run
- convert the GPU tensor to CPU/Numpy array (doesn’t work)
I have the following questions:
- Should I consider the CUDA resource that the neural network takes up? Because I met the CUDA re-initialization error when
model = model.to(device)model-to-device was executed.
- The second suggestion of @BarclayII is to construct the graph in the
runfunction rather than passing the graph as an input variable of
run. Will this transformation affect the results of
runfunction in multi-processing mode?
I follow the example graphsage-unsupervised and change the Reddit dataset into my custom dataset which is constructed by dgl.convert.from_scipy. I initialize the node feature of the graph by
- I am confused about the start method of torch multi-processing. In the torch official post torch official post, they suggest utilizing ‘spawn’ or ‘forkserver’, but in the dgl dgl-graphsage, the default ‘fork’ mode is presented. How to correctly use those start methods and understand the main difference between them? Any suggestions are welcomed.
Thanks in advance!