Hi,
I have trouble in multi-GPU training of graphsage (dgl-example), since I use a neural network to construct the graph before getting into the mp.process
calls. CUDA re-initialize error occurs in blew.
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the ‘spawn’ start method
After I set the start method to ‘spawn’, another error occurs in blew.
File “/home/xxx/examples/base_train_kmeans.py”, line 568, in attr_graph
p.start()
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/process.py”, line 105, in start
self._popen = self._Popen(self)
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/context.py”, line 284, in _Popen
return Popen(process_obj)
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/popen_spawn_posix.py”, line 32, in init
super(). init (process_obj)
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/popen_fork.py”, line 19, in init
self._launch(process_obj)
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/popen_spawn_posix.py”, line 47, in _launch
reduction.dump(process_obj, fp)
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/reduction.py”, line 60, in dump
ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can’t pickle <function run at 0x7f7970a157b8>: it’s not the same object as main .run
According to this post and the suggestions of @BarclayII, I have taken the following efforts:
- convert the GPU tensor to CPU/Numpy array (doesn’t work)
I have the following questions:
- Should I consider the CUDA resource that the neural network takes up? Because I met the CUDA re-initialization error when
model = model.to(device)
model-to-device was executed. - The second suggestion of @BarclayII is to construct the graph in the
run
function rather than passing the graph as an input variable ofrun
. Will this transformation affect the results ofrun
function in multi-processing mode?
I follow the example graphsage-unsupervised and change the Reddit dataset into my custom dataset which is constructed by dgl.convert.from_scipy. I initialize the node feature of the graph bygraph.ndata['feature']=features
. - I am confused about the start method of torch multi-processing. In the torch official post torch official post, they suggest utilizing ‘spawn’ or ‘forkserver’, but in the dgl dgl-graphsage, the default ‘fork’ mode is presented. How to correctly use those start methods and understand the main difference between them? Any suggestions are welcomed.
Thanks in advance!