Issues with train_cv_multi_gpu.py in graphSage

Hi,
I am trying to run train_cv_multi_gpu.py in pytorch/graphSage example. However, I am getting the error:

File “/home/abasak/miniconda3/lib/python3.7/site-packages/torch/cuda/init.py”, line 148, in _lazy_init
** "Cannot re-initialize CUDA in forked subprocess. " + msg)**
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the ‘spawn’ start method

I thought the thread_wrapped_func decorator is supposed to be a workaround for this issue but I am still getting this error. Could you provide some suggestions on how to resolve this?

Hi,

Could you try change L396


from
p = mp.Process(target=run, args=(proc_id, n_gpus, args, devices, data))

to

ctx = mp.get_context("spawn")
p = ctx.Process(target=run, args=(proc_id, n_gpus, args, devices, data))

and see whether this works?

Did you make any changes to the code? Were you creating something on GPU before starting the subprocesses?

Hi, BarclayII,
I have the same errors with abasak, since I add some training code of another Neural Network (NN) before the training of GraphSage. Specifically, I utilize the NN to extract features, which are utilized to construct the graph for GraphSage.

Changing the multi-processing start method from ‘fork’ to ‘spawn’ doesn’t address the problem. Another error occurs in blew.

Traceback (most recent call last):
File “/home/xxx/examples/base_train_kmeans.py”, line 624, in
main(args)
File “/home/xxx/examples/base_train_kmeans.py”, line 152, in main
embs = attr_graph(dict_f, labels, cams)
File “/home/xxx/examples/base_train_kmeans.py”, line 568, in attr_graph
p.start()
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/process.py”, line 105, in start
self._popen = self._Popen(self)
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/context.py”, line 284, in _Popen
return Popen(process_obj)
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/popen_spawn_posix.py”, line 32, in init
super().init(process_obj)
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/popen_fork.py”, line 19, in init
self._launch(process_obj)
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/popen_spawn_posix.py”, line 47, in _launch
reduction.dump(process_obj, fp)
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/reduction.py”, line 60, in dump
ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can’t pickle <function run at 0x7f7970a157b8>: it’s not the same object as main.run[quote=“BarclayII, post:3, topic:1207, full:true”]
Did you make any changes to the code? Were you creating something on GPU before starting the subprocesses?

Any suggestions about this issue? Thanks in advance!

Hi, VoVAllen, Thanks for your post. I met the same error with akasak@ abasak. Following your suggestion, another error occurs in blew.

Traceback (most recent call last):
File “/home/xxx/examples/base_train_kmeans.py”, line 624, in
main(args)
File “/home/xxx/examples/base_train_kmeans.py”, line 152, in main
embs = attr_graph(dict_f, labels, cams)
File “/home/xxx/examples/base_train_kmeans.py”, line 568, in attr_graph
p.start()
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/process.py”, line 105, in start
self._popen = self._Popen(self)
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/context.py”, line 284, in _Popen
return Popen(process_obj)
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/popen_spawn_posix.py”, line 32, in init
super().init(process_obj)
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/popen_fork.py”, line 19, in init
self._launch(process_obj)
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/popen_spawn_posix.py”, line 47, in _launch
reduction.dump(process_obj, fp)
File “/home/robot/anaconda3/envs/cycada/lib/python3.6/multiprocessing/reduction.py”, line 60, in dump
ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can’t pickle <function run at 0x7f7970a157b8>: it’s not the same object as main.run

Thanks for your time!

So you were using a neural network to construct the graph before getting into the mp.Process calls? Were you using CUDA there? If so, then I’m afraid that you need to either (1) use CPU to construct the graph instead, or (2) move the graph construction code into the run function (which essentially lets each GPU to construct its own graph).

Thank you for your prompt reply and it inspires me a lot.

  1. Yes, CUDA is utilized in the training of NN, but the features have already been converted to CPU (The device attribute of features is cpu). I will try to convert them into Numpy array format.
    Should I consider the CUDA resource that the NN takes up? Because I met the CUDA re-initialization error when model = model.to(device)model-to-device was executed.

  2. I follow the example graphsage-unsupervised and change the Reddit dataset into my custom dataset which is constructed by dgl.convert.from_scipy. I initialize the node feature of the graph by graph.ndata['feature']=features.
    According to your second suggestion, should I replace the input variable data of the run function with features (Numpy array/ torch.tensors CPU) and edges matrix (SciPy sparse matrix)? Will this transformation affect the results of run function in multi-processing mode?

  3. I am confused about the start method of torch multi-processing. In the torch official post torch official post, they suggest utilizing ‘spawn’ or ‘forkserver’, but in the dgl dgl-graphsage, the default ‘fork’ mode is presented. How to correctly use those start methods and understand the main difference between them? Any suggestions are welcomed.

Thanks again for your time!

If CUDA is utilized before getting into mp.Process then the program will crash, even if you convert the features to CPU afterward.

Yes.

Basically fork utilizes the forking mechanism from Unix to create new processes. This can usually save memory because the graph will be directly shared among children via copy-on-write. spawn will create a separate process and all the data will be passed to subprocesses via serialization, which is the default option for Windows but cannot save memory.