Unable to process "update_all' function in subprocess

xcwanAndy · July 14, 2021, 3:23pm

Hi all,

I have an urgent question related to DGLGraph.update_all function.

Context:
The main process runs DistDGL, with the context of:

dgl.distributed.initialize(args.ip_config)
g = dgl.distributed.DistGraph(args.graph_name, part_config=args.part_config)
model = DistSAGE(in_feats, args.num_hidden, n_classes, args.num_layers, F.relu, args.dropout)

Then I use python multiprocessing.Process to initiate a subprocess to run function explain.

Process(target=explain, args=(args, model,))

I construct a local DGLGraph object in the subprocess and want to run forward pass of the model through the newly-constructed graph. But I find the code stuck at graph.update_all function.

graph.ndata['h'] = n_feats
graph.update_all(fn.copy_u('h', 'm'), fn.mean('m', 'h'))

But I can run explain function successfully either by using threading.Thread or run it in main process. Therefore, I think the root cause is that update_all function may rely on some runtime objects that originally initialized in the main process, but unable to be accessed in the subprocess.

As I search for the runtime object, I find it already deprecated in dgl 0.6.x. And I’m using dgl-cu101==0.6.1 package.

Also, I add import dgl in the explain function, but still not works.

So could anyone give me some advice for it? And how can I run my program using multiprocessing?

Thanks a lot for your help!

VoVAllen · July 15, 2021, 6:08am

Is it a DistDGLGraph? Is it on CPU or GPU?

xcwanAndy · July 15, 2021, 6:30am

Thanks for your reply!

In the explain function, I construct a local DGLGraph graph, not DistDGLGraph. The graph is stored in CPU.

The model is originally in GPU, so I create a new DistDGL in CPU in the main process, and then copy model.state_dict() to it. See below:

xmodel = DistSAGE(in_feats, args.num_hidden, n_classes, args.num_layers,
                      F.relu, args.dropout)
xmodel.load_state_dict(model.module.state_dict())
torch.save(xmodel.state_dict(), args.xmodel_store)

And then in subprocess:

model = DistSAGE(feat_dim, args.num_hidden, n_classes, args.num_layers,
                     F.relu, args.dropout)
model.load_state_dict(torch.load(args.xmodel_store, map_location=torch.device('cpu')))

to copy the model to CPU.

VoVAllen · July 15, 2021, 6:48am

Could you try add the decorator to your subprocess function, as example at dgl/deepwalk.py at 0.6.x · dmlc/dgl · GitHub, decorator at dgl/utils.py at 0.6.x · dmlc/dgl · GitHub to see whether it still stucks?

The reason is that we found when forking the subprocess, there might be namespace collision for OpenMP, thus it might stuck the subprocess. For nightly version, you can try dgl.multiprocessing instead.

Another solution is to try spawning subprocess instead of forking subprocess, but this will make extra copy of the graph transferring between processes

xcwanAndy · July 15, 2021, 7:40am

Sure, I’ll try the decorator later.

Besides, how can I install dgl nightly version? pip3 install --pre dgl-cu101==0.6.1 seems not working.

VoVAllen · July 15, 2021, 9:42am

Yes, we’ve moved a self hosted s3 because our binary size exceeds pypi limits. Now you need pip install --pre dgl-cu102 -f https://data.dgl.ai/wheels-test/repo.html to install latest nightly build

xcwanAndy · July 16, 2021, 1:31pm

Thanks a lot @VoVAllen! The decorator approach works well.

BarclayII · July 19, 2021, 7:04am

In nightly builds and the upcoming version you can use Process class from dgl.multiprocessing which wraps the decorator.

xcwanAndy · July 21, 2021, 8:44am

@BarclayII Thanks! The answer helps me a lot!

minjie · July 26, 2021, 7:09am