Unable to process "update_all' function in subprocess

Hi all,

I have an urgent question related to DGLGraph.update_all function.

Context:
The main process runs DistDGL, with the context of:

dgl.distributed.initialize(args.ip_config)
g = dgl.distributed.DistGraph(args.graph_name, part_config=args.part_config)
model = DistSAGE(in_feats, args.num_hidden, n_classes, args.num_layers, F.relu, args.dropout)

Then I use python multiprocessing.Process to initiate a subprocess to run function explain.

Process(target=explain, args=(args, model,))

I construct a local DGLGraph object in the subprocess and want to run forward pass of the model through the newly-constructed graph. But I find the code stuck at graph.update_all function.

graph.ndata['h'] = n_feats
graph.update_all(fn.copy_u('h', 'm'), fn.mean('m', 'h'))

But I can run explain function successfully either by using threading.Thread or run it in main process. Therefore, I think the root cause is that update_all function may rely on some runtime objects that originally initialized in the main process, but unable to be accessed in the subprocess.

As I search for the runtime object, I find it already deprecated in dgl 0.6.x. And I’m using dgl-cu101==0.6.1 package.

Also, I add import dgl in the explain function, but still not works.

So could anyone give me some advice for it? And how can I run my program using multiprocessing?

Thanks a lot for your help!

Is it a DistDGLGraph? Is it on CPU or GPU?

Thanks for your reply!

In the explain function, I construct a local DGLGraph graph, not DistDGLGraph. The graph is stored in CPU.

The model is originally in GPU, so I create a new DistDGL in CPU in the main process, and then copy model.state_dict() to it. See below:

xmodel = DistSAGE(in_feats, args.num_hidden, n_classes, args.num_layers,
                      F.relu, args.dropout)
xmodel.load_state_dict(model.module.state_dict())
torch.save(xmodel.state_dict(), args.xmodel_store)

And then in subprocess:

model = DistSAGE(feat_dim, args.num_hidden, n_classes, args.num_layers,
                     F.relu, args.dropout)
model.load_state_dict(torch.load(args.xmodel_store, map_location=torch.device('cpu')))

to copy the model to CPU.

Could you try add the decorator to your subprocess function, as example at dgl/deepwalk.py at 0.6.x · dmlc/dgl · GitHub, decorator at dgl/utils.py at 0.6.x · dmlc/dgl · GitHub to see whether it still stucks?

The reason is that we found when forking the subprocess, there might be namespace collision for OpenMP, thus it might stuck the subprocess. For nightly version, you can try dgl.multiprocessing instead.

Another solution is to try spawning subprocess instead of forking subprocess, but this will make extra copy of the graph transferring between processes

Sure, I’ll try the decorator later.

Besides, how can I install dgl nightly version? pip3 install --pre dgl-cu101==0.6.1 seems not working.

Yes, we’ve moved a self hosted s3 because our binary size exceeds pypi limits. Now you need pip install --pre dgl-cu102 -f https://data.dgl.ai/wheels-test/repo.html to install latest nightly build

Thanks a lot @VoVAllen! The decorator approach works well.

In nightly builds and the upcoming version you can use Process class from dgl.multiprocessing which wraps the decorator.

@BarclayII Thanks! The answer helps me a lot!