Message passing fails on only with P100

willfang · May 5, 2020, 1:56pm

Hi,
I am running into a strange problem where dgl (version 0.4) seems to fail when I push the graph to gpu. It works on a K80, or my local machine’s gpu. Has anyone seen this before?

Run on local machine: it works (GTX1660)
Run on cluster’s interactive window: it works (K80)
Submit job to cluster: fails (P100)

A difference I’ve noticed is the difference in GPU

import dgl
import networkx as nx
import matplotlib.pyplot as plt
import torch as th
import dgl.function as fn

def main():
	use_cuda = th.cuda.is_available()
	#    use_cuda = False
	print(use_cuda)
	device = th.device("cuda:0" if use_cuda else "cpu")
	g = dgl.DGLGraph()
	g.add_nodes(6)
	g.ndata['h'] = th.tensor([0., 1., 2., 3., 4., 5.]).cuda()
	src = (1, 1, 2, 3)
	dst = (4, 5, 4, 5)
	g.add_edges(src,dst)
	g.to(th.device("cuda:0"))
	nx.draw(g.to_networkx(), node_size=50, node_color=[[.5, .5, .5,]])
	# plt.show()
	g.update_all(message_func=fn.copy_src(src='h', out='m'), reduce_func=fn.sum(msg='m',out='h'))
	print(g.ndata['h'])
if __name__ == '__main__':
    main()

VoVAllen · May 5, 2020, 2:32pm

Could you try install from source on the P100 machine? I doubt this is due to SM architecture version compatibility. @BarclayII

BarclayII · May 5, 2020, 2:47pm

It shouldn’t because the binaries are compiled with all SM architectures (30 35 50 60 70).

We’ll try finding a P100 machine and reproduce it.

classicsong · May 5, 2020, 2:51pm

Your P100 GPU had some E process Status. Does this caused by DGL? Can you try reset it to default?

BarclayII · May 5, 2020, 2:56pm

This is very likely.

@willfang Could you try resetting compute mode to default using the following:

nvidia-smi -c DEFAULT

willfang · May 7, 2020, 2:30pm

Sorry, had to coordinate with my admin on doing this. We tested it and got the same result unfortunately

willfang · May 7, 2020, 2:31pm

It’s installed from source using gcc 6.4.0, cmake-3.16.2

chnsh · May 7, 2020, 3:40pm

I am not sure if this is a general problem - it works for me on a p100

willfang · May 14, 2020, 2:03pm

Would you have any tips on how to debug this for my specific set-up?

Another thing to note, if I push the graph to cpu before calling update_all(), it works