Still occur error when train DistSAGE with OGB-Product

Sorry, @VoVAllen I found that I still have this issue: Error when train DistSAGE with OGB-Product

I wonder what is the dgl version in: [Bugfix] Fix CUDA 11.1 crashing when number of edges is larger than number of node pairs by BarclayII · Pull Request #3265 · dmlc/dgl · GitHub

Maybe I can try if this version would work.

0.7.2 should include that fix

1 Like

@VoVAllen Thanks! Upgrade to dgl v0.7.2 works well.

@VoVAllen I found that in some cases (most likely related to the graph partition results), there still exists error when training ogb-product distributedly.

Do you mean the same error? It’s weird. Could you try re-partition the graph to see whether error still exists?

Yes, I definitely believe this issue is due to the partition results. When I meet this bug, I re-partition the graph and sometimes it works and sometimes not. lol

Besides, sometimes the pad_data function may result in the occurrence of this issue as well.

I realized the problem and made the fix at Fix dist example padding problem by VoVAllen · Pull Request #3687 · dmlc/dgl · GitHub. Sorry for the inconvenience

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.