Distribute training occur error

lixusign · December 29, 2020, 3:21am

dlpack not work ， when I visit dist tensor data after I wrote it already。

I use it as follow:

1、I build one dist tensor with th.ones() tensor
2、I update some parts of this dist tensor with random tensor
3、then I found ，my update part in my current machine，can‘t visit by others machine， it return zhe old data （th.ones()）
4、I think this is dlpack not work bug. so I commented out some code in dgl as follow ：

1: kvstore.py push method local udf update
2：kvstore.py pull method from local dlpack share memory

I use socket instead of it to fetch data。 and it work。

so is it a bug or something I used wrong ? need help

lixusign · December 29, 2020, 3:24am

in pic， I mean pull@1 work, and pull@2 is not work 。I think the new data torch2 wrote is not the same memory the node2 graph server can use。

zhengda1936 · December 30, 2020, 5:28pm

can you provide code to demonstrate what you did?
It looks weird. if you pull the same data from different machines, one pull reads data from shared memory in the local machine, the other reads data with network communication. both should get the same data. I want to see what you did and why you get inconsistent results on different machines.

lixusign · December 31, 2020, 2:55am

thx for reply，I do a list action as follow before training:

first, I init a dist tensor use zero torch tensor
then，I see, all dist tensor variable is 0.0f it’s fine

second, I update the dist tensor variables use my torch.load(‘emb.file’) tensor insead
then, I query my dist tensor in the same machine， because of share memory in multi process ， the variable is right ，it‘s fine

third， I query my dist tensor in another machine，because of its tcp conn connected to she graphServer(PS) ，I saw the return variable is zero （ I make sure these variables are update in the first machine）

so I very surprise， I also read code ，but I cant find any question， so I remove some code in the before relay。

and let visit variables through socket forcible ，it is work I dont know why

zhengda1936 · January 4, 2021, 5:13pm

sorry for the late reply. Could you provide us your code to reproduce your problem?

system · February 3, 2021, 5:13pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.