Hi,
I encounter this error when I am trying to use a larger dataset:
Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.
The code generating this:
hetero_graph = dgl.heterograph({
('user', 'do', 'sth'): (src, dst),
('sth', 'done-by', 'user'): (dst, src)})
I wonder if the graph is too large to be load on one GPU (src and dst are of length over 80 million). I am using tensorflow backend, and my dataset is stored in several csv files. I have two GPUs on my server.
I am new to using multiple GPUs / distributed training. Can someone gives me a general approach of solving the problem? Does distributed partitioning help?