Construct and train large graph


I encounter this error when I am trying to use a larger dataset:

Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

The code generating this:

hetero_graph = dgl.heterograph({
    ('user', 'do', 'sth'): (src, dst),
    ('sth', 'done-by', 'user'): (dst, src)})

I wonder if the graph is too large to be load on one GPU (src and dst are of length over 80 million). I am using tensorflow backend, and my dataset is stored in several csv files. I have two GPUs on my server.

I am new to using multiple GPUs / distributed training. Can someone gives me a general approach of solving the problem? Does distributed partitioning help?

80M is not very large. are you trying to construct graph on gpu directly? what’s the device of src and dst? could you make sue src/dst are valid tensors? or try to trim src/dst?

Thanks for the reply. I found that I was using /gpu:0 which is running out of memory. Switching to /gpu:1 solves the problem. The device of src and dst is cpu.

However, I will need to use a larger dataset which has about 4 times of the edges of the previous one. That will probably exceed the memory of a single gpu. What should I do then?

Is CPU RAM large enough to hold entire graph? why not try to train with multi-gpus? just refer to this doc: Single Machine Multi-GPU Minibatch Node Classification — DGL 0.9 documentation