Construct and train large graph

JitongZ · July 12, 2022, 8:59am

Hi,

I encounter this error when I am trying to use a larger dataset:

Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

The code generating this:

hetero_graph = dgl.heterograph({
    ('user', 'do', 'sth'): (src, dst),
    ('sth', 'done-by', 'user'): (dst, src)})

I wonder if the graph is too large to be load on one GPU (src and dst are of length over 80 million). I am using tensorflow backend, and my dataset is stored in several csv files. I have two GPUs on my server.

I am new to using multiple GPUs / distributed training. Can someone gives me a general approach of solving the problem? Does distributed partitioning help?

Rhett-Ying · July 13, 2022, 12:38am

80M is not very large. are you trying to construct graph on gpu directly? what’s the device of src and dst? could you make sue src/dst are valid tensors? or try to trim src/dst?

JitongZ · July 13, 2022, 2:48am

Thanks for the reply. I found that I was using /gpu:0 which is running out of memory. Switching to /gpu:1 solves the problem. The device of src and dst is cpu.

However, I will need to use a larger dataset which has about 4 times of the edges of the previous one. That will probably exceed the memory of a single gpu. What should I do then?

Rhett-Ying · July 13, 2022, 3:08am

Is CPU RAM large enough to hold entire graph? why not try to train with multi-gpus? just refer to this doc: Single Machine Multi-GPU Minibatch Node Classification — DGL 0.9 documentation

system · August 12, 2022, 3:08am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.