About DistDGL memory footprint

I follow README.md located in the GraphSage directory to get data and run DistDGL on 4-nodes CPU cluster. According to the default parameters in README.md, the training performance is not very good. When I increase the number of Trainer and Sampler, the memory footprint is very high.

When I set 16 Trainers, 8 Samplers and 1 Server, the memory usage in a single machine can reach close to 80G.
When I set 16 Trainers, 0 Sampler and 1 Server, the memory usage in a single machine can reach close to 30G.
However, the partitioned data stored in the disk is only occupies less than 2G.

I am confused by such a high memory footprint.

Thanks for reporting the memory issue. Do you run 4 trainers in each machine? and how do you calculate memory consumption?
DistDGL uses shared memory to share data between trainers, samplers and servers. If your partition data is 2GB, each trainer, sampler and server can access to the partition data directly via shared memory. However, there is only one physical memory that stores the data.

The trainers and samplers communicate through a shared memory queue. After a sampler generates a mini-batch, it’ll place the mini-batch in the queue. A queue may contains many mini-batches, which consumes a lot of memory. I think you can change the queue size in DistDataLoader.

Do you run 8 samplers per trainer? Maybe you should reduce the number of samplers per trainer.

Please let me know how well it works.

Thanks for answering my question.

Perhaps the parameters I set in the above test was a little too large, which was 16 Trainers per machine. So I ran some tests again.

I’m still testing on a 4-node cluster. I confirmed that the size of the data stored on the disk for each partition is about 1.5GB. I estimate the memory footprint by the amount of memory displayed by the free command before and after the program runs.

I set the parameter queue_size of DistDataLoader in file graphsage/train_dist_unsupervised.py to 4.
In my tests, on each machine, I set 4 Trainers, 0 Sampler and 1 Server, the memory usage in a single machine is about 11GB. This is still weird.
I also set 1 Trainers, 0 Sampler and 1 Server on each machine, the memory usage in a single machine is about 2.4GB, which seems reasonable.