I follow README.md located in the GraphSage directory to get data and run DistDGL on 4-nodes CPU cluster. According to the default parameters in README.md, the training performance is not very good. When I increase the number of Trainer and Sampler, the memory footprint is very high.
When I set 16 Trainers, 8 Samplers and 1 Server, the memory usage in a single machine can reach close to 80G.
When I set 16 Trainers, 0 Sampler and 1 Server, the memory usage in a single machine can reach close to 30G.
However, the partitioned data stored in the disk is only occupies less than 2G.
I am confused by such a high memory footprint.