When using reddit dataset to train GCN and graphsage in distributed scenarios, the accuracy has not been improved, which is similar to that of the first epoch training. What is the reason?
pls share more details such as command line, args, output in terminal. It’s best to be able to reproduce. any modification in example code?
OK. First, let me describe the way to train distributed GCN and the corresponding output results. I modified the script dgl/examples/pytorch/graphsage/experimental/train_dist.py. I replaced the basic layer with GraphConv to support GCN’s training. In addition, in order to facilitate the statistical accuracy, I modified the statistical accuracy method, that is, record the correct number of vertices in each batch training, and output the training accuracy of this round at the end of each epoch training (the correct number of vertices in this round of training / the total number of vertices in the dataset).
The data I use is the reddit dataset built in DGL, and the operating environment is four Alibaba cloud servers. Here are my run commands:
python3 ~/dgl/tools/launch.py --workspace ~/redditgcn/ --num_trainers 1 --num_samplers 1 --num_servers 1 --part_config reddit/reddit.json --ip_config ip_config.txt “python3 train_dist.py --graph_name reddit --ip_config ip_config.txt --num_servers 1 --num_epochs 20 --num_hidden 256 --num_workers 1 --num_gpus -1 --n_classes 41 --lr 0.01 --num_layers 1 --fan_out 10,25”
After the first epoch training, the training accuracy outputs of the four machines are 0.13, 0.1388, 0.1297 and 0.1102 respectively. The total training accuracy was 0.5087
When the 100th epoch completes the training, the training accuracy of the four machines are 0.154, 0.1476, 0.1404 and 0.1549 respectively. The total training accuracy was 0.5969
Little improvement in accuracy.
Secondly, I introduce the distributed training of GraphSage. The training script is basically unchanged, but the method of calculating accuracy is modified, which is the same as that of training GCN. And the num_hidden was set 128. When the first epoch of training was completed, the training accuracy of the four machines were 0.1389, 0.1096, 0.1260 and 0.1308 respectively. The total training accuracy was 0.5053. When 100 epochs were completed, the training accuracy rates were 0.1597, 0.1564, 0.1594 and 0.1537 respectively, and the total training accuracy rate was 0.6292. Training accuracy increased by 0.12. Is this increase correct?
In the paper --A Comprehensive Survey on Graph Neural Networks, the number of edges used in the reddit dataset is 11606919, so I replaced the graph dataset, but the division of training set, test set and verification set remains unchanged. In this paper, the accuracy rate of GraphSage model can reach 95.4%, but when DGL was used for training (still in distributed scenarios), the accuracy rates of the four machines were 0.0641, 0.0649, 0.0656 and 0.0656 respectively at the end of the first epoch, with a total of 0.2602. After 100 epochs, the accuracy rates of the four machines were 0.1077, 0.1074, 0.1080 and 0.1083 respectively, with a total of 0.4314. After 200 epochs, the accuracy rates of the four machines were 0.1086, 0.1078, 0.1076 and 0.1078 respectively, with a total of 0.4318. I think such an increase in accuracy is unreasonable.
Sorry for the delay and thanks for your details. I have several questions:
- how do you partition the graph? use the
- have you tried or could you try to do not partition or partition into single 1 partition? As I hope to make sure if the issue you reported is related to distributed train or multi-partition train.
- did you try to tune parameters such as dropout? any difference?
OK, thank you for your questions. I’ll answer them one by one. For graph partition, the partition method I used is the
partition_graph.py. I’m sorry I didn’t try to train the model with a single machine so the datasets were partitioned and trained in distributed scenarios. And I didn’t tune other parameters except that the learning rate was set to 0.01.
Hi, could you try to train on single machine with single partition, even no distributed train at all? Let’s try to narrow down the suspects first.