The following is the way I run it. The very large number for batch size is to force all epochs to only do 1 batch (i.e., a hacky way to disable minibatching). pwd
works for workspace because this is an NFS filesystem. I am running on a supercomputing cluster with IP addresses to reserved machines specified manually in the ip_config file. This run uses 2 machines.
export EP=5
python3 ../../../../tools/launch.py \
--workspace `pwd` \
--num_trainers 2 \
--num_samplers 1 \
--num_servers 2 \
--ip_config ip_config.txt \
--part_config data/reddit.json \
"python3 train_dist.py --graph_name reddit --ip_config ip_config.txt --num_servers 2 --num_epochs ${EP} --batch_size 100000000000 --batch_size_eval 10000000000000 --log_every 1 --eval_every ${EP} --num_workers 1"
The training goes through as expected, but the execution runs into a problem when it calls evaluate on the val/test set.
.
.
.
.
Part 3 | Epoch 00004 | Step 00000 | Loss 4.5915 | Train Acc 0.0318 | Speed (samples/sec) 8171.7384 | GPU 0.0 MB | time 4.200 s
Part 3, Epoch Time(s): 7.8217, sample+data_copy: 3.5985, forward: 3.6194, backward: 0.5795, update: 0.0013, #seeds: 38357, #inputs: 181420
Total training time is 43.70542502403259
|V|=232965, eval batch size: 10000000000000
|V|=232965, eval batch size: 10000000000000
|V|=232965, eval batch size: 10000000000000
|V|=232965, eval batch size: 10000000000000
0it [00:00, ?it/s]
1it [00:53, 53.43s/it]
1it [00:54, 54.48s/it]
|V|=232965, eval batch size: 10000000000000
|V|=232965, eval batch size: 10000000000000
0it [00:00, ?it/s]
|V|=232965, eval batch size: 10000000000000
|V|=232965, eval batch size: 10000000000000
1it [00:52, 52.83s/it]
1it [00:54, 54.20s/it]
0it [00:00, ?it/s]
1it [00:08, 8.28s/it]
1it [00:08, 8.47s/it]
1it [00:08, 8.90s/it]
1it [00:08, 8.95s/it]
val tensor(324.) 5957
val tensor(1072.) 5958
val tensor(1068.) 5958
test tensor(2449.) 13926
Part 1, Val Acc 0.1793, Test Acc 0.1759, time: 63.5644
Machine (1) client (6) connect to server successfuly!
Using backend: pytorch
Machine (1) client (4) connect to server successfuly!
Using backend: pytorch
Machine (0) client (1) connect to server successfuly!
Using backend: pytorch
Machine (0) client (0) connect to server successfuly!
Using backend: pytorch
As you can see, it gets stuck on the validate/test evaluate call. It seems one partition is able to make it through, but the others are either stuck or running for a very long time.
Any help resolving this would be appreciated as as it is I cannot get the final numbers for the val/test set.
Note that this does not seem to be an issue if I run on a single host machine.
EDIT: It’s not an issue only if there is 1 worker: doing 2 workers on a single machine seems to result in the same problem.
EDIT2: Never mind, it hangs even for 1 host 1 worker.