Problem with Dist GraphSAGE val/test evaluation after training

l-hoang · February 26, 2021, 11:57pm

The following is the way I run it. The very large number for batch size is to force all epochs to only do 1 batch (i.e., a hacky way to disable minibatching). pwd works for workspace because this is an NFS filesystem. I am running on a supercomputing cluster with IP addresses to reserved machines specified manually in the ip_config file. This run uses 2 machines.

export EP=5
python3 ../../../../tools/launch.py \
--workspace `pwd` \
--num_trainers 2 \
--num_samplers 1 \
--num_servers 2 \
--ip_config ip_config.txt \
--part_config data/reddit.json \
"python3 train_dist.py --graph_name reddit --ip_config ip_config.txt --num_servers 2 --num_epochs ${EP} --batch_size 100000000000 --batch_size_eval 10000000000000 --log_every 1 --eval_every ${EP} --num_workers 1"

The training goes through as expected, but the execution runs into a problem when it calls evaluate on the val/test set.

.
.
.
.
Part 3 | Epoch 00004 | Step 00000 | Loss 4.5915 | Train Acc 0.0318 | Speed (samples/sec) 8171.7384 | GPU 0.0 MB | time 4.200 s
Part 3, Epoch Time(s): 7.8217, sample+data_copy: 3.5985, forward: 3.6194, backward: 0.5795, update: 0.0013, #seeds: 38357, #inputs: 181420
Total training time is 43.70542502403259
|V|=232965, eval batch size: 10000000000000
|V|=232965, eval batch size: 10000000000000
|V|=232965, eval batch size: 10000000000000
|V|=232965, eval batch size: 10000000000000
0it [00:00, ?it/s]
1it [00:53, 53.43s/it]
1it [00:54, 54.48s/it]
|V|=232965, eval batch size: 10000000000000
|V|=232965, eval batch size: 10000000000000
0it [00:00, ?it/s]
|V|=232965, eval batch size: 10000000000000
|V|=232965, eval batch size: 10000000000000
1it [00:52, 52.83s/it]
1it [00:54, 54.20s/it]
0it [00:00, ?it/s]
1it [00:08,  8.28s/it]
1it [00:08,  8.47s/it]
1it [00:08,  8.90s/it]
1it [00:08,  8.95s/it]
val tensor(324.) 5957
val tensor(1072.) 5958
val tensor(1068.) 5958
test tensor(2449.) 13926
Part 1, Val Acc 0.1793, Test Acc 0.1759, time: 63.5644
Machine (1) client (6) connect to server successfuly!
Using backend: pytorch
Machine (1) client (4) connect to server successfuly!
Using backend: pytorch
Machine (0) client (1) connect to server successfuly!
Using backend: pytorch
Machine (0) client (0) connect to server successfuly!
Using backend: pytorch

As you can see, it gets stuck on the validate/test evaluate call. It seems one partition is able to make it through, but the others are either stuck or running for a very long time.

Any help resolving this would be appreciated as as it is I cannot get the final numbers for the val/test set.

Note that this does not seem to be an issue if I run on a single host machine.
EDIT: It’s not an issue only if there is 1 worker: doing 2 workers on a single machine seems to result in the same problem.
EDIT2: Never mind, it hangs even for 1 host 1 worker.

lixusign · March 1, 2021, 4:17am

maybe some lock in multiprocessing

VoVAllen · March 1, 2021, 6:04am

How long did you wait? It usually took much longer time for evaluation comparing to training time, because evaluation need to be done on the whole graph

l-hoang · March 1, 2021, 6:49pm

In the past I’ve waited at least an hour from what I recall up until the supercomputer I’m running on kills the job due to a timeout.

Does training on the whole graph really cause runtime to go up significantly? The single evaluation forward pass seems to take longer than the entire 200+ epochs of training.

l-hoang · March 11, 2021, 5:37am

Bumping this post again for visibility.

BarclayII · March 15, 2021, 7:16am

What is your hardware configuration (memory size, shared memory size, how you started the cluster, etc.)?

Also bringing in @zhengda1936 for this issue.

zhengda1936 · March 15, 2021, 12:17pm

is this the first time you run evaluation?
We notice there is a bug in distributed data loader when running it with newer version of Python. This issue might be related: Hanging in Distributed GNN training · Issue #2315 · dmlc/dgl · GitHub
Can you try using 0 sampler worker and see how it goes?

l-hoang · March 16, 2021, 3:44pm

I will try that. What are the implementations of setting it to 0 though? Wouldn’t
that get rid of all samplers?

l-hoang · March 16, 2021, 6:34pm

0 sample works. Thanks for the information.

While I have your attention, is there a way to disable sampling in distributed GraphSAGE? The test evaluation does not do sampling, so theoretically it should be doable for training as well, right?

system · April 15, 2021, 6:35pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.