Problem with Dist GraphSAGE val/test evaluation after training

The following is the way I run it. The very large number for batch size is to force all epochs to only do 1 batch (i.e., a hacky way to disable minibatching). pwd works for workspace because this is an NFS filesystem. I am running on a supercomputing cluster with IP addresses to reserved machines specified manually in the ip_config file. This run uses 2 machines.

export EP=5
python3 ../../../../tools/ \
--workspace `pwd` \
--num_trainers 2 \
--num_samplers 1 \
--num_servers 2 \
--ip_config ip_config.txt \
--part_config data/reddit.json \
"python3 --graph_name reddit --ip_config ip_config.txt --num_servers 2 --num_epochs ${EP} --batch_size 100000000000 --batch_size_eval 10000000000000 --log_every 1 --eval_every ${EP} --num_workers 1"

The training goes through as expected, but the execution runs into a problem when it calls evaluate on the val/test set.

Part 3 | Epoch 00004 | Step 00000 | Loss 4.5915 | Train Acc 0.0318 | Speed (samples/sec) 8171.7384 | GPU 0.0 MB | time 4.200 s
Part 3, Epoch Time(s): 7.8217, sample+data_copy: 3.5985, forward: 3.6194, backward: 0.5795, update: 0.0013, #seeds: 38357, #inputs: 181420
Total training time is 43.70542502403259
|V|=232965, eval batch size: 10000000000000
|V|=232965, eval batch size: 10000000000000
|V|=232965, eval batch size: 10000000000000
|V|=232965, eval batch size: 10000000000000
0it [00:00, ?it/s]
1it [00:53, 53.43s/it]
1it [00:54, 54.48s/it]
|V|=232965, eval batch size: 10000000000000
|V|=232965, eval batch size: 10000000000000
0it [00:00, ?it/s]
|V|=232965, eval batch size: 10000000000000
|V|=232965, eval batch size: 10000000000000
1it [00:52, 52.83s/it]
1it [00:54, 54.20s/it]
0it [00:00, ?it/s]
1it [00:08,  8.28s/it]
1it [00:08,  8.47s/it]
1it [00:08,  8.90s/it]
1it [00:08,  8.95s/it]
val tensor(324.) 5957
val tensor(1072.) 5958
val tensor(1068.) 5958
test tensor(2449.) 13926
Part 1, Val Acc 0.1793, Test Acc 0.1759, time: 63.5644
Machine (1) client (6) connect to server successfuly!
Using backend: pytorch
Machine (1) client (4) connect to server successfuly!
Using backend: pytorch
Machine (0) client (1) connect to server successfuly!
Using backend: pytorch
Machine (0) client (0) connect to server successfuly!
Using backend: pytorch

As you can see, it gets stuck on the validate/test evaluate call. It seems one partition is able to make it through, but the others are either stuck or running for a very long time.

Any help resolving this would be appreciated as as it is I cannot get the final numbers for the val/test set.

Note that this does not seem to be an issue if I run on a single host machine.
EDIT: It’s not an issue only if there is 1 worker: doing 2 workers on a single machine seems to result in the same problem.
EDIT2: Never mind, it hangs even for 1 host 1 worker.

maybe some lock in multiprocessing

How long did you wait? It usually took much longer time for evaluation comparing to training time, because evaluation need to be done on the whole graph

In the past I’ve waited at least an hour from what I recall up until the supercomputer I’m running on kills the job due to a timeout.

Does training on the whole graph really cause runtime to go up significantly? The single evaluation forward pass seems to take longer than the entire 200+ epochs of training.

Bumping this post again for visibility.

What is your hardware configuration (memory size, shared memory size, how you started the cluster, etc.)?

Also bringing in @zhengda1936 for this issue.

is this the first time you run evaluation?
We notice there is a bug in distributed data loader when running it with newer version of Python. This issue might be related: Hanging in Distributed GNN training · Issue #2315 · dmlc/dgl · GitHub
Can you try using 0 sampler worker and see how it goes?

I will try that. What are the implementations of setting it to 0 though? Wouldn’t
that get rid of all samplers?

0 sample works. Thanks for the information.

While I have your attention, is there a way to disable sampling in distributed GraphSAGE? The test evaluation does not do sampling, so theoretically it should be doable for training as well, right?

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.