DistDGL rebalance the partitions before training？

tycc · September 10, 2024, 5:07pm

The number of nodes after partitioning is not the same as the number of nodes during training.

which api is this function implemented by, and Is it possible not to use this feature?

The following is the output of the training process:

Splitting the ogb-product dataset into two partitions

part 0 has 1488953 nodes and 1198163 are inside the partition
part 0 has 62170974 edges and 60484815 are inside the partition
part 1 has 1514106 nodes and 1250866 are inside the partition
part 1 has 64919624 edges and 63233465 are inside the partition

Then we start training and find

ubuntu, part 1, train: 98307 (local: 98307), val: 19661 (local: 19621), test: 1106545 (local: 1106545)
ubuntu2, part 0, train: 98308 (local: 96028), val: 19662 (local: 19662), test: 1106546 (local: 1082433)

After rebalance, the number of nodes in the training, validation, and test sets on the two workers is relatively close, with a difference of only 1 node

part 1:
local: 98307 + 19621 + 1106545 = 1224473
98307 + 19661 + 1106545 = 1224513
1224473 != 1250866
1224513 != 1514106
part 0:
local: 96028 + 19662 + 1082433 = 1198123
98308 + 19662 + 1106546 = 1224516
1198123 != 1198163
1224516 != 1488953

pranjaln · September 15, 2024, 7:33am

Can you post the code you are using to calculate the train, test and val nodes? Because DGL’s METIS implementation ensures that the training nodes are local to a subgraph. Essentially, all training nodes on a subgraph are local to that subgraph. So, in total, if you have 98307 and 96028 local training nodes, then the total number of training nodes in the complete graph would be 194335. What I think is happening is that you are including the HALO nodes as well in your calculation of train, test and val nodes. The HALO nodes in subgraph 1 can possibly be a training node in subgraph 2, but are not considered as a training node in subgraph 1.