I am using the distributed training Python script as mentioned here. My understanding is that the line
train_nid = dgl.distributed.node_split(g.ndata['train_mask'])
gives the node IDs of all the training nodes in the subgraph. However, these IDs are not the same as the original node IDs of the original graph. I found this out when I concatenated the train IDs for each of the worker machines during training and compared them to the original train IDs created while partitioning, as shown below -
import dgl
import torch as th
from ogb.nodeproppred import DglNodePropPredDataset
data = DglNodePropPredDataset(name='ogbn-arxiv')
graph, labels = data[0]
labels = labels[:, 0]
graph.ndata['labels'] = labels
splitted_idx = data.get_idx_split()
train_nid, val_nid, test_nid = splitted_idx['train'], splitted_idx['valid'], splitted_idx['test']
train_mask = th.zeros((graph.number_of_nodes(),), dtype=th.bool)
train_mask[train_nid] = True
val_mask = th.zeros((graph.number_of_nodes(),), dtype=th.bool)
val_mask[val_nid] = True
test_mask = th.zeros((graph.number_of_nodes(),), dtype=th.bool)
test_mask[test_nid] = True
graph.ndata['train_mask'] = train_mask
graph.ndata['val_mask'] = val_mask
graph.ndata['test_mask'] = test_mask
dgl.distributed.partition_graph(graph, graph_name='ogbn-arxiv', num_parts=4,
out_path='arxiv-4',
balance_ntypes=graph.ndata['train_mask'],
balance_edges=True)
Is there a way to get the original train IDs as well in the distributed training? My end goal is to see if the train IDs on each worker machine add up to the original train ID tensor.