Get Original Node IDs from subgraph

I am using the distributed training Python script as mentioned here. My understanding is that the line

train_nid = dgl.distributed.node_split(g.ndata['train_mask'])

gives the node IDs of all the training nodes in the subgraph. However, these IDs are not the same as the original node IDs of the original graph. I found this out when I concatenated the train IDs for each of the worker machines during training and compared them to the original train IDs created while partitioning, as shown below -

import dgl
import torch as th
from ogb.nodeproppred import DglNodePropPredDataset
data = DglNodePropPredDataset(name='ogbn-arxiv')
graph, labels = data[0]
labels = labels[:, 0]
graph.ndata['labels'] = labels

splitted_idx = data.get_idx_split()
train_nid, val_nid, test_nid = splitted_idx['train'], splitted_idx['valid'], splitted_idx['test']
train_mask = th.zeros((graph.number_of_nodes(),), dtype=th.bool)
train_mask[train_nid] = True
val_mask = th.zeros((graph.number_of_nodes(),), dtype=th.bool)
val_mask[val_nid] = True
test_mask = th.zeros((graph.number_of_nodes(),), dtype=th.bool)
test_mask[test_nid] = True
graph.ndata['train_mask'] = train_mask
graph.ndata['val_mask'] = val_mask
graph.ndata['test_mask'] = test_mask

dgl.distributed.partition_graph(graph, graph_name='ogbn-arxiv', num_parts=4,
                                out_path='arxiv-4',
                                balance_ntypes=graph.ndata['train_mask'],
                                balance_edges=True)

Is there a way to get the original train IDs as well in the distributed training? My end goal is to see if the train IDs on each worker machine add up to the original train ID tensor.

part_g.ndata[dgl.NID] stores the original node ID before partition.

Doesn’t this give only the reshuffled IDs?

yes(at least 20 characters)

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.