According to 6.1 Training GNN for Node Classification with Neighborhood Sampling — DGL 0.8 documentation, “The node features can be stored in either memory or external storage. Note that we only need to load the input nodes’ features, as opposed to load the features of all nodes as in full graph training.”
I’m still unsure how to best to this. From what I understand, we need to load and assign the features using
blocks.srcdata. A naive way of doing this could be as follows: If we have, say 3,000,000 nodes, and every node has a unique feature, we store 3,000,000 files corresponding to a unique ID. We store this unique id as a node feature that we can access, e.g.
g.ndata["node_feature_id"]. Then during training, for each neighbourhood sample, we load the relevant features according to the node feature id.
However, is this the only way of doing this? It does not seem very efficient in terms of storing so many small files, and also, it is inconvenient to keep track of each node id, making sure they correctly correspond to the correct graph. With my data, it would be much more maintainable to store the node ids as strings, but we cannot store strings in ndata.