Best practices for loading external features during stochastic training

According to 6.1 Training GNN for Node Classification with Neighborhood Sampling — DGL 0.8 documentation, “The node features can be stored in either memory or external storage. Note that we only need to load the input nodes’ features, as opposed to load the features of all nodes as in full graph training.”

I’m still unsure how to best to this. From what I understand, we need to load and assign the features using blocks[0].srcdata. A naive way of doing this could be as follows: If we have, say 3,000,000 nodes, and every node has a unique feature, we store 3,000,000 files corresponding to a unique ID. We store this unique id as a node feature that we can access, e.g. g.ndata["node_feature_id"]. Then during training, for each neighbourhood sample, we load the relevant features according to the node feature id.

However, is this the only way of doing this? It does not seem very efficient in terms of storing so many small files, and also, it is inconvenient to keep track of each node id, making sure they correctly correspond to the correct graph. With my data, it would be much more maintainable to store the node ids as strings, but we cannot store strings in ndata.

Or, you can store the features as a tensor with 3,000,000 rows in a single file, and during loading you only load the file partially. HDF5 (or Zarr if you want parallel access) does that well.

For efficiency reasons we would recommend indexing node features by consecutive integers 0, 1, …, N-1. If you have strings you can try additionally bookkeeping a map from those integers to your string IDs.