Unsupervised GraphSAGE explanation

Pleonix · May 7, 2020, 11:35am

Hello! I am new to DGL and DL.
I am trying to get embeddings with Unsupervised GraphSAGE example.

The code, in general, seems to be clear for me, but I struggle some difficulties.
First of all, my inputs: I have a graph of social connections. For some people in graph I have “features” and for some don’t, but I know connections. I picked GraphSAGE, because it gives me possibility to do inference on new unseen nodes (added after training, which is important to me).
My graph in general have nearly 27mln nodes with average degree of ~100-150 edges. But amount of nodes that I have features for is really limited to be less than 500.000
Q1: What to do in situation, when you don’t have any input features for a node? Fill it with some basic values?

I picked subgraph from a big one with 206k nodes and 3.1mln edges. Mostly, all nodes have meaningful features and “labels” for binary classification (0,1). For the ones, that don’t have features, I filled them with one same value across all graph.
I ran “train_sampling_unsupervised.py”, but confused by a result: while loss is decreasing, the LogReg results become worse and worse with epoch. The script itself works pretty slow: nearly 2 hours for epoch (3.1mln edges / batch_size = 512) on GPU
Q2: How to speed up training process?

I tried to use GPU, but it seems that batch preparation ( in enumarate(dataloader) step) is done in CPU. Is it true? Seems that source node determination + batch sampling is not a fast process compared to other parts of script (excluding evaluation/inference parts).
Q3: How to speed up batch generation/ negative sampling?

Any advice will be appreciated!

BarclayII · May 11, 2020, 8:26am

Q1: What to do in situation, when you don’t have any input features for a node? Fill it with some basic values?

Good question. I saw you filled the node features with the same value across all graphs. I think this would make the job of GNNs more difficult. Ideally, we want to associate a D-dimensional learnable vector for each node, but the number of parameters could be large in this case. A possible alternative would be assigning D-dimensional random features to nodes, and before message passing, project the features with a D\times D affine transformation. Other approaches include using constant features but with another architecture that is able to distinguish nodes at different topological locations, such as Position-aware Graph Neural Networks.

Q2: How to speed up training process?

I think assigning the same input feature may be why your model did not work so well. You could try assigning random features. Also you could try fiddling with the number of layers, learning rate, number of negative examples, etc.

Q3: How to speed up batch generation/ negative sampling?

I would say that it is not that batch generation and negative sampling is slow. Rather, the computation of GNN itself is too light, so that other operations took a much larger proportion of run time. In my experiments with node classification, I observed that the bottleneck is instead data transfer from CPU to GPU (Yes, even multiprocess neighbor sampling is not the bottleneck, but please correct me if this is not your case). Although there are possible system-level optimization opportunities (e.g. caching nodes with frequent accesses on GPU), I’m not sure if we would incorporate that in near future since they are system research efforts that all need extensive benchmarking.

For speeding up the batch generation alone, you can try multiprocessing by increasing the number of workers.

Hope that addresses your questions and please feel free to follow up!

joc32 · September 11, 2020, 4:29pm

“A possible alternative would be assigning D-dimensional random features to nodes, and before message passing, project the features with a D×D affine transformation. Other approaches include using constant features but with another architecture that is able to distinguish nodes at different topological locations, such as Position-aware Graph Neural Networks.”

@BarclayII Could you please provide any reference / justification for this ( on using random / constant values as node features)?

BarclayII · September 14, 2020, 7:09am

Random features is in particular useful in kernel methods. I’m not sure if it is applicable to GNN training though. Existing literatures that deals with nodes with no features include Position-aware Graph Neural Networks, and Inductive Matrix Completion Based on Graph Neural Networks (https://arxiv.org/abs/1904.12058) which essentially uses topology-based or motif-based features.

joc32 · September 14, 2020, 10:08am

@Pleonix In the case of Unsupervised GraphSAGE, I recommend trying node degrees. Read the original GraphSAGE paper (https://arxiv.org/abs/1706.02216) page 2, present work section.

our approach can also make use of structural features that are present in all graphs (e.g., node degrees). Thus, our algorithm can also be applied to graphs without node features.

Node degrees can be easily calculated using NetworkX library for python, by accessing .degree property on a NetworkX graph object.