Hello! I am new to DGL and DL.
I am trying to get embeddings with Unsupervised GraphSAGE example.
The code, in general, seems to be clear for me, but I struggle some difficulties.
First of all, my inputs: I have a graph of social connections. For some people in graph I have “features” and for some don’t, but I know connections. I picked GraphSAGE, because it gives me possibility to do inference on new unseen nodes (added after training, which is important to me).
My graph in general have nearly 27mln nodes with average degree of ~100-150 edges. But amount of nodes that I have features for is really limited to be less than 500.000
Q1: What to do in situation, when you don’t have any input features for a node? Fill it with some basic values?
I picked subgraph from a big one with 206k nodes and 3.1mln edges. Mostly, all nodes have meaningful features and “labels” for binary classification (0,1). For the ones, that don’t have features, I filled them with one same value across all graph.
I ran “train_sampling_unsupervised.py”, but confused by a result: while loss is decreasing, the LogReg results become worse and worse with epoch. The script itself works pretty slow: nearly 2 hours for epoch (3.1mln edges / batch_size = 512) on GPU
Q2: How to speed up training process?
I tried to use GPU, but it seems that batch preparation ( in enumarate(dataloader) step) is done in CPU. Is it true? Seems that source node determination + batch sampling is not a fast process compared to other parts of script (excluding evaluation/inference parts).
Q3: How to speed up batch generation/ negative sampling?
Any advice will be appreciated!