What's the best practice to train node embeddings for a big graph

For a big graph, node embeddings is too big to be fed into GPU as model parameters. Instead, i set “requires_grad” of node embeddings to be “True” in forward procedure in examples/pytorch/sampling/gcn_ns_sc.py. However, in this way, optimizer in Pytorch can not capture this trainable parameters and node embedding are still invariant. In fact, a feasible way is to use “torch.nn.Embedding” layer and node features are indexs in the embedding table. A problem for this is “torch.nn.Embedding” is too big to be fed into GPU memory. So, i want to know if there is a simple way to solve this problem. Thanks for your sharing.

def forward(self, nf):
    nf.layers[0].data['activation'] = nf.layers[0].data['features']
    **nf.layers[0].data['activation'].requires_grad = True**

    for i, layer in enumerate(self.layers):
        h = nf.layers[i].data.pop('activation')
        if self.dropout:
            h = self.dropout(h)
        nf.layers[i].data['h'] = h
        nf.block_compute(i,
                         fn.copy_src(src='h', out='m'),
                         lambda node : {'h': node.mailbox['m'].mean(dim=1)},
                         layer)

    h = nf.layers[-1].data.pop('activation')
    return h
2 Likes

How about using mini-batch training (e.g. with nodeflow) and only move the node embeddings to GPU when a mini-batch is sampled for training?

Thanks for your reply. It seems that mini-batching training could not solve this problem when node embeddings need to be updated. In the situation of mini-batch training, optimizer can not update node embeddings correctly because many optimization algorithms (e.g. Adam) need the histories of a tensor’s gradient and the tensor itself. In fact, the core of the problem is that the Graph Store just store the node embedding and could not update node embeddings like a parameter server does.

Hi, please see our example in apps/kg. For now, we have mix GPU-CPU implementation for big graph, where we store node embedding in CPU and calculated gradient in GPU. We also have distributed kvstore, which can training very large graph in a distributed manner across machines. We will release the distributed code example in 0.5 version. Thanks!

2 Likes

Thanks! I’ll have a look at the example in apps/kg. Hope to see 0.5 version soon.