Hello,how to do mini_batch on one graph?

About mini-batch?
I would like to ask how to do mini-batch training. If you get all embeddings on a whole graph and then update the losses in batches using part of nodes, is this correct? I feel a little unreasonable.
The pseudo code is as follows
The main code is
all_embedding =model(whole_graph, all_node_id, all_edge_type, all_edge_norm)

for every epoch:  
     for  mini_batch_id in batches:
        all_embedding =model(whole_graph, all_node_id, all_edge_type, all_edge_norm)

        //get all node embedding on the whole graph
        //get all node embedding on the whole graph

        loss_every_batch = model.calc_loss(all_embedding, mini_batch_id)
        //get loss on every batch 
        optimizer.zero_grad()
        loss_every_batch.backward()
        optimizer.step()

Thank you!!!

1 Like

Mini batch on a graph can be done, using the method dgl.batch. The following tutorial is a good beginning to understand how it works.

1 Like

Thank you for your reply!
I just have one graph,not many graphs?

Okay,
Do you mean that the graph structure is fixed but the feature matrix is changing?

No,just like the code below. Calculate the embedding representation of some nodes in each batch. It’s like calculating by sampling some nodes. But I see that some people are still calculating the embedding of all nodes in each batch, which puzzles me, not sure if the above code is correct.

up up up go go go up :pray:

This depends on if you are performing full graph training or mini-batch training. In full graph training, you update the representations of all nodes by simultaneously performing message passing over the full graph. In mini-batch training, we update the representations of nodes by performing message passing on a subgraph only, the loss will also be computed only on a subset of nodes.

1 Like

In most cases we don’t compute loss over the entire node embedding tensor for mini-batch training. Below is a pseudo-code about how to achieve this. You can see it first samples a subgraph and then extracts the embeddings needed for loss computation.

for epoch in range(MAX_EPOCHS):  
     for batch_id in range(NUM_BATCHES):
          batch_node = all_node_id[batch_id * BATCH_SIZE : (batch_id + 1) * BATCH_SIZE)
          # sample from the whole graph
          mini_batch = sample(whole_graph, batch_node)
          # get the embeddings needed by this batch
          batch_embedding = extract(whole_graph, all_embedding, mini_batch)

          loss_every_batch = model.calc_loss(batch_embedding)
          # get loss on every batch 
          optimizer.zero_grad()
          loss_every_batch.backward()
          optimizer.step()

See complete examples in our repo. They all follow the above training methodology.

1 Like

The code seems to have some APIs without documentation to learn.
Can the previous ALL nn module be used directly in mini batch training? The previous nn module does not seem to be designed for mini batch.

The training data of the above code is indeed obtained in batches. However, in each batch, the embedding of all nodes is calculated, and only a part of the nodes used in the calculation of loss in each batch .
In other words, in each batch, the aggregation operation is performed on the entire graph, and only a part of the nodes are used to calculate the loss when calculating the loss.

Isn’t that okay?
In this case, the parameters of all nodes are updated when backpropagating.

The API doc will be online soon. Sorry for the delay. Here is a hands-on tutorial we are preparing for WWW’20. It covers the concept and usage of the new user experience for mini-batching sampling. https://github.com/dglai/WWW20-Hands-on-Tutorial/blob/master/large_graphs/large_graphs.ipynb .

They can! In fact, this is one of the goals of the whole new sampler API design. You can see here that we directly use the dgl.nn.SAGEConv module on sampled graphs.

1 Like

@mufeili Although the embedding representation of the entire graph is calculated, only a part of the nodes are used. It seems that more parameters will be updated, which is slower than sample the nodes needed?

That means you have some unnecessary computation, which will be slower.

But that’s not wrong, right?
I have been troubled with minibatch training. It is a compromise.

Did you check the tutorial @minjie posted?

1 Like

I have checked that tutorial and have a general understanding of the use of minibatch. There are some APIs such as in_subgraph which are not well understood.
I am doing a complex recommendation system task, sampling is not easy to achieve.

@mufeili

2020 2020 2020 2020 2020

I don’t fully get your question. Are you asking about whether the entire embedding is being updated during backward propagation? The answer depends on how the backend framework (such as PyTorch) implement the gradient operation of an embedding lookup. If implemented correctly, only the node embeddings that are used during forward propagation are updated. I need to check whether this is the case for torch.nn.Embedding. If that is unfortunately not the case, you need to manually implement the gradient update part.

感谢您的耐心解答,您说的问题确实是一方面,
我的主要问题是能不能每个批次计算的的时候都在所有节点上进行若干层聚合从而获取所有节点的表示(本来应该按所依赖的邻居进行取样的,但是我做的事情取样比较麻烦),但是每个批次计算损失的时候只利用上面获取的节点表示中批次中涉及到的节点计算损失进行反向传播。
我主要是不太确定这样做对不对,因为最近看到有人这么做。
如果更新的时候不涉及到不依赖的节点那样的话最好不过了,这样的话在整个图上计算其实好像也没太大问题。

That is technically doable and correct, but I want to point out that it is equivalent to performing aggregation on the full neighborhood, which is exactly what g.in_subgraph is for.

1 Like