The inference process is synchronous when using pytorch as backend

In pytorch, I use y=model(x) (here x is a pytorch Tensor and model is a DNN model) to infer,following is the code.

class FCModel(torch.nn.Module):
    def __init__(self, in_dim, n_hidden_1, n_hidden_2, out_dim):
        super(FCModel, self).__init__()
        
        self.l1 = torch.nn.Linear(in_dim, n_hidden_1)
        self.l2 = torch.nn.Linear(n_hidden_1, n_hidden_2)
        self.l3 = torch.nn.Linear(n_hidden_2, out_dim)

    def forward(self, x):
        out = F.relu(self.l1(x))  
        out = F.relu(self.l2(out))
        out = self.l3(out)
        return out

model = FCModel(4096, 2048, 1024, 22)
model.cuda()
...
x, label = next(train_loader)
x.cuda()
t0 = time.time()
y = model(x)
#torch.cuda.synchronize()
print(time.time()-t0)

And the inference is asynchronous, which means that if torch.cuda.synchronize() is commented out, the print time is very small and the inference process is running in the background.

When I use DGL and pytorch as backend, the code is as following.

g = dgl.contrib.graph_store.create_graph_from_store(
        args.dataset, "shared_mem")
train_loader = iter(dgl.contrib.sampling.NeighborSampler(g, args.batch_size,
                                                        args.num_neighbors,
                                                        neighbor_type='in',
                                                        shuffle=True,
                                                        num_workers=16,
                                                        num_hops=args.n_layers+1,
                                                        seed_nodes=train_nid,
                                                        prefetch=False))
model = GCNSampling(in_feats,
                        args.n_hidden,
                        n_classes,
                        args.n_layers,
                        F.relu,
                        args.dropout)
model.cuda()
...
nf, label = next(train_loader)  #type(x) is NodeFlow
nf.copy_from_parent()
t0 = time.time()
y = model(nf)
#torch.cuda.synchronize()
print(time.time()-t0)

And the printed time stays the same no matter whether torch.cuda.synchronize() is commented out, which means the inference is synchronous, while I wish the inference process is also async.
Is it because DGL implements the synchronization mechanism of inference?

Hi, I cannot find your code, could you please post it again?

Thank you for your reply. The code has been updated.