CPU sys too high

weber · July 17, 2020, 8:59am

Hi, I’m training a new gcn model with 1 v100 card and 64 CPU cores, the TOP command shows that the CPU sys consumes about 50% CPU cores:

top - 13:39:20 up 22:57, 5 users, load average: 61.43, 61.34, 62.37

Tasks: 32 total, 2 running, 30 sleeping, 0 stopped, 0 zombie

%Cpu(s): 50.3 us, 47.2 sy, 0.0 ni, 2.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

the strace command’s output is:
% time seconds usecs/call calls errors syscall

49.98 2.286199 1899 1204 1 futex
47.46 2.170858 35 62204 clock_gettime
2.56 0.116924 10 12146 sched_yield
0.01 0.000419 3 166 sched_setaffinity
0.00 0.000123 1 83 sched_getaffinity

100.00 4.574523 75803 1 total

and the gdb stack shows that both clock_gettime and futex are called by at::native::to,

and the pystack log is:

File “train_sampler.py”, line 198, in

main(dataset)

File “train_sampler.py”, line 145, in main

loss = model.train_region(nf, batch_nid)

File “model_sampler.py”, line 106, in train_region

h = self.gcn(nf)

File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call

result = self.forward(*input, **kwargs)

File “model_sampler.py”, line 66, in forward

nf = self.layers[i](nf, i)

File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call

result = self.forward(*input, **kwargs)

File “model_sampler.py”, line 43, in forward

nf.block_compute(i_layer, self.message_func, self.reduce_func)

File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/dgl/nodeflow.py”, line 935, in block_compute

Runtime.run(prog)

File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/dgl/runtime/runtime.py”, line 11, in run

exe.run()

File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/dgl/runtime/ir/executor.py”, line 132, in run

udf_ret = fn_data(node_data, mail_data)

File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/dgl/runtime/degree_bucketing.py”, line 157, in _rfunc_wrapper

return reduce_udf(nbatch)

File “model_sampler.py”, line 30, in reduce_func

h = nodes.data['h']

File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/dgl/utils.py”, line 285, in getitem

return self._fn(key)

File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/dgl/frame.py”, line 655, in

return utils.LazyDict(lambda key: self._frame[key][rows], keys=self.keys())

File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/dgl/frame.py”, line 96, in getitem

user_idx = idx.tousertensor(F.context(self.data))

File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/dgl/utils.py”, line 105, in tousertensor

self._user_tensor_data[ctx] = F.copy_to(data, ctx)

File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/dgl/backend/pytorch/tensor.py”, line 95, in copy_to

return input.cuda()

File “”, line 1, in

So, any suggestions that can help me to investigate this problem? Thanks!

zhengda1936 · July 17, 2020, 7:15pm

based on strace and your pystack log, it seems the overhead is from the tensor from CPU to GPU. but it’s still weird that data copy from cpu to gpu consumes so much CPU. it’s hard for me to think of a reason why there is much high contention in CPU. is it right you are using one process?

weber · July 18, 2020, 2:57am

Yes, i’m using one process.

the key code snippet of train_sampler.py is:

    sampler = dgl.contrib.sampling.NeighborSampler(
        g,
        batch_size * 3,
        5,
        2,
        seed_nodes=seed_nodes,
        num_workers=32)
    count = 0
    for batch, nf in zip(batches, sampler):
        nf.copy_from_parent(ctx=device)
        batch_nid = nf.map_from_parent_nid(-1, batch.reshape(-1), True)
        batch_nid = batch_nid.reshape(-1, 3).cuda()
        loss = model.train_region(nf, batch_nid)

the key code snippet of model_sampler.py is:

def reduce_func(self, nodes):
h = nodes.data[‘h’]
# h = torch.matmul(h, )
# h = self.W(h)
m = nodes.mailbox[‘msg’]
m = m.sum(dim=1, keepdim=True)
m = m / m.norm(dim=2, keepdim=True).clamp(min=1e-6)
h = h.unsqueeze(1)
h_new = torch.cat((m, h), 1).sum(dim=1)
if self.activation:
h_new = self.activation(h_new)
return {‘h’: h_new / h_new.norm(dim=1, keepdim=True)}

def forward(self, nf, i_layer):
    nf.block_compute(i_layer, self.message_func, self.reduce_func)
    return nf

After some investigation, I’m wondering that, is it because of the function copy_from_parent is lazy, then when call forward pass, it will concurrently copy the nodes to gpu.

Is my guess correct? if so, how to avoid the copy action cocurrently?

Thanks!

weber · July 20, 2020, 12:04pm

Resolved this problem by torch.set_num_threads(8).

minjie · July 20, 2020, 2:49pm

Glad you’ve resolved the issue. One thing to note is that the NodeFlow API will be deprecated in the upcoming 0.5 release. The new APIs can cover more sampling algorithms and are easier to use. Check out the examples here. API doc and user guide are on the way.