Hi, I’m training a new gcn model with 1 v100 card and 64 CPU cores, the TOP command shows that the CPU sys consumes about 50% CPU cores:
top - 13:39:20 up 22:57, 5 users, load average: 61.43, 61.34, 62.37
Tasks: 32 total, 2 running, 30 sleeping, 0 stopped, 0 zombie
%Cpu(s): 50.3 us, 47.2 sy, 0.0 ni, 2.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
the strace command’s output is:
% time seconds usecs/call calls errors syscall
49.98 2.286199 1899 1204 1 futex
47.46 2.170858 35 62204 clock_gettime
2.56 0.116924 10 12146 sched_yield
0.01 0.000419 3 166 sched_setaffinity
0.00 0.000123 1 83 sched_getaffinity
100.00 4.574523 75803 1 total
and the gdb stack shows that both clock_gettime and futex are called by at::native::to,
and the pystack log is:
File “train_sampler.py”, line 198, in
main(dataset)
File “train_sampler.py”, line 145, in main
loss = model.train_region(nf, batch_nid)
File “model_sampler.py”, line 106, in train_region
h = self.gcn(nf)
File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “model_sampler.py”, line 66, in forward
nf = self.layers[i](nf, i)
File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “model_sampler.py”, line 43, in forward
nf.block_compute(i_layer, self.message_func, self.reduce_func)
File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/dgl/nodeflow.py”, line 935, in block_compute
Runtime.run(prog)
File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/dgl/runtime/runtime.py”, line 11, in run
exe.run()
File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/dgl/runtime/ir/executor.py”, line 132, in run
udf_ret = fn_data(node_data, mail_data)
File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/dgl/runtime/degree_bucketing.py”, line 157, in _rfunc_wrapper
return reduce_udf(nbatch)
File “model_sampler.py”, line 30, in reduce_func
h = nodes.data['h']
File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/dgl/utils.py”, line 285, in getitem
return self._fn(key)
File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/dgl/frame.py”, line 655, in
return utils.LazyDict(lambda key: self._frame[key][rows], keys=self.keys())
File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/dgl/frame.py”, line 96, in getitem
user_idx = idx.tousertensor(F.context(self.data))
File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/dgl/utils.py”, line 105, in tousertensor
self._user_tensor_data[ctx] = F.copy_to(data, ctx)
File “/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/dgl/backend/pytorch/tensor.py”, line 95, in copy_to
return input.cuda()
File “”, line 1, in
File “”, line 1, in
So, any suggestions that can help me to investigate this problem? Thanks!