Hi there,
When I run DistDGL on mag240m on 2 nodes, each with 2 trainers or 4 nodes, each with 2 trainers, I got the following error indicating the message is too large. Any idea or suggestion how to fix this? Thank you.
[08:48:21] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[08:48:21] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[08:48:21] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[08:48:21] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[08:50:13] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[08:50:13] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[08:52:22] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[08:52:22] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[08:53:45] /opt/dgl/src/rpc/network/msg_queue.cc:28: Message is larger than the queue.
Traceback (most recent call last):
File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/u/kunwu2/scratch/IGB-Datasets/benchmark/do_graphsage_node_classification.py", line 696, in <module>
main(args)
File "/u/kunwu2/scratch/IGB-Datasets/benchmark/do_graphsage_node_classification.py", line 498, in main
dgl.distributed.initialize(args.ip_config)
File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/dist_context.py", line 278, in initialize
serv.start()
File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/dist_graph.py", line 471, in start
start_server(
File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/rpc_server.py", line 173, in start_server
rpc.send_response(client_id, res, group_id)
File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/rpc.py", line 752, in send_response
send_rpc_message(msg, get_client(client_id, group_id))
File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/rpc.py", line 1077, in send_rpc_message
_CAPI_DGLRPCSendRPCMessage(msg, int(target))
File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__
File "dgl/_ffi/_cython/./function.pxi", line 227, in dgl._ffi._cy3.core.FuncCall
File "dgl/_ffi/_cython/./function.pxi", line 217, in dgl._ffi._cy3.core.FuncCall3
dgl._ffi.base.DGLError: [08:53:45] /opt/dgl/src/rpc/network/socket_communicator.cc:123: Check failed: Send(ndarray_data_msg, recv_id) == 3400 (3401 vs. 3400) :
Stack trace:
[bt] (0) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x75) [0x7f890c914ab5]
[bt] (1) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/libdgl.so(dgl::network::SocketSender::Send(dgl::rpc::RPCMessage const&, int)+0x659) [0x7f890ce12419]
[bt] (2) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/libdgl.so(dgl::rpc::SendRPCMessage(dgl::rpc::RPCMessage const&, int)+0x1f) [0x7f890ce1d18f]
[bt] (3) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/libdgl.so(+0x8a1145) [0x7f890ce23145]
[bt] (4) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7f890cc903f8]
[bt] (5) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/_ffi/_cy3/core.cpython-39-x86_64-linux-gnu.so(+0x1a45f) [0x7f88d07b345f]
[bt] (6) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/_ffi/_cy3/core.cpython-39-x86_64-linux-gnu.so(+0x1acbf) [0x7f88d07b3cbf]
[bt] (7) python(_PyObject_MakeTpCall+0x2ec) [0x4f073c]
[bt] (8) python(_PyEval_EvalFrameDefault+0x4b5a) [0x4ec58a]