DistDGL Message Too Large Error When Running Large-Scale Graphs

Hi there,

When I run DistDGL on mag240m on 2 nodes, each with 2 trainers or 4 nodes, each with 2 trainers, I got the following error indicating the message is too large. Any idea or suggestion how to fix this? Thank you.

[08:48:21] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[08:48:21] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[08:48:21] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[08:48:21] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[08:50:13] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[08:50:13] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[08:52:22] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[08:52:22] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[08:53:45] /opt/dgl/src/rpc/network/msg_queue.cc:28: Message is larger than the queue.
Traceback (most recent call last):
  File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/u/kunwu2/scratch/IGB-Datasets/benchmark/do_graphsage_node_classification.py", line 696, in <module>
  File "/u/kunwu2/scratch/IGB-Datasets/benchmark/do_graphsage_node_classification.py", line 498, in main
  File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/dist_context.py", line 278, in initialize
  File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/dist_graph.py", line 471, in start
  File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/rpc_server.py", line 173, in start_server
    rpc.send_response(client_id, res, group_id)
  File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/rpc.py", line 752, in send_response
    send_rpc_message(msg, get_client(client_id, group_id))
  File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/rpc.py", line 1077, in send_rpc_message
    _CAPI_DGLRPCSendRPCMessage(msg, int(target))
  File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 227, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./function.pxi", line 217, in dgl._ffi._cy3.core.FuncCall3
dgl._ffi.base.DGLError: [08:53:45] /opt/dgl/src/rpc/network/socket_communicator.cc:123: Check failed: Send(ndarray_data_msg, recv_id) == 3400 (3401 vs. 3400) : 
Stack trace:
  [bt] (0) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x75) [0x7f890c914ab5]
  [bt] (1) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/libdgl.so(dgl::network::SocketSender::Send(dgl::rpc::RPCMessage const&, int)+0x659) [0x7f890ce12419]
  [bt] (2) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/libdgl.so(dgl::rpc::SendRPCMessage(dgl::rpc::RPCMessage const&, int)+0x1f) [0x7f890ce1d18f]
  [bt] (3) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/libdgl.so(+0x8a1145) [0x7f890ce23145]
  [bt] (4) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7f890cc903f8]
  [bt] (5) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/_ffi/_cy3/core.cpython-39-x86_64-linux-gnu.so(+0x1a45f) [0x7f88d07b345f]
  [bt] (6) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/_ffi/_cy3/core.cpython-39-x86_64-linux-gnu.so(+0x1acbf) [0x7f88d07b3cbf]
  [bt] (7) python(_PyObject_MakeTpCall+0x2ec) [0x4f073c]
  [bt] (8) python(_PyEval_EvalFrameDefault+0x4b5a) [0x4ec58a]

Just to clarify.

  • Do you run DistDGL on 2 machines or 4 machines?
  • When you say “each with 2 trainers”, do you mean each machine have 2 trainers? Are they using different GPUs?

Also, could you provide steps to reproduce the error?

Hi Minjie,

Thank you for your reply!

I have tried two configurations. In the first configuration, I am running DistDGL on 2 machines. Each machine has 2 trainers. Each trainer corresponds to 1 A100 GPU.

After realizing this couldn’t work. I changed the number of machines to 4. In this second configuration, each machines still has 2 trainers and each trainer corresponds to 1 A100 GPU.

The script I used is at https://github.com/K-Wu/IGB-Datasets/blob/main/benchmark/heterogeneous_version/do_dist_training_mag240m_2_2.slurm and https://github.com/K-Wu/IGB-Datasets/blob/main/benchmark/heterogeneous_version/do_dist_training_mag240m_4_2.slurm.

If you want to reproduce this issue on a local cluster instead of a slurm-managed cluster, I can work with you to produce a working reproducing script.

Thank you again.

Best Regards,

Does the problem still exist if you switch to DGL’s default launcher?

Hi Minjie,

Thank you for checking my code. In the custom launcher we created, the only change we made is changing the ssh <ip_address> command to ssh <node_name>: This is the working way to perform logging in order to execute remote command on the slurm-managed cluster. Because we only made this change, it seems unlikely that this would cause the “message too large” error. We don’t have a cluster owned by ourselves to run the experiment with DGL’s default launcher.

Please let us know if there is anything I can do to help locate the issue. Feel free to chat with me in your working hours. My availability is at go.kunwu.me/calendar

Thanks again.

Best Regards,

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.