DistDGL Message Too Large Error When Running Large-Scale Graphs

K-Wu · February 28, 2024, 3:11pm

Hi there,

When I run DistDGL on mag240m on 2 nodes, each with 2 trainers or 4 nodes, each with 2 trainers, I got the following error indicating the message is too large. Any idea or suggestion how to fix this? Thank you.

[08:48:21] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[08:48:21] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[08:48:21] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[08:48:21] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[08:50:13] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[08:50:13] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[08:52:22] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[08:52:22] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[08:53:45] /opt/dgl/src/rpc/network/msg_queue.cc:28: Message is larger than the queue.
Traceback (most recent call last):
  File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/u/kunwu2/scratch/IGB-Datasets/benchmark/do_graphsage_node_classification.py", line 696, in <module>
    main(args)
  File "/u/kunwu2/scratch/IGB-Datasets/benchmark/do_graphsage_node_classification.py", line 498, in main
    dgl.distributed.initialize(args.ip_config)
  File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/dist_context.py", line 278, in initialize
    serv.start()
  File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/dist_graph.py", line 471, in start
    start_server(
  File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/rpc_server.py", line 173, in start_server
    rpc.send_response(client_id, res, group_id)
  File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/rpc.py", line 752, in send_response
    send_rpc_message(msg, get_client(client_id, group_id))
  File "/u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/rpc.py", line 1077, in send_rpc_message
    _CAPI_DGLRPCSendRPCMessage(msg, int(target))
  File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 227, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./function.pxi", line 217, in dgl._ffi._cy3.core.FuncCall3
dgl._ffi.base.DGLError: [08:53:45] /opt/dgl/src/rpc/network/socket_communicator.cc:123: Check failed: Send(ndarray_data_msg, recv_id) == 3400 (3401 vs. 3400) : 
Stack trace:
  [bt] (0) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x75) [0x7f890c914ab5]
  [bt] (1) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/libdgl.so(dgl::network::SocketSender::Send(dgl::rpc::RPCMessage const&, int)+0x659) [0x7f890ce12419]
  [bt] (2) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/libdgl.so(dgl::rpc::SendRPCMessage(dgl::rpc::RPCMessage const&, int)+0x1f) [0x7f890ce1d18f]
  [bt] (3) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/libdgl.so(+0x8a1145) [0x7f890ce23145]
  [bt] (4) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7f890cc903f8]
  [bt] (5) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/_ffi/_cy3/core.cpython-39-x86_64-linux-gnu.so(+0x1a45f) [0x7f88d07b345f]
  [bt] (6) /u/kunwu2/scratch/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/_ffi/_cy3/core.cpython-39-x86_64-linux-gnu.so(+0x1acbf) [0x7f88d07b3cbf]
  [bt] (7) python(_PyObject_MakeTpCall+0x2ec) [0x4f073c]
  [bt] (8) python(_PyEval_EvalFrameDefault+0x4b5a) [0x4ec58a]

minjie · February 29, 2024, 1:34am

Just to clarify.

Do you run DistDGL on 2 machines or 4 machines?
When you say “each with 2 trainers”, do you mean each machine have 2 trainers? Are they using different GPUs?

Also, could you provide steps to reproduce the error?

K-Wu · February 29, 2024, 3:01am

Hi Minjie,

Thank you for your reply!

I have tried two configurations. In the first configuration, I am running DistDGL on 2 machines. Each machine has 2 trainers. Each trainer corresponds to 1 A100 GPU.

After realizing this couldn’t work. I changed the number of machines to 4. In this second configuration, each machines still has 2 trainers and each trainer corresponds to 1 A100 GPU.

The script I used is at https://github.com/K-Wu/IGB-Datasets/blob/main/benchmark/heterogeneous_version/do_dist_training_mag240m_2_2.slurm and https://github.com/K-Wu/IGB-Datasets/blob/main/benchmark/heterogeneous_version/do_dist_training_mag240m_4_2.slurm.

If you want to reproduce this issue on a local cluster instead of a slurm-managed cluster, I can work with you to produce a working reproducing script.

Thank you again.

Best Regards,
Kun

minjie · March 7, 2024, 7:18am

Does the problem still exist if you switch to DGL’s default launcher?

K-Wu · March 9, 2024, 5:18am

Hi Minjie,

Thank you for checking my code. In the custom launcher we created, the only change we made is changing the ssh <ip_address> command to ssh <node_name>: This is the working way to perform logging in order to execute remote command on the slurm-managed cluster. Because we only made this change, it seems unlikely that this would cause the “message too large” error. We don’t have a cluster owned by ourselves to run the experiment with DGL’s default launcher.

Please let us know if there is anything I can do to help locate the issue. Feel free to chat with me in your working hours. My availability is at go.kunwu.me/calendar

Thanks again.

Best Regards,
Kun

system · April 8, 2024, 5:19am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.