Two-Node Job Failure During Network Configurations

Hi there,

I ran the distributed DGL (GitHub - dmlc/dgl: Python package built to ease deep learning on graph, on top of existing DL frameworks.) on a slurm-managed cluster today, but the job failed with the error message “dgl._ffi.base.DGLError: [13:16:40] /opt/dgl/src/rpc/network/socket_communicator.cc:240: Cannot bind to 172.24.192.105:30050”. I requested two nodes in the slurm script. I used the ethernet interface in the experiment. Stdout and stderr are at stderr · GitHub.

I also tried the Infiniband interface later but got a Bus error. Stdout and stderr are at stderr.ib0 · GitHub.

Essentially, DGL sets up servers and clients at the beginning of the job and binds the processes to ports for socket communication. And I got the error at this stage.

To run DistDGL on slurm cluster, I modified /tool/launch.py to use srun to create new processes instead of ssh. The modified script is at https://github.com/K-Wu/IGB-Datasets/blob/main/benchmark/slurm_launcher.py.

I have checked with the supercomputing center and they report that no IP communication restriction were set. Any ideas or suggestions on how to fix this? Thank you.

Best Regards,

Kun

Indeed, recently, while attempting pseudo-distributed training with multiple Docker containers on a super server, I encountered the same issue (bus error). However, later, when I rented multiple servers on a cloud platform for real distributed training, it appeared to avoid this problem. In fact, I am also curious about the root cause of the issue.

[13:16:40] /opt/dgl/src/rpc/network/tcp_socket.cc:86: Failed bind on 172.24.192.105:30050 , error: Cannot assign requested address

slurm-managed cluster and docker containers on same server are not verified when releasing DistDGL.

I suggest to verify the distributed server and clients could be instantiated successfully. For example, try to set up client&server on target machines via socket module from python. This should be simple. And socket module is probably more robust and powerful to handle network connections than DistDGL(as we set up via raw socket programming in C language). If socket works well, then we could further dive deep into the gap.

A simple client/server from ChatGPT for reference

import socket

def start_dummy_server(host='127.0.0.1', port=12345):
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.bind((host, port))
        s.listen()
        print(f"Listening on {host}:{port}...")
        conn, addr = s.accept()
        with conn:
            print(f"Connected by {addr}")
            conn.sendall(b'Hello, client')

if __name__ == '__main__':
    start_dummy_server()

import socket

def connect_to_dummy_server(host='127.0.0.1', port=12345):
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.connect((host, port))
        print(f"Connected to {host}:{port}")
        data = s.recv(1024)
        print(f"Received: {data.decode()}")

if __name__ == '__main__':
    connect_to_dummy_server()

1 Like

And please start with single server(single line in ip_config.txt), then extend to multiple servers(nodes/machines).

As for the launch command in DistDGL, please also start with num_servers=1, num_clients=1.

I am sharing details about the Docker environment I utilized along with some testing information. I hope this can be helpful.

Env: dgl-0.9, cuda-11.3, torch-1.12

Testing Steps:

  1. Conducted socket testing for the C/S program, and the process completed successfully.
  2. Used DGL’s launch.py script, calling torchrun from torch (modified ‘python3’ in the script to ‘torchrun’). Ran a custom program (only modified the model for DDP training), and successfully obtained results.
  3. Used DGL’s launch.py script (without changes) on the same env to run a 4-partition training for ogb-products (code sourced from the example in the master branch). Encountered the error message in the link."

what is the customized programming? is it running torch.distributed APIs only without DistDGL RPC APIs? Namely, it proves that torch.distributed cannot run well on 4 nodes?

I only used torch.distributed APIs and skipped the DistDGL RPC APIs because this method only communicates model parameters. Here is a part of the code link.

then it demonstrated torch.distributed works well on the docker env.

In the docker env, you just start several containers to mimic nodes(machines) and all of them are running on a single machine?

Yes, all containers are running on a single machine.

Hi Rhett,

Thank you! This is very helpful! I got a connection refused error when running the pair in a slurm job. I am checking with the supercomputing center to see if there is anything they can do to fix it. I am also wondering if there is any easy way to replace the socket communication in DistDGL with alternatives. I will give you an update once I get some progress.

Best Regards,
Kun

Unfortunately, there’s no easy way to replace the socket communication in DistDGL.

Hi @Rhett-Ying ,

Thank you for your help! You previous messages helped a lot for my debugging. I have brought up DistDGL on slurm cluster finally. I changed the ssh scheme in DistDGL to ssh as the cluster at NCSA on our campus will reject ssh login to compute nodes.

I do have a further question: When I run large graph, the client got None when retrieving the graph and gpb in shared memory. The following is a stack trace. I checked the server side has created the partition book, but for some reason, when the client tried to read the shared memory, it got None. Is there any idea on how to fix this? For reference, I am running the IGB (https://github.com/IllinoisGraphBenchmark/IGB-Datasets/tree/main/igb) homogeneous medium dataset. I partition the graph into 2 node, each with 1 trainer. Each partition has 2GB of graph.dgl and 20GB of node_feat.dgl.

Thank you again.

Best Regards,
Kun

File “/u/kunwu2/projects/IGB-datasets/benchmark/do_graphsage_node_classification.py”, line 351, in main
g = dgl.distributed.DistGraph(args.graph_name, part_config=args.part_config)
File “/projects/bbzc/kunwu2/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/dist_graph.py”, line 596, in init
g = dgl.distributed.DistGraph(args.graph_name, part_config=args.part_config)
File “/projects/bbzc/kunwu2/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/dist_graph.py”, line 596, in init
self._init(gpb)
File “/projects/bbzc/kunwu2/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/dist_graph.py”, line 630, in _init
self._client.map_shared_data(self._gpb)
File “/projects/bbzc/kunwu2/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/kvstore.py”, line 1289, in map_shared_data
for ntype in partition_book.ntypes:
AttributeError: ‘NoneType’ object has no attribute ‘ntypes’

ssh to ssh? could you elaborate and share more about how you achieve this, then other users may benefit from your efforts?

please make sure servers are already ready which indicates gpb and graph are already copied to shared memory. Sometimes the log error are delayed so we may miss the first error.

Thank you for your quick reply! Sorry for the corrupted message. I just realized I need to escape < and >: What I meant is I changed ssh <IP address> to ssh <node name> in slurm clusters. NCSA reject ssh <Infiniband/ethernet interface IP address> login to compute nodes.

I will try what you said in the morning (US central time). So far I only confirmed that the server has instantied gpb but I will try to figure out how to make sure the copy has done. Is there any way to wait until the copy has finished and then launch the client?

Thank you again.

Best Regards,
Kun

In theory, DistGraph is instantiated after all servers are ready(especially the main severs on each machine). In your env, do you have 2 nodes(machines)? You could add log here: https://github.com/dmlc/dgl/blob/3ced3411e55bca803ed5ec5e1de6f62e1f21478f/python/dgl/distributed/dist_graph.py#L419.

1 Like

Thank you. For the line you mentioned, I have located that yesterday. Yes, I have 2 nodes.

It turns out the issue is that I passed the wrong graph_name: It should be igb240m_medium instead of igbh240m_medium. The whole command line after correction is as follows.

python -m benchmark.slurm_launcher --num_trainers=1 --num_samplers=0 --num_servers=1 --ip_config=/tmp/node_lists.$SLURM_JOB_ID.out --nodename_config=/tmp/node_name_lists.$SLURM_JOB_ID.out --workspace=$MY_TOOLKIT_PATH/../workspace --part_config=$MY_TOOLKIT_PATH/../out_dataset/igb240m_medium_2_1/igb240m_medium.json "python -m benchmark.do_graphsage_node_classification --ip_config=/tmp/node_lists.$SLURM_JOB_ID.out --num_gpus=1 --part_config=$MY_TOOLKIT_PATH/../out_dataset/igb240m_medium_2_1/igb240m_medium.json --graph_name=igb240m_medium"

When the graph_name is wrongly specified, the client cannot find the shared memory file object storing the graph and the graph partition book, thus returning None for both graph and gpb. However, it is interesting that the server starts up without issues. I suspect some logic in DistDGL uses the graph_name specified in the .json part config, and some other logic rely on the graph_name specified by the user in the command line.

Best Regards,
Kun

1 Like

Yes. DistGraphServer relies on part_config to load graph partitions while DistGraph uses graph_name to load partition from shared memory.

1 Like