ogb-paper100M unable to run with distributed GraphSAGE

l-hoang · March 16, 2021, 11:36pm

Hello.

For some reason, I am unable to run ogb-paper100M with distributed GraphSAGE. I have tried on 2
different super computing clusters. Here is the error I get on one of them:

Traceback (most recent call last):
  File "train_dist.py", line 331, in <module>
    main(args)
  File "train_dist.py", line 294, in main
    run(args, device, data)
  File "train_dist.py", line 259, in run
    g.ndata['labels'], val_nid, test_nid, args.batch_size_eval, device)
  File "train_dist.py", line 155, in evaluate
    pred = model.inference(g, inputs, batch_size, device)
  File "train_dist.py", line 117, in inference
    for blocks in tqdm.tqdm(dataloader):
  File "/jet/home/lhoang/.local/lib/python3.6/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/jet/home/lhoang/.local/lib/python3.6/site-packages/dgl/distributed/dist_dataloader.py", line 163, in __next__
    result = self.queue.get(timeout=1800)
  File "<string>", line 2, in get
  File "/usr/lib64/python3.6/multiprocessing/managers.py", line 757, in _callmethod
    kind, result = conn.recv()
  File "/usr/lib64/python3.6/multiprocessing/connection.py", line 254, in recv
    buf = self._recv_bytes()
  File "/usr/lib64/python3.6/multiprocessing/connection.py", line 411, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib64/python3.6/multiprocessing/connection.py", line 387, in _recv
    raise EOFError
EOFError

The error I get on the other machine is different. In this case though, I copied the partitions I made on the cluster above to the other machine (the other machine does not have enough memory to run METIS partitioning on the graph; I’ve updated the data paths and such as necessary after the copy):

Traceback (most recent call last):
  File "train_dist.py", line 337, in <module>
    main(args)
  File "train_dist.py", line 273, in main
    dgl.distributed.initialize(args.ip_config, args.num_servers, num_workers=args.num_workers)
  File "/home1/03372/lhoang/.local/lib/python3.6/site-packages/dgl/distributed/dist_context.py", line 100, in initialize
    os.environ.get('DGL_CONF_PATH'))
  File "/home1/03372/lhoang/.local/lib/python3.6/site-packages/dgl/distributed/dist_graph.py", line 263, in __init__
    self.gpb, graph_name = load_partition_book(part_config, self.part_id)
  File "/home1/03372/lhoang/.local/lib/python3.6/site-packages/dgl/distributed/partition.py", line 103, in load_partition_book
    node_map = part_metadata['node_map'] if is_range_part else np.load(part_metadata['node_map'])
  File "/home1/03372/lhoang/.local/lib/python3.6/site-packages/numpy/lib/npyio.py", line 416, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
TypeError: expected str, bytes or os.PathLike object, not dict
Traceback (most recent call last):
  File "train_dist.py", line 337, in <module>
    main(args)
  File "train_dist.py", line 273, in main
    dgl.distributed.initialize(args.ip_config, args.num_servers, num_workers=args.num_workers)
  File "/home1/03372/lhoang/.local/lib/python3.6/site-packages/dgl/distributed/dist_context.py", line 100, in initialize
    os.environ.get('DGL_CONF_PATH'))
  File "/home1/03372/lhoang/.local/lib/python3.6/site-packages/dgl/distributed/dist_graph.py", line 263, in __init__
    self.gpb, graph_name = load_partition_book(part_config, self.part_id)
  File "/home1/03372/lhoang/.local/lib/python3.6/site-packages/dgl/distributed/partition.py", line 103, in load_partition_book
    node_map = part_metadata['node_map'] if is_range_part else np.load(part_metadata['node_map'])
  File "/home1/03372/lhoang/.local/lib/python3.6/site-packages/numpy/lib/npyio.py", line 416, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
TypeError: expected str, bytes or os.PathLike object, not dict

Both errors happen during reading of partitions from what I can tell, and this is a 4 host partition.
Both systems are running the same DistDGL python code. The only difference I can think of is that OGB on the machine where the dataset was downloaded and partitioned is a more up to date version, but this shouldn’t affect this as OGB isn’t called at this point in execution (this is during DistDGL file loading).

Any insight would be appreciated.

Thank you,
Loc Hoang

zhengda1936 · March 22, 2021, 5:26pm

For the first error, it seems that the training code runs but when it gets to the inference code, it fails. If so, I believe this is a bug we observed previously. DistDGL isn’t compatible with python 3.8. Could you try running without sampler worker. This should fix the problem.

python3 ~/workspace/dgl/tools/launch.py
–workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/
–num_trainers 1
–num_samplers 0
–num_servers 1
–part_config data/ogb-product.json
–ip_config ip_config.txt
“python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_servers 1 --num_epochs 30 --batch_size 1000 –num_workers 0”

zhengda1936 · March 22, 2021, 5:28pm

for the second error, could you show me your JSON file? it seems you are not using range partitioning when you partition the graph.

l-hoang · March 23, 2021, 4:08am

Hello.

Thanks for the replies. Both of these python versions are 3.6
from what I can tell, so it shouldn’t be a 3.8 issue.

re: sampler setting, yes, I was made aware of that bug in another
post and will look into it here as well.

re: 2nd setting, here is the JSON.

{
    "edge_map": {
        "_E": [
            [
                0,
                417790259
            ],
            [
                417790259,
                800057474
            ],
            [
                800057474,
                1198604697
            ],
            [
                1198604697,
                1615685872
            ]
        ]
    },
    "etypes": {
        "_E": 0
    },
    "graph_name": "ogb-paper100M",
    "halo_hops": 1,
    "node_map": {
        "_N": [
            [
                0,
                28574557
            ],
            [
                28574557,
                55648781
            ],
            [
                55648781,
                82355578
            ],
            [
                82355578,
                111059956
            ]
        ]
    },
    "ntypes": {
        "_N": 0
    },
    "num_edges": 1615685872,
    "num_nodes": 111059956,
    "num_parts": 4,
    "part-0": {
        "edge_feats": "data/part0/edge_feat.dgl",
        "node_feats": "data/part0/node_feat.dgl",
        "part_graph": "data/part0/graph.dgl"
    },
    "part-1": {
        "edge_feats": "data/part1/edge_feat.dgl",
        "node_feats": "data/part1/node_feat.dgl",
        "part_graph": "data/part1/graph.dgl"
    },
    "part-2": {
        "edge_feats": "data/part2/edge_feat.dgl",
        "node_feats": "data/part2/node_feat.dgl",
        "part_graph": "data/part2/graph.dgl"
    },
    "part-3": {
        "edge_feats": "data/part3/edge_feat.dgl",
        "node_feats": "data/part3/node_feat.dgl",
        "part_graph": "data/part3/graph.dgl"
    },
    "part_method": "metis"
}

It’s worth nothing that this JSON works in the first setting (minus a path change for the “data” paths to the correct paths).

l-hoang · March 23, 2021, 4:38am

Got 100M running on the first setting by setting sampler to 0.
Thanks.

The second setting is the one I’m more interested in, however; I would
appreciate any help I can get there in getting in running.

zhengda1936 · March 25, 2021, 11:56pm

just to answer the question, DGL 0.6 uses a new JSON format. Even though DGL 0.6 supports the old format used by DGL 0.5, DGL 0.5 doesn’t support the new format. To fix the issue above, we just need to upgrade DGL to 0.6.

system · April 24, 2021, 11:56pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.