In DistDGL, How Can I Partition the Graph without Halos?

K-Wu · February 27, 2024, 8:08am

Is there a way to partition the graph for DistDGL without involving halos? We want to do this because halos are occupying spaces, making it hard to accommodate the mag240m on 4 256GB nodes. I tried to partition the graph for DistDGL without halos by setting num_hops=0 when calling dgl.distributed.partition_graph. However, I got the following assertion errors.

Any suggestions or ideas on this matter? Thanks in advance.

(gids_osdi24) kunwu2@bafs-01:/data/kunwu2/IGB-Datasets$ python -m benchmark.heterogeneous_version.partition_graph --num_parts=4 --num_trainers_per_machine=2 --dataset=mag240m
Constructing graph_data
Constructed graph_data
Created heterograph
load mag240m takes 168.082 seconds
|V|=244160499, |E|=1728364232
train: 1112392, valid: 138949, test: 146818
Converting to homogeneous graph takes 31.949s, peak mem: 575.951 GB
Reshuffle nodes and edges: 1759.076 seconds
Split the graph: 491.900 seconds
Construct subgraphs: 48.162 seconds
Splitting the graph into partitions takes 2300.576s, peak mem: 626.451 GB
Traceback (most recent call last):
File “/home/kunwu2/anaconda3/envs/gids_osdi24/lib/python3.9/runpy.py”, line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/home/kunwu2/anaconda3/envs/gids_osdi24/lib/python3.9/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/data/kunwu2/IGB-Datasets/benchmark/heterogeneous_version/partition_graph.py”, line 188, in
dgl.distributed.partition_graph(
File “/home/kunwu2/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/partition.py”, line 964, in partition_graph
typed_eids
ValueError: operands could not be broadcast together with shapes (2672668,) (10702358,)
[1]+ Killed python -m benchmark.heterogeneous_version.partition_graph --num_parts=4 --num_trainers_per_machine=2 --dataset=mag240m

minjie · February 29, 2024, 1:37am

DistDGL currently does not support partitioning without halo nodes. It is because the way distributed neighbor sampling is implemented which requires at least one-hop neighborhood to co-locate on the same machine. If you have further questions regarding to distributed training, you could also contact GraphStorm team for more comprehensive support.

K-Wu · February 29, 2024, 2:57am

Got it. Thanks for the reply, Minjie!
Best Regards,
Kun

system · March 30, 2024, 2:57am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.