In DistDGL, How Can I Partition the Graph without Halos?

Is there a way to partition the graph for DistDGL without involving halos? We want to do this because halos are occupying spaces, making it hard to accommodate the mag240m on 4 256GB nodes. I tried to partition the graph for DistDGL without halos by setting num_hops=0 when calling dgl.distributed.partition_graph. However, I got the following assertion errors.

Any suggestions or ideas on this matter? Thanks in advance.

(gids_osdi24) kunwu2@bafs-01:/data/kunwu2/IGB-Datasets$ python -m benchmark.heterogeneous_version.partition_graph --num_parts=4 --num_trainers_per_machine=2 --dataset=mag240m
Constructing graph_data
Constructed graph_data
Created heterograph
load mag240m takes 168.082 seconds
|V|=244160499, |E|=1728364232
train: 1112392, valid: 138949, test: 146818
Converting to homogeneous graph takes 31.949s, peak mem: 575.951 GB
Reshuffle nodes and edges: 1759.076 seconds
Split the graph: 491.900 seconds
Construct subgraphs: 48.162 seconds
Splitting the graph into partitions takes 2300.576s, peak mem: 626.451 GB
Traceback (most recent call last):
File “/home/kunwu2/anaconda3/envs/gids_osdi24/lib/python3.9/runpy.py”, line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/home/kunwu2/anaconda3/envs/gids_osdi24/lib/python3.9/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/data/kunwu2/IGB-Datasets/benchmark/heterogeneous_version/partition_graph.py”, line 188, in
dgl.distributed.partition_graph(
File “/home/kunwu2/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/partition.py”, line 964, in partition_graph
typed_eids
ValueError: operands could not be broadcast together with shapes (2672668,) (10702358,)
[1]+ Killed python -m benchmark.heterogeneous_version.partition_graph --num_parts=4 --num_trainers_per_machine=2 --dataset=mag240m

DistDGL currently does not support partitioning without halo nodes. It is because the way distributed neighbor sampling is implemented which requires at least one-hop neighborhood to co-locate on the same machine. If you have further questions regarding to distributed training, you could also contact GraphStorm team for more comprehensive support.

Got it. Thanks for the reply, Minjie!
Best Regards,
Kun

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.