Is there a way to partition the graph for DistDGL without involving halos? We want to do this because halos are occupying spaces, making it hard to accommodate the mag240m on 4 256GB nodes. I tried to partition the graph for DistDGL without halos by setting num_hops=0
when calling dgl.distributed.partition_graph
. However, I got the following assertion errors.
Any suggestions or ideas on this matter? Thanks in advance.
(gids_osdi24) kunwu2@bafs-01:/data/kunwu2/IGB-Datasets$ python -m benchmark.heterogeneous_version.partition_graph --num_parts=4 --num_trainers_per_machine=2 --dataset=mag240m
Constructing graph_data
Constructed graph_data
Created heterograph
load mag240m takes 168.082 seconds
|V|=244160499, |E|=1728364232
train: 1112392, valid: 138949, test: 146818
Converting to homogeneous graph takes 31.949s, peak mem: 575.951 GB
Reshuffle nodes and edges: 1759.076 seconds
Split the graph: 491.900 seconds
Construct subgraphs: 48.162 seconds
Splitting the graph into partitions takes 2300.576s, peak mem: 626.451 GB
Traceback (most recent call last):
File “/home/kunwu2/anaconda3/envs/gids_osdi24/lib/python3.9/runpy.py”, line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/home/kunwu2/anaconda3/envs/gids_osdi24/lib/python3.9/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/data/kunwu2/IGB-Datasets/benchmark/heterogeneous_version/partition_graph.py”, line 188, in
dgl.distributed.partition_graph(
File “/home/kunwu2/anaconda3/envs/gids_osdi24/lib/python3.9/site-packages/dgl/distributed/partition.py”, line 964, in partition_graph
typed_eids
ValueError: operands could not be broadcast together with shapes (2672668,) (10702358,)
[1]+ Killed python -m benchmark.heterogeneous_version.partition_graph --num_parts=4 --num_trainers_per_machine=2 --dataset=mag240m