Failed on writing metis and killed for distdgl on papers100m

Hi,

I’m trying to run papers100m on two machines but got stuck at the partition step. I’m wondering how big the cpu memory is needed for running partition_graph.py on ogb_papers100m. Currently, my cpu mem is 312GB. After running python3 partition_graph.py --dataset ogb-paper100M --num_parts 2 --balance_train --balance_edges, it gave failed on writing metis... and then Killed for oom. Complete err msg as below:

load ogbn-papers100M
This will download 56.17GB. Will you proceed? (y/N)
y
Downloading http://snap.stanford.edu/ogb/data/nodeproppred/papers100M-bin.zip
Downloaded 56.17 GB: 100%|█████████████████████████████████████| 57519/57519 [18:25<00:00, 52.04it/s]
Extracting dataset/papers100M-bin.zip
Loading necessary files...
This might take a while.
Processing graphs...
100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 21399.51it/s]
Converting graphs into DGL objects...
100%|██████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.80s/it]
Saving...
finish loading ogbn-papers100M
finish constructing ogbn-papers100M
load ogb-paper100M takes 2013.006 seconds
|V|=111059956, |E|=1615685872
train: 1207179, valid: 125265, test: 214338
Converting to homogeneous graph takes 22.111s, peak mem: 190.091 GB
Convert a graph into a bidirected graph: 548.671 seconds, peak memory: 233.830 GB
Construct multi-constraint weights: 4.294 seconds, peak memory: 233.830 GB
Failed on writing metis2905.6
Failed on writing metis2905.7
Failed on writing metis2905.9
Failed on writing metis2905.11
Failed on writing metis2905.12
Failed on writing metis2905.13
Killed

Thanks in advance!

I tried on my side. see details below. Besides increase your RAM, you could try with ParMETIS which may require lower RAM and runs faster: dgl/examples/pytorch/rgcn/experimental at master · dmlc/dgl · GitHub

load ogbn-papers100M
finish loading ogbn-papers100M
finish constructing ogbn-papers100M
load ogb-paper100M takes 319.756 seconds
|V|=111059956, |E|=1615685872
train: 1207179, valid: 125265, test: 214338
Converting to homogeneous graph takes 22.279s, peak mem: 227.535 GB
Convert a graph into a bidirected graph: 467.348 seconds, peak memory: 271.274 GB
Construct multi-constraint weights: 4.047 seconds, peak memory: 271.274 GB
[03:12:53] /opt/dgl/src/graph/transform/metis_partition_hetero.cc:87: Partition a graph with 111059956 nodes and 3228124712 edges into 2 parts and get 89205390 edge cuts
Metis partitioning: 4079.545 seconds, peak memory: 367.365 GB
Assigning nodes to METIS partitions takes 4553.460s, peak mem: 367.365 GB
Reshuffle nodes and edges: 233.161 seconds
Split the graph: 1178.080 seconds
Construct subgraphs: 261.475 seconds
Splitting the graph into partitions takes 1676.012s, peak mem: 409.879 GB
part 0 has 69828010 nodes and 55645679 are inside the partition
part 0 has 815801925 edges and 773852264 are inside the partition
part 1 has 67651329 nodes and 55414277 are inside the partition
part 1 has 889126971 edges and 841833608 are inside the partition
Save partitions: 535.586 seconds, peak memory: 409.879 GB
There are 1615685872 edges in the graph and 0 edge cuts for 2 partitions.

I modified write_mag.py for papers100m in line 8 to ‘ogbn-papers100M’ (I need to change the saved file’s names also + the name of json to read at the end but error came before these lines so I did not address them yet) but when I try to run it, it shows there is some error on line 16. I’m wondering if this is because some of them in the script that are specifically for mag. Also, papers100m seems to be in the homogeneous format. Should I delete line 19-25, skipping the step that change heterogeneous format to homogeneous format?

Here is the details:

This will download 56.17GB. Will you proceed? (y/N)
y
Downloading http://snap.stanford.edu/ogb/data/nodeproppred/papers100M-bin.zip
Downloaded 56.17 GB: 100%|████████████████████████████████████| 57519/57519 [19:20<00:00, 49.56it/s]
Extracting dataset/papers100M-bin.zip
Loading necessary files...
This might take a while.
Processing graphs...
100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 20068.44it/s]
Converting graphs into DGL objects...
100%|█████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.95s/it]
Saving...
Traceback (most recent call last):
  File "write_pa.py", line 16, in <module>
    hg.nodes['paper'].data['feat'] = hg_orig.nodes['paper'].data['feat']
  File "/home/luyc/.local/lib/python3.8/site-packages/dgl/view.py", line 38, in __getitem__
    ntid = self._typeid_getter(ntype)
  File "/home/luyc/.local/lib/python3.8/site-packages/dgl/heterograph.py", line 1189, in get_ntype_id
    raise DGLError('Node type "{}" does not exist.'.format(ntype))
dgl._ffi.base.DGLError: Node type "paper" does not exist.

The link I shared with you is dedicated for mag dataset, so you need to modify write_mag.py and others accordingly. And you need to install ParMETIS as well.

BTW, you could specify the root of DglNodePropPredDataset to avoid downloading dataset for each run.

Do you mind giving some hints on how to modify it? It seems papers100m is in homogeneous format that many lines in write_mag.py will be unnecessary for papers100m. After loading the dataset, I directly set g to dataset[0] and skip to line 21. However, running this version will show there is some issue saying that tuple does not have ‘ndata’. Then, I tried to treat it as a heterogenous graph like mag, simply deleting

Then, it went oom when trying to remove self-loops. Really appreciated!

Please go through this tutorial first: 7.1 Preprocessing for Distributed Training — DGL 0.9.0 documentation. This doc will give you the basic ideas of what write_mag.py does. I believe you’re able to generate write_papers.py on your own.

write_mag.py mainly aims to generate inputs for ParMETIS: xxx_nodes.txt, xxx_edges.txt. When you treat papers100M as heterogeneous directly, you hit OOM. I think some keys are not necessary such as g.ndata['orig_id']. As RAM is very tight in your machine, you need to free any unnecessary objects when generating inputs ofr ParMETIS and any following operations. For example, after dumping node_data into xxx_nodes.txt, you’d better delete those objects. And maybe the node feature occupies much RAM, you could free them once processed.

1 Like

@lwwlwwl Hi, I think the case you’re facing(namely, OOM on single machine, but you have 2 machines available.) could benefit from DGL’s incoming distributed partition pipeline. I will working on it and will let you know once ready. At that time, I hope you could partition papers100M very conveniently…

Thanks for letting me know :slight_smile:

I guess I did not write write_pa.py correctly or something else that when I use ParMETIS command mpirun -np 2 pm_dglpart pa 1, the result files are as follow:


while p<part_id>-pa_nodes.txt, p<part_id>-pa_edges.txt and p<part_id>-pa_stats.txt are not there.

In my terminal output for write_pa, it shows There are 1615685872 edges, remove 0 self-loops and 0 duplicated edges and pa_removed_edges.txt is also empty. I am not sure which part went wrong here. This is the script I use.

EDIT:
The previous files may be because the terminal got interrupted. I ran it all over again from the beginning today and the terminal showed failed on writing parmetis instead as following:

[000] gnvtxs: 111059956, gnedges: 1615685872, ncon: 1
[001] gnvtxs: 111059956, gnedges: 1615685872, ncon: 1

-------------------------------------------------------
[001] proc/self/stat/VmPeak:     120813.86 MB
[000] proc/self/stat/VmPeak:     120655.54 MB
-------------------------------------------------------

DistDGL partitioning, ncon: 1, nparts: 2 [i64, r32, MPI 3.1]
[ 111059956 3228124712   55529978   55529978] [0.50] [50] [ 0.00] [ 0.00]
Failed on writing parmetis0.5630.1
Failed on writing parmetis1.5631.1
[  58443227 2311333690   29131890   29311337] [0.49] [50] [ 0.00] [ 0.00]
Failed on writing parmetis1.5631.2
Failed on writing parmetis0.5630.2
[  30511872 1611154500   15231914   15279958] [0.49] [50] [ 0.00] [ 0.00]
Failed on writing parmetis1.5631.3
Failed on writing parmetis0.5630.3
[  15785823 1097603430    7878841    7906982] [0.49] [50] [ 0.00] [ 0.00]
Failed on writing parmetis1.5631.4
Failed on writing parmetis0.5630.4
[   8110318  726369398    4045593    4064725] [0.49] [50] [ 0.00] [ 0.00]
[   4145560  461349276    2072179    2073381] [0.49] [50] [ 0.00] [ 0.00]
Failed on writing parmetis0.5630.6
Failed on writing parmetis1.5631.6
[   2111094  278940646    1051943    1059151] [0.49] [50] [ 0.00] [ 0.00]
Failed on writing parmetis1.5631.7
Failed on writing parmetis0.5630.7
[   1072278  164877198     535732     536546] [0.49] [50] [ 0.00] [ 0.00]
Failed on writing parmetis0.5630.8
Failed on writing parmetis1.5631.8
[    543799   96848940     270687     273112] [0.49] [50] [ 0.00] [ 0.00]
Failed on writing parmetis0.5630.9
Failed on writing parmetis1.5631.9
[    275507   54625068     137643     137864] [0.50] [50] [ 0.00] [ 0.00]
[    139554   30057652      69379      70175] [0.50] [50] [ 0.00] [ 0.00]
Failed on writing parmetis0.5630.11
Failed on writing parmetis1.5631.11
[     70710   16251574      35338      35372] [0.50] [50] [ 0.00] [ 0.00]
[     35897    8708418      17801      18096] [0.50] [50] [ 0.00] [ 0.00]
[     18266    4650466       9122       9144] [0.50] [50] [ 0.00] [ 0.00]
[      9323    2481258       4600       4723] [0.50] [50] [ 0.00] [ 0.00]
[      4772    1326040       2375       2397] [0.50] [50] [ 0.00] [ 0.00]
[      2463     710536       1204       1259] [0.50] [50] [ 0.00] [ 0.00]
[      1280     379134        628        652] [0.50] [50] [ 0.00] [ 0.00]
[       683     181440        325        358] [0.50] [50] [ 0.00] [ 0.00]
[       370      74992        173        197] [0.50] [50] [ 0.00] [ 0.00]
[       205      25106         94        111] [0.49] [50] [ 0.00] [ 0.01]
[       117       7318         53         64] [0.49] [50] [ 0.00] [ 0.02]
[        92       5886         41         51] [0.49] [50] [ 0.00] [ 0.02]
nvtxs:         92, cut: 23174894, balance: 1.021 
nvtxs:        117, cut: 21134598, balance: 1.016 
nvtxs:        205, cut: 21115576, balance: 1.006 
nvtxs:        370, cut: 20281883, balance: 1.015 
nvtxs:        683, cut: 18822389, balance: 1.017 
nvtxs:       1280, cut: 16046888, balance: 1.019 
nvtxs:       2463, cut: 13824493, balance: 1.020 
nvtxs:       4772, cut: 12804927, balance: 1.020 
nvtxs:       9323, cut: 12443345, balance: 1.020 
nvtxs:      18266, cut: 12205984, balance: 1.019 
nvtxs:      35897, cut: 12290415, balance: 1.018 
nvtxs:      70710, cut: 12627187, balance: 1.014 
nvtxs:     139554, cut: 13207305, balance: 1.010 
nvtxs:     275507, cut: 14028116, balance: 1.006 
nvtxs:     543799, cut: 15333179, balance: 1.006 
nvtxs:    1072278, cut: 16907936, balance: 1.003 
nvtxs:    2111094, cut: 18473122, balance: 1.001 
nvtxs:    4145560, cut: 20942908, balance: 1.001 
nvtxs:    8110318, cut: 24159595, balance: 1.004 
nvtxs:   15785823, cut: 27900878, balance: 1.012 
nvtxs:   30511872, cut: 32165456, balance: 0.964 
nvtxs:   58443227, cut: 36717864, balance: 0.919 
nvtxs:  111059956, cut: 41316715, balance: 1.000 
      Setup: Max: 917.158, Sum: 1834.315, Balance:   1.000
   Matching: Max: 1901.147, Sum: 3802.295, Balance:   1.000
Contraction: Max: 1982.308, Sum: 3964.616, Balance:   1.000
   InitPart: Max:   0.004, Sum:   0.008, Balance:   1.000
    Project: Max:   8.647, Sum:  16.678, Balance:   1.037
 Initialize: Max: 135.476, Sum: 270.173, Balance:   1.003
      K-way: Max: 172.257, Sum: 344.513, Balance:   1.000
      Remap: Max:   0.153, Sum:   0.307, Balance:   1.000
      Total: Max: 5203.944, Sum: 10407.888, Balance:   1.000
       Aux1: Max: 539.555, Sum: 1079.110, Balance:   1.000
       Aux2: Max: 595.049, Sum: 1190.098, Balance:   1.000
Final   2-way Cut: 41316715     Balance: 0.996 

-------------------------------------------------------
[001] proc/self/stat/VmPeak:     120813.86 MB
[000] proc/self/stat/VmPeak:     120655.54 MB
-------------------------------------------------------
Fatal error in PMPI_Irecv: Invalid count, error stack:
PMPI_Irecv(156): MPI_Irecv(buf=0x7f46a3b8f010, count=-1774416661, MPI_LONG_LONG_INT, src=0, tag=1, comm=0x84000007, request=0x55aff05ba3c0) failed
PMPI_Irecv(98).: Negative count, value is -1774416661
Fatal error in PMPI_Irecv: Invalid count, error stack:
PMPI_Irecv(156): MPI_Irecv(buf=0x7f1c393a5010, count=-21988628, MPI_LONG_LONG_INT, src=0, tag=1, comm=0x84000004, request=0x55bac1820510) failed
PMPI_Irecv(98).: Negative count, value is -21988628

Another run gave seg fault

...
-------------------------------------------------------
[001] proc/self/stat/VmPeak:     120813.86 MB
[000] proc/self/stat/VmPeak:     120655.54 MB
-------------------------------------------------------
[   1] size: 7116079104 != 37406043710
[   0] size: 7673339904 != 37352735191
[   1] size: 0 != 444239832
[   1] size: 0 != 1166103961
[   0] size: 0 != 444239832
[   0] size: 0 != 1166103961

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 6468 RUNNING AT ddgltest
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

I have no idea what caused this. Appreciate any help!

Have you ever tried to partition MAG via ParMETIS according to the scripts under the //example? just make sure pm_dglpart works well as expected.

Just tested. I think pm_dglpart works well as expected.

Terminal output for mpirun -np 2 pm_dglpart mag 1 is

[000] gnvtxs: 1939743, gnedges: 42182144, ncon: 4
[001] gnvtxs: 1939743, gnedges: 42182144, ncon: 4
[000] proc/self/stat/VmPeak:     3052.07 MB
-------------------------------------------------------

-------------------------------------------------------
[001] proc/self/stat/VmPeak:     2899.10 MB

DistDGL partitioning, ncon: 4, nparts: 2 [i64, r32, MPI 3.1]
[   1939743   42182144     969871     969872] [0.50] [200] [ 0.00 0.00 0.00 0.00] [ 0.00 0.00 0.00 0.00]
[   1001196   29835490     499095     502101] [0.49] [200] [ 0.00 0.00 0.00 0.00] [ 0.00 0.00 0.00 0.00]
[    515339   19957822     257281     258058] [0.49] [200] [ 0.00 0.00 0.00 0.00] [ 0.00 0.00 0.00 0.00]
[    263596   12926044     131582     132014] [0.49] [200] [ 0.00 0.00 0.00 0.00] [ 0.00 0.00 0.00 0.00]
[    134158    8062656      67044      67114] [0.49] [200] [ 0.00 0.00 0.00 0.00] [ 0.00 0.00 0.00 0.00]
[     68085    4822858      34027      34058] [0.49] [200] [ 0.00 0.00 0.00 0.00] [ 0.00 0.00 0.00 0.00]
[     34499    2821286      17217      17282] [0.49] [200] [ 0.00 0.00 0.00 0.00] [ 0.00 0.00 0.00 0.00]
[     17441    1623364       8720       8721] [0.49] [200] [ 0.00 0.00 0.00 0.00] [ 0.00 0.00 0.00 0.00]
[      8810     906588       4394       4416] [0.49] [200] [ 0.00 0.00 0.00 0.00] [ 0.00 0.00 0.00 0.00]
[      4449     496624       2223       2226] [0.49] [200] [ 0.00 0.00 0.00 0.00] [ 0.00 0.00 0.00 0.00]
[      2242     267724       1120       1122] [0.49] [200] [ 0.00 0.00 0.00 0.00] [ 0.00 0.00 0.00 0.00]
[      1129     143434        564        565] [0.50] [200] [ 0.00 0.00 0.00 0.00] [ 0.00 0.00 0.00 0.00]
[       570      76628        284        286] [0.50] [200] [ 0.00 0.00 0.00 0.00] [ 0.00 0.00 0.00 0.00]
[       357      48816        176        181] [0.50] [200] [ 0.00 0.00 0.00 0.00] [ 0.00 0.00 0.00 0.00]
[       354      48788        175        179] [0.50] [200] [ 0.00 0.00 0.00 0.00] [ 0.00 0.00 0.00 0.00]
nvtxs:        354, cut:   893967, balance: 1.020 1.018 1.027 1.030 
nvtxs:        357, cut:   892166, balance: 1.014 1.011 1.021 1.025 
nvtxs:        570, cut:   847991, balance: 1.008 1.018 1.001 1.007 
nvtxs:       1129, cut:   853856, balance: 1.002 1.001 1.017 1.019 
nvtxs:       2242, cut:   852461, balance: 1.019 1.019 1.007 1.004 
nvtxs:       4449, cut:   897701, balance: 1.019 1.006 1.005 1.005 
nvtxs:       8810, cut:   971707, balance: 1.006 1.018 1.017 1.006 
nvtxs:      17441, cut:  1065821, balance: 1.006 1.020 1.013 1.007 
nvtxs:      34499, cut:  1157456, balance: 1.020 1.004 1.005 1.009 
nvtxs:      68085, cut:  1234529, balance: 1.010 1.020 1.016 1.006 
nvtxs:     134158, cut:  1325119, balance: 1.020 1.012 1.011 1.020 
nvtxs:     263596, cut:  1488813, balance: 1.016 1.016 1.015 1.020 
nvtxs:     515339, cut:  1679233, balance: 1.017 1.020 1.020 1.018 
nvtxs:    1001196, cut:  1916268, balance: 1.019 1.014 1.016 1.020 
nvtxs:    1939743, cut:  2209083, balance: 1.001 1.020 1.020 1.020 
      Setup: Max:   7.256, Sum:  14.513, Balance:   1.000
   Matching: Max:   6.763, Sum:  13.526, Balance:   1.000
Contraction: Max:  10.586, Sum:  21.172, Balance:   1.000
   InitPart: Max:   0.006, Sum:   0.013, Balance:   1.000
    Project: Max:   0.113, Sum:   0.176, Balance:   1.282
 Initialize: Max:   0.734, Sum:   1.458, Balance:   1.006
      K-way: Max:   2.276, Sum:   4.552, Balance:   1.000
      Remap: Max:   0.002, Sum:   0.003, Balance:   1.001
      Total: Max:  25.524, Sum:  51.048, Balance:   1.000
       Aux1: Max:   4.682, Sum:   9.364, Balance:   1.000
       Aux2: Max:   3.533, Sum:   7.067, Balance:   1.000
Final   2-way Cut: 2209083      Balance: 1.001 1.020 1.020 1.020 

-------------------------------------------------------
[001] proc/self/stat/VmPeak:     2899.10 MB
[000] proc/self/stat/VmPeak:     3052.07 MB
-------------------------------------------------------

-------------------------------------------------------
[001] proc/self/stat/VmPeak:     3408.94 MB
[000] proc/self/stat/VmPeak:     3190.36 MB
-------------------------------------------------------

The terminal for running python3 verify_mag_partitions.py is

test part 0
test part 1

Somehow (tried on a new vm with the same exact setting) I passed the failed on writing parmetis error and there is Invalid count error left.

Fatal error in PMPI_Irecv: Invalid count, error stack:
PMPI_Irecv(156): MPI_Irecv(buf=0x7f046b98c010, count=-1774416661, MPI_LONG_LONG_INT, src=0, tag=1, comm=0x84000007, request=0x56080ccd08f0) failed
PMPI_Irecv(98).: Negative count, value is -1774416661
Fatal error in PMPI_Irecv: Invalid count, error stack:
PMPI_Irecv(156): MPI_Irecv(buf=0x7f1f79b0d010, count=-21988628, MPI_LONG_LONG_INT, src=0, tag=1, comm=0x84000004, request=0x5582a4ee9760) failed
PMPI_Irecv(98).: Negative count, value is -21988628

There’s probably something wrong when generating the inputs for ParMETIS.

One way is to look deep into the code you used for generating input.

The other way is to utilize the new distributed partition pipeline I mentioned last week. This will be the recommended way in the future. Would you like to try with it though it’s not fully delivered? I think your case is a perfect case to use this pipeline. Just follow the tutorial: 7.1 Data Preprocessing — DGL 0.9 documentation. In order to use it, you need to clone DGL(GitHub - dmlc/dgl: Python package built to ease deep learning on graph, on top of existing DL frameworks.) and use the dist_part branch. All the related scripts lie under //tools/.

1 Like

Hi Rhett, thanks for suggesting the pipeline. I tried it and got timeoutError when dispatching data. The client machine gave RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 60000ms for recv operation to complete Exception in thread Thread-2: and the server machines gave RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.138.0.4]:51850 Exception in thread Thread-1: I guess it is the timeout error from the client machine causing the connection closed error on the server side. Do you mind having a look at it? Thank you

Here is the complete err msg:

/usr/bin/python3 /home/luyc/workspace/dist_part/tools/distgraphlaunch.py --num_proc_per_machine 1  --ip_config ip_config.txt  --master_port 12345 "/usr/bin/python3 /home/luyc/workspace/dist_part/tools/distpartitioning/data_proc_pipeline.py --world-size 2 --partitions-dir /home/luyc/workspace/dataset/ogbn_papers100M/2parts --input-dir /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data --graph-name papers100M --schema metadata.json --num-parts 2 --output /home/luyc/workspace/dataset/ogbn_papers100M/partitioned "
(export DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1  RANK=0 MASTER_ADDR=10.138.0.2 MASTER_PORT=12345; /usr/bin/python3 /home/luyc/workspace/dist_part/tools/distpartitioning/data_proc_pipeline.py --world-size 2 --partitions-dir /home/luyc/workspace/dataset/ogbn_papers100M/2parts --input-dir /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data --graph-name papers100M --schema metadata.json --num-parts 2 --output /home/luyc/workspace/dataset/ogbn_papers100M/partitioned )
(export DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1  RANK=1 MASTER_ADDR=10.138.0.2 MASTER_PORT=12345; /usr/bin/python3 /home/luyc/workspace/dist_part/tools/distpartitioning/data_proc_pipeline.py --world-size 2 --partitions-dir /home/luyc/workspace/dataset/ogbn_papers100M/2parts --input-dir /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data --graph-name papers100M --schema metadata.json --num-parts 2 --output /home/luyc/workspace/dataset/ogbn_papers100M/partitioned )
cleanupu process runs
[ddgltest INFO 2022-08-26 19:00:49,954 PID:6958] Added key: store_based_barrier_key:1 to store for rank: 0
[ddistdgltest INFO 2022-08-26 19:00:49,954 PID:12907] Added key: store_based_barrier_key:1 to store for rank: 1
[ddgltest INFO 2022-08-26 19:00:49,954 PID:6958] Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[ddgltest INFO 2022-08-26 19:00:49,954 PID:6958] [Rank: 0] Done with process group initialization...
[ddistdgltest INFO 2022-08-26 19:00:49,954 PID:12907] Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[ddgltest INFO 2022-08-26 19:00:49,954 PID:6958] [Rank: 0] Starting distributed data processing pipeline...
[ddistdgltest INFO 2022-08-26 19:00:49,955 PID:12907] [Rank: 1] Done with process group initialization...
[ddistdgltest INFO 2022-08-26 19:00:49,955 PID:12907] [Rank: 1] Starting distributed data processing pipeline...
[ddistdgltest INFO 2022-08-26 19:00:51,020 PID:12907] [Rank: 1] Initialized metis partitions and node_types map...
[ddistdgltest INFO 2022-08-26 19:00:51,021 PID:12907] Loading numpy from /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data/node_data/_N/feat-1.npy
[ddgltest INFO 2022-08-26 19:00:51,060 PID:6958] [Rank: 0] Initialized metis partitions and node_types map...
[ddgltest INFO 2022-08-26 19:00:51,060 PID:6958] Loading numpy from /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data/node_data/_N/feat-0.npy
[ddistdgltest INFO 2022-08-26 19:01:02,821 PID:12907] Loading numpy from /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data/node_data/_N/year-1.npy
[ddistdgltest INFO 2022-08-26 19:01:03,011 PID:12907] Loading numpy from /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data/node_data/_N/label-1.npy
[ddistdgltest INFO 2022-08-26 19:01:03,105 PID:12907] [Rank: 1] node feature name: _N/feat, feature data shape: torch.Size([55529978, 128])
[ddistdgltest INFO 2022-08-26 19:01:03,106 PID:12907] [Rank: 1] node feature name: _N/year, feature data shape: torch.Size([55529978, 1])
[ddistdgltest INFO 2022-08-26 19:01:03,106 PID:12907] [Rank: 1] node feature name: _N/label, feature data shape: torch.Size([55529978, 1])
[ddistdgltest INFO 2022-08-26 19:01:03,106 PID:12907] Reading csv files from /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data/edge_index/_N:_E:_N1.txt
[ddistdgltest INFO 2022-08-26 19:01:41,842 PID:12907] [Rank: 1] Done reading edge_file: 4, (807842936,)
[ddistdgltest INFO 2022-08-26 19:01:41,887 PID:12907] [Rank: 1] Done reading dataset deom /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data
[ddgltest INFO 2022-08-26 19:01:44,331 PID:6958] Loading numpy from /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data/node_data/_N/year-0.npy
[ddgltest INFO 2022-08-26 19:01:44,730 PID:6958] Loading numpy from /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data/node_data/_N/label-0.npy
[ddgltest INFO 2022-08-26 19:01:44,843 PID:6958] [Rank: 0] node feature name: _N/feat, feature data shape: torch.Size([55529978, 128])
[ddgltest INFO 2022-08-26 19:01:44,843 PID:6958] [Rank: 0] node feature name: _N/year, feature data shape: torch.Size([55529978, 1])
[ddgltest INFO 2022-08-26 19:01:44,843 PID:6958] [Rank: 0] node feature name: _N/label, feature data shape: torch.Size([55529978, 1])
[ddgltest INFO 2022-08-26 19:01:44,843 PID:6958] Reading csv files from /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data/edge_index/_N:_E:_N0.txt
[ddgltest INFO 2022-08-26 19:02:32,985 PID:6958] [Rank: 0] Done reading edge_file: 4, (807842936,)
[ddgltest INFO 2022-08-26 19:02:33,032 PID:6958] [Rank: 0] Done reading dataset deom /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data
[Rank:  1 ] Reading file:  /home/luyc/workspace/dataset/ogbn_papers100M/2parts/_N.txt
Traceback (most recent call last):
  File "/home/luyc/workspace/dist_part/tools/distpartitioning/data_proc_pipeline.py", line 52, in <module>
    multi_machine_run(params)
  File "/home/luyc/workspace/dist_part/tools/distpartitioning/data_shuffle.py", line 690, in multi_machine_run
    gen_dist_partitions(rank, params.world_size, params)
  File "/home/luyc/workspace/dist_part/tools/distpartitioning/data_shuffle.py", line 544, in gen_dist_partitions
    read_dataset(rank, world_size, id_lookup, params, schema_map)
  File "/home/luyc/workspace/dist_part/tools/distpartitioning/data_shuffle.py", line 394, in read_dataset
    edge_data = augment_edge_data(edge_data, id_lookup, edge_tids, rank, world_size)
  File "/home/luyc/workspace/dist_part/tools/distpartitioning/utils.py", line 243, in augment_edge_data
    edge_data[constants.OWNER_PROCESS] = lookup_service.get_partition_ids(edge_data[constants.GLOBAL_DST_ID])
  File "/home/luyc/workspace/dist_part/tools/distpartitioning/dist_lookup.py", line 200, in get_partition_ids
    owner_resp_list = alltoallv_cpu(self.rank, self.world_size, out_list)
  File "/home/luyc/workspace/dist_part/tools/distpartitioning/gloo_wrapper.py", line 122, in alltoallv_cpu
    __alltoall_cpu(rank, world_size, recv_counts, send_counts)
  File "/home/luyc/workspace/dist_part/tools/distpartitioning/gloo_wrapper.py", line 60, in __alltoall_cpu
    dist.scatter(output_tensor_list[i], input_tensor_list if i == rank else [], src=i)
  File "/home/luyc/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2359, in scatter
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 60000ms for recv operation to complete
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/luyc/workspace/dist_part/tools/distgraphlaunch.py", line 111, in run
    subprocess.check_call(ssh_cmd, shell=True)
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 22 10.138.0.4 '(export DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1  RANK=1 MASTER_ADDR=10.138.0.2 MASTER_PORT=12345; /usr/bin/python3 /home/luyc/workspace/dist_part/tools/distpartitioning/data_proc_pipeline.py --world-size 2 --partitions-dir /home/luyc/workspace/dataset/ogbn_papers100M/2parts --input-dir /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data --graph-name papers100M --schema metadata.json --num-parts 2 --output /home/luyc/workspace/dataset/ogbn_papers100M/partitioned )'' returned non-zero exit status 1.
[Rank:  0 ] Reading file:  /home/luyc/workspace/dataset/ogbn_papers100M/2parts/_N.txt
Traceback (most recent call last):
  File "/home/luyc/workspace/dist_part/tools/distpartitioning/data_proc_pipeline.py", line 52, in <module>
    multi_machine_run(params)
  File "/home/luyc/workspace/dist_part/tools/distpartitioning/data_shuffle.py", line 690, in multi_machine_run
    gen_dist_partitions(rank, params.world_size, params)
  File "/home/luyc/workspace/dist_part/tools/distpartitioning/data_shuffle.py", line 544, in gen_dist_partitions
    read_dataset(rank, world_size, id_lookup, params, schema_map)
  File "/home/luyc/workspace/dist_part/tools/distpartitioning/data_shuffle.py", line 394, in read_dataset
    edge_data = augment_edge_data(edge_data, id_lookup, edge_tids, rank, world_size)
  File "/home/luyc/workspace/dist_part/tools/distpartitioning/utils.py", line 243, in augment_edge_data
    edge_data[constants.OWNER_PROCESS] = lookup_service.get_partition_ids(edge_data[constants.GLOBAL_DST_ID])
  File "/home/luyc/workspace/dist_part/tools/distpartitioning/dist_lookup.py", line 200, in get_partition_ids
    owner_resp_list = alltoallv_cpu(self.rank, self.world_size, out_list)
  File "/home/luyc/workspace/dist_part/tools/distpartitioning/gloo_wrapper.py", line 122, in alltoallv_cpu
    __alltoall_cpu(rank, world_size, recv_counts, send_counts)
  File "/home/luyc/workspace/dist_part/tools/distpartitioning/gloo_wrapper.py", line 60, in __alltoall_cpu
    dist.scatter(output_tensor_list[i], input_tensor_list if i == rank else [], src=i)
  File "/home/luyc/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2359, in scatter
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.138.0.4]:51850
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/luyc/workspace/dist_part/tools/distgraphlaunch.py", line 111, in run
    subprocess.check_call(ssh_cmd, shell=True)
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 22 10.138.0.2 '(export DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1  RANK=0 MASTER_ADDR=10.138.0.2 MASTER_PORT=12345; /usr/bin/python3 /home/luyc/workspace/dist_part/tools/distpartitioning/data_proc_pipeline.py --world-size 2 --partitions-dir /home/luyc/workspace/dataset/ogbn_papers100M/2parts --input-dir /home/luyc/workspace/dataset/ogbn_papers100M/chunked-data --graph-name papers100M --schema metadata.json --num-parts 2 --output /home/luyc/workspace/dataset/ogbn_papers100M/partitioned )'' returned non-zero exit status 1.

btw for the previous step, random_partition.py probably needs to be in the same directory as utils to avoid cannot import name 'setdir' from 'utils' and the args specified in the tutorial is not consistent with args set in the script.

This is a known issue (I just hit this issue too :rofl:)and I will fix it. Could you try to increase below value to 1800 instead of 60: dgl/constants.py at 7e2ed9f8a84e63b8f647c3b779181d5489499d46 · dmlc/dgl · GitHub. This works for me.

As for the issue you mentioned in previous step. I ran with below command: PYTHONPATH=/dgl/tools:$PYTHONPATH python3 /dgl/tools/partition_algo/random_partition.py .... How did you run it? could you elaborate more?

Sorry for the late reply. Using 1800 instead of 60 does help with the issue. Thanks!

As for the issue, I simply moved the file out to dgl/tool, the same directory where utils is.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.