Failed on writing metis and killed for distdgl on papers100m

Hi,

I’m trying to run papers100m on two machines but got stuck at the partition step. I’m wondering how big the cpu memory is needed for running partition_graph.py on ogb_papers100m. Currently, my cpu mem is 312GB. After running python3 partition_graph.py --dataset ogb-paper100M --num_parts 2 --balance_train --balance_edges, it gave failed on writing metis... and then Killed for oom. Complete err msg as below:

load ogbn-papers100M
This will download 56.17GB. Will you proceed? (y/N)
y
Downloading http://snap.stanford.edu/ogb/data/nodeproppred/papers100M-bin.zip
Downloaded 56.17 GB: 100%|█████████████████████████████████████| 57519/57519 [18:25<00:00, 52.04it/s]
Extracting dataset/papers100M-bin.zip
Loading necessary files...
This might take a while.
Processing graphs...
100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 21399.51it/s]
Converting graphs into DGL objects...
100%|██████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.80s/it]
Saving...
finish loading ogbn-papers100M
finish constructing ogbn-papers100M
load ogb-paper100M takes 2013.006 seconds
|V|=111059956, |E|=1615685872
train: 1207179, valid: 125265, test: 214338
Converting to homogeneous graph takes 22.111s, peak mem: 190.091 GB
Convert a graph into a bidirected graph: 548.671 seconds, peak memory: 233.830 GB
Construct multi-constraint weights: 4.294 seconds, peak memory: 233.830 GB
Failed on writing metis2905.6
Failed on writing metis2905.7
Failed on writing metis2905.9
Failed on writing metis2905.11
Failed on writing metis2905.12
Failed on writing metis2905.13
Killed

Thanks in advance!

I tried on my side. see details below. Besides increase your RAM, you could try with ParMETIS which may require lower RAM and runs faster: dgl/examples/pytorch/rgcn/experimental at master · dmlc/dgl · GitHub

load ogbn-papers100M
finish loading ogbn-papers100M
finish constructing ogbn-papers100M
load ogb-paper100M takes 319.756 seconds
|V|=111059956, |E|=1615685872
train: 1207179, valid: 125265, test: 214338
Converting to homogeneous graph takes 22.279s, peak mem: 227.535 GB
Convert a graph into a bidirected graph: 467.348 seconds, peak memory: 271.274 GB
Construct multi-constraint weights: 4.047 seconds, peak memory: 271.274 GB
[03:12:53] /opt/dgl/src/graph/transform/metis_partition_hetero.cc:87: Partition a graph with 111059956 nodes and 3228124712 edges into 2 parts and get 89205390 edge cuts
Metis partitioning: 4079.545 seconds, peak memory: 367.365 GB
Assigning nodes to METIS partitions takes 4553.460s, peak mem: 367.365 GB
Reshuffle nodes and edges: 233.161 seconds
Split the graph: 1178.080 seconds
Construct subgraphs: 261.475 seconds
Splitting the graph into partitions takes 1676.012s, peak mem: 409.879 GB
part 0 has 69828010 nodes and 55645679 are inside the partition
part 0 has 815801925 edges and 773852264 are inside the partition
part 1 has 67651329 nodes and 55414277 are inside the partition
part 1 has 889126971 edges and 841833608 are inside the partition
Save partitions: 535.586 seconds, peak memory: 409.879 GB
There are 1615685872 edges in the graph and 0 edge cuts for 2 partitions.

I modified write_mag.py for papers100m in line 8 to ‘ogbn-papers100M’ (I need to change the saved file’s names also + the name of json to read at the end but error came before these lines so I did not address them yet) but when I try to run it, it shows there is some error on line 16. I’m wondering if this is because some of them in the script that are specifically for mag. Also, papers100m seems to be in the homogeneous format. Should I delete line 19-25, skipping the step that change heterogeneous format to homogeneous format?

Here is the details:

This will download 56.17GB. Will you proceed? (y/N)
y
Downloading http://snap.stanford.edu/ogb/data/nodeproppred/papers100M-bin.zip
Downloaded 56.17 GB: 100%|████████████████████████████████████| 57519/57519 [19:20<00:00, 49.56it/s]
Extracting dataset/papers100M-bin.zip
Loading necessary files...
This might take a while.
Processing graphs...
100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 20068.44it/s]
Converting graphs into DGL objects...
100%|█████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.95s/it]
Saving...
Traceback (most recent call last):
  File "write_pa.py", line 16, in <module>
    hg.nodes['paper'].data['feat'] = hg_orig.nodes['paper'].data['feat']
  File "/home/luyc/.local/lib/python3.8/site-packages/dgl/view.py", line 38, in __getitem__
    ntid = self._typeid_getter(ntype)
  File "/home/luyc/.local/lib/python3.8/site-packages/dgl/heterograph.py", line 1189, in get_ntype_id
    raise DGLError('Node type "{}" does not exist.'.format(ntype))
dgl._ffi.base.DGLError: Node type "paper" does not exist.

The link I shared with you is dedicated for mag dataset, so you need to modify write_mag.py and others accordingly. And you need to install ParMETIS as well.

BTW, you could specify the root of DglNodePropPredDataset to avoid downloading dataset for each run.

Do you mind giving some hints on how to modify it? It seems papers100m is in homogeneous format that many lines in write_mag.py will be unnecessary for papers100m. After loading the dataset, I directly set g to dataset[0] and skip to line 21. However, running this version will show there is some issue saying that tuple does not have ‘ndata’. Then, I tried to treat it as a heterogenous graph like mag, simply deleting

Then, it went oom when trying to remove self-loops. Really appreciated!

Please go through this tutorial first: 7.1 Preprocessing for Distributed Training — DGL 0.9.0 documentation. This doc will give you the basic ideas of what write_mag.py does. I believe you’re able to generate write_papers.py on your own.

write_mag.py mainly aims to generate inputs for ParMETIS: xxx_nodes.txt, xxx_edges.txt. When you treat papers100M as heterogeneous directly, you hit OOM. I think some keys are not necessary such as g.ndata['orig_id']. As RAM is very tight in your machine, you need to free any unnecessary objects when generating inputs ofr ParMETIS and any following operations. For example, after dumping node_data into xxx_nodes.txt, you’d better delete those objects. And maybe the node feature occupies much RAM, you could free them once processed.