How to create Massive Heterograph in DGL?

Hi all, I have some credit card transaction data that I’m trying to create a Heterograph out of. This heterograph will have 3 kinds of nodes (cardholders, merchants and transactions) each having 80-dimensional node feature . The eventual goal is to train a multi-GPU R-GCN on this Heteorograph to do node classification. My graph has 1B+ nodes and 10B+ edges so I’m wondering what would be the best way to create the Heterograph and then perform multi-GPU R-GCN training. I know there’s Heterograph function in DGL but it’s data_dict arg requires Pytorch tensors loaded into memory of 1 machine right? That doesn’t seem feasible for the size of my graph. Any workarounds or alternative ways of creating a large heterograph of this scale?

Hi,

Could you save the graph without features by in-memory process? Possibly you may need to save the features separately now. We’ll consider support this feature in the future

do you like to use multiple machines or just a single machine with multiple GPUs?

For a start I’d like to be able to use NVIDIA DGX A100 which has eight A100 GPUs (either 40GB or 80GB) but because of the graph size it may require multiple DGX A100s (multi-node)

i don’t understand. it seems you try to avoid using one machine in your first post. but you like to start with one machine?

Ideally I’d like to be able to create the graph on 2 (or more) DGX A100s and then be able to train a GNN on 16 (or more ) A100 GPUs.

Hi can you please give an example of how I would do that? Currently, this is how I create the Heterograph

data_dict = {('cardholder', 'transaction', 'merchant') : (card_nodes_th, merch_nodes_th)}
g = dgl.heterograph(data_dict)

Here card_nodes_th and merch_nodes_th are Pytorch tensors for my cardholder and merchant nodes and I have to load them all from disk into CPU memory.

And then I add cardholder node features like this:

g.nodes['cardholder'].data['features'] = card_feats_th 

where card_feats_th is the cardholder feats pytorch tensor

Can you save the graph directly after g = dgl.heterograph(data_dict) instead of after g.ndata[...]=.... And store those features separately by torch.save?

I see. And then use neighbor sampling for training and during neighbor sampling load the features for the input nodes (i.e. source nodes) from disk? This way only the graph structure using the data_dict needs to be created in memory. And when we load the features from disk that only for a subset of the nodes i.e. the source nodes from the MFG right? Does DGL support reading in the node features directly from disk to GPU memory through things like GPU Direct Storage? Also found this related paper https://arxiv.org/pdf/2101.07956.pdf and was wondering if this is on the DGL team’s radar.

Hi,

We don’t support GPUDirect Storage now. But you can store the feature on the disk and use memmap to read it to save memory. We have an example for this at dgl/train.py at master · dmlc/dgl · GitHub

Also for the PyTorch-Direct paper, it helps accelerate the scatter-gather from cpu to gpu instead of disk to gpu. We are also collaborating with PyTorch-Direct team to bring this feature to dgl. Please stay tuned

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.