Using DGL to save and load really large graphs

universvm · June 11, 2021, 4:53pm

Hi,

I’m using DGL and PyTorch on a protein dataset that includes 100K + structures (which would then be converted to 100K+ graphs). The dataset does not fit in memory so ideally I’d like to save parts of it and then load them later. I’m using DGL’s dataloader for that.

Is there a way I can progressively save parts of my dataset or should I just manually produce batches and load them at training time?

thanks in advance.

VoVAllen · June 14, 2021, 7:59am

It’s not supported now. Could you try save those into separate files? Such as 1k graphs in a file with 100 files?

universvm · June 14, 2021, 8:06am

Sure I could do that, but this means that I will need to load a different dataset with the pytorch dataloader each time. There is no other way to do this right?

I’ve seen that the core functions for saving are not written in Python, otherwise I would have done it myself. Thanks for your help though

VoVAllen · June 15, 2021, 5:45am

You can define your own pytorch dataset class and encapsulate the split logic in the class, following Writing Custom Datasets, DataLoaders and Transforms — PyTorch Tutorials 1.8.1+cu102 documentation. For example:

from torch.utils.data.dataset import Dataset
class CustomDataset(Dataset):
   def __init__(self, ...):
      ...
   
   def __getitem__(self, index):
       filename = f"graph_{index//100}.bin"
       return dgl.load_graphs(filename, index%100)

system · July 15, 2021, 5:46am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.