Dealing with large Graph datasets

harri13 · February 13, 2020, 3:25pm

Hi,

I have transformed my raw data into graphs. As a result I have 300 000 graphs for training and validation and another 50 000 for testing.

Loading the whole dataset to training is not recommended, I am running on CPU and it consumes 100 % of it. I used the following code to generate batches of data.

from torch.utils.data import Dataset, IterableDataset, DataLoader
from itertools import cycle, islice

class IterableM(IterableDataset):
    
    def __init__(self,data):
        self.data=data
    
    def process_data(self,data):
        for graph in data:
            yield graph
            
    def get_stream(self,data):
        return cycle(self.process_data(data))
    
    def __iter__(self):
        return self.get_stream(self.data)

Data and model initialization

trainset=MyData("./data.bin") # List of tuples containig 300k graphs and their labels  

data=IterableM(trainset)   ## The class that I assume divides my data into batches



model = Classifier(1, 64,64, 2) ## Model initialization
loss_func =nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
model.train()



from torch.utils.data import Dataset, IterableDataset, DataLoader
from itertools import cycle, islice

loader=DataLoader(data,batch_size=10000, num_workers=0,collate_fn=collate)

Training

num=0
epoch_losses = []
epoch_loss = 0
count=0
for (bg, label) in islice(loader,30):
    prediction = model(bg)
    loss = loss_func(prediction, label)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    epoch_loss += loss.detach().item()
    num+=1
    count+=1
    epoch_loss /=(num)
    print('Epoch {}, loss {:.4f}'.format(count, epoch_loss))
    epoch_losses.append(epoch_loss)

What I assumed is that the above training is similar to performing 30 epochs on batches of 10 000, hence covering the whole training set.

My resulting losses came as follows:

Epoch 1, loss 0.2717
Epoch 2, loss 0.2711
Epoch 3, loss 0.1805
Epoch 4, loss 0.1127
Epoch 5, loss 0.0767
Epoch 6, loss 0.0576
Epoch 7, loss 0.0468
Epoch 8, loss 0.0394
Epoch 9, loss 0.0341
Epoch 10, loss 0.0302
Epoch 11, loss 0.0272
Epoch 12, loss 0.0246
Epoch 13, loss 0.0224
Epoch 14, loss 0.0206
Epoch 15, loss 0.0191
Epoch 16, loss 0.0178
Epoch 17, loss 0.0167
Epoch 18, loss 0.0155
Epoch 19, loss 0.0147
Epoch 20, loss 0.0139
Epoch 21, loss 0.0132
Epoch 22, loss 0.0126
Epoch 23, loss 0.0121
Epoch 24, loss 0.0114
Epoch 25, loss 0.0109
Epoch 26, loss 0.0105
Epoch 27, loss 0.0101
Epoch 28, loss 0.0097
Epoch 29, loss 0.0094
Epoch 30, loss 0.0090

And the confusion matrix for my binary classification problem was
[11712, 6598
3399, 8291] (not perfect but just for the sake of beginning).

I know that my approach is a bit extreme, especially when taking 10 000 samples in a batch. I would like to know your opinion on the correctness of this code in covering an entire dataset of 300 000 samples. Is there a better alternative that you would suggest to traing hundreds of epochs, each of which would take around 5000 new graphs per epoch and train using smaller batches (256 or 512) until the 300 000 samples are covered (like Keras Generator) or is my code feasible somehow ?

Thanks in advance !

VoVAllen · February 13, 2020, 6:44pm

Actually I didn’t really get what’s your exact question. Your pipeline looks good to me. Batch size needs to finetune and there’s no absolute way to say larger is better or not.

Do you think it’s too slow/comsuming too much resources? Or are you seeking for training advice for better performances?