Hi,
I have transformed my raw data into graphs. As a result I have 300 000 graphs for training and validation and another 50 000 for testing.
Loading the whole dataset to training is not recommended, I am running on CPU and it consumes 100 % of it. I used the following code to generate batches of data.
from torch.utils.data import Dataset, IterableDataset, DataLoader
from itertools import cycle, islice
class IterableM(IterableDataset):
def __init__(self,data):
self.data=data
def process_data(self,data):
for graph in data:
yield graph
def get_stream(self,data):
return cycle(self.process_data(data))
def __iter__(self):
return self.get_stream(self.data)
Data and model initialization
trainset=MyData("./data.bin") # List of tuples containig 300k graphs and their labels
data=IterableM(trainset) ## The class that I assume divides my data into batches
model = Classifier(1, 64,64, 2) ## Model initialization
loss_func =nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
model.train()
from torch.utils.data import Dataset, IterableDataset, DataLoader
from itertools import cycle, islice
loader=DataLoader(data,batch_size=10000, num_workers=0,collate_fn=collate)
Training
num=0
epoch_losses = []
epoch_loss = 0
count=0
for (bg, label) in islice(loader,30):
prediction = model(bg)
loss = loss_func(prediction, label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.detach().item()
num+=1
count+=1
epoch_loss /=(num)
print('Epoch {}, loss {:.4f}'.format(count, epoch_loss))
epoch_losses.append(epoch_loss)
What I assumed is that the above training is similar to performing 30 epochs on batches of 10 000, hence covering the whole training set.
My resulting losses came as follows:
Epoch 1, loss 0.2717
Epoch 2, loss 0.2711
Epoch 3, loss 0.1805
Epoch 4, loss 0.1127
Epoch 5, loss 0.0767
Epoch 6, loss 0.0576
Epoch 7, loss 0.0468
Epoch 8, loss 0.0394
Epoch 9, loss 0.0341
Epoch 10, loss 0.0302
Epoch 11, loss 0.0272
Epoch 12, loss 0.0246
Epoch 13, loss 0.0224
Epoch 14, loss 0.0206
Epoch 15, loss 0.0191
Epoch 16, loss 0.0178
Epoch 17, loss 0.0167
Epoch 18, loss 0.0155
Epoch 19, loss 0.0147
Epoch 20, loss 0.0139
Epoch 21, loss 0.0132
Epoch 22, loss 0.0126
Epoch 23, loss 0.0121
Epoch 24, loss 0.0114
Epoch 25, loss 0.0109
Epoch 26, loss 0.0105
Epoch 27, loss 0.0101
Epoch 28, loss 0.0097
Epoch 29, loss 0.0094
Epoch 30, loss 0.0090
And the confusion matrix for my binary classification problem was
[11712, 6598
3399, 8291] (not perfect but just for the sake of beginning).
I know that my approach is a bit extreme, especially when taking 10 000 samples in a batch. I would like to know your opinion on the correctness of this code in covering an entire dataset of 300 000 samples. Is there a better alternative that you would suggest to traing hundreds of epochs, each of which would take around 5000 new graphs per epoch and train using smaller batches (256 or 512) until the 300 000 samples are covered (like Keras Generator) or is my code feasible somehow ?
Thanks in advance !