GAT learns only the mean of the node attributes

Hi here,

I am trying to use GAT to learn node attributes, but the MSEloss does not decrease even after training for many epochs.

To be more specific, I have a network of about 130 nodes and 190 edges. For each node, I have three features. According to the node features, I can run some physics-based simulation to predict the attribute for each node. Then, I randomly change the node features for 100 times, which can yield 100 different sets of attributes by running the physics-based simulation.

So, I have one graph, 100 sets of [features, attribute], and my goal is to learn the node attributes.
My code snippet is attached below.

Can someone point out what I am doing wrong? Thanks

    # Data generator settings
network = 'Tnet3'
n_sims=100

    
# GAT parameters
device = torch.device("cpu")
batch_size = 1     # batch size used for training, validation and test
patience = 10      # used for early stop
best_score = -1
best_loss = 10000
num_heads = 4      # number of hidden attention heads
num_layers = 2     # number of hidden layers
num_out_heads = 1  # number of output attention heads
num_hidden = 1    # number of hidden units
in_drop = 0        # input feature dropout
attn_drop = 0      # attention dropout
alpha = 0.2        # the negative slop of leaky relu
residual = True    # use residual connection
lr = 1e-4         # learning rate
weight_decay = 0   # weight decay
epochs = 1000    

# define loss function
loss_fcn = torch.nn.MSELoss()
# create the dataset
graph, dataset = gd.OneStepSimulation(network, n_sims)
train_dataset = dataset[0:int(n_sims*0.8+1)]
valid_dataset = dataset[int(n_sims*0.8+1):int(n_sims*0.9+1)]
test_dataset = dataset[int(n_sims*0.9+1):n_sims+1]


num_feats = 3
n_classes = 1
g = graph
heads = ([num_heads] * num_layers) + [num_out_heads]
# define the model
model = GAT(g,
            num_layers,
            num_feats,
            num_hidden,
            n_classes,
            heads,
            F.elu,
            in_drop,
            attn_drop,
            alpha,
            residual)
# define the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
model = model.to(device)
cur_step = 0
for epoch in range(epochs):
    
    # train model
    model.train()
    loss_list = []
    for batch, data in enumerate(train_dataset):
        feats, labels = data
        feats = feats.to(device)
        labels = labels.to(device)
        model.g = graph
        for layer in model.gat_layers:
            layer.g = graph
        outputs = model(feats.float())
        # print(outputs)
        loss = loss_fcn(outputs, labels)
        loss_list.append(loss.item())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    loss_data = np.array(loss_list).mean()
    print("Epoch {:05d} | Loss: {:.4f}".format(epoch + 1, loss_data))

Your code looks fine to me. However, the question is whether we can really train a model in this setting. You have some ground truth node features and you want to recover them from randomly corrupted features. Then does your dataset still contain enough information for making the prediction?

Yes, I guess you are right. I only include node attributes in the model; however, the outputs also depend on edge attributes. I am not sure how to incorporate the edge attributes into the model though.