Train GCN with many graphs instead of one batch (for RL)

I’m trying to train a GCN with 1 graph at a time. (I need it for RL, where observations come in a sequence)

When I train it with 1 batched graph, it works fine

Epoch 00195 | Time(s) 0.0674 | Loss 0.0003 | ETputs(KTEPS) 519.32
Epoch 00196 | Time(s) 0.0674 | Loss 0.0056 | ETputs(KTEPS) 519.42
Epoch 00197 | Time(s) 0.0674 | Loss 0.0024 | ETputs(KTEPS) 519.47
Epoch 00198 | Time(s) 0.0674 | Loss 0.0181 | ETputs(KTEPS) 519.51
Epoch 00199 | Time(s) 0.0674 | Loss 0.0019 | ETputs(KTEPS) 519.57

When I try to use one graph at a time, it doesn’t.

for g in gs:
                logits = model(g, features)
                loss = loss_fcn(logits[train_mask], labels[train_mask])
                losses.append(loss.item())
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

Output:

Epoch 00196 | Time(s) 0.1517 | Loss 9.9473 | ETputs(KTEPS) 4.62
Epoch 00197 | Time(s) 0.1516 | Loss 10.5100 | ETputs(KTEPS) 4.62
Epoch 00198 | Time(s) 0.1515 | Loss 10.6043 | ETputs(KTEPS) 4.62
Epoch 00199 | Time(s) 0.1514 | Loss 10.0472 | ETputs(KTEPS) 4.62

I don’t know what I’m doing wrong. It looks like it’s resetting the weights with each new graph but I’m not sure. I’m this has come up in the past, but couldn’t find solutions.

Runnable code:

If you’re training with 1 graph like for g in gs, you’re supposed to evaluate in the way as well, I think. Namely, evaluate on each graph one by one like you trained, not on the whole batched graph directly: Regression with GCN. Run with `--many-graphs` to train it one graph at a time, default is one batched graph, many epoch. · GitHub.

could you have a try on this?

Since the task is regression, I’m just looking at the training loss, which is tracked per graph. Should’ve removed the evaluate line.

Also, if you run it and print logits and labels, you’ll notice logits are very close together, and nothing to do with the labels. (with --many-graphs)

Any ideas? This seems like a pretty big deal

sorry for the delay, as most of us are on vacation. I have no ideas on this. will ask someone else to take a look at this.

If you try replacing

for g in gs:
    features = g.ndata["feat"]
    labels = g.ndata["label"]
    train_mask = g.ndata["train_mask"]
    val_mask = g.ndata["val_mask"]
    test_mask = g.ndata["test_mask"]
    in_feats = features.shape[1]
    # n_classes = data.num_labels
    n_classes = n_classes
    n_edges = g.num_edges()

    logits = model(g, features)
    loss = loss_fcn(logits[train_mask], labels[train_mask])
    losses.append(loss.item())
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

with

all_logits = []
all_labels = []
all_train_masks = []
for g in gs:
    features = g.ndata["feat"]
    labels = g.ndata["label"]
    train_mask = g.ndata["train_mask"]
    val_mask = g.ndata["val_mask"]
    test_mask = g.ndata["test_mask"]
    in_feats = features.shape[1]
    # n_classes = data.num_labels
    n_classes = n_classes
    n_edges = g.num_edges()

    logits = model(g, features)
    all_logits.append(logits)
    all_labels.append(labels)
    all_train_masks.append(train_mask)
    # loss = loss_fcn(logits[train_mask], labels[train_mask])
    # losses.append(loss.item())
    # optimizer.zero_grad()
    # loss.backward()
    # optimizer.step()
logits = torch.cat(all_logits)
labels = torch.cat(all_labels)
train_masks = torch.cat(all_train_masks)
loss = loss_fcn(logits[train_mask], labels[train_mask])
losses.append(loss.item())
optimizer.zero_grad()
loss.backward()
optimizer.step()

You will see a similar result to the batched case. That says, I doubt it’s simply because your data is randomly generated in a noisy way and a larger batch size just helps a lot for a fast convergence.

I was wondering:

Since labels are just the node’s own features “degree” (int) and “strat” (either 1 or 0) multiplied, then graph convolution would actually confuse this information. However, should it be impossible to represent this function? Bc that’s what I’m finding.

logits tensor([0.2616, 0.2639, 0.2456, 0.2272, 0.2615, 0.2333, 0.2315, 0.2304, 0.2486,
        0.2493], grad_fn=<SliceBackward>)
label tensor([0.6000, 0.0000, 0.7000, 0.0000, 0.0000, 0.0000, 0.5000, 0.0000, 0.0000,
        0.0000])
label mean: 0.3467000126838684
batched loss 0.11877474188804626
many gs loss 0.11885928802921626

If so, would Graph Attention be able to? Since it learns edge weights it could learn to only care about self-loops.

UPDATE: trying GAT with default parameters and seeing very similar results. Predicting around the label mean, no sensitivity to when the label is 0. (which should be easy since label==0 when strat==0). Edge weights don’t seem to settle on the self-loops even after dozens of epochs.

image

Edges weighted by attention, color is hotter based on node label (black=0, white=9)

Here’s the gist of my current code:

UPDATE 2:

Since the regression outputs are between 0 and 1, I tried using GAT with L1 loss instead of L2 and it seems to actually learn something!

logits tensor([0.3955, 0.5527, 0.0365, 0.5493, 0.0103, 0.0182, 0.0364, 0.0364, 0.1838,
        0.0342], grad_fn=<SliceBackward>)
label tensor([0.8000, 0.9000, 0.0000, 0.9000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000])
label mean: 0.3655099868774414
batched loss 0.017586950212717056
many gs loss 0.07839850544929504

It’s still not perfect, and I wonder why, since it’s such an easy task.

image

UPDATE 3

Confusingly, RMSE (root of the mean squared error) should do even better than L1 on small values, but it makes the results behave the same way as MSE.

GCN instead of GAT doesn’t work great .

Have you tried increasing the number of randomly generated graphs?