Torch graphconv fails with expected scalar type Double but found Float

saguinag · July 1, 2021, 10:13pm

A toy custom-built dataset
Number of graphs: 1 Number of features: 9 Number of classes: 2
the graph has normalized values for the edge weights.

With simple GCN fails training when I call train(graph,model) like this:

graph = dgl.add_self_loop(graph)
model = GCN(graph.ndata['feat'].shape[1], 16, dataset.num_classes)

The error is:

  424                 rst = graph.dstdata['h']
    425                 if weight is not None:
--> 426                     rst = th.matmul(rst, weight)
    427 
    428             if self._norm != 'none':

RuntimeError: expected scalar type Double but found Float

Should I increase the size of my graph?

class GCN(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(GCN, self).__init__()
        self.conv1 = GraphConv(in_feats, h_feats)
        self.conv2 = GraphConv(h_feats, num_classes)

    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h

BarclayII · July 5, 2021, 6:47am

The error suggests that your data type did not match. PyTorch by default uses float32 while it seems that your graph’s node feature has float64. Could you please have a check?

saguinag · July 7, 2021, 12:37pm

Was able to fix the issue. Great suggestion. Now … Down in the train function from this link the line with F.cross_entropy(logits[train_mask], labels[train_mask]) returns this error msg: RuntimeError: expected scalar type Long but found Float.

My edges start out as an array of decimals or int values ranging from 1 up to 5339. Should I normalize the edge weights ?
Inside my Dataset class I have:

edge_features = torch.from_numpy(edges_data['shared_member_count'].to_numpy()).float()
self.graph.edata['weight'] = edge_features

Should I try changing this ‘float’ to ‘Long’?

conda_env/gcns37/lib/python3.7/site-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction)
   2466     if size_average is not None or reduce is not None:
   2467         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 2468     return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
   2469 
   2470 
conda_env/gcns37/lib/python3.7/site-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
   2262                          .format(input.size(0), target.size(0)))
   2263     if dim == 2:
-> 2264         ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)

saguinag · July 7, 2021, 12:44pm

Got it fixed and working. Found solution/suggestion here in SO and changed:

self.graph.ndata['label'] = node_labels

to → self.graph.ndata['label'] = node_labels.type(torch.LongTensor)

Results

Two issues:

Should I ensure that my graph is defined as a bidirectional graph?
Should this line below have something wrong with the ‘loss’?

...
In epoch 245, loss: nan, val acc: 0.733 (best 0.733), test acc: 0.758 (best 0.758)

BarclayII · July 12, 2021, 1:48am

Usually this is a better idea.

Your loss is NaN so your gradient is likely NaN as well. This is usually a bug.

saguinag · August 4, 2021, 2:52pm

Agreed! Got issues fixed!