Size mismatch when running GAT with manual entry

Maulpy · August 12, 2021, 11:58am

Hello all, good day. i have tried to explore DGL since last year and just start again after leaving it 7 months. Right now i am begin to work on it again and struggling with GAT, with very little knowledge/experience in programming, i aiming to learn it on the go but it is really a steep curve. Right now i try the best i can in the last two weeks working on simplest model i can imagine, but so far still no result.

I try manually to go with only 3 nodes and 3 edges (including the train_mask tensor etc), replacing only data from citation_graph from example written in GAT.

I define the node data with 5x5 tensor, and the edge data for message passing with 1x3 tensor. after i run no matter what the size is mismatch :

(e.g : RuntimeError: size mismatch, m1: [30 x 8], m2: [16 x 1] at C:\cb\pytorch_1000000000000\work\aten\src\TH/generic/THTensorMath.cpp:41)

i figure maybe that is caused by improper definition of tensor in train_mask etc, eventhough i dont believe it.

I try to figure out what the generate_mask_tensor do in the util but i just got lost in many functions. i do get a gist by reading the explanation from 4 fundamental formula. but the coding is bothering me so much, i admit my insufficiency but i really want to get this done for now.

Here is the code that is used for defining the features etc.

from dgl import DGLGraph
#from dgl.data import citation_graph as citegrh
import networkx as nx
import pandas as pd
import numpy as np
import torch
import dgl
import torch as th
import time

#Let's begin with only 3 nodes and 3 edges as the simplest example possible.
#Definition stated in tutorial not worked for me, let me use GCN example
#def load_cora_data():
    #data = citegrh.load_cora()
    #features = torch.FloatTensor(data.features)
    #labels = torch.LongTensor(data.labels)
    #mask = torch.BoolTensor(data.train_mask) --> train_mask = g.ndata['train_mask'], this is for the built in module, how i make one for myself?
    #g = DGLGraph(data.graph)
    #return g, features, labels, mask
    
#How to address and determine non uniformity of layer?
u, v = th.tensor([0, 1, 0]), th.tensor([1, 2, 2])
g = dgl.graph((u, v))
#only 2 m depth interval
n_weights_1 = th.tensor([[x, y, z, aa, ab],
                       [.., .., .., .., ..],
                       [.., .., .., .., ..],
                       [.., .., .., .., ..],
                       [.., .., .., .., ..]])
n_weights_2 = ...
n_weights_3 = ...
n_weights = th.stack([n_weights_1, n_weights_2, n_weights_3])
g.ndata['z'][th.tensor([0,1,2])] = n_weights

#(time_lapse(hr), dx_1, dx_2)

e_weights_1 = th.tensor([0.4, 3, 3])
e_weights_2 = th.tensor([0.08, -3, 0]) 
e_weights_3 = th.tensor([0.75, -3, -3])
e_weights = th.stack([e_weights_1, e_weights_2, e_weights_3])

features = n_weights
#mask = g.ndata['z'][th.tensor([1, 0, 0])] #Masking matrix (at least the dimension) need to be revised, in the example the dim is 'N' nodes (rows) x 1 column
mask = th.tensor([1, 1, 0])
#labels = g.ndata['z'][th.tensor([0, 1, 1])]
labels = th.tensor([0, 1, 1])
#From https://docs.dgl.ai/en/0.5.x/_modules/dgl/data/citation_graph.html
#    @property
#    def labels(self):
#        deprecate_property('dataset.label', 'g.ndata[\'label\']')
#        return F.asnumpy(self._g.ndata['label'])


g.edata['e'][th.tensor([0,1,2])] = e_weights

# create the model, 2 heads, each head has hidden size 8
net = GAT(g,
          in_dim=features.size()[1],
          hidden_dim=3,
          out_dim=7,
          num_heads=1)

# create optimizer
optimizer = torch.optim.Adam(net.parameters(), lr=1e-4)

# main loop
dur = []
for epoch in range(50):
    if epoch >= 3:
        t0 = time.time()

    logits = net(features) #how is should modify the 'features'?HOW IS ONLY 'FEATURES' NEEDED?
    logp = F.log_softmax(logits, 1)
    loss = F.nll_loss(logp[mask], labels[mask]) #how i should modify the 'message'?

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch >= 3:
        dur.append(time.time() - t0)

    print("Epoch {:05d} | Loss {:.4f} | Time(s) {:.4f}".format(
        epoch, loss.item(), np.mean(dur)))

Is the wrong train_mask splitting (due to manual way) the problem? or there is other problem i should know, i can’t find how to split train_mask and others easily too, and looking in this post it doesnt get any clearer.

Best Regards

mufeili · August 15, 2021, 12:17pm

You do not need to do g.ndata['z'][th.tensor([0,1,2])]. g.ndata['z'] is enough. Same for g.edata.
Can you post the GAT model you defined?
Why did you assign the edata? It was not used anywhere.
I don’t think the issue is related to how you defined the mask. Most likely it is how you defined the model.

Maulpy · August 15, 2021, 3:27pm

Thank you for the response @mufeili

The rest of class (GAT Layer, MultiHeadGATLayer and GAT) are exactly the same as in the tutorial.
After i previously looking the form of tensor in citation_graph features, labels and mask, i see that the tensor in them goes along 1st rank tensor, but my features is in 2nd order, i don’t really know how to transform them appropriately (if needed to be).
I apply edge data because i think that is the necessary features for message passing in defining relative spatio-temporal for each nodes. i dont think i need to make different UDF because the equation in GAT already clear.

The coding is as below, i just figure that the in_features in m2 always 2 times the out_features in m1. no matter how i change the attributes in nn.Linear and others.

class GATLayer(nn.Module):
    def __init__(self, g, in_dim, out_dim):
        super(GATLayer, self).__init__()
        self.g = g
        # equation (1)
        self.fc = nn.Linear(in_dim, out_dim, bias=False)
        # equation (2)
        self.attn_fc = nn.Linear(2 * out_dim, 1, bias=False)  #seems no need to be modified, '2' exist maybe due to concatenation
        self.reset_parameters()

    def reset_parameters(self):
        """Reinitialize learnable parameters."""
        gain = nn.init.calculate_gain('relu')
        nn.init.xavier_normal_(self.fc.weight, gain=gain)
        nn.init.xavier_normal_(self.attn_fc.weight, gain=gain)

    def edge_attention(self, edges):       #no need to be modified
        # edge UDF for equation (2)
        z2 = torch.cat([edges.src['z'], edges.dst['z']], dim=1) #dim changed from 1 to 2
        a = self.attn_fc(z2)
        return {'e': F.leaky_relu(a)}
    
    #Why the form of message & reduce func like this?

    def message_func(self, edges):
        # message UDF for equation (3) & (4)
        return {'z': edges.src['z'], 'e': edges.data['e']} #Dictionary format ke:value #The original, maybe i need to modify it with something like 

    def reduce_func(self, nodes):                              #no need to be modified
        # reduce UDF for equation (3) & (4)
        # equation (3)
        alpha = F.softmax(nodes.mailbox['e'], dim=1)
        # equation (4)
        h = torch.sum(alpha * nodes.mailbox['z'], dim=1)
        return {'h': h}

    def forward(self, h):
        # equation (1)
        z = self.fc(h)
        self.g.ndata['z'] = z
        # equation (2)
        self.g.apply_edges(self.edge_attention)
        # equation (3) & (4)
        self.g.update_all(self.message_func, self.reduce_func)
        return self.g.ndata.pop('h')

class MultiHeadGATLayer(nn.Module):
    def __init__(self, g, in_dim, out_dim, num_heads, merge='cat'):
        super(MultiHeadGATLayer, self).__init__()
        self.heads = nn.ModuleList()
        for i in range(num_heads):
            self.heads.append(GATLayer(g, in_dim, out_dim))
        self.merge = merge

    def forward(self, h):
        head_outs = [attn_head(h) for attn_head in self.heads]
        if self.merge == 'cat':
            # concat on the output feature dimension (dim=2) #Change dim from 1 to 2
            return torch.cat(head_outs, dim=1)
        else:
            # merge using average
            return torch.mean(torch.stack(head_outs))

class GAT(nn.Module):
    def __init__(self, g, in_dim, hidden_dim, out_dim, num_heads):
        super(GAT, self).__init__()
        self.layer1 = MultiHeadGATLayer(g, in_dim, hidden_dim, num_heads)
        # Be aware that the input dimension is hidden_dim*num_heads since
        # multiple head outputs are concatenated together. Also, only
        # one attention head in the output layer.
        self.layer2 = MultiHeadGATLayer(g, hidden_dim * num_heads, out_dim, 1)

    def forward(self, h):
        h = self.layer1(h)
        h = F.elu(h)
        h = self.layer2(h)
        return h

Best Regards

Maulpy · August 15, 2021, 3:28pm

And i also don’t need have any labels, but i haven’t excluded it from the codes.

Best Regards

mufeili · August 16, 2021, 5:04am

The GAT implementation you used is quite outdated. It’s now recommended to use the implementation in the GAT example here instead. I’m sorry for the inconvenience and we should probably deprecate or update that. @minjie .
As for using second order tensors or third order tensors, do you know the semantic meanings of the last two dimensions in the third order tensors?
As for spatio-temporal relevance, what are the semantic meanings of the edge features? Copy @BarclayII

Maulpy · August 16, 2021, 9:24am

Oke i am trying to learn those source examples, thank you.

I am sorry, what do you mean by semantic meanings? i think i haven’t seen it in the dgl tutorial.
2. If you mean if i could explain those in sentence, the graph is evolve in the nodes number along times but each nodes itself in actuality will be fixed in certain coordinates. the features for each nodes represent the data which is gained along another dimension like its heat along time.
3. assuming like in (2), the edge features only contain the relative spatial-temporal difference data for each nodes

I hope that is suffice
Best Regards

mufeili · August 17, 2021, 3:15am

what do you mean by semantic meanings?

By semantic meanings, I was just interested in what kind of information was encoded.

For example, for second order tensors of shape (B, D), B means batch size and D means feature size.

Rather than using third order input features, is it possible to augment the graph structure and then use second order input features?

assuming like in (2), the edge features only contain the relative spatial-temporal difference data for each nodes

Then how do you plan to use them in your model?

Maulpy · August 19, 2021, 2:57pm

I am sorry for late reply.

Excuse me, i am not very sure that i got it right.
I think maybe you assuming i planning use 3D tensor because i will doing some batching from 2D.
But i got no intention for that until you mention it, and i just heard the batching after you call it.
I just think that because i saw the dataloader module description (in which from what i see the GraphDataLoader now is deprecated in GraphCollator class).

I just want to run the GAT on each ‘step’ separatedly, and i plan to see the result by see the evolution of the attention ‘manually’, forgive my naiveness, i intend no batching if that is necessary for evolution of graphs.

Regarding what i plan to use the edge features, it is i think necessary for message passing input, none other than that. is that? i really got nervous by this question though

So which module that i could use? i saw the link you shared previously, based on what i need it should be enough to utilize gat.py, train.py & early stopping. i just have to modify the main definition in the train module right?

I am just dying to see any result by this point however weak this is

Best Regards.

Maulpy · August 19, 2021, 3:04pm

But maybe for rough picture, the batch size takes around 20 nodes, with the node features are 2D tensor (lets assume just 5 x 5 matrix for now or generally 5 x n). The edge features only simple 1D tensor (1 x 3 matrix)

BarclayII · August 20, 2021, 8:18am

Before answering your question let me rephrase your scenario so I don’t miss anything. You have a single graph with N nodes. Each node has a L-timestep sequence of D-dimensional features (hence the 2D tensor). The edges also have a K-dimensional feature. You wish to see how the attention values vary at each timestep. Did I get your intention right?

Maulpy · August 20, 2021, 10:29am

@Barclayll yes, that is quite what i picture.

Each node is in its own timestep sequence; t-1 for node-1, t-2 for node-2 etc. being L equal with N, but i rather worked that t-1 for graph-1, t-2 for graph-2.
each node feature is 2D tensor because it carry informations that are recorded along certain time. better phrase for me is “each node has D-dimensional features, in which the nodes exist according its own L-timestep sequence”. (i have problem with dangling participle hha)
edges have K dimensional features. which contain information of relative spatio temporality between nodes, in my case it is only a 3 dimensional vector
I want to see how the attention value evolve, expecting that by comparing the entropy distribution of graph at ‘L+1’ timestep is higher (for each nodes) than graph at ‘L’ generally, therefore pave a way to proof the connectivity in my hypothesis albeit rather weakly.

I hope i just can find a way, the right module and libraries as simple as possible
Above all i hope my intention is clear.

Best Regards

Maulpy · August 20, 2021, 10:34am

what i mean with “are recorded along certain time” is in the real world when it was recorded in the machine, it got nothing to do with how it processed in GAT, the features won’t be changed or change at all during computation. sorry

Maulpy · August 21, 2021, 4:59pm

I just try to install argparse as this link suggest, but all three option fail , 2 of them request the python in 2.6, is there any suggestion for how to install it?

Best Regards

Maulpy · August 22, 2021, 8:32am

sorry, i don’t know if argparse isn’t needed to be installed, but i cannot find time for win-64 anywhere

system · September 21, 2021, 8:33am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.