Node feature update?

data-weirdo · January 18, 2022, 7:57am

As far as I observed, the examples on DGL document is all about:
Taking a graph, and update a node feature within a graph through iteration.

However, what I want to do is as below.

I have look-up table (of NLP) and use the embeddings corresponding to the node as features.
However, as there are a lot of sentences, I am not supposing making ‘one’ graph, but a lot of ‘graphs’.

Is it okay to declare a node feature to each graph per every epoch after I get embedding from Look-up table?

I need to update the embedding of the look-up table, but I should re-initialize the node feature per every step. I am using GAT, and I have such a low performance on accuracy that I doubt the justifiability of using DGL library.

I need help!!

mufeili · January 19, 2022, 5:53am

Your proposal sounds fine to me and DGL should support that. DGL is a library for graph neural networks (GNNs) and the question here can be more about whether GNNs are useful for your task. Could you provide more details on the experiment, the performance numbers and the baselines you are comparing against?

data-weirdo · January 19, 2022, 6:35am

I am comparing it with vanilla Transformer.
For the fair comparison, I connected all the nodes each other on GAT.

For the token classification, vanilla Transformer achieves about 93%,
but GAT stays on 19%.

As I am updating the feature of every instances per every epoch,
(updating means doing inplace, in my case)
I doubt the problem might happen at there.

Of course input is ordered and and the order is same for both Transformer and GAT.

mufeili · January 20, 2022, 4:18pm

GAT is still different from vanilla Transformer due to:

a different attention computation mechanism
a lack of positional encoding

Meanwhile as you mentioned, the feature update may not be done properly. Could you post a minimal runnable example for us to take a deeper look?

data-weirdo · January 20, 2022, 4:24pm

This is part of the GAT code.

def forward(self, g):
    feat = g.ndata.pop('x') 

    for l in range(self.num_layers): 
        feat = self.gat_layers[l](g, feat).flatten(1) 
    feat = self.gat_layers[-1](g, feat).mean(1)
    
    g.ndata['logit'] = feat 

    net_output = dgl.unbatch(g)

    return net_output

and input of the forward function is as below:

# Previous code is a code for making a graph object
# with dgl.graph([. ],[. ]).  &.  ndata['x']

graphs = dgl.batch(graphs)

return graphs

graphs for dgl.batch is a Batch of graph objects
that I make every epoch with the feature of updated word embedding.

Thank you mufeili.

mufeili · January 20, 2022, 4:26pm

Are you also updating the word embeddings? If so, can you check if feat has gradient after loss.backward()?

data-weirdo · January 20, 2022, 4:39pm

Yes, I am. And I also identified that gradient was actually flowing. As far as I know, GAT does not use Q, K, V attention, but the way GAT uses as attention is relatively powerful. Is it possible to show much worse performance than Transformer? I doubt my code, but I cannot find where’s wrong.

Do you have any point you doubt for degradation?

Thank you, mufeili.

mufeili · January 20, 2022, 4:42pm

You can use randomly initialized word embeddings without updating them and see if the performance gap is still huge.

If so, then likely the attention mechanism difference and positional encoding play a critical role.

data-weirdo · January 20, 2022, 4:49pm

Oh, that is nice counsel. Thank you mufeili. I’m going to try it!

system · February 19, 2022, 6:05pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.