Dgl with TagPPI feature/node error

igwill · April 24, 2023, 5:06pm

Hi, I am getting the ‘Expect number of features to match number of nodes (len(u))’ error mentioned in other questions. I am getting this specifically in the context of training a TagPPI model (which works to predict protein interactions with protein npz embeddings produced by SeqVec and Alphafold).

The TagPPI Github is pretty quiet - so I am asking here in case anyone has fixed this specific problem?

I have tried environments a fresh as py-3.10/pytorch-2.0/cuda-11.8/dgl-1.0.2 to as old as py-3.6/pytorch-1.10/cuda-10.2/dgl-0.9.1, each time it is the same problem that seems to track back to TagPPI’s graph_cmap_loader.py script. The error can pop up while processing either protein G1 or G2, and sometimes the features > nodes and other times nodes > features. The specific feature/node mismatch values change every attempt, but I believe that is due to the dropout/shuffling used during training.

The error:

Running EPOCH 1
Traceback (most recent call last):
  File "/lustre/fs0/home/iwill/TAGPPI/TAGPPI-main/my_main.py", line 25, in <module>
    main()
  File "/lustre/fs0/home/iwill/TAGPPI/TAGPPI-main/my_main.py", line 22, in main
    train(trainArgs)
  File "/lustre/fs0/home/iwill/TAGPPI/TAGPPI-main/my_train_and_validation.py", line 54, in train
    for batch_idx,(G1,dmap1,G2,dmap2,y) in enumerate(train_loader):
  File "/home/iwill/my-envs/tagppi_6/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()
  File "/home/iwill/my-envs/tagppi_6/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 678, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/iwill/my-envs/tagppi_6/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/iwill/my-envs/tagppi_6/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/lustre/fs0/home/iwill/TAGPPI/TAGPPI-main/graph_cmap_loader.py", line 74, in __getitem__
    G2,embed2 = self.loader(cmaproot+p2+'.npz',p2)
  File "/lustre/fs0/home/iwill/TAGPPI/TAGPPI-main/graph_cmap_loader.py", line 42, in default_loader
    G.ndata['feat'] = g_embed
  File "/home/iwill/my-envs/tagppi_6/lib/python3.10/site-packages/dgl/view.py", line 99, in __setitem__
    self._graph._set_n_repr(self._ntid, self._nodes, {key: val})
  File "/home/iwill/my-envs/tagppi_6/lib/python3.10/site-packages/dgl/heterograph.py", line 4032, in _set_n_repr
    raise DGLError('Expect number of features to match number of nodes (len(u)).'
dgl._ffi.base.DGLError: Expect number of features to match number of nodes (len(u)). Got 295 and 292 instead.

I am using the default data provided by the authors, and only made one modification to this script to get it to work on GPU (added generator=torch.Generator(device=‘cuda’) to the dataloader).

Thanks for the help! I am showing the full graph_cmap_loader.py code below:

import torch
import dgl
import scipy.sparse as spp
from seq2tensor import s2t
import os
import numpy as np
import re
import sys
from torch.utils.data import DataLoader,Dataset
import sys
from my_main import *

if len(sys.argv) > 1:
    datasetname, rst_file, pkl_path, batchsize = sys.argv[1:]
    batchsize = int(batchsize)
else:
    datasetname = 'yeast'
    rst_file = './results/yeast_pipr.tsv'
    pkl_path = './model_pkl/GAT'
    batchsize = 64

device = torch.device('cuda')

def collate(samples):

    graphs1,dmaps1,graphs2,dmaps2,labels = map(list, zip(*samples))
    return graphs1,dmaps1,graphs2,dmaps2,torch.tensor(labels)

cmaproot = './data/'+datasetname+'/real_cmap/'
embed_data = np.load("./data/"+datasetname+"/dictionary/protein_embeddings.npz")

def default_loader(cpath,pid):

    cmap_data = np.load(cpath)
    nodenum = len(str(cmap_data['seq']))
    cmap = cmap_data['contact']
    g_embed = torch.tensor(embed_data[pid][:nodenum]).float().to(device)

    adj = spp.coo_matrix(cmap)
    G = dgl.DGLGraph(adj).to(device)
    G = G.to(torch.device('cuda'))
    G.ndata['feat'] = g_embed

    if nodenum > 1000:
        textembed = embed_data[pid][:1000]
    elif nodenum < 1000:
        textembed = np.concatenate((embed_data[pid], np.zeros((1000 - nodenum, 1024))))

    textembed = torch.tensor(textembed).float().to(device)
    return G,textembed


class MyDataset(Dataset):

    def __init__(self,type,transform=None,target_transform=None, loader=default_loader):

        super(MyDataset,self).__init__()
        pns=[]
        with open('./data/'+datasetname+'/actions/'+type+'_cmap.actions.tsv', 'r') as fh:
            for line in fh:
                line = line.strip('\n')
                line = line.rstrip('\n')
                words = re.split('  |\t',line)
                pns.append((words[0],words[1],int(words[2])))

        self.pns = pns
        self.transform = transform
        self.target_transform = target_transform
        self.loader = loader

    def __getitem__(self, index):
        p1,p2, label = self.pns[index]
        G1,embed1 = self.loader(cmaproot+p1+'.npz',p1)
        G2,embed2 = self.loader(cmaproot+p2+'.npz',p2)
        return G1,embed1,G2,embed2,label


    def __len__(self):
        return len(self.pns)

def pad_sequences(vectorized_seqs, seq_lengths, contactMaps, contact_sizes, properties):
    seq_tensor = torch.zeros((len(vectorized_seqs), seq_lengths.max())).long()
    for idx, (seq, seq_len) in enumerate(zip(vectorized_seqs, seq_lengths)):
        seq_tensor[idx, :seq_len] = torch.LongTensor(seq)

    contactMaps_tensor = torch.zeros((len(contactMaps), contact_sizes.max(), contact_sizes.max())).float()
    # contactMaps_tensor = torch.ones((len(contactMaps), contact_sizes.max(), contact_sizes.max())).float()*(-1.0)

    for idx, (con, con_size) in enumerate(zip(contactMaps, contact_sizes)):
        contactMaps_tensor[idx, :con_size, :con_size] = torch.FloatTensor(con)

    seq_lengths, perm_idx = seq_lengths.sort(0, descending=True)
    seq_tensor = seq_tensor[perm_idx]
    contactMaps_tensor = contactMaps_tensor[perm_idx]
    contact_sizes = contact_sizes[perm_idx]

    target = properties.double()
    if len(properties):
        target = target[perm_idx]

    contactMaps_tensor = contactMaps_tensor.unsqueeze(1)  # [batchsize,1,max_length,max_length]
    return seq_tensor, seq_lengths, contactMaps_tensor, contact_sizes, target

def pad_dmap(dmaplist):

    pad_dmap_tensors = torch.zeros((len(dmaplist), 1000, 1024)).float()
    for idx, d in enumerate(dmaplist):
        d = d.float().cpu()
        pad_dmap_tensors[idx] = torch.FloatTensor(d)
    pad_dmap_tensors = pad_dmap_tensors.unsqueeze(1).cuda()
    return pad_dmap_tensors

train_dataset = MyDataset(type = 'train')
train_loader = DataLoader(dataset = train_dataset, batch_size = batchsize, shuffle=True,drop_last = True,collate_fn=collate, generator=torch.Generator(device='cuda')) # added generator=torch.Generator(device='cuda')
test_dataset = MyDataset(type = 'test')
test_loader = DataLoader(dataset = test_dataset, batch_size = batchsize , shuffle=True,drop_last = True,collate_fn=collate)

czkkkkkk · April 25, 2023, 3:21am

Hi @igwill. It seems that the size of g_embed does not match the number of nodes of G in Line 42. Can you check the g_embed size before feeding it to G?

igwill · April 25, 2023, 6:34pm

Thank you for the quick reply. Indeed that size mismatch is what’s going on - but it is not for all proteins. Seven proteins match just fine before getting to this problem child (sizes 295 vs 292).
Here are the last outputs from this step in the default_loader function:

nodenum is 854
g_embed length is 854
g_embed is tensor([[-0.3362, -0.0877, -0.0850,  ..., -0.2904,  0.2652,  0.2911],
        [ 0.0257,  0.4789,  0.0761,  ..., -0.0162,  0.1996,  0.2956],
        [-0.0155, -0.1703,  0.0737,  ...,  0.0434,  0.1103, -0.1769],
        ...,
        [-0.2248, -0.2561,  0.0084,  ..., -0.2275,  0.1083, -0.1278],
        [-0.2543, -0.0071,  0.0740,  ..., -0.0933,  0.1440, -0.0912],
        [-0.2027,  0.1386, -0.1564,  ...,  0.1625, -0.1969, -0.1225]])
G isGraph(num_nodes=854, num_edges=23954,
      ndata_schemes={}
      edata_schemes={})

nodenum is 59
g_embed length is 59
g_embed is tensor([[-0.3556, -0.1379, -0.0917,  ..., -0.2526, -0.0242,  0.3453],
        [-0.2281,  0.2757, -0.1602,  ..., -0.2342, -0.0326,  0.1073],
        [-0.0355, -0.1601,  0.0321,  ..., -0.2930,  0.1983, -0.1177],
        ...,
        [-0.0265,  0.0224, -0.3109,  ..., -0.2147, -0.0951, -0.0090],
        [ 0.2207,  0.0904,  0.0844,  ...,  0.0323, -0.0259, -0.1708],
        [-0.1693,  0.3720, -0.1931,  ...,  0.0684, -0.2738, -0.1622]])
G isGraph(num_nodes=59, num_edges=891,
      ndata_schemes={}
      edata_schemes={})

nodenum is 295
g_embed length is 295
g_embed is tensor([[-0.3399, -0.0940, -0.0824,  ..., -0.0480,  0.1147,  0.1128],
        [ 0.2022,  0.2139,  0.1005,  ...,  0.2424,  0.0260, -0.1995],
        [ 0.1819, -0.0007, -0.1988,  ..., -0.0266,  0.0773, -0.1405],
        ...,
        [-0.3939, -0.2824, -0.1768,  ...,  0.0194, -0.0387, -0.0707],
        [-0.4689, -0.4695, -0.0849,  ..., -0.2983,  0.2909, -0.1734],
        [-0.5129, -0.2805, -0.4166,  ...,  0.1559, -0.1940, -0.1230]])
G isGraph(num_nodes=292, num_edges=9518,
      ndata_schemes={}
      edata_schemes={})
Traceback (most recent call last):
..............
  File "/lustre/fs0/home/iwill/TAGPPI/TAGPPI-main/graph_cmap_loader.py", line 78, in __getitem__
    G2,embed2 = self.loader(cmaproot+p2+'.npz',p2)
  File "/lustre/fs0/home/iwill/TAGPPI/TAGPPI-main/graph_cmap_loader.py", line 46, in default_loader
    G.ndata['feat'] = g_embed
  File "/home/iwill/my-envs/tagppi_6/lib/python3.10/site-packages/dgl/view.py", line 99, in __setitem__
    self._graph._set_n_repr(self._ntid, self._nodes, {key: val})
  File "/home/iwill/my-envs/tagppi_6/lib/python3.10/site-packages/dgl/heterograph.py", line 4032, in _set_n_repr
    raise DGLError('Expect number of features to match number of nodes (len(u)).'
dgl._ffi.base.DGLError: Expect number of features to match number of nodes (len(u)). Got 295 and 292 instead.

As I understand it, “nodenum” should come directly from cmap (AlphaFold) data. And g_embed integrates the SeqVec embeddings. G is the graph.
I ran SeqVec myself and downloaded the author-provided AlphaFold cmap files. At this point I’m not sure if my best bet to solve this has to do with the script itself or the embedding inputs. For some proteins to work and others not seems odd to me, but maybe more indicative of something in the input files?

Editing the training data to remove all protein-pairs that involve the offending protein allows the script to continue but eventually hits a new protein with a similar problem (the exact difference between nodes/features vary).

Thanks

czkkkkkk · April 27, 2023, 4:11am

Hi @igwill,

I feel I cannot offer you more detailed suggestions. You may need to ask the authors of this repository for help.

igwill · April 27, 2023, 2:06pm

That’s totally fair, thank you for the responses.

system · May 27, 2023, 2:07pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.