dgl._ffi.base.DGLError: Expect number of features to match number of nodes (len(u)). Got 42 and 145 instead

Hello, I’d like some help with this error:

dgl._ffi.base.DGLError: Expect number of features to match number of nodes (len(u)). Got 42 and 145 instead.

I’ve already read the previous posts on the similar error and they did not help me with my doubt.

I am currently working on a graph classification task. I currently have a list of 150 elements (from 0-149) and. My dataset consists of graphs with nodes that are part of the 150 element list. Let’s say elements 0,9,143 are present as nodes in graph 1. And let’s say graph 2 consists of nodes 1,7,9,148. All of these nodes have a feature vector of size 1024. It is very common for the same nodes in different graphs to have different feature vectors. So node id 9 in graph 1 has a feature vector that is different from the feature vector in node 2.

Now that the context is clear, the error I am facing is that there is a mismatch between the number of nodes and the number of features. For some reason, dgl says that the number of nodes in graph 1 is 143 instead of 3. I assume it’s because the highest node id in graph 1 is 143 so it thinks that there are 143 nodes in that specific graph when in reality it’s just a node id (since dgl doesn’t allow string names for nodes) and the total number of nodes is 3 in that graph. So I am unable to create the dataset class correctly. Is there anything obvious that I am currently doing wrong here? Here’s the script I am using currently

import dgl
import urllib
import pandas as pd
import torch
from dgl.data import DGLDataset
from ast import literal_eval

edges = pd.read_csv('ade20k/val_edges_int.csv')
properties = pd.read_csv('ade20k/val_graph_int_properties.csv')

edges.head()

properties.head()

class ADE20kDataset(DGLDataset):
    def __init__(self):
        super().__init__(name='synthetic')

    def process(self):
        edges = pd.read_csv('ade20k/val_edges_int.csv', converters={'feature': literal_eval})
        properties = pd.read_csv('ade20k/val_graph_int_properties.csv')
        self.graphs = []
        self.labels = []

        # Create a graph for each graph ID from the edges table.
        # First process the properties table into two dictionaries with graph IDs as keys.
        # The label and number of nodes are values.
        label_dict = {}
        # num_nodes_dict = {}
        for _, row in properties.iterrows():
            label_dict[row['graph_id']] = row['label']
            # num_nodes_dict[row['graph_id']] = row['num_nodes']

        # For the edges, first group the table by graph IDs.
        edges_group = edges.groupby('image_id')

        # For each graph ID...
        for graph_id in edges_group.groups:
            # Find the edges as well as the number of nodes and its label.
            edges_of_id = edges_group.get_group(graph_id)
            src = edges_of_id['src'].to_numpy()
            dst = edges_of_id['dst'].to_numpy()
            feature = edges_of_id['feature'].to_numpy()
            print(graph_id, len(feature[0]))
            print("The {}th graph has {} nodes and {} edges.".format(graph_id, src, dst))
            # num_nodes = num_nodes_dict[graph_id]
            label = label_dict[graph_id]

            # Create a graph and add it to the list of graphs and labels.
            g = dgl.graph((src, dst))
            g.ndata['feat'] = feature
            self.graphs.append(g)
            self.labels.append(label)

        # Convert the label list to tensor for saving.
        self.labels = torch.LongTensor(self.labels)

    def __getitem__(self, i):
        return self.graphs[i], self.labels[i]

    def __len__(self):
        return len(self.graphs)

dataset = ADE20kDataset()
graph, label = dataset[0]
print(graph, label)
print("The length of dataset is",len(dataset))

Any suggestions as to how to solve this error?

How about specifying num_nodes explicitly when creating graph: g=dgl.graph((src, dst), num_nodes=xxx). Otherwise, g.num_nodes() may not be expected if inferred automatically from src/dst IDs.

Hello @Rhett-Ying, thank you for replying. I’ve actually done that on my first try. That’s when I got the following error:

dgl._ffi.base.DGLError: The num_nodes argument must be larger than the max ID in the data, but got 11 and 144.

Because of this error, I commented out the num_nodes argument and let dgl calculate it on it’s own.

feature is edge feature? feature = edges_of_id['feature'].to_numpy()
but you assign it to node feature in graph: g.ndata['feat'] = feature ?

No, it is indeed a node feature. Since the custom dataset tutorial for graph classification on dgl does not mention including node features, I’ve referenced how node classification code includes the node features and tried replicating that. Apologies if that isn’t the correct way of including node features. And correct me if I am wrong, but since edges_of_id['feature].to_numpy() contains the list of all features that are grouped by the common graph ID (set of features belonging to the same graph), I assumed that’s how I include features in the graph as well. Since in node classification task, the features are included in the csv along with the node labels, and it is different for graph classification, I’ve included the features column in the edges csv that contains the dst and src node. You can find my version of CSV files here (they’re huge, so it’ll be easier to open them using pandas): Updated - Google Drive

Could you please let me know how to include node features in a custom graph classification dataset.

DGL has a utility class for creating dataset from csv files. please take a look: 4.6 Loading data from CSV files — DGL 0.8.1 documentation

As for assign node features to nodes in each graph, it’s intuitive. just make sure num_feature equals num_nodes. How did you get num_nodes explicitly? does the maximum id of src/dst equals num_nodes? If not, you need to map src/dst IDs into continuous ones before creating graph.

Thanks a lot for the utility class link. I completely missed it. It does seem to have a section to load data for multiple graphs with node features. I’ll try using that.

Regarding how I got the num_nodes explicitly since in my case all the nodes in the graph are connected to each other, it was simply the total nodes. For example, for each graph, I have a list of id’s that are to be nodes for that graph, so the length of that list is the num_nodes for that graph, isn’t it?

And no, the maximum id of src/dst is not equal to the number of nodes in the graph. Could you please share an article of what you mean by converting the src/dst ids into continuous ones before creating the graph?

in your case, does max(src/dst ID) equal length of that list? Namely, are the src/dst IDs labelled from 0 to num_nodes-1? If not, you need to re-label src/dst IDs like this:

import numpy as np
ids = np.unique(np.concatenate((src,dst), 0))
mapping = {index : i for i, index in enumerate(ids)}
#use mapping original src/dst to obtain new src/dst which are labelled from 0.
...
1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.