Datasets from multiple CSV

Hi everyone.

I had a problem while trying to create muliple datasets from multiple CSVs. I replaced the single CSV names for a variable, so when i create a dataset, I can use different CSVs as asrguments in order to crearte different datasets. The start of my code is:

class Dataset(DGLDataset):
def init(self, nodes, relations):
super().init(name=‘i2b2_dataset’)
self.nodes = nodes
self.relations = relations

def process(self):
    nodes_data = pd.read_csv(nodes)
    edges_data = pd.read_csv(relations)

So when i call the class I can create different datasets with it:

dataset_1 = Dataset(nodes=‘nodes_1.csv’, relations=‘relations_2.csv’)
dataset_2 = Dataset(nodes=‘nodes_2.csv’, relations=‘relations_2".csv’)

The problem is this “nodes” and “relations” variables from the “process()” function are not declared and I get: “NameError: name ‘nodes’ is not defined”

How and where should I declare them?

Thank you so much.

I did not get it. It seems that the code snippet you post is not complete and is inconsistent with your description. E.g. where did you use self.nodes and self.relations?

Sorry. This is the full code (sorry about some intends):

class Dataset(DGLDataset):
def init(self):
super().init(name=‘i2b2_dataset’)
#self.nodes = nodes
#self.relations = relations

def process(self):
    nodes_data = pd.read_csv("nodes.csv")
    edges_data = pd.read_csv("relations.csv")

    node_features = torch.from_numpy(nodes_data['Type'].astype('category').cat.codes.to_numpy())

    edge_features = torch.from_numpy(edges_data['Type'].astype('category').cat.codes.to_numpy())


    edges_src = torch.from_numpy(edges_data['Start'].to_numpy())
    edges_dst = torch.from_numpy(edges_data['End'].to_numpy())

    self.graph = dgl.graph((edges_src, edges_dst), num_nodes=nodes_data.shape[0])
    self.graph.ndata['Type'] = node_features
    self.graph.edata['Type'] = edge_features
    self.graph.ndata['Embeddings'] = torch.rand(self.graph.num_nodes(), 768)

    n_nodes = nodes_data.shape[0]
    n_train = int(n_nodes * 0.6)
    n_val = int(n_nodes * 0.2)
    train_mask = torch.zeros(n_nodes, dtype=torch.bool)
    val_mask = torch.zeros(n_nodes, dtype=torch.bool)
    test_mask = torch.zeros(n_nodes, dtype=torch.bool)
    train_mask[:n_train] = True
    val_mask[n_train:n_train + n_val] = True
    test_mask[n_train + n_val:] = True
    self.graph.ndata['train_mask'] = train_mask
    self.graph.ndata['val_mask'] = val_mask
    self.graph.ndata['test_mask'] = test_mask

def __getitem__(self, i):
    return self.graph

def __len__(self):
    return 1

dataset_1 = Dataset(nodes=‘nodes_1.csv’, relations=‘relations_2.csv’)
dataset_2 = Dataset(nodes=‘nodes_2.csv’, relations=‘relations_2".csv’)

The thing is I intend to create multiple datasets, so I substitute the nodes and relations name of the orginal Karate Club code (Make Your Own Dataset — DGL 0.6.1 documentation) for two variables, so I can use this class to create more than one dataset object. Hope it helps.

Thanks again.

I guess what you are looking for is something like below?

class Dataset(DGLDataset):
    def __init__(self, nodes, relations, ...):
        self.nodes = nodes
        self.relations = relations

    def process(self, ...):
        nodes_data = pd.read_csv(self.nodes)
        edges_data = pd.read_csv(self.relations)
1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.