Datasets from multiple CSV

ogggcar · June 16, 2021, 11:18am

Hi everyone.

I had a problem while trying to create muliple datasets from multiple CSVs. I replaced the single CSV names for a variable, so when i create a dataset, I can use different CSVs as asrguments in order to crearte different datasets. The start of my code is:

class Dataset(DGLDataset):
def init(self, nodes, relations):
super().init(name=‘i2b2_dataset’)
self.nodes = nodes
self.relations = relations

def process(self):
    nodes_data = pd.read_csv(nodes)
    edges_data = pd.read_csv(relations)

So when i call the class I can create different datasets with it:

dataset_1 = Dataset(nodes=‘nodes_1.csv’, relations=‘relations_2.csv’)
dataset_2 = Dataset(nodes=‘nodes_2.csv’, relations=‘relations_2".csv’)

The problem is this “nodes” and “relations” variables from the “process()” function are not declared and I get: “NameError: name ‘nodes’ is not defined”

How and where should I declare them?

Thank you so much.

mufeili · June 17, 2021, 2:30am

I did not get it. It seems that the code snippet you post is not complete and is inconsistent with your description. E.g. where did you use self.nodes and self.relations?

ogggcar · June 17, 2021, 5:25am

Sorry. This is the full code (sorry about some intends):

class Dataset(DGLDataset):
def init(self):
super().init(name=‘i2b2_dataset’)
#self.nodes = nodes
#self.relations = relations

def process(self):
    nodes_data = pd.read_csv("nodes.csv")
    edges_data = pd.read_csv("relations.csv")

    node_features = torch.from_numpy(nodes_data['Type'].astype('category').cat.codes.to_numpy())

    edge_features = torch.from_numpy(edges_data['Type'].astype('category').cat.codes.to_numpy())


    edges_src = torch.from_numpy(edges_data['Start'].to_numpy())
    edges_dst = torch.from_numpy(edges_data['End'].to_numpy())

    self.graph = dgl.graph((edges_src, edges_dst), num_nodes=nodes_data.shape[0])
    self.graph.ndata['Type'] = node_features
    self.graph.edata['Type'] = edge_features
    self.graph.ndata['Embeddings'] = torch.rand(self.graph.num_nodes(), 768)

    n_nodes = nodes_data.shape[0]
    n_train = int(n_nodes * 0.6)
    n_val = int(n_nodes * 0.2)
    train_mask = torch.zeros(n_nodes, dtype=torch.bool)
    val_mask = torch.zeros(n_nodes, dtype=torch.bool)
    test_mask = torch.zeros(n_nodes, dtype=torch.bool)
    train_mask[:n_train] = True
    val_mask[n_train:n_train + n_val] = True
    test_mask[n_train + n_val:] = True
    self.graph.ndata['train_mask'] = train_mask
    self.graph.ndata['val_mask'] = val_mask
    self.graph.ndata['test_mask'] = test_mask

def __getitem__(self, i):
    return self.graph

def __len__(self):
    return 1

dataset_1 = Dataset(nodes=‘nodes_1.csv’, relations=‘relations_2.csv’)
dataset_2 = Dataset(nodes=‘nodes_2.csv’, relations=‘relations_2".csv’)

The thing is I intend to create multiple datasets, so I substitute the nodes and relations name of the orginal Karate Club code (Make Your Own Dataset — DGL 0.6.1 documentation) for two variables, so I can use this class to create more than one dataset object. Hope it helps.

Thanks again.

mufeili · June 21, 2021, 6:56am

I guess what you are looking for is something like below?

class Dataset(DGLDataset):
    def __init__(self, nodes, relations, ...):
        self.nodes = nodes
        self.relations = relations

    def process(self, ...):
        nodes_data = pd.read_csv(self.nodes)
        edges_data = pd.read_csv(self.relations)

system · July 21, 2021, 6:56am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.