Loading data from CSV files

Hello. I want to load data from my CSV file to do Graph Classification. In the example, ‘feat’ column is written like “0.736833152378035,0.10522806046048205,0.9418796835016118”. My data’s nodes, edges and graph’s features are float. So I just made them as str and add them by ‘+’ and put it in ‘feat’ column. Is it right? I’m doubting this.

I made meta.yaml, nodes.csv, edges.csv, graphs.csv files in same folder but when i try to load data, after some minutes, Keyerror happend. Here is the rough error code.(I’m making the code in another internal-network computer so I can’t copy and paste it)

KeyError Traceback (most recent call last)

—> dataset = dgl.data.CSVDataset(’./GC’)

in CSVDataset.init()

→ super().init(

ds_name,

raw_dir=os.path.dirname(meta_yaml_path),

force_reload=force_reload,

verbose=verbose,

transfrom=transform,

)

in DGLDataset.init()

else:

self._save_dir = save_dir

—> self._load()

in DGLDataset.__load(self)

if not load_flag:

self._download()

—> self.process()

self.save()

if self.verbose:

in CSVDataset.process(self)

graph_data = GraphData.load_from_csv(

meta_graph,

base_dir=base_dir,

separator=meta_yaml.seperator,

data_parser=data_parser,

)

#construct graph

—> self.graphs, self.data = DGLGraphConstructor.construct_graphs(

node_data, edge_data, graph_data

)

if len(self.data) == 1:

self.labels = list(self.data.values())[0]

in DGLGraphConstructor.construct_graphs(node_data, edge_data, graph_data)

edge_data = [edge_data]

node_dict = Nodedata.to_dict(node_data)

—> edge_dict = EdgeData.to_dict(edge_data, node_dict)

graph_dict = DGLGraphConstructor._construct_graphs(node_dict, edge_Dict)

if graph_data is None:

in EdgeData.to_dict(edge_data, node_dict)

orig_src_ids = e_data.src[idx].astype(

node_dict[graph_id][src_type]['dtype]

)

orig_dst_ids = e_data.src[idx].astype(

node_dict[graph_id][dst_type]['dtype]

)

—> src_ids = [src_mapping[index] for index in orig_src_ids]

dst_ids = [src_mapping[index] for index in orig_dst_ids]

if graph_id not in edge_dict:

in (.0)

orig_src_ids = e_data.src[idx].astype(

none_dict[graph_id][src_type][‘dtype’]

)

orig_dst_ids = e_data.src[idx].astype(

none_dict[graph_id][dst_type][‘dtype’]

)

—> src_ids = [src_mapping[index] for index in orig_src_ids]

dst_ids = [src_mapping[index] for index in orig_dst_ids]

if graph_id not in edge_dict:

KeyError : 975083

I’m wondering why this error happened, and how to fix it.
I checked whole nodes and graphs’s id are in ‘node_id’ and ‘graph_id’, no duplicated.
I would really appreciate it if you could reply my question.
Thank you.

Have you checked this user guide? It may also be helpful if you can provide some fake data files and a script to reproduce the errors.

Yes I followed the example in that page.
My dataset is like :

  • nodes.csv (2301029 rows x 3 columns):
    graph_id (numpy.int64)
    node_id (numpy.int64)
    feat (str)
    EX) graph_id 9967 / node_id 0 / feat 4,0
  • edges.csv (2466770 rows x 4 columns):
    graph_id (numpy.int64)
    src_id (numpy.int64)
    dst_id (numpy.int64)
    feat (str)
    EX) graph_id 1549 / src_id 461295 / dst_id 185738 / feat 14.914122846632385, 1.0
  • graphs.csv (14513 rows x 3 columns):
    graph_id (numpy.int64)
    label (numpy.int64)
    feat (str)
    EX) graph_id 0 / label 0 / feat 5790.0

What I did was just
import dgl
dataset = dgl.data.CSVDataset(’./folder_name’)
print(len(dataset))

Many things can go wrong with the data parsing process. Without access to your data, at least one that can reproduce the issue, it’s hard to give further suggestions.

1 Like

It seems that some nodes having many graph ids are problems. My goal is to classify ‘a 2step graph from one node’ so I just make those nodes independently. Thanks for your apply mufeili.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.