What happens in this code when the data is changed

if i have two datasets: cora and citeseer. in cora dataset, paper_id such as

123,345, 675 

and in citeseer dataset, it takes the form

 435, der, 456 

in addition to citation data is relation between paper_id and take the form

123  345
675  123
  .
  .

in cora dataset but in citeseer dataset it takes the form 435 der der 456 The following code is run with respect to cora dataset but it not run with respect to sciteseer dataset

citations = pd.read_csv(
os.path.join(data_dir, "cora.cites"),
sep="\t",
header=None,
names=["target", "source"], )
print("Citations shape:", citations.shape)
column_names = ["paper_id"] + [f"term_{idx}" for idx in range(1433)] + ["subject"]
papers = pd.read_csv(
os.path.join(data_dir, "cora.content"), sep="\t", header=None, names=column_names,
)
print("Papers shape:", papers.shape)
class_values = sorted(papers["subject"].unique())
class_idx = {name: id for id, name in enumerate(class_values)}
paper_idx = {name: idx for idx, name in enumerate(sorted(papers["paper_id"].unique()))}

papers["paper_id"] = papers["paper_id"].apply(lambda name: paper_idx[name])
citations["source"] = citations["source"].apply(lambda name: paper_idx[name])
citations["target"] = citations["target"].apply(lambda name: paper_idx[name])
papers["subject"] = papers["subject"].apply(lambda value: class_idx[value])```

KeyError in citations line is : 'ghani01hypertext'

Are you trying to parse cora and citeseer datasets on your own? They are newer and different from DGL provides? why not use dgl.data.CoraGraphDataset and dgl.data.CiteseerGraphDataset. DGL highly recommends processing graph data into a dgl.data.DGLDataset subclass. Pls refer to Make Your Own Dataset — DGL 0.8 documentation.

As for the KeyError in your case, you’re trying to parse both datasets with one same code? You may need to debug on citeseer and change code if required.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.