Question about details on reddit datapreparation procedure

Hi all,
I just ran the GraphSage unsupervised demo with reddit npz files.
Now I want to use my own dataset. After debugging the code, I found some details about the data structure in the npz files. but not clear enough to reconstruct my own npz files.
So is there any way to find the scripts that generate the npz files, at least some clue about which csv on the reddit github is used for generating the graph and the features.

Thanks.

Hi,

You don’t have to exactly follow what RedditDataset does to use your own dataset: RedditDataset is only responsible for generating the following objects as in https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/train_sampling_unsupervised.py#L349

    data = train_mask, val_mask, test_mask, in_feats, labels, n_classes, g

So essentially you only need to prepare these objects from your CSV.

  • train_mask, val_mask, test_mask: numpy boolean arrays indicating whether the node belongs to training/validation/test set.
  • in_feats: input feature size.
  • labels: ground truth class of a node as scalar integer.
  • n_classes: number of possible classes.
  • g: the graph. Node features are stored as g.ndata['features'] as a matrix (i.e. a float vector for each node).

Feel free to follow up.

Thank you very much. Let me have a try.