Creating a Dataset for Node Classification with Multiple Graphs

Hi,
I have some questions about loading data.
As shown in the diagram below, my model is for node classification, and categorizes nodes into two types.
My dataset contains multiple graphs, each with different structures and about ten thousands of nodes. Only the nodes have features; the edges and graphs do not.

In the DGL 2.1.x user guide, chapter 4.6, I saw that data can be loaded from CSV files. Which way should I use?

  1. ‘Dataset of a single graph with features and labels’: Combine all graph into a single graph
  2. ‘Dataset of multiple graphs’

Does anyone have better suggestions?

In your case, you want to learn node classification with labels on graphs in the training set but predict on unseen test graphs. You should use dataset of multiple graphs.

However, it is a little bit different from the settings in the user guide. The graphs.csv in your data doesn’t contain labels. The labels should be provided in nodes.csv instead. And you should provide the column of graph_id in nodes.csv and edges.csv to distinguish nodes and edges of different graphs.

During training, you will load mini-batches from the graphs in the training set and combine each batch to a single graph for learning.