How to effeciently load data from load disk

for eg, I download the reddit dataset to my local disk. how to effeciently change the
class RedditDataset(DGLBuiltinDataset): and don’t need to download files again, thanks!

Hi, DGL’s builtin RedditDataset class will automatically download the files and avoid redownloading if they already exist. Please see the api doc for usage: https://docs.dgl.ai/api/python/dgl.data.html#reddit-dataset

thank you @minjie. pls correct me if I am wrong,

I see the has_cache() function is to check whether there is a cache and it only check dgl.graph.bin format. is there any config let me set the download file path & load like .npz file, thanks!

def has_cache(self):
        graph_path = os.path.join(self.save_path, 'dgl_graph.bin')
        if os.path.exists(graph_path):
            return True
        return False

Currently I rewrite some functions to import local data:

import scipy.sparse as sp
from dgl.data.utils import _get_dgl_url, generate_mask_tensor, load_graphs, save_graphs, deprecate_property
from dgl import backend as F

def process():
    # graph
    coo_adj = sp.load_npz(os.path.join(raw_path, "reddit_graph.npz"))
    reddit_graph = from_scipy(coo_adj)
    # features and labels
    reddit_data = np.load(os.path.join(raw_path, "reddit_data.npz"))
    features = reddit_data["feature"]
    labels = reddit_data["label"]
    # tarin/val/test indices
    node_types = reddit_data["node_types"]
    train_mask = (node_types == 1)
    val_mask = (node_types == 2)
    test_mask = (node_types == 3)
    reddit_graph.ndata['train_mask'] = generate_mask_tensor(train_mask)
    reddit_graph.ndata['val_mask'] = generate_mask_tensor(val_mask)
    reddit_graph.ndata['test_mask'] = generate_mask_tensor(test_mask)
    reddit_graph.ndata['feat'] = F.tensor(features, dtype=F.data_type_dict['float32'])
    reddit_graph.ndata['label'] = F.tensor(labels, dtype=F.data_type_dict['int64'])
    return reddit_graph

Checkout this download utility function. There is a path argument for specifying the place to store the files.

Just curious. Why do you mask nodes of certain types for validation? I’m guessing you only want to validate that your trained model can generalize to nodes of type 2? I’m trying to understand the generate_mask_tensor function.

Duplicate question here: Generate_mask_tensor documentation