CPU-GPU Data Transfer

I am very late to the party, but I am trying to understand single GPU training for GNNs using DGL, particularly when the CPU-GPU data transfer happens. I am doing this to better understand what step drains the GPU out of memory.
To this end, I have written a simple script which breaks down each step, along with my understanding of what data gets transferred to the GPU in that step. I monitor the GPU memory usage using nvitop while the script runs (hence, the multiple time.sleep statements). I would appreciate it if someone can look at the script and corroborate if what I think is correct (Steps 1-4).

Step 5 is where I am very confused. Here’s my understanding -

  1. The sampler samples a computation graph from the graph structure. The graph structure right now is available on both the CPU and the GPU memory and the device parameter in train_dataloader should specify whether the sampling takes place on the CPU or the GPU. However, having the device as CPU or GPU in the train_dataloader has the same effect on the GPU memory which blows up in this step (significantly more than in Step 4).
  2. Is the significant increase in GPU memory util in Step 5 because the GPU now stores the sampled computation graph? But this should be relatively very small (and not gigabytes, which is the case empirically) given that the features are already cached in to GPU memory (Step 3) and I haven’t even fetched them in Step 5.
  3. Assuming, it is the sampled computation graphs for each minibatch stored causing the memory blow up, shouldn’t the computation graphs for each minibatch be cleared once the batch is done?

I apologize for such a long post, but hope to learn about this from you all.

from dgl.data import AsNodePredDataset
from dgl.dataloading import (
    DataLoader,
    MultiLayerFullNeighborSampler,
    NeighborSampler,
)
import argparse
from ogb.nodeproppred import DglNodePropPredDataset
import dgl
import dgl.nn as dglnn
import torch
import torch.nn as nn
import torch.nn.functional as F
import time


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--dataset",
        type=str,
        default="ogbn-arxiv"
    )
    parser.add_argument(
        "--seed",
        type=int,
        default=11
    )
    args = parser.parse_args()
    if not torch.cuda.is_available():
        device = "cpu"
    else:
        device = "cuda"
    print(f"Training in {device} mode.")

    print("Loading data")
    dataset = AsNodePredDataset(DglNodePropPredDataset(args.dataset, root='../../dataset'))
    g = dataset[0]  # STEP 1: Graph and the features are loaded on CPU memory
    print("Dataset loaded into memory. Will shift to device in 10 seconds...")
    time.sleep(10)
    g = g.to(device) # STEP 2: Only the graph (not the features) are copied on to GPU memory
    print(f"Dataset moved to {g.device}. Will move the features in 10 seconds...")
    time.sleep(10)
    x = g.ndata['feat'] # STEP 3: Node features are copied on to GPU memory
    print(f"Dataset feat moved to {x.device}. Creating sampler in 10 seconds...")
    time.sleep(10)
    print("Moving training indices to GPU...")
    train_idx = dataset.train_idx.to(device) # STEP 4: Training indices are copied on to GPU memory

    print("Creating sampler...")
    sampler = MultiLayerFullNeighborSampler(3)

    print("Creating dataloader...")
    train_dataloader = DataLoader(
        g,
        train_idx,
        sampler,
        device=device,
        batch_size=32,
        shuffle=True,
        drop_last=False,
    )
    print("Dataloader created. Starting in 10 seconds...")
    time.sleep(10)
    for it, (input_nodes, output_nodes, blocks) in enumerate(train_dataloader):  # STEP 5: 
        print("Minibatch - ", it)
        pass
    print("Training done. Terminating in 10 seconds...")
    time.sleep(10)

I would recommend you to look into dgl.graphbolt.DataLoader, which is an overhaul and redesign of the old dataloader. More information about it is here: DGL 2.1 release blog

Hey @mfbalin
I will look into it. However, the newer DGL versions are not yet supported on my shared cluster. I am still on version 1.1.3. Can you help me with the questions considering that version?

Your step 2 copies features too since features are stored in the graph object.

The memory increase in Step 5 is probably because the sampling operation needs to allocate some scratch space to be able to perform the work.

What is the conflict between your shared cluster and the newer DGL versions?

The GLIBC versions seem to be an issue. The shared cluster has a much older version 2.17 of GLIBC.