I am very late to the party, but I am trying to understand single GPU training for GNNs using DGL, particularly when the CPU-GPU data transfer happens. I am doing this to better understand what step drains the GPU out of memory.
To this end, I have written a simple script which breaks down each step, along with my understanding of what data gets transferred to the GPU in that step. I monitor the GPU memory usage using nvitop
while the script runs (hence, the multiple time.sleep
statements). I would appreciate it if someone can look at the script and corroborate if what I think is correct (Steps 1-4).
Step 5 is where I am very confused. Here’s my understanding -
- The sampler samples a computation graph from the graph structure. The graph structure right now is available on both the CPU and the GPU memory and the
device
parameter intrain_dataloader
should specify whether the sampling takes place on the CPU or the GPU. However, having thedevice
as CPU or GPU in thetrain_dataloader
has the same effect on the GPU memory which blows up in this step (significantly more than in Step 4). - Is the significant increase in GPU memory util in Step 5 because the GPU now stores the sampled computation graph? But this should be relatively very small (and not gigabytes, which is the case empirically) given that the features are already cached in to GPU memory (Step 3) and I haven’t even fetched them in Step 5.
- Assuming, it is the sampled computation graphs for each minibatch stored causing the memory blow up, shouldn’t the computation graphs for each minibatch be cleared once the batch is done?
I apologize for such a long post, but hope to learn about this from you all.
from dgl.data import AsNodePredDataset
from dgl.dataloading import (
DataLoader,
MultiLayerFullNeighborSampler,
NeighborSampler,
)
import argparse
from ogb.nodeproppred import DglNodePropPredDataset
import dgl
import dgl.nn as dglnn
import torch
import torch.nn as nn
import torch.nn.functional as F
import time
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--dataset",
type=str,
default="ogbn-arxiv"
)
parser.add_argument(
"--seed",
type=int,
default=11
)
args = parser.parse_args()
if not torch.cuda.is_available():
device = "cpu"
else:
device = "cuda"
print(f"Training in {device} mode.")
print("Loading data")
dataset = AsNodePredDataset(DglNodePropPredDataset(args.dataset, root='../../dataset'))
g = dataset[0] # STEP 1: Graph and the features are loaded on CPU memory
print("Dataset loaded into memory. Will shift to device in 10 seconds...")
time.sleep(10)
g = g.to(device) # STEP 2: Only the graph (not the features) are copied on to GPU memory
print(f"Dataset moved to {g.device}. Will move the features in 10 seconds...")
time.sleep(10)
x = g.ndata['feat'] # STEP 3: Node features are copied on to GPU memory
print(f"Dataset feat moved to {x.device}. Creating sampler in 10 seconds...")
time.sleep(10)
print("Moving training indices to GPU...")
train_idx = dataset.train_idx.to(device) # STEP 4: Training indices are copied on to GPU memory
print("Creating sampler...")
sampler = MultiLayerFullNeighborSampler(3)
print("Creating dataloader...")
train_dataloader = DataLoader(
g,
train_idx,
sampler,
device=device,
batch_size=32,
shuffle=True,
drop_last=False,
)
print("Dataloader created. Starting in 10 seconds...")
time.sleep(10)
for it, (input_nodes, output_nodes, blocks) in enumerate(train_dataloader): # STEP 5:
print("Minibatch - ", it)
pass
print("Training done. Terminating in 10 seconds...")
time.sleep(10)