Hi @mufeili! I am trying to run this script now with cuda but it is giving me errors so I’ve been doing some modifications in the code.
As the --gpu
argument only takes in a list of int, in run method, device = devices[proc_id]
will result in device as a int. So I have manually changed device = 'cuda'
(I only have 1 cuda device), so that all the .to(device)
is actually moving the variables to cuda.
Then errors about graph on cpu instead of gpu appear, so I used graph.to('cuda')
to move them to cuda. This works for most part except that I got stuck at this line in run
method
batch_pred = model(blocks, batch_inputs)
with this error
The size of tensor a (239) must match the size of tensor b (238) at non-singleton dimension 0
and so I tried to look at the blocks and batch_inputs size
blocks:
[Block(num_src_nodes=239, num_dst_nodes=239, num_edges=2209), Block(num_src_nodes=239, num_dst_nodes=238, num_edges=4877)]
batch_inputs:
tensor([[ 0.1176, 0.0267, 0.0682, ..., -0.0811, 0.1327, -0.0654],
[ 0.0055, -0.0862, 0.0992, ..., 0.1190, 0.0212, -0.0026],
[-0.1085, 0.1328, 0.0735, ..., -0.1280, 0.0788, -0.0433],
...,
[ 0.1032, -0.0583, 0.1064, ..., 0.0943, 0.0396, 0.0045],
[ 0.1174, 0.1042, -0.0406, ..., 0.0431, 0.1444, -0.0271],
[-0.1012, -0.0311, -0.0439, ..., -0.1308, 0.1054, -0.0702]],
device='cuda:0')
length of batch_inputs: 239
Is the second block causing the error? I imagine that it is because it has 239 src nodes but 238 dst nodes but I’m not sure what I did wrong because the dataloader was the one that made the blocks.
The dataloader is initialized with
dataloader = dgl.dataloading.EdgeDataLoader(
g, train_seeds, sampler,
negative_sampler=NegativeSampler(g, args.num_negs),
batch_size=args.batch_size,
shuffle=True,
drop_last=False,
pin_memory=True,
num_workers=args.num_workers)
Thank you!