Pinsage error when training

navmarri · June 6, 2020, 7:52pm

I’m using pinsage model with custom data which is a bipartite graph. Following is the graph structure which is a heterogenous graph

Image node with features of dimensions 2048
Text node with features of dimensions 500
Edge between Image node and text node and vice versa. The edge weight is 1 for all the edges.

   g = dataset['train-graph']
    
    item_texts = dataset['item-texts']
   
    user_ntype = 'tag' 
    item_ntype = 'image'
    user_to_item_etype = dataset['user-to-item-type']

    device = torch.device(args.device)

    # Assign user and movie IDs and use them as features (to learn an individual trainable
    # embedding for each entity)
    img_features = torch.from_numpy(np.load('f200k/data/graph/img_features.npy'))
    txt_features = torch.from_numpy(np.load('f200k/data/graph/txt_features.npy'))
    g.nodes[user_ntype].data['id'] = txt_features
    g.nodes[item_ntype].data['id'] = img_features
    g.edges['has'].data['weights'] = torch.ones(g.edges(etype='has')[0].shape[0], dtype=torch.int64)
    g.edges['in'].data['weights'] = torch.ones(g.edges(etype='in')[0].shape[0], dtype=torch.int64)

    # Sampler
    batch_sampler = sampler_module.ItemToItemBatchSampler(
        g, user_ntype, item_ntype, args.batch_size)
    neighbor_sampler = sampler_module.NeighborSampler(
        g, user_ntype, item_ntype, args.random_walk_length,
        args.random_walk_restart_prob, args.num_random_walks, args.num_neighbors,
        args.num_layers)
    collator = sampler_module.PinSAGECollator(neighbor_sampler, g, user_ntype, item_ntype)
    dataloader = DataLoader(
        batch_sampler,
        collate_fn=collator.collate_train,
        num_workers=args.num_workers)
    dataloader_test = DataLoader(
        torch.arange(g.number_of_nodes(item_ntype)),
        batch_size=args.batch_size,
        collate_fn=collator.collate_test,
        num_workers=args.num_workers)
    dataloader_it = iter(dataloader)

    # Model
    model = PinSAGEModel(g, item_ntype, user_ntype, args.hidden_dims, args.num_layers).to(device)
    print(model)
    # Optimizer
    opt = torch.optim.Adam(model.parameters(), lr=args.lr)

    # For each batch of head-tail-negative triplets...
    for epoch_id in range(args.num_epochs):
        model.train()
        for batch_id in tqdm.trange(args.batches_per_epoch):
            pos_graph, neg_graph, blocks = next(dataloader_it)

I get the following error

Traceback (most recent call last):
  File "model.py", line 149, in <module>
    train(dataset, args)
  File "model.py", line 101, in train
    pos_graph, neg_graph, blocks = next(dataloader_it)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 819, in __next__
    return self._process_data(data)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/_utils.py", line 385, in reraise
    raise self.exc_type(msg)
dgl._ffi.base.DGLError: Caught DGLError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 35, in fetch
    return self.collate_fn(data)
  File "/home/ubuntu/dl_workspace/dgl/examples/pytorch/pinsage/sampler.py", line 144, in collate_train
    pos_graph, neg_graph, blocks = self.sampler.sample_from_item_pairs(heads, tails, neg_tails)
  File "/home/ubuntu/dl_workspace/dgl/examples/pytorch/pinsage/sampler.py", line 77, in sample_from_item_pairs
    blocks = self.sample_blocks(seeds, heads, tails, neg_tails)
  File "/home/ubuntu/dl_workspace/dgl/examples/pytorch/pinsage/sampler.py", line 51, in sample_blocks
    frontier = sampler(seeds)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/dgl/sampling/pinsage.py", line 115, in __call__
    neighbor_graph = select_topk(neighbor_graph, self.num_neighbors, self.weight_column)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/dgl/sampling/neighbor.py", line 162, in select_topk
    g._graph, nodes_all_types, k_array, edge_dir, weight_arrays, bool(ascending))
  File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 232, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [19:48:49] /opt/dgl/src/array/array.cc:620: Check failed: (weight->dtype).code == kDLFloat (

@BarclayIl

BarclayII · June 8, 2020, 7:25am

Hi, this should be fixed by PR https://github.com/dmlc/dgl/pull/1565. Could you please install the nightly build and try again?

Thanks.

navmarri · June 12, 2020, 1:59am

@BarclayII That worked thanks.
I’ve a question regarding the pinsage model. I tried using the movie lens data.I see that Following is the model created.

PinSAGEModel(
  (proj): LinearProjector(
    (inputs): ModuleDict(
      (year): Embedding(82, 64, padding_idx=81)
      (genre): Linear(in_features=18, out_features=64, bias=True)
      (id): Embedding(3707, 64, padding_idx=3706)
      (title): BagOfWords(
        (emb): Embedding(4946, 64, padding_idx=1)
      )
    )
  )
  (sage): SAGENet(
    (convs): ModuleList(
      (0): WeightedSAGEConv(
        (Q): Linear(in_features=64, out_features=64, bias=True)
        (W): Linear(in_features=128, out_features=64, bias=True)
        (dropout): Dropout(p=0.5, inplace=False)
      )
      (1): WeightedSAGEConv(
        (Q): Linear(in_features=64, out_features=64, bias=True)
        (W): Linear(in_features=128, out_features=64, bias=True)
        (dropout): Dropout(p=0.5, inplace=False)
      )
    )
  )
  (scorer): ItemToItemScorer()
)

However I don’t see the user features not being used here. In the preprocess_movielens.py file, features for the user are assigned. But I don’t see them being used in the model.

g.nodes['user'].data['gender'] = torch.LongTensor(users['gender'].cat.codes.values)
g.nodes['user'].data['age'] = torch.LongTensor(users['age'].cat.codes.values)
g.nodes['user'].data['occupation'] = torch.LongTensor(users['occupation'].cat.codes.values)
g.nodes['user'].data['zip'] = torch.LongTensor(users['zip'].cat.codes.values)

Can you please help me understand?

BarclayII · June 12, 2020, 3:00am

What I understood from PinSAGE paper is that they also do not use user features. The reference of their random walk algorithm used in their neighbor sampling is actually Pixie, which is a metapath-based random walk going from an item to a user then to another item. Therefore the sampled neighborhood of an item is always another item. The computation hence do not need user features.

Please correct me if I’m wrong.

aakanksha · June 15, 2020, 11:02am

Hi, I understand that their random walk algorithm for neighbour sampling works on just the item nodes. But I was wondering if it is possible to also include the user nodes in the sampled neighbourhood of an item node?
I’d greatly appreciate any pointers for such an implementation.

Thanks.

BarclayII · June 20, 2020, 3:19pm

Currently there is no ready-to-use implementation with both user and item nodes as the neighbor of a user/item. However, you can treat the user-item graph as a homogeneous graph and replace PinSAGESampler with RandomWalkNeighborSampler to do what you want.

Please feel free to follow up. Thanks.

navmarri · June 21, 2020, 12:50am

@BarclayII How will the dimensionality work if user feature and item feature are different dimensionality.
We could add projection layer to project them to common feature space but not sure how sampling would work. It would be great if you can share a minimal example

navmarri · June 22, 2020, 12:24am

I’m training Pinsage model using the setup described above. I get the following error when the training is at 95%. This is out of memory exception but I’m wondering why it is coming in at 95%.

95%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍        | 13754/14498 [3:25:52<12:44,  1.03s/it]Traceback (most recent call last):
  File "model.py", line 183, in <module>
    train(dataset, args)
  File "model.py", line 134, in train
    loss = model(pos_graph, neg_graph, blocks).mean()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "model.py", line 46, in forward
    h_item = self.get_repr(blocks)
  File "model.py", line 54, in get_repr
    return h_item_dst + self.sage(blocks, h_item)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/dl_workspace/dgl/examples/pytorch/pinsage/layers.py", line 167, in forward
    h = layer(block, (h, h_dst), block.edata['weights'])
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/dl_workspace/dgl/examples/pytorch/pinsage/layers.py", line 139, in forward
    g.update_all(fn.u_mul_e('n', 'w', 'm'), fn.sum('m', 'n'))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/dgl/heterograph.py", line 3636, in update_all
    Runtime.run(prog)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/dgl/runtime/runtime.py", line 11, in run
    exe.run()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/dgl/runtime/ir/executor.py", line 1066, in run
    graph = self.graph.data(ctx)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/dgl/heterograph.py", line 4736, in get_immutable_gidx
    return self.graph._graph.get_unitgraph(self.etid, ctx)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/dgl/utils.py", line 481, in wrapper
    dic[key] = func(self, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/dgl/heterograph_index.py", line 903, in get_unitgraph
    return g.asbits(self.bits_needed(etype or 0)).copy_to(ctx)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/dgl/heterograph_index.py", line 227, in copy_to
    return _CAPI_DGLHeteroCopyTo(self, ctx.device_type, ctx.device_id)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/dgl/_ffi/_ctypes/function.py", line 190, in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/dgl/_ffi/base.py", line 62, in check_call
    raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: [23:42:31] /opt/dgl/src/runtime/cuda/cuda_device_api.cc:97: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: out of memory
Stack trace:
  [bt] (0) /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/dgl/libdgl.so(dgl::runtime::CUDADeviceAPI::AllocDataSpace(DLContext, unsigned long, unsigned long, DLDataType)+0xdc6) [0x7f80ba2395e6]
  [bt] (1) /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::Empty(std::vector<long, std::allocator<long> >, DLDataType, DLContext)+0x289) [0x7f80ba106cf9]
  [bt] (2) /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::CopyTo(DLContext const&) const+0xb7) [0x7f80ba16e837]
  [bt] (3) /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/dgl/libdgl.so(dgl::UnitGraph::CSR::CopyTo(DLContext const&) const+0x49) [0x7f80ba20ed69]
  [bt] (4) /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/dgl/libdgl.so(dgl::UnitGraph::CopyTo(std::shared_ptr<dgl::BaseHeteroGraph>, DLContext const&)+0x126) [0x7f80ba2049f6]
  [bt] (5) /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/dgl/libdgl.so(dgl::HeteroGraph::CopyTo(std::shared_ptr<dgl::BaseHeteroGraph>, DLContext const&)+0x11c) [0x7f80ba14cdcc]
  [bt] (6) /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/dgl/libdgl.so(+0xd40008) [0x7f80ba15d008]
  [bt] (7) /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/dgl/libdgl.so(DGLFuncCall+0x52) [0x7f80ba0e84f2]
  [bt] (8) /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f8112798ec0]