Dgl.sampling.random_walk got segmentation fault when starting with some node

bean · November 11, 2020, 4:04pm

While I use random_walk with metapath, I find for some given start node it will get a segmentation fault. I have tested filter out the start node not the same with metapath[0]'s node type, but it still gets segment fault. What is the possible reason?

I guess this may be caused by the random walk process because for some time the same node will not produce the error.

I use the following code to construct the graph

def convert_to_hg(nxg):
    g = dgl.from_networkx(nxg)
    g.ndata[dgl.NTYPE] = torch.LongTensor(list(nx.get_node_attributes(nxg, 'node_type').values()))  # assign node types from nxg
    g.edata[dgl.ETYPE] = torch.LongTensor(list(nx.get_edge_attributes(nxg, 'edge_type').values()))  # assign edge types from nxg
    ntypes = [str(i) for i in torch.unique(g.ndata['_TYPE']).tolist()]  # name of each node type ID must be str
    etypes = [str(i) for i in torch.unique(g.edata['_TYPE']).tolist()]  # name of each edge type ID
    hg = dgl.to_heterogeneous(g, ntypes, etypes)
    edge_features = list(nx.get_edge_attributes(nxg, 'edge_weight').values())
    for etype in hg.canonical_etypes:
        # edge IDs in the original homogeneous graph (and the NetworkX graph)
        nxg_edge_ids = hg.edges[etype].data[dgl.EID]
        hg.edges[etype].data['edge_weight'] = torch.as_tensor([edge_features[i.item()] for i in nxg_edge_ids], dtype=torch.float32)
    return hg

and then run random_walk

    type_map = graph.nodes(data='node_type')
    for metapath in metapaths:
        print(metapath)
        metapath = metapath * k
        select_starters = []
        for i in all_nodes:
            si = str(i.item())
            if str(type_map[si]) == metapath[0][0]:
                select_starters.append(i.item())
        select_starters = torch.as_tensor(select_starters)
        traces, types = dgl.sampling.random_walk(g, nodes=select_starters, metapath=metapath)

BarclayII · November 13, 2020, 12:13pm

This is likely a bug and I need a concrete case for reproducing this issue. Could you provide an example graph?

If you think your graph is too big, then you could try subgraphing it. For instance, if you can observe segfaults for a single node ID on your graph, then it’s likely that the same segfault will still happen on a subgraph centered on that node.

bean · November 16, 2020, 3:18am

I use the PubMed dataset from here.

bean · November 17, 2020, 2:42am

I also dump the graph with torch.save,

And I run it with start node [63106] and metapath ('1', '2', '1') * n

BarclayII · November 18, 2020, 8:18am

Could you grant access to this file? I sent a request.

bean · November 18, 2020, 8:36am

Sorry I have updated the permission

BarclayII · November 18, 2020, 11:27am

Worked for me with DGL 0.5.2:

In [1]: import torch

In [2]: import dgl
Using backend: pytorch
INFO:rdflib:RDFLib Version: 4.2.2

In [3]: g = torch.load('samplegraph.pkl')

In [4]: dgl.sampling.random_walk(g, [63106], metapath=[('1', '2', '1')] * 10)
Out[4]:
(tensor([[63106,    -1,    -1,    -1,    -1,    -1,    -1,    -1,    -1,    -1,
             -1]]),
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]))

In [5]: dgl.__version__
Out[5]: '0.5.2'

What are your PyTorch & DGL & CUDA version and operating system?

bean · November 18, 2020, 2:00pm

I use cpu version of pytorch 1.7 and dgl 0.5.2
Quite strange, but the following code also raise error.

dgl.sampling.random_walk(g, range(g.number_of_nodes()), metapath=[('1', '2', '1')] * 10)

I get error like this,

/opt/dgl/include/dgl/random.h:78: Check failed: lower < upper (0 vs. -19018) :

BarclayII · November 18, 2020, 3:01pm

That’s because your seed node IDs exceed the number of nodes with type '1'. g.number_of_nodes() returns the nodes of all types (63109), while your number of nodes with type '1' is only 20163.

This code works fine for me:

dgl.sampling.random_walk(g, range(g.number_of_nodes('1')), metapath=[('1', '2', '1')] * 10)

That being said, the error message looks pretty bad. Should add a better sanity check.

EDIT: my PyTorch version is also 1.7.0+cpu, and my DGL version is 0.5.2. I was using Ubuntu 18.04.

bean · November 18, 2020, 3:53pm

So heterogeneous graph index all type of nodes start with 0? Then why random_walk can return a node with a global id? (like the id can be g.number_of_nodes()-1)

I guess random_walk also return type specific id? If this is right, how can I get a global node id?

BarclayII · November 19, 2020, 2:51pm

Yeah I just noticed that. random_walk works with type-specific IDs, but it does not enforce the seed IDs to be valid. If you give invalid seed IDs it could do all sorts of weird stuff.

Getting global node IDs depend on how you define the mapping from type-specific node IDs and global node IDs. You will have to map them yourself for now. For instance, you can get a type-specific ID and add an offset to it.

bean · November 19, 2020, 3:38pm

Yeah, it quite confusing.