Sampler of metapath2vec can't cover all authors

My question only focuses on the metapath sampler of the metapath2vec example.

I use dgl.sampling.random_walk() to genetate metapath from AMiner dataset for metapath2vec, as sampler.py does. However, I find the result traces can’t cover all “author” nodes (1693216/1693531), while the original author’s code (py4genMetaPaths.py) generates correct result.

Both codes share the same parameters: num_walks=1000, walk_length=100. In theory, each trace contains 100 “author” nodes, and 3.88M (3883 conf * 1000 walks/node) traces are generated. As a result, totally 388M “author” nodes are sampled, which should be enough to cover all 1.7M “author” nodes.

After looking into the codes, I find that the sampling strategy used by dgl.sampling.random_walk() is different from original author’s code, resulting in different sampling probability.

For example, we have a simplified AMiner dataset as follows:

g = dgl.heterograph({
    ('author', 'ap', 'paper'): ([0, 0, 1, 1, 2], [0, 1, 1, 2, 2]),
    ('paper', 'pa', 'author'): ([0, 1, 1, 2, 2], [0, 0, 1, 1, 2]),
    ('paper', 'pc', 'conf'): ([0, 1, 2], [0, 0, 1]),
    ('conf', 'cp', 'paper'): ([0, 0, 1], [0, 1, 2])
})
a0 - p0
   \    \
a1 - p1 - c0
   \
a2 - p2 - c1

(1) the original author’s code first computes connections between “author” and “conf” as two dicts:

conf_author = {c0: [a0, a0, a1], c1: [a1, a2]}
author_conf = {a0: [c0, c0], a1: [c0, c1], a2: [c1]}

(note that a0 appears twice in conf_author[c0] because there are two paths c0-p0-a0 and c0-p1-a0 connecting them)
Then it uniformly chooses the next nodes directly FROM THE LISTS in each step, thus totally ignoring “paper” nodes. If we start from c0, the probability of a0 and a1 being sampled is 2/3 and 1/3 respectively.

(2) dgl.sampling.random_walk() chooses the next nodes FROM NEIGHBOURS in each step. It has to sample a “paper” node before “author” node. If we start from c0, the probability of a0 and a1 being sampled will be 3/4 and 1/4 respectively.
c0 -> 1/2p0 + 1/2p1 -> 1/2a0 + 1/4a0 + 1/4a1 = 3/4a0 + 1/4a1

I think this is why the traces generated by dgl.sampling.random_walk() can’t cover all “author” nodes, and proved my guess by counting sampled times of a0 and a1 in one step:

>>> traces, _ = dgl.sampling.random_walk(g, [0] * 1000, metapath=['cp', 'pa'])
>>> authors = traces[:, 2]
>>> authors.unique(return_counts=True)
(tensor([0, 1]), tensor([754, 246]))

As you can see, the ratio of sampled times of a0 and a1 is 3:1, not 2:1.

To generate the same result as original author’s code, one possible solution is to compute CA adjacency matrix and use it as the prob parameter:

>>> cp = g.adj(etype='cp', scipy_fmt='csr')
>>> pa = g.adj(etype='pa', scipy_fmt='csr')
>>> ca = cp * pa
>>> ca.todense()
matrix([[2, 1, 0],
        [0, 1, 1]], dtype=int64)
>>> cag = dgl.heterograph({('conf', 'ca', 'author'): ca.nonzero()})
>>> cag.edata['p'] = torch.from_numpy(ca.data).float()
>>> traces, _ = dgl.sampling.random_walk(cag, [0] * 1000, metapath=['ca'], prob='p')
>>> traces[:, 1].unique(return_counts=True)
(tensor([0, 1]), tensor([670, 330]))

That’s a very valuable observation and example!

How does that impact performance? From what I can imagine, enumerating all possible path, followed by uniformly sampling from the paths, will be quite costly especially for online sampling.

My method precomputes the conf-author bipartie graph, which contains about 5.4M edges. Performing random walk on it takes about 1 hour, while the original author’s code takes about 40 min on my computer.

The result traces generated by my method can cover all 1693531 “author” nodes, and the “author” node that appears the least time appears 15~16 times, which is the same as the result generated by original author’s code.

This is my finished code: pytorch-tutorial/random_walk.py at master · ZZy979/pytorch-tutorial · GitHub

I guess what @BarclayII means is do you observe any model accuracy improvement due to the discrepancy of the two random walk algorithms?

If random walk can’t cover all “author” nodes, you can’t get the embeddings of missing nodes, thus unable to perform downstream task such as node classification or clustering.

@jermainewang I think a correct solution would be to have a dedicated metapath2vec random walk function in this case.

Although multiplying the adjacency matrix works as a temporary solution, it will become impractical if the graph becomes large.

Until we have implement the dedicated function for metapath2vec random walks, a possible workaround is to
(1) Count the number of possible transition paths from a single node that has the same metapath, and
(2) Assign a transition probability on each edge so that the overall sampling probability of every path becomes uniform.

(1) can be achieved with dynamic programming, or message passing on the entire graph, starting from the destination node of the metapath and walking backwards.
(2) can be achieved by assigning the edge probabilities as the quotient between the destination node’s count and the source node’s count obtained from (1).

That being said, dgl.sampling.random_walk should support specifying list of transition probability tensors, or otherwise it cannot handle the case where the same edge type is repeated in the metapath. This might not be a problem for the case of ACM though.