My question only focuses on the metapath sampler of the metapath2vec example.
I use dgl.sampling.random_walk()
to genetate metapath from AMiner dataset for metapath2vec, as sampler.py does. However, I find the result traces can’t cover all “author” nodes (1693216/1693531), while the original author’s code (py4genMetaPaths.py) generates correct result.
Both codes share the same parameters: num_walks=1000, walk_length=100
. In theory, each trace contains 100 “author” nodes, and 3.88M (3883 conf * 1000 walks/node) traces are generated. As a result, totally 388M “author” nodes are sampled, which should be enough to cover all 1.7M “author” nodes.
After looking into the codes, I find that the sampling strategy used by dgl.sampling.random_walk()
is different from original author’s code, resulting in different sampling probability.
For example, we have a simplified AMiner dataset as follows:
g = dgl.heterograph({
('author', 'ap', 'paper'): ([0, 0, 1, 1, 2], [0, 1, 1, 2, 2]),
('paper', 'pa', 'author'): ([0, 1, 1, 2, 2], [0, 0, 1, 1, 2]),
('paper', 'pc', 'conf'): ([0, 1, 2], [0, 0, 1]),
('conf', 'cp', 'paper'): ([0, 0, 1], [0, 1, 2])
})
a0 - p0
\ \
a1 - p1 - c0
\
a2 - p2 - c1
(1) the original author’s code first computes connections between “author” and “conf” as two dicts:
conf_author = {c0: [a0, a0, a1], c1: [a1, a2]}
author_conf = {a0: [c0, c0], a1: [c0, c1], a2: [c1]}
(note that a0
appears twice in conf_author[c0]
because there are two paths c0-p0-a0
and c0-p1-a0
connecting them)
Then it uniformly chooses the next nodes directly FROM THE LISTS in each step, thus totally ignoring “paper” nodes. If we start from c0
, the probability of a0
and a1
being sampled is 2/3 and 1/3 respectively.
(2) dgl.sampling.random_walk()
chooses the next nodes FROM NEIGHBOURS in each step. It has to sample a “paper” node before “author” node. If we start from c0
, the probability of a0
and a1
being sampled will be 3/4 and 1/4 respectively.
c0 -> 1/2p0 + 1/2p1 -> 1/2a0 + 1/4a0 + 1/4a1 = 3/4a0 + 1/4a1
I think this is why the traces generated by dgl.sampling.random_walk()
can’t cover all “author” nodes, and proved my guess by counting sampled times of a0
and a1
in one step:
>>> traces, _ = dgl.sampling.random_walk(g, [0] * 1000, metapath=['cp', 'pa'])
>>> authors = traces[:, 2]
>>> authors.unique(return_counts=True)
(tensor([0, 1]), tensor([754, 246]))
As you can see, the ratio of sampled times of a0
and a1
is 3:1, not 2:1.
To generate the same result as original author’s code, one possible solution is to compute CA adjacency matrix and use it as the prob
parameter:
>>> cp = g.adj(etype='cp', scipy_fmt='csr')
>>> pa = g.adj(etype='pa', scipy_fmt='csr')
>>> ca = cp * pa
>>> ca.todense()
matrix([[2, 1, 0],
[0, 1, 1]], dtype=int64)
>>> cag = dgl.heterograph({('conf', 'ca', 'author'): ca.nonzero()})
>>> cag.edata['p'] = torch.from_numpy(ca.data).float()
>>> traces, _ = dgl.sampling.random_walk(cag, [0] * 1000, metapath=['ca'], prob='p')
>>> traces[:, 1].unique(return_counts=True)
(tensor([0, 1]), tensor([670, 330]))