My question only focuses on the metapath sampler of the metapath2vec example.

I use `dgl.sampling.random_walk()`

to genetate metapath from AMiner dataset for metapath2vec, as sampler.py does. However, I find the result traces can’t cover all “author” nodes (1693216/1693531), while the original author’s code (py4genMetaPaths.py) generates correct result.

Both codes share the same parameters: `num_walks=1000, walk_length=100`

. In theory, each trace contains 100 “author” nodes, and 3.88M (3883 conf * 1000 walks/node) traces are generated. As a result, totally 388M “author” nodes are sampled, which should be enough to cover all 1.7M “author” nodes.

After looking into the codes, I find that the sampling strategy used by `dgl.sampling.random_walk()`

is different from original author’s code, resulting in different sampling probability.

For example, we have a simplified AMiner dataset as follows:

```
g = dgl.heterograph({
('author', 'ap', 'paper'): ([0, 0, 1, 1, 2], [0, 1, 1, 2, 2]),
('paper', 'pa', 'author'): ([0, 1, 1, 2, 2], [0, 0, 1, 1, 2]),
('paper', 'pc', 'conf'): ([0, 1, 2], [0, 0, 1]),
('conf', 'cp', 'paper'): ([0, 0, 1], [0, 1, 2])
})
```

```
a0 - p0
\ \
a1 - p1 - c0
\
a2 - p2 - c1
```

(1) the original author’s code first computes connections between “author” and “conf” as two dicts:

```
conf_author = {c0: [a0, a0, a1], c1: [a1, a2]}
author_conf = {a0: [c0, c0], a1: [c0, c1], a2: [c1]}
```

(note that `a0`

appears twice in `conf_author[c0]`

because there are two paths `c0-p0-a0`

and `c0-p1-a0`

connecting them)

Then it uniformly chooses the next nodes directly FROM THE LISTS in each step, thus totally ignoring “paper” nodes. If we start from `c0`

, the probability of `a0`

and `a1`

being sampled is 2/3 and 1/3 respectively.

(2) `dgl.sampling.random_walk()`

chooses the next nodes FROM NEIGHBOURS in each step. It has to sample a “paper” node before “author” node. If we start from `c0`

, the probability of `a0`

and `a1`

being sampled will be 3/4 and 1/4 respectively.

`c0 -> 1/2p0 + 1/2p1 -> 1/2a0 + 1/4a0 + 1/4a1 = 3/4a0 + 1/4a1`

I think this is why the traces generated by `dgl.sampling.random_walk()`

can’t cover all “author” nodes, and proved my guess by counting sampled times of `a0`

and `a1`

in one step:

```
>>> traces, _ = dgl.sampling.random_walk(g, [0] * 1000, metapath=['cp', 'pa'])
>>> authors = traces[:, 2]
>>> authors.unique(return_counts=True)
(tensor([0, 1]), tensor([754, 246]))
```

As you can see, the ratio of sampled times of `a0`

and `a1`

is 3:1, not 2:1.

To generate the same result as original author’s code, one possible solution is to compute CA adjacency matrix and use it as the `prob`

parameter:

```
>>> cp = g.adj(etype='cp', scipy_fmt='csr')
>>> pa = g.adj(etype='pa', scipy_fmt='csr')
>>> ca = cp * pa
>>> ca.todense()
matrix([[2, 1, 0],
[0, 1, 1]], dtype=int64)
>>> cag = dgl.heterograph({('conf', 'ca', 'author'): ca.nonzero()})
>>> cag.edata['p'] = torch.from_numpy(ca.data).float()
>>> traces, _ = dgl.sampling.random_walk(cag, [0] * 1000, metapath=['ca'], prob='p')
>>> traces[:, 1].unique(return_counts=True)
(tensor([0, 1]), tensor([670, 330]))
```