PinSAGESampler with unexpected results

lucaslrolim · January 21, 2021, 2:26am

I generated a bipartite graph using the Stochastic Block Model strategy from NetworkX. In this graph, users of type A have an 80% probability of having an edge to item type X and 20% to item type Y; also, users of type B have a 20% of probability to have an edge to item type X and 80% to item type Y.

Ids are sequential: 0-199 (users A), 200-399 (users B), 400-429 (item X), and 430-479 (item Y).

n_user_groups = 2
sizes = [200, 200, 30, 50]
probs = [
         [0, 0, 0.8, 0.2],
         [0, 0, 0.2, 0.8],
         [0.8, 0.2, 0, 0],
         [0.2, 0.8, 0, 0],
]
g = nx.stochastic_block_model(sizes, probs, seed=0)

max_user_id = sum(sizes[:n_user_groups])
normalize = lambda edge: (edge[0], edge[1] - max_user_id)
user_edges = list(map(normalize, list(g.edges())))

swap = lambda edge: (edge[1], edge[0])
item_edges = list(map(swap, user_edges))


graph = dgl.heterograph({
    ('user', 'watched', 'item'): user_edges,
    ('item', 'watched-by', 'user'): item_edges
    }
)

Using PinSAGESampler in this graph, I notice the neighbors selected are not what I was expecting. Let’s take the node of id 0 as an example; I expected around 80% of the chosen neighbors in the sampler to be part of user group A (id 0-199), but this was not what happened. In a lot of empirical tests, this percentage was around 30-50%.

 sampler = dgl.sampling.PinSAGESampler(
        graph,
        'user', # target node type
        'item', # auxiliar node type
        3, # random walk max lengh / 2
        0.1, # restart prob
        100, # n random walks
        100 # number of neighbors
    )
 seeds = torch.LongTensor([0])
 frontier = sampler(seeds)
 frontier.all_edges(form='uv')
 print( frontier.all_edges(form='uv'))
 sum( frontier.all_edges(form='uv')[0] > 200)

output

(tensor([234,   0,  96, 106,  66, 165,  49, 221, 224, 311, 313, 390,   7, 369,
        132, 160, 152, 149, 143, 141, 139, 392, 125, 119, 114, 394, 101, 100,
          1,  92,  90, 239, 312, 346, 297, 285, 269, 264, 263, 252, 164, 235,
        350, 233, 377, 213, 200, 173, 322,  37,  30,  42,  31,  21,  78,  80,
         82,  84,   9,  26, 341, 286, 287, 288, 292, 296,  29, 338,  25,  22,
        280, 327, 335, 330, 284, 283, 281,  20, 279, 278, 270,  32,  33, 262,
        261, 257, 256,  34, 250, 385, 373, 375, 376,  16, 378, 379, 382, 384,
        371, 386]), tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0]))
tensor(56)

Does anyone know why this is happening?

BarclayII · January 25, 2021, 7:04am

PinSAGE takes neighbors by doing multi-hop random walks with restarts and taking the most frequently visited nodes. The probability distribution obtained this way will no longer be 0.8/0.2, but more complicated than that.

lucaslrolim · January 27, 2021, 3:57pm

Yes. The expected values are something around 0.64.

I found that the main cause of my problem was that depending on the argument "num_traversals" the sampler returns a different number of neighbors.

I think this behavior occurs if using the max number of transversals in argument the random walks struggle (or is not possible at all) to reach the “num_neighbors” also in the arguments.

system · February 26, 2021, 3:57pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.