Neighbor Sampling in Heterogeneous graph

iamjiang · January 6, 2020, 9:41pm

Suppose I have a heterogeneous graph with Canonical Edge Types as following:

(“Member”, “Parent”, “Member”) , (“Member”, “Spouse”, “Member”) , (“Member”, “Buy”, “Product”)

The frequency distribution of relationship and nodes are :
Parent : 1000
Spouse: 500
Buy : 300

Member : 200
Product 100

How can I use the function dgl.contrib.sampling.NeighborSampler to create a sub-graph which can represent the distribution of edges and nodes for this heterogeneous graph ? In other word, I am not sure how to define the argument of transition_prob in the the following function of NeighborSampler such that the sampler is representative of the original graph.

The sampling of heterogenous graph is essential for mini batch training when the graph is giant.

Thanks

sampler = dgl.contrib.sampling.NeighborSampler(
    g,
    batch_size,
    5,
    1,
    transition_prob=?,
    seed_nodes=torch.arange(g.number_of_nodes()),
    prefetch=True,
    add_self_loop=True,
    shuffle=False,
    num_workers=4
)

VoVAllen · January 7, 2020, 1:17pm

We’re still working on this feature and don’t have solution for now. @BarclayII may write a RFC for this later this month.

iamjiang · January 7, 2020, 5:19pm

Thanks for your prompt response.
I am wondering how to train a big heterograph using mini-batch sampling. Since grpah is big, I can’t train the modeling using all training data.

Can I treat heterograph as homogeneous graph and conduct neighbor sampling by ignoring different node type and edge type.

if the Canonical Edge Types of my heterograph is as following:

(“Member”, “Parent”, “Member”) , (“Member”, “Spouse”, “Member”) , (“Member”, “Buy”, “Product”)

how to set the seed_nodes of NeighborSampler for different type of nodes which have their own labeling space ? Any guidance will be greatly appreciated.

batch_size=1024
sampler = dgl.contrib.sampling.NeighborSampler(
g,
batch_size,
5,
1,
seed_nodes=torch.arange(g.number_of_nodes(ntype=“Member”) + g.number_of_nodes(ntype=“Product”) ),
prefetch=True,
add_self_loop=True,
shuffle=False,
num_workers=4
)