Need some help with graphbolt and HeteroGraphConv for link prediction

First of all, thank you for creating such a great project! I am just coming up to speed on GNN’s and facing a learning curve with GNNs in general and with DGL’s bits and pieces in particular, so please forgive if my question is naive and has an obvious answer somewhere staring at me in the docs or examples.

I am trying to implement a recommender system via link prediction on a heterogeneous graph and starting out with a trivial subset of my data in hopes of adding more node and edge types as well as related embeddings down the road. For the time being my graph has the following node types:

  • users
  • posts
  • media

users like posts
posts can contain one or more media (images or videos)
media has embeddings based on visual content.

the goal is to learn embeddings for posts and users to be able to predict what other posts users might like based on their previous likes.

Even with this limited set of node types and edges, my dataset is pretty large and I’m using graphbolt for batching. I’ve been trying to follow the documentation for stochastic training on large heterogenous graphs, but I’m stuck on the basic step of getting post and user embeddings after running a batch through 2 layers of HeteroGraphConv. The following code demonstrates my issue on a toy dataset:

import dgl
import dgl.graphbolt as gb
import dgl.nn as dglnn

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

n_users = 50
n_posts = 100
n_media = 200

n_likes = 200
n_contains = 200

input_feature_dim = 32
hidden_feature_dim = 16
output_feature_dim = 8

like_src = np.random.randint(0, n_users, n_likes)
like_dst = np.random.randint(0, n_posts, n_likes)

contains_src = np.random.randint(0, n_posts, n_contains)
contains_dst = np.random.randint(0, n_media, n_contains)


hetero_graph = dgl.heterograph(
    {
        ("post", "liked-by", "user"): (like_dst, like_src),
        ("media", "is-in", "post"): (contains_dst, contains_src),
    }
)

train_mask = torch.zeros(n_likes, dtype=torch.bool).bernoulli(0.6)
media_feature = torch.randn(n_media, input_feature_dim)
post_feature = torch.randn(n_posts, input_feature_dim)
user_feature = torch.randn(n_users, input_feature_dim)


feature = gb.BasicFeatureStore(
    {
        ("node", "media", "feat"): gb.TorchBasedFeature(media_feature),
        ("node", "post", "feat"): gb.TorchBasedFeature(post_feature),
        ("node", "user", "feat"): gb.TorchBasedFeature(user_feature),
    }
)

sampling_graph = gb.from_dglgraph(hetero_graph)
posts, users = hetero_graph.edges(etype="liked-by")
seed_edges = torch.concatenate([posts, users]).reshape(2, -1).T[train_mask]

seeds = {
    "post:liked-by:user": gb.ItemSet((seed_edges), names=("seeds")),
}

train_set = gb.HeteroItemSet(seeds)

datapipe = gb.ItemSampler(train_set, batch_size=2, shuffle=True)
datapipe = datapipe.sample_uniform_negative(sampling_graph, 2)
datapipe = datapipe.sample_layer_neighbor(sampling_graph, [5, 5])
datapipe = datapipe.fetch_feature(feature, node_feature_keys={"media": ["feat"]})
train_dataloader = gb.DataLoader(datapipe)


class ActivityGCN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()

        self.hidden_dim = hidden_dim
        self.conv1 = dglnn.HeteroGraphConv(
            {
                "liked-by": dglnn.SAGEConv(input_dim, hidden_dim, "mean"),
                "is-in": dglnn.SAGEConv(input_dim, hidden_dim, "mean"),
            },
            aggregate="mean",
        )
        self.conv2 = dglnn.HeteroGraphConv(
            {
                "liked-by": dglnn.SAGEConv(hidden_dim, output_dim, "mean"),
                "is-in": dglnn.SAGEConv(hidden_dim, output_dim, "mean"),
            },
            aggregate="mean",
        )

        self.layers = nn.ModuleList(modules=[self.conv1, self.conv2])

    def forward(self, blocks, x):
        hidden_x = x

        for layer_idx, (layer, block) in enumerate(zip(self.layers, blocks)):
            print(f"layer {layer_idx} input {hidden_x.keys()}")
            for k, v in hidden_x.items():
                print(
                    f"layer {layer_idx} input features for node type '{k}' shape {v.shape}"
                )

            hidden_x = layer(block, hidden_x)
            is_last_layer = layer_idx == len(self.layers) - 1
            if not is_last_layer:
                hidden_x = {k: F.relu(v) for k, v in hidden_x.items()}

        print(f"returning embeddings for node types {hidden_x.keys()}")
        return hidden_x


model = ActivityGCN(input_feature_dim, hidden_feature_dim, output_feature_dim)
batch_iterator = iter(train_dataloader)

batch = next(batch_iterator)

input_features = {
    nt: v for (nt, f_name), v in batch.node_features.items() if f_name == "feat"
}

rst = model(batch.blocks, input_features)

print(f"forward pass result {rst}")

only media contains actual features, so the edges are reversed:
“media:is-in:post”
“post:liked-by:user”

running the code, I get the following output

layer 0 input dict_keys(['media'])
layer 0 input features for node type 'media' shape torch.Size([49, 32])
layer 1 input dict_keys([])
returning embeddings for node types dict_keys([])
forward pass result {}

if I update feature fetching as follows:

datapipe = datapipe.fetch_feature(
    feature, node_feature_keys={"media": ["feat"], "post": ["feat"], "user": ["feat"]}
)

I get the following:

layer 0 input dict_keys(['post', 'media'])
layer 0 input features for node type 'post' shape torch.Size([17, 32])
layer 0 input features for node type 'media' shape torch.Size([30, 32])
layer 1 input dict_keys(['post'])
layer 1 input features for node type 'post' shape torch.Size([17, 16])
returning embeddings for node types dict_keys([])
forward pass result {}

I get computed post features for layer 1, but the end result is still the same and since posts and users are not supposed to have any intrinsic features in my current setup, including some random data does not seem right.
I’m using dgl v.2.3.0 compiled from source on macOS
I was referencing the following docs:
Stochastic Training on Large Graphs 6.3
HeteroGraphConv
Stochastic Training of GNNs with GraphBolt (link prediction)

I suspect that part of my issue is related to this bug I filed:
https://github.com/dmlc/dgl/issues/7687
But I also think that I’m missing some key concepts in my current understanding. Any help or guidance will be highly appreciated.

1 Like

Hi, the HeteroGraphConv now assumes all types of node features should be provided during forward. In your case, I would recommend to add learned embeddings for posts and users and add reverse edges (e.g., user:like:post). It will fix the forward problem.