Heterogeneous graph with multiple user and edge types

nillbahrami · August 12, 2023, 1:53pm

Hello, I am attempting to construct a heterogeneous graph consisting of node types such as user, item, and category, alongside edge types like view, buy, add to cart, and item belongs to category. The issue I am facing is that, when I introduce various types of edges, the user nodes end up being replicated multiple times. For instance, let’s consider users 1 and 2, along with products 1, 2, and 3. If user 1 purchases items 1 and 2, and views item 3, the graph erroneously adds three separate nodes for user 1 instead of just one.
can someone please help me.

pubu · August 12, 2023, 2:39pm

Can you post your example code and output?

I don’t see the problem you mentioned in the test code:

graph_data = {
    ('user', 'purchases', 'item'): (torch.tensor([0, 0]), torch.tensor([0, 1])),
    ('user', 'views', 'item'): (torch.tensor([0]), torch.tensor([2]))
}

g = dgl.heterograph(graph_data)
print("ntypes:", g.ntypes)
print("etypes:", g.etypes)
print("canonical_etypes:", g.canonical_etypes)
print(g)

Which produces:

ntypes: ['item', 'user']
etypes: ['purchases', 'views']
canonical_etypes: [('user', 'purchases', 'item'), ('user', 'views', 'item')]
Graph(num_nodes={'item': 3, 'user': 1},
      num_edges={('user', 'purchases', 'item'): 2, ('user', 'views', 'item'): 1},
      metagraph=[('user', 'item', 'purchases'), ('user', 'item', 'views')])

nillbahrami · August 13, 2023, 9:33am

thanks here’s my code:

df_categories_item = pd.read_csv('/content/product-categories.csv', sep = ';')

df_products = pd.read_csv('/content/products.csv', on_bad_lines = 'skip', sep = ';')

df_view = pd.read_csv('/content/train-item-views.csv')

df_purchase = pd.read_csv('/content/train-purchases.csv', sep = ';')

data_dict = {
    ('user', 'view', 'item'): (th.tensor(df_view['userId'].values.astype('int64')),
                               th.tensor(df_view['itemId'].values.astype('int64'))),
    ('user', 'purchase', 'item'): (th.tensor(df_purchase['userId'].values.astype('int64')),
                                   th.tensor(df_purchase['itemId'].values.astype('int64'))),
    ('item', 'is_from', 'category'): (th.tensor(df_categories_item['itemId'].values.astype('int64')),
                                      th.tensor(df_categories_item['categoryId'].values.astype('int64')))
}
g = dgl.heterograph(data_dict)

and also I tried using a small dataset, it also generates lots of users:

class CustomGraphDataset(DGLDataset):

    def __init__(self):
        super().__init__(name = 'hetera_graph')


    def process(self):

        node_types_df = pd.read_csv("/content/test_node_types.csv")
        nodes_df = pd.read_csv("/content/test_nodes.csv")
        edges_df = pd.read_csv("/content/test_edges.csv")


        buys_edges = edges_df.loc[edges_df["edge_type"] == "buys"]
        viewa_edges = edges_df.loc[edges_df["edge_type"] == "view"]
        belongs_edges = edges_df.loc[edges_df["edge_type"] == "belongs"]

        addToC_edges = edges_df.loc[edges_df["edge_type"] == "add_to_cart"]


        type_id_dict = dict(zip(node_types_df['node_type'],
                                     node_types_df['type_id']))


        nodes = nodes_df['node_id'].tolist()
        node_types = nodes_df['node_type'].map(type_id_dict).tolist()


        edge_weights = th.from_numpy(edges_df["weight"].to_numpy())



        data_dict = {
          ('user', 'view', 'item'): (th.tensor(viewa_edges['source'].values.astype('int64')),
                               th.tensor(viewa_edges['target'].values.astype('int64'))),
          ('user', 'buys', 'item'): (th.tensor(buys_edges['source'].values.astype('int64')),
                                   th.tensor(buys_edges['target'].values.astype('int64'))),
          ('item', 'belongs', 'category'): (th.tensor(belongs_edges['source'].values.astype('int64')),
                                      th.tensor(belongs_edges['target'].values.astype('int64'))),
          ('user', 'add_to_cart', 'item'): (th.tensor(addToC_edges['source'].values.astype('int64')),
                                   th.tensor(addToC_edges['target'].values.astype('int64')))
          }

        self.graph = dgl.heterograph(data_dict)
        print(self.graph)


        for e_t in self.graph.etypes:
            self.graph.edges[e_t].data["weight"] = th.from_numpy(
                edges_df[edges_df["edge_type"] == e_t]['weight'].to_numpy())

    def __getitem__(self, idx):
        return self.graph

    def __len__(self):
        return 1

dataset = CustomGraphDataset()
dataset.process()
dataset[0]

output:
Graph(num_nodes={‘category’: 6, ‘item’: 9, ‘user’: 10},
num_edges={(‘item’, ‘belongs’, ‘category’): 6, (‘user’, ‘add_to_cart’, ‘item’): 2, (‘user’, ‘buys’, ‘item’): 4, (‘user’, ‘view’, ‘item’): 9},
metagraph=[(‘item’, ‘category’, ‘belongs’), (‘user’, ‘item’, ‘add_to_cart’), (‘user’, ‘item’, ‘buys’), (‘user’, ‘item’, ‘view’)])

(the graph has only 3 users)

pubu · August 15, 2023, 8:14am

This is strange. Did you check your dataset (csv files)? Is it possible to post a sample of the data you are using (the small dataset)?

system · September 14, 2023, 8:14am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.