Dear All members of DGL community.
I have worked on this whole day, i haven’t tested it at all; i am little scared what will come out of it hha. I begging answers from all of gurus here, several questions which i really can’t figure out after many hours of browsing. i hope it doesn’t deviate too much. thanks in advance
This post is the continuation of this
So my current goal is to get a solid code of link prediction, which i think is the simplest and the most compatible to what i seek, i really hope later i will get preliminary result before i turn into another model. and get a result of many type of lost (Cross-entropy, BPR, Margin, etc).
First of all, i set my nodes data differently from what is written in guide, each nodes data is loaded from tensor file in my computer. and it is obviously in rank-2 tensor format.
What i want to ask is the following.
- Overall, The clarification the modification that i made is make sense
- Several sub-questions that follows
2.1 I assume for the simplest mechanics i could think of that strength definition is by defining difference between two nodes as input of edges features, and made a dot product out of it that is propotional to the distance (i put the distance in 'G.nodes[].data[‘linkdata’]). is it will be refined iteratively in the computation? i still have some doubt over it.
2.2. the nodes data is really different in dimension, each loaded from different tensor file, i hope the broadcasting will work, i notice i don’t do any masking or padding as some torch refer to.
2.3. I really hope will find a better write the
2.4. The question about severed link that happen in the graph evolution (e.g : if the nodes occur in between), this is really just pop out in my mind this noon, i don’t know if the code will define automatically without i define too much, i did it one by one with only 9 nodes, and it is very exhausting. i hope i will get a simpler way to do this.
2.5. i don’t know how to build a (maybe dictionary) of nodes and its type to be more compact and callable, i made it into heterogenous type because i think the writing is more compatible than in homogenous (even it is indeed homogenous).
2.6. Related to Part 2; i don’t know what to do with SAGE class, along with RGCN, user_feats and item_feats, i change anything that i think is necessary, the last two definition is somewhat baffle me, even though i think all of the input are complete (i change the ‘hetero_graph’ into ‘G’ to fit the part 1 code).
That’s All, Thank you very much in advance.
Pardon for the swear words
, please pay no heed, it is all adressed to me only
So here is the code. i simply made it into two parts in single jupyter notebook file, one for developing dataset, the other is an effort in adapting the link prediction code from guidance.
This one is introduction part, containing commentary and my lines of thought
#0.UNDERLYING HTPOTHESIS
#1.DEVELOP THE DATASET
#2. ADAPT THE LINK PREDICTION FOR HETEROGENOUS GRAPH FIRST.
#0. UNDERLYING HYPOTHESIS
#This is a preliminary phase before exploring another model.
#Primary function that are selected are : ‘fn.v_sub_u = x’ , fn.e_dot_x. this is a simple yet strong definition of spatial interaction strength (i think)
#This may provide usefulness later https://docs.dgl.ai/api/python/dgl.data.html#edge-prediction-datasets
This one for dataset development (PART ONE)
#1. DEVELOP THE DATASET
#TAKEN FROM ‘DATA PREPARATION INTO TENSOR FORM’ Line 5.
#JUST OCCUR TO ME, HOW TO ADDRESS SEVERED LINK ALONG GRAPH EVOLUTION?CAN WE JUST SET IT TO UPDATED LINK FEATURES?
#Link assumed to be at most skip 1 row/column to be assumed connect, update taken into account those information. outside those assumption the edges is not taken into consideration
#therefore there are only 16 edge type.
#can we assume full configuration is set, but the graph added one by one. very tedious, too many assumption. CAN WE MAKE FOR IT TAKE ALL OF IT?
#reversed return value from assumption will be accepted
import torch
import pandas as pd
import dgl
#For this case i modified my base data to the border of the layout in Area X, in the 2 conditions (all in border):
#Adjacent, but not immediate :
#14-Nov : (Area-A1, Area-A2)
#19-Nov : (Area-B1, Area-B2)
#21-Nov : (Area-C1, Area-C2, Area-C3, Area-C4)
#23-Nov : (Area-D5)
#Adjacent, and relatively immediate :
#18-Dec : (Area-E1, Area-E2, Area-E3, Area-E4, Area-E5, Area-E6)
#20-Dec : (Area-F1, Area-F2, Area-F3, Area-F4)
#Modified, pd.read_csv --> pd.read_excel https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
#If we skip the centre (it will be positioned last), how we counter it later?too muny periphery that will be overlooked. Actually this is a good case for how te center will afflict the other?
#Procedure contained here (https://docs.dgl.ai/en/latest/api/python/dgl.dataloading.html) doesn’t simplify loading from disk
#FOR TYPE 1
print(“Loading xlsx…”)
AreaA1 = pd.read_excel(‘C:/Users/Acer/DGLCONDA05/Data Type 1/Area-A1.xlsx’)
AreaA2 = pd.read_excel(‘C:/Users/Acer/DGLCONDA05/Data Type 1/Area-A2.xlsx’)
…
…
#Is it possible for data type to be float32?as the guide suggest.
print(“Converting to Tensor…”)
Area_A1 = torch.tensor(AreaA1.values, dtype=torch.int64)
Area_A2 = torch.tensor(AreaA2.values, dtype=torch.int64)
…
…
torch.save(Area_A1, ‘C:/Users/Acer/DGLCONDA05/Data Type 1/Area-A1.pt’)
torch.save(Area_A2, ‘C:/Users/Acer/DGLCONDA05/FDC Data Type 1/Area-A2.pt’)
#IS THERE A SIMPLER WAY TO DEFINE HETEROGRAPH THAN THE BELOW???
#Build dictionary of node object from node type and edge type, make it in generalized form
graph_data_type1 = {
(‘Area_A1’, ‘0dx-2y0’, ‘Area_A2’): (torch.tensor([0]), torch.tensor([1])),
(‘Area_A2’, ‘5dx1y1’, ‘Area_B1’): (torch.tensor([1]), torch.tensor([2])),
…
}
#How to define node more conveniently, is dict-inng like this work?
#How to define edges data more conveniently?
G.edges[‘0dx-2y0’].data[‘linkdata’] = torch.tensor([0,-2,0])
G.edges[‘5dx1y1’].data[‘linkdata’] = torch.tensor([5,1,1])
…
…
G = dgl.DGLHeterograph(graph_data_type1)
G.ndata[‘gabungan’][0] = Area_A1
G.ndata[‘gabungan’][1] = Area_A2
…
…
#How to define ‘etype’ here?All is FORTUNATELY automatically defined as in ‘Set/get Features for All Edges of a Single Edge Type’ part https://docs.dgl.ai/en/latest/generated/dgl.DGLGraph.edges.html
This part for incorporating Link Prediction Code (PART TWO)
#2. ADAPT THE LINK PREDICTION FOR HETEROGENOUS GRAPH FIRST.
# h contains the node representations for each node type computed from
# the GNN defined in the previous section (Section 5.1).
# maybe use elrow iterate for the particular features of data?no?matrix, no need, function is accomadating already.
#BAHH, HOW TO LOAD THE DATA AGAIN???
#IS THIS apply_edges really does iterate over nodes?
#for sub, see https://docs.dgl.ai/generated/dgl.function.v_sub_u.html#dgl.function.v_sub_u
#for dot, see https://docs.dgl.ai/generated/dgl.function.u_dot_e.html#dgl.function.u_dot_e
import dgl
import pytorch
import pandas
import numpy
import networkx
import dgl.nn as dglnn
import torch.nn as nn
import torch.nn.functional as F
class HeteroDotProductPredictor(nn.Module):
def forward(self, G, h, etype):
with G.local_scope():
for i in range(8)
x = fn.v_sub_u
G.ndata['gabungan'][i] = h
G.apply_edges(fn.e_dot_x('h', 'h', 'linkdata'), etype=etype)
return G.edges[etype].data['linkdata']
def construct_negative_graph(G, k, etype):
utype, _, vtype = etype
src, dst = G.edges(etype=etype)
neg_src = src.repeat_interleave(k)
neg_dst = torch.randint(0, graph.number_of_nodes(vtype), (len(src) * k,))
return dgl.heterograph(
{etype: (neg_src, neg_dst)},
num_nodes_dict={ntype: graph.number_of_nodes(ntype) for ntype in graph.ntypes})
#THIS PART BELOW IS YET TO BE CLEAR TO ME… SHED SOME LIGHT UPON ME…see homogenous graph part explanation.
#https://docs.dgl.ai/en/0.5.x/guide/training-node.html
#Define SAGE class first as per https://docs.dgl.ai/en/0.5.x/guide/training-node.html
#Contruct a two-layer GNN model
class SAGE(nn.Module):
def __init__(self, in_feats, hid_feats, out_feats):
super().__init__()
self.conv1 = dglnn.SAGEConv(
in_feats=in_feats, out_feats=hid_feats, aggregator_type='mean')
self.conv2 = dglnn.SAGEConv(
in_feats=hid_feats, out_feats=out_feats, aggregator_type='mean')
def forward(self, graph, inputs):
# inputs are features of nodes
h = self.conv1(G, inputs)
h = F.relu(h)
h = self.conv2(G, h)
return h
#Need to explore RGCN more at https://github.com/dmlc/dgl/blob/master/examples/pytorch/rgcn-hetero/entity_classify.py
#But it seems irrelevant.
#All ‘hetero_graph’ are changed into ‘G’
class Model(nn.Module):
def __init__(self, in_features, hidden_features, out_features, rel_names):
super().__init__()
self.sage = RGCN(in_features, hidden_features, out_features, rel_names)
self.pred = HeteroDotProductPredictor()
def forward(self, G, neg_g, j, etype):
h = self.sage(G, j)
return self.pred(G, h, etype), self.pred(neg_g, h, etype)
def compute_loss(pos_score, neg_score):
# Margin loss
n_edges = pos_score.shape[0]
return (1 - neg_score.view(n_edges, -1) + pos_score.unsqueeze(1)).clamp(min=0).mean()
k = 3 #i hope this is reasonable
model = Model(5, 5, 5, G.etypes) #don't know how to adjust it, is it reasonable???
#'feats' means feature size, i will replace user and item to source and destination
#WTF is user and item stand for? just play along and change into source and destination, still nonsense though.
source_feats = G.nodes[:].data['linkdata']
destination_feats = G.nodes[:].data['linkdata']
node_features = {'user': user_feats, 'item': item_feats}
opt = torch.optim.Adam(model.parameters())
#https://docs.dgl.ai/en/0.4.x/generated/dgl.DGLGraph.edges.html, ":" means all right?
for epoch in range(10):
negative_graph = construct_negative_graph(G, k, ('source', : , 'destination'))
pos_score, neg_score = model(hetero_graph, negative_graph, node_features, ('source', : , 'destination'))
loss = compute_loss(pos_score, neg_score)
opt.zero_grad()
loss.backward()
opt.step()
print(loss.item())