Loading a custom dataset with DataLoader results in an error with no error message

BearBiscuit05 · October 11, 2023, 1:13am

I am using UK2006 to simulate GraphSage training, but I encounter a ‘Segmentation fault (core dumped)’ error when I try to load the data using dataloader. I’m not sure what’s causing this issue, and I would appreciate some assistance. Additionally, I’m running the program on a machine with an A100 GPU that has 500GB of memory, so I think it’s not a memory issue.

path = "/home/xxx/workspace/data"
dataset = "uk-2006-05"    
graphbin = "%s/%s/graph.bin" % (path,dataset)
labelbin = "%s/%s/labels.bin" % (path,dataset)
featsbin = "%s/%s/feats_%d.bin" % (path,dataset,100)
edges = np.fromfile(graphbin,dtype=np.int32)
srcs = torch.tensor(edges[::2])
dsts = torch.tensor(edges[1::2])
g = dgl.graph((srcs,dsts))
feats = np.fromfile(featsbin,dtype=np.float32).reshape(-1,100)
feats_tmp = feats[:77741023]
label = np.fromfile(labelbin,dtype=np.int64)
label= label[:77741023]
g.ndata['feat'] = torch.tensor(feats_tmp)
g.ndata['label'] = torch.tensor(label)                                              
trainnum = int(77741023 * 0.01)
train_idx = np.arange(trainnum,dtype=np.int32)
sampler = NeighborSampler([10,10,10])
use_uva = True
print("flag...")
train_dataloader = DataLoader(g, train_idx, sampler, device='cuda',
                                  batch_size=16, shuffle=True,
                                  drop_last=False, num_workers=0,
                                  use_uva=use_uva)

error msg:

flag...
Segmentation fault (core dumped)

BarclayII · October 12, 2023, 1:04am

Segfaults are indeed hard to debug. Could you try the following one by one to help us locate the problem?

Change 1: Replace g with a random graph (e.g. with dgl.rand_graph). The reason is that I saw srcs and dsts are strided arrays and I’m not sure if strided arrays are properly handled.
Change 2: Replace train_idx with an int64 torch tensor. The reason is that I’m not sure if a numpy int32 array is properly supported in DGL.

BearBiscuit05 · October 12, 2023, 3:06am

Hello, thank you very much for your response. I have made the following attempts based on your suggestions: First, I changed train_idx to the int64 type, but after rerunning, the same error still occurred. Then, I replaced the graph g with the Twitter graph using the same method, and processed srcs and dsts in the same way as with the UK graph, with no changes to other operations like feat and label. After running, the program did not throw any errors. In fact, I performed the same processing steps for all five datasets, but only the UK graph resulted in a segmentation fault. I hope this feedback can be of some help. By the way, both the UK2006 and UK2007 datasets encounter this issue.

BarclayII · October 12, 2023, 3:50am

So the problem is just in the UK graph themselves. Could you check the value of g.num_nodes()? I saw the train indices can go as large as 77741023 but g.num_nodes() might be lower than that.

BearBiscuit05 · October 12, 2023, 4:15am

These are output messages related to dataloader, and the training IDs were randomly selected from the 1% of graph node sizes.

BearBiscuit05 · October 13, 2023, 7:52am

Today, I reconstructed the UK graph again, and this time there were no errors, confirming that the issue is related to the UK dataset. However, the lack of error messages made it difficult to pinpoint the exact cause of the problem. If I have the time, I will continue to investigate and provide feedback.

system · November 12, 2023, 7:52am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.