SIGSEGV: Segmentation fault on GraphBolt Training

grandintegrator · April 29, 2024, 4:24am

Hey team, I am running into a Segmentation Fault error with GraphBolt and unsure how to debug this as I have a very small graph:

Graph(num_nodes={'account': 98, 'imperative': 27},
      num_edges={('account', 'has_imperative', 'imperative'): 264},
      metagraph=[('account', 'imperative', 'has_imperative')])

And the sampling code I have created is below:

# Create the sampling graph
sampling_graph = gb.from_dglgraph(dgl_graph)
# Create the itsemset
src, dst = dgl_graph.edges(etype="has_imperative")
node_pairs = list(zip(src.tolist(), dst.tolist()))
train_itemset = gb.ItemSet(
items=(torch.tensor(node_pairs),),
names=('node_pairs',)
)

# Create the data loader
datapipe = gb.ItemSampler(train_itemset, batch_size=1, shuffle=True)
datapipe = datapipe.copy_to(device)
datapipe = datapipe.sample_uniform_negative(sampling_graph, 5)
datapipe = datapipe.sample_neighbor(sampling_graph, [5, 5])
datapipe = datapipe.transform(partial(gb.exclude_seed_edges, include_reverse_edges=True))
datapipe = datapipe.fetch_feature(feature_store, node_feature_keys=["feat"])
data_loader = gb.DataLoader(datapipe)

Not sure if this is useful but here is how I created the feature_store using TorchFeatureStore:

# TODO: Replace with actual feature matrices
account_features = generate_random_features(98, 10)
imperative_features = generate_random_features(27, 10)

# Creat the feature store
account_feature = TorchBasedFeature(account_features)
imperative_feature = TorchBasedFeature(imperative_features)
torch.save(account_features, "/tmp/account_features.pt") 
torch.save(imperative_features, "/tmp/imperative_features.pt")
# Create OnDiskFeatureData Metadata
feat_data = [
    gb.OnDiskFeatureData(domain="node", type="account", name="feat", format="torch", path="/tmp/account_features.pt", in_memory=True),
    gb.OnDiskFeatureData(domain="node", type="imperative", name="feat", format="torch", path="/tmp/imperative_features.pt", in_memory=True)
]
feature_store = TorchBasedFeatureStore(feat_data)

The error is uninformative so any guidance on how to address this would be useful. Here is some further information on my versions and infrastructure:

Torch Version: 2.1.2+cu121, DGL Version: 2.1.0+cu121
DGL Installed by running: !pip install dgl -f https://data.dgl.ai/wheels/cu121/repo.html
Compute: g5.4xlarge (64 GB Memory, 24GB of GPU Memory, 16 vCPUs) running on Databricks

mfbalin · May 6, 2024, 7:09pm

Could you post the error, even if it may seem uninformative to you but maybe we can catch something.

mfbalin · May 6, 2024, 7:12pm

I think you are not moving the graph and the features to either pinned memory or cuda device. I see that you have copy_to right after the item sampler in your data pipeline. This means the stages after that will be executed on your GPU, if device is a cuda device. To solve this, you need to make your graph and features GPU accessible. There are two ways, either pinning their memories or moving them to the GPU.

You can resolve this issue by doing either of:

sampling_graph = gb.from_dglgraph(dgl_graph).to(device)
sampling_graph = gb.from_dglgraph(dgl_graph).pin_memory_()

And doing the either of for the features:

feature_store = TorchBasedFeatureStore(feat_data).to(device)
feature_store = TorchBasedFeatureStore(feat_data).pin_memory_()

system · June 5, 2024, 7:13pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.