Hello!
After upgrading to DGL 2.4.0, I am encountering serialization errors when attempting to use GraphBolt’s OnDiskNpyArray in distributed training scenarios. The error specifically occurs during the serialization process when passing dataset objects to multiple GPUs.
Pseudo code sample:
def load_data(graph_path):
dataset = gb.OnDiskDataset(graph_path).load(tasks="link_prediction")
graph = dataset.graph
features = dataset.feature
train_set = dataset.tasks[0].train_set
validation_set = dataset.tasks[0].validation_set
test_set = dataset.tasks[0].test_set
return graph, features, train_set, validation_set, test_set
graph, features, train_set, validation_set, test_set = load_data(GRAPH_PATH)
# Error happens when TorchDistributor tries to serialize these objects
# to pass them to run_instance
distributor = TorchDistributor(
num_processes=world_size,
local_mode=False,
use_gpu=True
)
distributor.run(
run_instance,
-1,
world_size,
graph, # These objects trigger serialization error
features,
train_set,
validation_set,
test_set
)
This gives the following error:
RuntimeError: Tried to serialize object __torch__.torch.classes.graphbolt.OnDiskNpyArray which does not have a __getstate__ method defined!
My environment:
- DGL Version (e.g., 1.0): 2.4.0
- Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): torch==2.3.1, cuda 12.1
- OS (e.g., Linux): Linux
- How you installed DGL (
conda
,pip
, source): pip (via Databricks wheel) - Build command you used (if compiling from source): N/A
- Python version: 3.11
- CUDA/cuDNN version (if applicable): 12.1
- GPU models and configuration (e.g. V100): g5.48large (A10G)
- Any other relevant information:
Thanks in Advance!