OnDiskNpyArray RuntimeError when upgrading dgl 2.3.0 -> 2.4.0

Hello!
After upgrading to DGL 2.4.0, I am encountering serialization errors when attempting to use GraphBolt’s OnDiskNpyArray in distributed training scenarios. The error specifically occurs during the serialization process when passing dataset objects to multiple GPUs.

Pseudo code sample:

def load_data(graph_path):
    dataset = gb.OnDiskDataset(graph_path).load(tasks="link_prediction")
    graph = dataset.graph
    features = dataset.feature
    train_set = dataset.tasks[0].train_set
    validation_set = dataset.tasks[0].validation_set
    test_set = dataset.tasks[0].test_set
    return graph, features, train_set, validation_set, test_set

graph, features, train_set, validation_set, test_set = load_data(GRAPH_PATH)

# Error happens when TorchDistributor tries to serialize these objects
# to pass them to run_instance
distributor = TorchDistributor(
    num_processes=world_size, 
    local_mode=False,
    use_gpu=True
)

distributor.run(
    run_instance,
    -1,
    world_size,
    graph,  # These objects trigger serialization error
    features,
    train_set,
    validation_set,
    test_set
)

This gives the following error:
RuntimeError: Tried to serialize object __torch__.torch.classes.graphbolt.OnDiskNpyArray which does not have a __getstate__ method defined!

My environment:

  • DGL Version (e.g., 1.0): 2.4.0
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): torch==2.3.1, cuda 12.1
  • OS (e.g., Linux): Linux
  • How you installed DGL (conda, pip, source): pip (via Databricks wheel)
  • Build command you used (if compiling from source): N/A
  • Python version: 3.11
  • CUDA/cuDNN version (if applicable): 12.1
  • GPU models and configuration (e.g. V100): g5.48large (A10G)
  • Any other relevant information:

Thanks in Advance!

1 Like