Apache Beam with GraphDataLoader

cindymc.nedhead · December 16, 2021, 8:57pm

Our use case involves reading many Parquet files on GCS, creating a customDGLDataset used by a GraphDataLoader. I’d like to implement this pipeline in Apache Beam (and have done so, for one Parquet file).

The flow looks something like this:

Read Parquet files → Convert to graph[] → load into DGLDataset → create GraphDataLoader w/ that DGLDataset → incrementally train graph classification in batches

With multiple files (total 84 GB), this process naturally runs OOM.

Wondering if there’s a good Beam pipeline that folks have written to do this, or if there’s a much better approach to scaling w/ multiple workers? I realize DGL distributed training might be a next step, but we’re focusing now on being able to create the DGLDataset using Beam, if possible.

VoVAllen · December 17, 2021, 7:51am

Can you make seperate files for each chunk? And define your own DGLDataset to handle the loading issue, that won’t load all the graph together?

VoVAllen · December 17, 2021, 7:52am

Convert to graph[] → load into DGLDataset

You might need to rewrite the DGLDataset to avoid loading all the graphs simultaneously

Also DGL supports random access the graphs on the disk, but it might be slow.

cindymc.nedhead · December 17, 2021, 5:06pm

As it turns out, DGL doesn’t play well w/ Beam pipelines because of serialization (pickling, in Python, I guess) issues. But I got around this by creating a custom DGLDataset that would read/convert individual Parquet files into DGL graph[] per getitem() invocation. This solution is totally independent of Beam, so works well for our clients.

system · January 16, 2022, 5:06pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.