Our use case involves reading many Parquet files on GCS, creating a customDGLDataset used by a GraphDataLoader. I’d like to implement this pipeline in Apache Beam (and have done so, for one Parquet file).
The flow looks something like this:
Read Parquet files → Convert to graph[] → load into DGLDataset → create GraphDataLoader w/ that DGLDataset → incrementally train graph classification in batches
With multiple files (total 84 GB), this process naturally runs OOM.
Wondering if there’s a good Beam pipeline that folks have written to do this, or if there’s a much better approach to scaling w/ multiple workers? I realize DGL distributed training might be a next step, but we’re focusing now on being able to create the DGLDataset using Beam, if possible.