NodeDataLoader for very large graphs

We’ve implemented a NodeDataLoader per the DGL example, using a small Parquet file instead of CSV.

In reality, we have about 280 Parquet files, with a total of over 280 million nodes. These nodes already have node ids that are unique to their data source (Neo4j). The NodeDataLoader appears to require an input graph, where the source/destination node ids are indexed starting w/ zero. There is no way for us to ‘renumber’ our node ids w/out reading in our entire 280 million-node Neo4j data and creating some Neo4j/DGL node remapping.

Is there a way for us to incrementally load data to NodeDataLoader, using our own DGLDataset, that doesn’t require us to renumber all nodes to be 0-based, as DGL appears to require?

1 Like

what do you mean of incrementally load data to NodeDataLoader? do you wanna load/sample from a dynamic(always changing, adding more nodes from Parquet files) graph with NodeDataLoader?

You said you have already created your own DGLDataset? so got a final graph which has all the nodes in 280 Parquet files?

By ‘incrementally load’, I mean build up the DGL graph by reading successive batches of Parquet files with a NodeDataLoader. Since DGL graphs have node source[], dest[] node ids that are 0-based, we can’t use the input (Neo4j) node ids as the DGL node source[], dest[] ids. We’d have to have some mapping of Neo4j/DGL node ids, which we could do w/ extra steps in our pipeline (eg: read all PQ files, create a Neo4j/DGL node id mapping, then use those in the NodeDataLoader). I’m trying to determine whether this is necessary.

For example, if we use distributed DGL training, it seems we need to create the entire DGL graph before it can be partitioned for distributed trainining. We haven’t got a DGL graph with all nodes in 280 Parquet files; I imagine we may need to further sample it before creating the DGL graph, unless the DGL graphs used in Node Classification algorithms can handle 280m nodes?

What does your Parquet files contain? I think you should have some parquets representing the node data, and some others representing the edges?

I can think of several options:

  1. You could try ParMETIS to partition your graph in a distributed manner.
  2. Or you can partition your graph on Neo4j or other stuff, and treat each partition as a single graph. You could then develop whatever strategy you like, e.g. either train partition by partition, or train each partition on a single machine, etc.

Thanks for your suggestions, BarclayII. I’ll give them some more thought.