Inclusion of the AirfRANS dataset in DGL

FlorentExtrality · February 2, 2024, 1:01pm

Hello,
First, thank for all the works you have done to democratize Geometric Deep Learning!

In 2022, I have generated a dataset of 1000 RANS simulations of airflow around airfoils in a subsonic flight regime. This led to a publication in the Datasets and Benchmarks Track of NeurIPS 2022 (AirfRANS: High Fidelity Computational Fluid Dynamics Dataset for Approximating Reynolds-Averaged Navier–Stokes Solutions | OpenReview).

As it is a point cloud based dataset, I proposed to include it in PyTorch Geometric in 2023, which has been done since (torch_geometric.datasets.AirfRANS — pytorch_geometric documentation). Now, I wanted to know if it would be possible to make it available directly in DGL too?

Best,
Florent Bonnet

Rhett-Ying · February 4, 2024, 1:27am

We’re glad to make it available in DGL. We’ve released GraphBolt since DGL 2.0, so could you follow the documentation to compose the dataset into OnDiskDataset. Once it’s done, we could host the dataset in our server. Please feel free to contact us if any issues are hit during dataset composition.
When the dataset is ready, could you please offer an end2end example that works with the dataset? or modify existing example to work with it.

FlorentExtrality · February 12, 2024, 5:13pm

Hi!
Thanks for accepting it! I am currently working on it, I am learning how GraphBolt works.
As I understand, you only have to save the data in a NumPy format along with the metadata.yaml file right? The data will then be automatically handled by the dgl.graphbolt.BuiltinDataset class? And this does not require any pull request from my side?

In addition, the AirfRANS dataset is a point cloud based dataset, there is no edges given with the raw data, you can create some if you want by a radius graph for example but nothing is set by default. I do not know if this is also handled by the GraphBolt framework.

Finally, there is no need to code a DGLDataset class anymore? Or should I propose both?

Best,
Florent

FlorentExtrality · February 16, 2024, 3:11pm

Hi!
I proposed an implementation for the AirfRANS dataset via the DGLDataset class here.
I also have another question concerning the GraphBolt implementation, I do not understand if the GraphBolt framework supports multiple graphs/point clouds. As I understand it, the metadata.yaml file has a field graph designed to handle only one graph, how should I write this file for handling multiple graphs?

Best,
Florent

Rhett-Ying · February 19, 2024, 1:35am

Single graph is supported for now.
No need to code DGLDataset anymore.
As a good start point, you could download one of BuiltinDataset for reference, such as ogbn-mag.
Once the dataset is ready, you need to test it with an end2end example. adding support for exiting example is preferable.
Then you could pass me the dataset and we’ll upload to our public repo and file required pull request.

FlorentExtrality · February 19, 2024, 5:01pm

Hi!
Thanks for your answers. When I’m creating the OnDiskDataset it looks like the “edges” part of the metadata.yaml is mandatory. I have tried to fill it with an empty .csv but it does not seem to work.
Is there a canonical way of implementing point clouds that have no edges in GraphBolt?

And by the way, as GraphBolt handle only single graphs, I have concatenated all the simulations in one array and added an index feature to be able to find back the different simulation.

Rhett-Ying · February 20, 2024, 4:37am

No. if no edges exist, then how to use the graph(all nodes are isolated) for training and inference? Let’s see if GraphBolt should support such scenario.

what is the simulation here? it’s training/val/test set? could you elaborate more about this?

FlorentExtrality · February 20, 2024, 11:04am

A simulation is the results of a Computational Fluid Dynamics (CFD) solver for a certain airfoil and boundary conditions. It is a point cloud with the different target fields (velocity, pressure and turbulent viscosity) attached to each of its node. So it is a data point in the dataset, there are 1000 simulations in AirfRANS and each is composed of a point cloud of roughly 180000 nodes. So, to transform it into a single “graph” I concatenated all 1000 simulations into a single array and added an indexing as a feature to be able to reconstruct each simulation.

For the global task of building a surrogate model to predict the airflow around airfoil, we do not have a canonical graph associated. Each simulation needs a mesh for running the CFD computation but this mesh does not make sense in the context of Machine Learning. So we let the construction of a graph (if needed) to the user. This can be done via a radius graph for example.

Rhett-Ying · February 21, 2024, 1:09am

I checked one point cloud example in DGL repo: https://github.com/dmlc/dgl/blob/master/examples/pytorch/pointcloud/pointnet/ModelNetDataLoader.py. And the customized dataloader returns point set during iteration.

So in order to support your dataset, all need to do is make OnDiskDataset support empty edges.csv? Does the existing dataloading works for this case?

Rhett-Ying · February 22, 2024, 2:55am

We discussed this request and find that it’s better to compose graph dataset outside DGL(GraphBolt). Namely, host the dataset somewhere else and generate graph from the dataset via some tools/algorithms. Once graph is ready, users are free to train with DGL(GraphBolt). Could you look into it? Maybe https://pytorch3d.org/ would help.

FlorentExtrality · February 22, 2024, 11:23am

Hi!
This has already been done, it is available in a ready-to-use version in PyTorch Geometric and it is also available in the airfrans library. The dataset is hosted on the Sorbonne Université servers.
What about the legacy DGLDataset class that allows such empty graph? The code is ready, it would allow to directly load and treat the dataset via DGL and then use GraphBolt if wanted. What do you think?

Rhett-Ying · February 22, 2024, 12:14pm

but how and when the graph structure is generated even with DGLDataset? Both DGL and GraphBolt requires a real graph with edges for training(no matter full graph training or mini-batch training).

FlorentExtrality · February 22, 2024, 2:21pm

In the code I proposed, the DGLDataset for airfrans is a wrapper of a list of dgl.graph object with no edges:

g = dgl.graph(([], []), num_nodes=self._positions[k].shape[0])
g.ndata[“pos”] = F.tensor(
self._positions[k], dtype=F.data_type_dict[“float32”]
)
g.ndata[“feat”] = F.tensor(
self._feats[k], dtype=F.data_type_dict[“float32”]
)
g.ndata[“label”] = F.tensor(
self._labels[k], dtype=F.data_type_dict[“float32”]
)
g.ndata[“surf”] = F.tensor(
self._surfaces[k], dtype=F.data_type_dict[“float32”]
)
self.graphs.append(g)

Then, you can either use only the node features as PyTorch (for example) tensors and use the PyTorch dataloader if you do not want to use any graph. If you do want to use a graph, then you can modify the dataset by building a graph for each simulation (with the help of dgl.radius_graph and the position feature of the nodes for example) and then use the DGL dataloader or GraphBolt.

Would it makes sense?

Rhett-Ying · February 26, 2024, 2:02am

@minjie what do you think about it?

Rhett-Ying · February 29, 2024, 2:17am

It makes sense. let’s compose DGLDataset instead of graphbolt.BuiltinDataset. let’s work on Inclusion of the AirfRANS dataset by FlorentExtrality · Pull Request #7119 · dmlc/dgl · GitHub.