`DistDGL` store and access data in shared memory

pubu · July 28, 2023, 10:46am

I have a TensorDict storing some extra node id’s and their features. I want to store this data in the shared memory of kvstore with the name ext_feat when DistGraph is loaded (at the end of the __init__ function of DistGraph). This data will be local to each partition. Then, I want to read data from it in the pull function of KVClient. The idea is to store this data in the shared memory when DistGraph is loaded and then read it from shared memory during feature access in the pull function of KVClient during distributed training.

@Rhett-Ying @zhengda1936 Can you guide me please?

Rhett-Ying · August 1, 2023, 3:24am

why not use DistTensor directly instead of hacking into DistGraph if the tensor’s first dimension equals node/edge number. Just like this: https://github.com/dmlc/dgl/blob/8c213ef12273f2fa1eaaecf3840079938754e1d5/tests/distributed/test_dist_graph_store.py#L87.

pubu · August 1, 2023, 12:48pm

@Rhett-Ying Thank you for your reply.

As far as my understanding goes, accessing the elements of DistTensor causes a remote request if the data is not present in the local machine. This is what I want to avoid.

I have features and attributes for some nodes that are not in the original graph. Also, the nodes are not in sequence and not contiguous e.g. nids={1, 7, 5678, 98476, ...} etc. I have separate data for each partition which needs to be stored in each machine. I cannot store this data in the graph structure during partitioning as that will defeat the objective I am trying to achieve.

Now, I can store this data in a separate class and access it that way, but it involves extra overhead of memory copy which I want to avoid. Also, since the nodes for which I have extra data are not in sequence, I choose the dictionary structure where the key is node id and the value is feature vector and other attributes. That way, I can access them by keys. As a workaround, I can maintain two tensors, one for storing the nids and the other for storing the features and attributes and then get the indices of the required node from the first tensor and get the data of those indices from the second tensor, however it seems like a lot of overhead especially for large graphs such as ogb-paper100M. That is the reason for choosing the dictionary structure.

I want to store this dictionary in the shared memory local to each machine to avoid extra overhead of memory consumptions/copy/transfer as well as remote requests, that’s why i cannot use DistTensor.

Is there a way to store this dictionary in the shared memory of each machine with a name e.g. ext_data and then access it during training?

If I store that data in g.ndata as you suggested, will that be stored in the local shared memory of each machine? And then how to access it in the pull function where the graph object is not available? My objective is to inter-mingle this data with the node features that are pulled in the pull function for training.

P.S. the nodes for which I have data are not a fixed number which means that the number of nodes in the dictionary grow and shrink depending on the logic during training.

Thank you for spending time to read the lengthy question.

Rhett-Ying · August 3, 2023, 1:07am

Yes. data is loaded by the primary servers(DistGraphServer) on each machine and moved to shared memory that could be accessed by clients(DistGraph).

I’m mot quite understand why you need to fetch data in pull explicitly. In user code, we usually don’t need to do in this way. Could you share the demo code to illustrate how to use the special data you want?

As you said, if we use DistTensor, we do need to fetch data from remote server if the node id we’re fetching is not local. The overhead here should be fine except the feature data is very large. But if you hack the server which load such data into shared memory on each machines, it may cause sync issue if the data is changing during train.

system · September 2, 2023, 1:08am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.