Transformer implemented by dgl

#1
  1. How to change the max_length=50 fixed in the TranslationDataset? It is short. Does the bigger length impact the speed severely?
  2. Does the dgl-transformer have any speedup over the raw transformer?
#2

Hi,

  1. No, bigger length does not impact speed severely, one advantage of DGL transformer is that we do not use padding.
  2. We haven’t ported custom gpu kernel in dgl master branch, so the transformer code in dgl example is much slower than raw transformer. You would see updates about it in dgl 0.4.
    Currently, if custom kernel is adopted, it would take around 2000s/epoch to train a dgl-transformer with 8 NVIDIA V100 GPUs on WMT-14 en-de dataset, and it could achieve a bleu score of 28 on test set.
    The major advantage of using dgl to write transformer is its flexibility to deal with any kind of graphs(I mean vanilla transformer is two fully connected graph, but it’s not necessary to be like that). If you are interested about this insight, please stay tuned for our new paper at ICLR 2019 workshop: Representation Learning on Graph and Manifolds.