BatchNorm and Normalization with binary and continuous variables


I am working with a graph that has different types of features. Some features are continuous variables (eg. Age: 20yo, 24yo, etc.) and some are binary (eg : is married 1:yes/0:no). I have three different questions:

  1. There is an issue when using nn.BatchNorm1d to standardize data from a graph that contains binary and continuous variable because it does not make sense to normalize binary variable (at least I think so because you loose the interpretability of the variable). Do you agree ?
  2. For GraphSAGE paper, the authors made a normalisation on the embeddings at the end of each layer (by the l2 norm). But the data are not on the same scale, so it does not make any sens to apply it when you have binary and continuous variable in your graph features. Again, do you agree ?
  3. In a case where you have binary and continuous variables and among those who are continuous there is one that have huge values compared to all others, the model will have a tendency to give high weight to this variable, is there a way to minimize this impact ?

Thanks a lot.

Hi, the answer to your questions is in overall case-by-case so I can only give some generic suggestions:

One way to model binary variables is to treat it as categorical values with learnable embeddings. Applying normalization to embeddings is generally okay. Also, I think even normalizing binary variables is typically fine.

For spurious values, typically you want to cast it to a better distribution before modeling. The sklearn package may contain utilities for that.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.