Last time we talked about how our surprising observation of the bimodalization of BN on the distribution led to some hypothesis on its effect. In particular, we observed a (higher) peak on the negative side and another (lower) peak on the positive side. What happens if we manually create such bimodal peaks and replace the BN layer with this transformation?
Fixed symmetric/asymmetric transformations
An interesting initial experiment is then try some static/fixed transformations that are expected to bimodalize the distribution. For a function \( f \), we expect \( f’(x) \geq 0 \) (to preserve the order of the original data), and meanwhile, with the following properties:
- \( f’(x) \) is relatively large for \( x \in (-\varepsilon, \varepsilon) \)
- \( f’(x) \) is small (or even \( \longrightarrow 0 \) as \( x \rightarrow \pm \infty \)
This should be easy to understand. We want the slope to be steep because we want to drive the data value “away” from 0. But we don’t want to drive it too far, so as \( x \) gets farther from the origin, we have the slope flattens out and thus form a peak on the positive side and a peak on the negative side. Here are some examples of the functions:
where
\[\begin{aligned} f(x) &= \left\{ \begin{array}{ll} \displaystyle \frac{-1+\sqrt{1+4x}}{2} & \ x \geq 0 \\ \\ \displaystyle \frac{1-\sqrt{1-4x}}{2} & \ x < 0 \end{array} \right. \\ g(x) &= \left\{ \begin{array}{ll} \displaystyle \log(x+1) & \ x \geq 0 \\ \displaystyle -\log(-x+1) & \ x < 0 \end{array} \right. \\ h(x) &= \frac{2}{1+e^{-x}} - 1 + \frac{1}{2} \cdot \mathbf{sign}(x)(\log(|x|)+1) \\ &= 2 \varphi(x) - 1 + \frac{1}{2} \cdot g(x) \\ BN &= \frac{x-\mu}{\sigma} = \frac{x-1}{3} \\ mystery &= \left\{ \begin{array}{ll} \displaystyle \frac{-1+\sqrt{1+4(x-2)}}{2} + 1 & \ x > 2 \\ \\ \displaystyle \frac{x^2}{4} & \ 0 \leq x \leq 2 \\ \\ \displaystyle -\frac{x^2}{4} & \ -2 < x < 0 \\ \\ \displaystyle \frac{1-\sqrt{1+4(-x-2)}}{2} - 1 & \ x < -2 \end{array} \right. \end{aligned}\](Note: for the last one, BN, I simply took some sample \( \mu \) and \( \sigma \). In practice they depend on the datasets.)
So how’s the performance? It turns out that with the same setting as before but BN replaced by the mock BN, the performance of the network is at about the same level as usual NN that no BN for which no batch-normalization is used. This is bad. Here are the distribution plots:
The bimodal shape is observed, but the performance is bad. Also, if we closely compare, we can find that the bimodal shape created here does not exactly resemble the BN’s distribution shape. Here we have a larger and higher peak on the positive side, which indicates that we have used the wrong “center” of the distribution (because we should have emphasized the negative portion more). Here are a list of problems that taking this approach has, which may account for this result:
- According to the BN paper, it not only normalizes the data (i.e. the transformation depends on the data), but also applies two learnable parameters, \( \gamma \) and \( \beta \) so that eventually \( \hat{x}_j = \gamma_i \cdot \frac{x_j - \mu_i}{\sqrt{\sigma_i}} + \beta_i \) (where \( i \) denotes the batch index). So for each batch, the normalizations are very flexible and learnable. However, here, by proposing transformations that are fixed and data-independent, we are just like saying: “Ok, for whatever data I am given, I will transform it in the same way.” Clearly, this makes fixed transformations inferior.
- BN has different training and testing behaviors (in testing, in particular, it uses the training sample population statistics). But this shouldn’t be the main reason because the training set convergence is also slowered when we replaced BN with mock-BN.
- The mock-BN didn’t emphasize anything about the zero-center and unit-variance property. Instead, it is just trying to force a shape. Since ReLU clip the value at \( x = 1 \), perhaps the location of the center does matter.
Given these thoughts, we moved on to a slightly more generic analysis.
Learnable symmetric/asymmetric transformations
Since BN is learnable, let’s add some learnable factors to our mock transformations as well! As we concluded above in the section about fixed transformations, the flexibility of BN may be an important factor that creates the difference in BN’s performance and fixed transformation’s performance.
Therefore, we can think about how to add these learning factors. We proposed two functions (transformations) to add the learnable factors to.
Log-shift transformation
The log function, \( \log (x) \) has half of the shape that we desired (steep slope around origin, and flattens gradually). Therefore, we propose the following log-shift function:
\[g(x) = 1_{x \geq 0} \cdot \log(x+1) + 1_{x < 0} \cdot (- \log(-x+1))\]where \( 1_{(\cdot)} \) is the indicator function. Note that this function is differentiable, even at 0 where the piecewise definition changes. As we shall see next, even after we include the learnable factor, such differentiability is still preserved. We now define:
\[\begin{eqnarray} g(x,a,b,c) = \left\{ \begin{array}{ll} a \cdot \log((bx)^c+1) & \ x \geq 0 \\ -a \cdot \log((-bx)^c+1) & \ x < 0 \end{array} \right. \end{eqnarray}\]so \( a,b,c \) are three parameters that can be updated via backpropagation (SGD, RMSProp, Adam, etc.). Different parameters yield different shapes of the function \( g \), as is shown in the figure on the left below:
Such flexibility makes our approach closer (in parameter property) to batch-normalization, but in the meantime is more general, because it does not depend on the data. However, there is one thing that we need to pay attention now: we have to make sure \( c > 0 \) What if this is violated? Then we probably will end up getting something undesired, as shown in the figure on the right above. To do this, we re-define the transformation \( g \) by adding an exponential mask to it:
\[\begin{eqnarray} g(x,a,b,c) = \left\{ \begin{array}{ll} a \cdot \log((bx)^{\exp(c)}+1) & \ x \geq 0 \\ -a \cdot \log((-bx)^{\exp(c)}+1) & \ x < 0 \end{array} \right. \end{eqnarray}\]Then, TensorFlow optimizer should take care of all the backpropagation details and the chain rule.
How is the result? While the details can be found on my research journal #6 (the 10/18 one), I can show the most important parts here. First, let’s look at the pre-mock-BN vs. post-mock-BN input distribution, still at layer 4 of the network:
This is different from the fixed transformation, as we now correctly have a slightly larger negative peak and another positive peak after BN. However, the range of two peaks are similar, which marks a difference from BN. And here is the performance comparison, on the testing set, of BN vs. mock BN (log-shift) vs. usual NN:
Bravo! So we’ve got some improvements based on normal NN. However, the log-shift still performs a bit worse than batch-normalization. In my research journal, I also compared the loss convergence, which shows similar improvements.
Sqrt-shift transformation
We now try another learnable transformation. Note that we want to try different types of transformations because, even though the parameters are learnable, the shapes of these transformations still essentially depend on the underlying functions that serve as their “base” (e.g. \( \log (x) \) is the base function of the log-shift transformation above). The next function I had in mind was \( f(x)=\sqrt{x+\frac{1}{4}}-\frac{1}{2} \), which I named as the sqrt-shift function. Adding learnable parameters to it use the similar exponential-mask trick as before, we eventually have the transformation function:
\[\begin{eqnarray} h(x,a,b) = \left\{ \begin{array}{ll} \sqrt{(\exp(a) \cdot x)^{\exp(b)}+\frac{1}{4}}-\frac{1}{2} & \ x \geq 0 \\ -\sqrt{(-\exp(a) \cdot x)^{\exp(b)}+\frac{1}{4}}+\frac{1}{2} & \ x < 0 \end{array} \right. \end{eqnarray}\](At first glance, it probably doesn’t make sense why we need to exponentiate the \( a \) as well. After some experiments it turned out adding the \( \text{exp}() \) around it helped stabilize the convergence.)
The sqrt-shift transformation bears a great deal of resemblance to the log-shift above, in terms of the general shape. Just like log-shift, the shape of its curve differs as we adjust the values of parameters \( a \) and \( b \):
However, as you might have noticed, as \( x \) gets large, the flattening is not as strong as in the log case (which is no surprise since its derivative has an extra square-root). This, as we shall see in the eventual post-mock-BN + pre-ReLU distribution, suggests a significant difference between this simulation and real batchnorm.