In my 10/15 post, I briefly introduced what I was working on in my senior thesis: The Effect ot Pre-ReLU Input Distribution on Deep Neural Network Performance.
One interesting aspect that we’ve been looking into is the case of batch-normalization, which works particularly well when combined with ReLU— more specifically, a much faster convergence and better overall predictive power eventually make us wonder what distribution does batch-normalization forge the pre-ReLU input into, so that it appears so “affable” to the ReLU activation?
Intuitively, the answer seems straightforward. Since we are shifting each dimension of data by its mean and scaling it by a factor of \( \frac{1}{\sigma} \), where \( \sigma \) is the batch standard deviation, a standard normal distribution \( \mathcal{N}(0,1) \) seems to be the most reasonable target that batch-normalization is trying to shape the input into. But after a more careful reasoning, this may not be true. Having a mean of 0 and standard deviation of 1 in no means imply that the distribution must be \( \mathcal{N}(0,1) \). For example, distribution \( \text{Laplace}\big(0, \frac{1}{\sqrt{2}}\big) \) also has mean 0 and variance 1.
Therefore, we need to conduct some experiments and see what influence BN has exerted on the input to ReLU activation in each layer. A primary tool that we resort to for the visualizations of the input distribution is TensorBoard, which is a visualization framework that comes along with TensorFlow. With this, we are able to monitor the distribution shape change of the inputs at each layer across all training epochs. Here is one of distribution shapes that I plotted, for a 5-layer neural network where 3rd and 4th layers are batch-normalized:
You can view more plots like this, along with my analysis, in my research journals. It turns out, after repetitive experiments, that BN forces a bimodal behavior on the input distribution plot, by creating a higher, narrower and closer-to-zero peak on the negative side, and a lower, wider and farther-from-zero peak on the positive side. In comparison, in layer 1 & 2 where we didn’t impose the BN, there is no such phenomenon (and similarly, after we impose BN on layer 1 & 2 instead of 3 & 4, the bimodalization occurs in layer 1 and 2). Also, comparing the figure above, we found that in a BN network, the tendency for the bimodal is even more obvious (e.g. comparing figure on the left to that at the center). And if we compare the distribution of input right before BN and right after BN (figure at the center vs. on the right), it seems BN is trying to emphasize such bimodal shape. Interesting.
This brought us to the thought of whether ReLU prefers such bimodalization behavior. There must be a reason that this kind of transformation by BN can accelerate the training convergence. However, we should note that since ReLU flattens the negative part to zero anyway, the high, left peak shouldn’t matter. But the shape that we found gives us a very good indicator in terms of the direction we can head to when mocking BN’s behavior.
Why don’t we observe three peaks or more? We believe that this is the problem with the ReLU activation. Since ReLU only distinguishes on the positivity of a value, there is no reason for it to produce another peak. The current peaks, one on the negative and one on the positive sides, exactly reflect the effect of ReLU (Of course, this is a qualitative hypothesis that I propose, there may be other explanations as well).
So the asymmetric bimodal distribution leads to an important understanding of batchnorm. In the next post on my thesis, I will talk about our results.