Happy New Year! I’ve been a bit busy lately with my grad school application and the thesis research experiments, so I haven’t worked on the website update frequently.

Sqrt-shift learnable transformation - continue

Last time we introduced the sqrt-shift function that is able to adjust itself through backpropagation to simulate BN’s behavior. So how well did it perform? It turned out, surprisingly, that this “simulation” (why the “”? Explain later) worked almost as well as BN! This is a good sign :-) Here is a graph that compares usual NN, log-shift NN, sqrt-shift NN and BN:

Cross-compare the test set accuracy convergence, for usual NN (orange), log-shift (blue), sqrt-shift (yellow) and BN (green)

Note that the green line is shorter than the other 3 not because I used different total epochs, but because its line went beyond 0.98 and almost coincided with the yellow line. In general, we found that this mock (I called it “mock-4.3” because the result was reported in section 4.3 of my journal) works as well as BN in terms of both convergence boost and test performance. This makes it more interesting to study its shape. In particular, although we designed the function in the hope that the function will somehow force out a bimodal behavior, the fact that we used learnable parameters in the transformation enabled it to change as a result of backpropagation. Therefore, the shape might not be what we expected. Indeed:

Left: Layer 4 input before mock BN (i.e. sqrt-shift in this case) and ReLU; Right: Layer 4 input right after mock BN, but before ReLU

The shape looks pretty good, for both of them. Recall that the major motivation behind batch-normalization is to center the distribution and normalize it. While the graph on the left is pre-BN, it also reflects, in a certain sense, the effect of our mock-BN layer in layer 3. It is roughtly centered with a single peak. But what makes us more interested is the graph on the right— or rather, the difference between the two graphs. In particular, if you look at the range, you would find that the mock-normalized data distribution now ranges from -100 to 100! This is certainly not BN at all.

And yet we found this worked almost as well as BN. Why? Well, we don’t know yet. Therefore, we decided to look into greater details by studying the statistical properties of these distributions.

Here is a comparison plot on the similar boost we observed when we combine the sqrt-shift mock-BN with other activation functions. It worked well.

For all the activations, adding mock-BN led to a boost in test accuracy convergence.

MLE, and KL divergence

The first two concepts we first turned to, as we were seeking theoretical explanations for the effect of pre-ReLU distributions, were MLE and KL-divergence. In particular:

  • MLE: We can regard the transformation as an attempt to re-shape the data so that, after the transformation, the MLE of the parameters of \( \text{transformed data} \sim \mathcal{N}(\mu_0, \sigma_0) \) is \( \mu_0 = 0 \) and \( \sigma_0 = 1 \).
  • KL-divergence: There is a two fold understanding:
    • Find \( \text{argmax}_{\mu_0, \sigma_0} \mathbb{P}\bigg[\frac{z_i - \mu_i}{\sigma_i} \bigg| \mathcal{N}(\mu_0, \sigma_0) \bigg] \), where \(z_i\) is a data column, and \( \mu_i, \sigma_i \) are transformation parameters (they are note mean and stddev).
    • Then we can write out the optimal \( \mu_0, \sigma_0 \) as an expression of \( z_i, \mu_i \) and \( \sigma_i \).
    • Compute \( \text{argmin}_{\mu_i, \sigma_i} KL(\mathcal{N}(\mu_0, \sigma_0) || \mathcal{N}(0,1)) \)
    • This should give us the optimal \( \mu_i, \sigma_i \) so that we can apply the transformation.

With KL divergence, we conducted the experiment by inserting a “KL” layer in between the linear transformation \( Wx+b \) and the ReLU, replacing BN:

The mock-BN layer now is truly a KL divergence optimization layer, where we used GD/Newton's method to find out the optimal parameters and do the transformation

So how did the two methods mentioned above work?

The MLE did not work so well. It is easy to see that, after we expand the expression, the optimal solution is to simply squeeze all the data to zero by dividing the data value by \( \infty \). This actually makes sense since we assume each data entry is independent. The derivation can be found in my journal #7.

The KL divergence, on the other hand, has two main disadvantages as well: (1) It is asymmetric. In other words, the fact that \( KL(p || q) \neq KL (q || p) \) means our method above can be biased. Moreover, (2) the fact that we may need to explore higher-order transformations (not just the linear ones) makes the KL analyais very hard. For now, we assume the transformation is \( \frac{z_i-\mu_i}{\sigma_i} \), which is already troubling.

As we shall see later, we turned our attention to the moment-matching method, which works better than the two methods mentioned in this section. While I didn’t cover the experiment results of using KL-divergence here, you are very welcomed to visit my journal page on the KL-divergence.

Next time, I will talk about our results using moment-matching (it was our focus over the past few weeks)!