The title of my senior thesis is: The Effect ot Pre-ReLU Input Distribution on Deep Neural Network Performance:
Well, this is a not a short title. But really, the gist of this thesis can’t be more straightforward (though challenging): how can we better the performance of neural networks by playing with the pre-ReLU (an activation function) input distribution?
Neural networks, especially deep neural networks, have been a truly prevalent machine learning technique whose variants have been used in sundry applications due to its excellent performance & predicting power. In particular, recent publications on AlexNet, VGG Net and AlphaGo have focused on important areas such as large-scale image classification and computation-heavy Go playing.
Recall that in artificial neural networks, each neuron (also known as unit) applies a linear transformation \(Wx+b\) to the input \(x\) it receives, and apply the computed outcome through an activation function such as sigmoid and tanh. Lots of other variant activations have been proposed recently, however. Rectified Linear Unit (ReLU), has been an increasingly popular choice. At each neuron, ReLU computes
\[z_{i+1} = \max \{0, \phi_i(z_i)\} = \max \{0, f(Wz_i+b)\}\]where \( z_i \) is the input of the \( i^\text{th} \) layer and \(f\) is usually the identity function (but in more advanced settings, say when we are using dropout/batch-normalization, it is not identity function). Such preference for ReLU over other, more classical activation functions, is believed to be a result of ReLU’s ability to provide efficient gradient propagation (no vanishing/exploding gradients, since the negative part is simply flattened to 0 and the positive part is just an identity function) [LeCun et al. 1998] and its sparse activation (all negative terms are mapped to 0) [Glorot et al. 2011]. In Glorot et al.’s paper, in particular, the authors also attempted to come to a biological argument which can justify the power of ReLU, by comparing the percentage of the neurons in “active phase” on average.
In this senior thesis, we focus on how exactly the input distribution to ReLU, \( \phi_i(z_i) \) to ReLU could influence the behavior of the neural network. In other words, we want to find out more about the magic behind ReLU, and more importantly, what kind of inputs ReLU prefers to be fed to it.
In the coming posts, I will gradually update on the discovery of my research! At the end of fall 2016 semester and spring 2017 semester there will be two presentations on the research as well. Meanwhile, a thesis paper is to be expected by next May that will summarize on our work.