Thank you for being interested in my research experiences! In my years at Carnegie Mellon (especially my junior & senior years), I have been very fortunate to have the opportunity to work with brilliant CMU professors and students on these very exciting research areas.

As I mentioned on the home page, I will join CMU’s Machine Learning Department in the fall of 2017… so I expect to be able to work on a myriad amount of interesting projects in the future!! :P

Index (updating…)

Senior Thesis

You can find the draft for my thesis here, and thesis presentation slides here

Advisor: Professor J. Zico Kolter

Here is a collection of the documents related to my (mainly research journals & poster) undergraduate senior thesis on the effect of pre-ReLU input distribution on DNN’s performance. As a part of the senior thesis, a thesis paper that reflects on our work is expected in early 2017.

What is pre-ReLU distribution?

In 2015, a paper by Ioffe and Szegedy really amazed the world: by proposing the technique called “batch-normalization” (BN), their model for correction of the internal covariate shift has proved to be very effective in boosting neural network convergence. In particular, as they (and many other researchers afterwards) reported, such effectiveness is especially well-demonstrated when combining BN with ReLU activation function.

The power of ReLU activation, most believe, comes from two aspects: (1) it helps resolve the vanishing gradient problem, which is present in other activations such as sigmoid and tanh; (2) it induces sparsity by flattening out entirely the negative portion. Even though some other methods, such as ELU, has come into the picture nowadays, we are primarily interested in the problem of “what kind of distribution does ReLU favor most?”

What has been done so far?

Under the guidance of my research advisor Zico, I tried to come up with different models that could mock the behavior of batch normalization (BN). Because we don’t yet understand fully how BN accelerates the training convergence through the shift-and-scale of the distribution, this was naturally a good example to study of. In order to generate data-independent and more generic distribution hypothesis, we first need to “behave” like BN.

We have approached the problem with three major understandings of BN:

  • Distribution shape. In our experiment with BN in a neural net with relatively deep layers, we found, surprisingly, that the batch-normalization tries to push the distribution into a bimodal shape (and emphasizes the peaks). This is not observed in simple, usual neural network.
  • Distance to target distribution. If there is a certain distribution that BN is trying to transform the data into, then looking at the difference/distance between empirical data distribution and target distribution may be useful.
  • Moment matching. Statistical moments, \( \mathbb{E}[X^i] \), are great indicators of the shape and center of a distribution. For instance, 1rd moment measures the mean and the 3rd moment measures the skew. In light of this idea, we can imagine BN as an attempt to match the first two moments of the data with 0 and 1, respectively. So why not match it to higher desired moments? Easy as this may sound, lot of convex optimization techniques were involved!
  • Copula transform: What if we make a perfect transform to a target distribution? One way to do this is by matching the quantiles of empirical and target distributions. In particular, we can use sorting to achieve this purpose— \( \mathsf{Quantile}(x) = \frac{\mathsf{SortIndex}(x)+0.5}{\text{len}(x)} \), where \( \mathsf{SortIndex} \) is a function that returns the index, for each element of vector \(x\), in the sorted version of \(x\). For example, if \( x_1 \) is the smallest value in \( x \), then \( \mathsf{SortIndex}(x_1) = 0 \). We add 0.5 to balance the bias that we introduced by choosing 0-indexing.

So far, the copula transform approach looks the most promising. On MNIST & CIFAR-10 dataset, adding Copula to one layer of the network can outperform the result of adding multiple BN layers to the same network (see the last two journals). Note that the performance here is measured by the rate of convergence. We have optimized the copula transform (which we call BatchCopula) so that it supports efficient CPU & GPU versions. More results to come!

Research Documents

Independent Study

Advisor: Professor Scott E. Fahlman

In my junior undergraduate year, I also had the opportunity to work with professor Scott E. Fahlman on information extraction of Scone Knowledge-Base system.

About Scone

Scone is an artificial intelligence knowledge base (KB) system that uses symbolic reasoning and first order logic to represent knowledges and make inferences dynamically. It was developed mainly by Professor Scott E. Fahlman and a team consisting of PhDs, masters and undergrads at CMU over the years. With techniques such as marker-passing algorithm as well as the symbolic representing of the knowledge networks, it is able to make relational inferences as well as categorical predictions very rapidly and efficiently.

Scone is primarilty written in Common Lisp, and some other outer components have been built in Python and Java (such as extractors, etc.). While it is certainly hard to summarize this complicated system in just one paragraph, you are very welcomed to learn more on Scott’s website on Scone.

What to extract, and from where?

Scone is a clever AI that goes beyond first-order logic (which is required to usual programs). In particular, it uses the marker-passing algorithm along with hierarchical links to manage not only the learned information, but also the relational judgments. The efficient Q&A as well as inferences make it possible to infer certain facts without being told so directly (for example, “Clyde is an elephant in Harry Potter” implies that “Clyde is in Britan”, based on the contextual information previously inferred about Harry Potter).

But such clean way of managing the information comes at a cost: we cannot simply to everything get in the KB without verifying the information. Also, in some cases, the data come in the form of text, with different irrelevant symbols, words and even phrases in it. So a key question that I explored on is the extraction of valuable information from semi-organized sources: where useful information is present but in an unorderly fashion. In some of such sources, the information may not even be directly accessible (but need further user-source interactions).

We primarily tested on data in geographical and sports domain.


After several attempts, the method that I found worked best and most efficiently was a two-step analysis:

  • Phantom web engine + general Bayes classifier. Because the information may not be available immediately, the first challenge is to simulate a typical human-level behavior and get the raw information. Also, Bayes classifer has turned out to be an efficient way of filtering out irrelevant parts. Some of the nuances and smaller parts, nevertheless, could not be detected easily.
  • NLP structure tree. To get into the meaning of the sentences and words, an NLP parsing on the structure was super-userful. While standard NLP techniques were typically for understanding meanings of text and speeches, the NLP tree worked especially well for Scone AI because its tree structure also implies the relational information that could be directly harnessed by Scone.

Another very challenging part I would like to highlight is the cross-checking part. When the AI absorbs a new piece of information, we need to know at least two things: (1) It does not have this information (overlap) now; and (2) it is consistent with the previous things it has learned. For example, if Scone previously learned that “Clyde is an elephant” and “Clyde is in Britain”, then a new statement saying “Clyde is flying happily over the Missisipi river” can compromise the sanity of the KB. To automate the sanity checking process in Scone AI, I built a cross-checking mechanism taking advantage of the high-order logical constructs of Scone as well as its relational links. Such connected knowledge net can quickly pass the query both top-down (subtype check) or bottom-up (negation/cancellation properties; e.g. “{Clyde} is an {elephant} but is {white}” is an cancellation/exception).

More information can be found in my CMU poster and Scone’s website. The code (mainly written in Python + Java) and other documents I have, unfortunately, cannot be revealed immediately here.


The results were mainly in the form of implementations, and the code was written primarily in Common Lisp and Python. While I cannot release the code directly, I am in the process of tidying the notes & reports that I have while working on the information extraction. In May 2016, I was also very fortunate to be able to attend the “Meeting of the Minds” research poster presentation at Carngie Mellon and received a 3rd place in its Sigma-Xi competition (CS division). The poster can be found here.

Doom DRL Project

This is a recent project (and half research, I would say) that I completed with my teammate. We worked on the very interesting problem of training an agent in the game Doom (which is 3D, partially observable and strategy-demanding) using the one of the latest Deep Reinforcement Learning (DRL) technique. More details can be found in the paper, our presentation poster, or this blog post of mine.

Project Documents