The blog of Firstprayer On The Way To Be A Data Scientist
I'm developing a Mini Course about Web Development. Check it out!

Geoffrey Hinton's Dark Knowledge of Machine Learning

Recently Geoffrey Hinton had made a presentation about “Dark Knowledge” in TTIC to shared his insights about ensemble methods in machine learning and deep neural network. This blog is kind of a summary of his presentation after I watched the video and the slide.

Model Ensemble

The easiest way to extract a lot of knowledge from the training data is to learn many different models in parallel.
At test time we average the predictions of all the models or of a selected subset of good models that make different errors.

Different models may be focusing on different parts of the model, and averaging them gives us more comprehensive knowledge (In this case, overfitting each individual model can be helpful) – “That’s how almost all ML competitions are won”. However, for a big ensemble, it might take a lot of memory and a long time to produce the prediction, which is not desirably in a production environment. Besides, a big ensemble is highly redundant and has very little knowledge per parameter.

So, here’s the problem: can we transfer the knowledge of a big ensemble into a single smaller model?

Soft targets: A way to transfer the function

While training a classification model, we usually use “hard targets”: a category vector with one entry be 1 and the rest be 0, like [0, 0, 1, 0]; “Soft targets” are somewhat probabilistic distributions of different categories, like [0.15, 0.2, 0.6, 0.05]. Now, if we have a big ensemble, we can divide the averaged logits from the ensemble by a “temperature” to get a very soft distribution: .

Then, instead of using “hard targets”, we might choose to minimize the cross enthropy between softmax prediction and the given “soft target” distributions(which are generated by the big ensemble we have — we’re trying to transfer the knowledge of the big ensemble into our new model). Geoffey claims “this is the dark knowledge” because such “soft targets” can reveal knowledges about different categories, and yield more constraints on the model function. In practise, it works better to fit both the hard targets and the soft targets from the ensemble. The cross entropy with “hard targets” needs to be down-weighted – but even it’s down-weighted, this term is very important to get the best result.

Training an ensemble of models

How to efficiently train an ensemble? We can choose to parallelize, but it’s still expensive.

In the case of neural nets, we can use Dropout — each hidden unit has a probability(usually 0.5) to be omitted. Consider a neural net with a single hidden layer contains H hidden units, we’re actually sampling from 2^H different architectures, all of which share weights. In this way each model is strongly regularized.

Then, in test time, we can either: sample many different architectures and take the geometric mean of their outputs, or: use all hidden units, but multiply their outputs with their associated probabilities. The latter one computes the expectation exactly, and is much faster. For multi-layer neural nets, we use dropout for each hidden layer, and use all hidden units in test time(it’s not exactly the expection in multi-layer case, but it’s a pretty approximation).

How good is the effect of knowledge transfer? Experiments

MNIST experiment:

  1. Train a 784 -> 800 -> 800 -> 10 neural net with vanilla backprop: 146 test errors
  2. Train a 784 -> 1200 -> 1200 -> 10 net using dropout and weight constraints and jittering the input: 67 errors
  3. Using both the soft targets obtained from the big net and the hard targets, still uses vanilla backprop to train a 784 -> 800 -> 800 -> 10 neural net: 74 test errors!

More experimental results can be found on the original slide.

This shows the soft targets do contain a lot of useful knowledge. It gives us an idea that in object recognitions, besides transforming input images(e.g. feature learning), transforming targets can also have big effects on improving the generalization of models. There’re also other ways to transform the targets, e.g. organize the labels into a tree.

Training a community of neural nets

If we train ten 784 -> 500 -> 300 -> 10 nets independently on MNIST, they average about 158 test errors, but the geometric ensemble gets 143 errors
If we let each net try to match soft targets derived by averaging the opinions of the whole community as it is training: The nets now average 126 errors, The ensemble gets 120 errors!

Mine knowledge more efficiently

We can encourage different members of the ensemble to focus on resolving different confusions

The idea is to train “specialists” focusing on certain domains of the problem. For example, by feeding examples enriched in mushrooms during training, we can get a model specialize on distinguishing “mushroom” and “non-mushroom”. To achieve, we may use k-means to cluster soft target vectors we have.

The problem for such specialists is they easily overfit. One way to prevent this is during training, each specialist uses a reduced softmax that has one dustbin class for all the classes it does not specialize in. In addition, “The specialist is initialized with the weights of a previously trained generalist model and uses early stopping to prevent over-fitting”. What’s more, for 50% data we fit the hard targets only, for 50% others we fit the soft targets only, and the soft targets can prevent overfitting.

Another problem is how to use such reduced softmax result during test time. One way to do it is to for each test or transfer case we run a fast iterative loop to find the set of logits that fit best with the partial distributions produced by the trained specialists.

Summary

Basically the most two important concepts in this presentation are:

  1. Use soft targets together with hard targets
  2. Train specialist to obtain more knowledge from data

Here’re the links to the video and the slide.