ML Paper Challenge Day 2 — Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
Day 2: 2020.04.13
Paper: Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
Goal: “to build accurate (NLP) models which fit a given memory and latency budget”
Scope: “Since an exhaustive search over this space is impractical, we fix the model architecture to bidirectional Transformers, known to be suitable for a wide range of NLP tasks”
- Teacher: “a highly accurate but large model for an end task, that does not meet the resource constraints”. Use BERT-BASE and BERT-LARGE respectively.
- Student: “compact models that satisfy resource constraints”
- Pre-trained Distillation (PD) :
> Step 1 — Pre-training on Unlabelled language model data: “A compact model is trained with a masked LM objective, capturing linguistic phenomena from a large corpus of natural language texts.”
(Pre-training mitigates the negative effects caused by an imperfect transfer set in step 2.)
> Step 2 — Distillation on Unlabelled transfer data: “This well-read student (after pre-training) is now prepared to take full advantage of the teacher expertise, and is trained on the soft labels (predictive distribution) produced by the teacher.”
> Step 3 — Fine-tuning on Labelled data (Optional) : “makes the model robust to potential mismatches between the distribution of the transfer and labeled sets” (including this step -> PDF instead of PD)
- Pre-training students is surprisingly important, even when millions of samples are available for transfer. (Even though the teacher has already pre-trained and fine-tuned.)
- No shortcut — Full Pre-training is necessary
1. It is not enough to pre-train just the word embeddings.
2. It is worse to truncate deep pre-trained models, especially for shallowed students.
3. “As a rule of thumb, prioritize depth over width, especially with pre-trained students.”