Day 2: 2020.04.13
Paper: Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
Category: Model/NLP

Goal: “to build accurate (NLP) models which fit a given memory and latency budget

Scope: “Since an exhaustive search over this space is impractical, we fix the model architecture to bidirectional Transformers, known to be suitable for a wide range of NLP tasks


  • Teacher: “a highly accurate but large model for an end task, that does not meet the resource constraints”. Use BERT-BASE and BERT-LARGE respectively.
  • Student: “compact models that satisfy resource constraints
  • Pre-trained Distillation (PD) :
    > Step 1 — Pre-training on Unlabelled language model data: “A compact model is trained with a masked LM objective, capturing linguistic phenomena from a large corpus of natural language texts.
    (Pre-training mitigates the negative effects caused by an imperfect transfer set in step 2.)
    > Step 2 — Distillation on Unlabelled transfer data: “This well-read student (after pre-training) is now prepared to take full advantage of the teacher expertise, and is trained on the soft labels (predictive distribution) produced by the teacher.
    > Step 3 — Fine-tuning on Labelled data (Optional) : “makes the model robust to potential mismatches between the distribution of the transfer and labeled sets” (including this step -> PDF instead of PD)


  • Pre-training students is surprisingly important, even when millions of samples are available for transfer. (Even though the teacher has already pre-trained and fine-tuned.)
  • No shortcut — Full Pre-training is necessary
    1. It is not enough to pre-train just the word embeddings.
    2. It is worse to truncate deep pre-trained models, especially for shallowed students.
    3. “As a rule of thumb, prioritize depth over width, especially with pre-trained students.
Chun-kit Ho

cloud architect@ey | full-stack software engineer | social innovation | certified professional solutions architect in aws & gcp