Member-only story

Day 2: 2020.04.13
Paper: Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
Category: Model/NLP

Goal: “to build accurate (NLP) models which fit a given memory and latency budget

Scope: “Since an exhaustive search over this space is impractical, we fix the model architecture to bidirectional Transformers, known to be suitable for a wide range of NLP tasks

Idea

  • Teacher: “a highly accurate but large model for an end task, that does not meet the resource constraints”. Use BERT-BASE and BERT-LARGE respectively.
  • Student: “compact models that satisfy resource constraints
  • Pre-trained Distillation (PD) :
    > Step 1 — Pre-training on Unlabelled language model data: “A compact model is trained with a masked LM

--

--

Chun-kit Ho
Chun-kit Ho

Written by Chun-kit Ho

cloud architect@ey | full-stack software engineer | social innovation | certified professional solutions architect in aws & gcp

No responses yet