Member-only story
ML Paper Challenge Day 18 — Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Day 18: 2020.04.29
Paper: Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Category: Model/Deep Learning/Speech Recognition
Model Architecture

Input: log-spectrograms of power normalised audio clips, calculated on 20ms windows
Output: alphabet of each language
Inference: CTC models paired a with language model trained on a bigger corpus of text
Batch Normalisation for Deep RNNs
Objective: To train networks using gradient descent when the size and depth increases
2 Ways to apply:
- Insert a BatchNorm transformation, B(·), immediately before every non-linearity -> Not effective
- Batch normalise only the vertical connections
For each hidden unit, compute the mean and variance statistics over all items in the mini-batch over the length of the sequence.
SortaGrad
Objective: Make training more stable. Accelerates training and results in better generalisation
Use the length of the utterance as a heuristic for difficulty and train on the shorter (easier) utterances first.
In the first training epoch,
iterate through mini-batches in the training set in increasing order of the length of the longest utterance in the mini-batch.
After the first epoch,
training reverts back to a random order over mini-batches