Member-only story

Day 19: 2020.04.30
Paper: Achieving Human Parity in Conversational Speech Recognition
Category: Model/Deep Learning/Speech Recognition

Model Architecture

Achieved by combining multiple models!

3 CNNs

  1. VGG architecture
    uses small (3x3) filters, is deeper, and applies up to five convolutional layers before pooling
  2. ResNet architecture
    adds highway connections, i.e. a linear transform of each layer’s input to the layer’s output
    The only difference is that we apply Batch Normalisation before computing ReLU activations.
  3. LACE (layer-wise context expansion with attention) model
    a TDNN variant in which each higher layer is a weighted sum of nonlinear transformations of a window of lower layer frames
    -> each higher layer exploits broader context than lower…

--

--

Chun-kit Ho
Chun-kit Ho

Written by Chun-kit Ho

cloud architect@ey | full-stack software engineer | social innovation | certified professional solutions architect in aws & gcp

No responses yet