Key Points

1. The paper discusses the use of the indefinite article "a" in a verb phrase that indicates importance or leadership.

2. It presents the contexts corresponding to a few experts in the WMT'14 En→ Fr translation model and the words surrounding the corresponding positions in the input sentences.

3. Due to infrastructure peculiarities, a different gating function was used to ensure every expert received the same batch size in machine translation experiments.

4. The paper introduces the softmax gating function and an alternate formulation for obtaining a sparse gating vector.

5. It describes the implementation of top-K and batchwise masks to assign experts to input examples and the use of a threshold mask at inference time.

6. The paper discusses modifications to the inference when using batchwise functions during training and the use of per-expert threshold values.

7. It provides details about the attention mechanism described in GNMT and its implementation as a feed forward neural network with trainable weight matrices and vectors.

8. For performance reasons, the paper details the use of a slightly different attention function in their models that allows for the simultaneous computation of attention on multiple source and target time steps.

9. The paper notes little difference in quality between the two attention functions used in their models.

Summary

Introduction of the Sparsely-Gated Mixture-of-Experts Layer for Conditional Computation
The paper introduces a new technique called the Sparsely-Gated Mixture-of-Experts Layer (MoE) to address challenges in conditional computation. The authors demonstrate significant improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. The MoE, consisting of up to thousands of feed-forward sub-networks, is guided by a trainable gating network that determines a sparse combination of these experts to use for each example. The technique is applied to language modeling and machine translation tasks, achieving significantly better results than state-of-the-art models at lower computational cost.

The historical use of mixture-of-experts approach in research is discussed, with focus on the potential of using multiple MoEs with their own gating networks as parts of a deep model. Additionally, the paper explores various challenges related to modern computing devices, large batch sizes, and network bandwidth, and presents solutions to ensure computational efficiency and balanced expert utilization. Experimental results demonstrate the effectiveness of the technique, with models achieving lower perplexity scores and higher BLEU scores compared to existing methods. Overall, the paper provides insights into the potential of conditional computation and its application in deep neural networks for language modeling and machine translation tasks.

Advancements through the Sparsely-Gated Mixture-of-Experts Layer in Language Modeling and Machine Translation
The paper introduces a new technique called Sparsely-Gated Mixture-of-Experts Layer (MoE) to tackle challenges in conditional computation specific to language modeling and machine translation tasks. The key innovation of the paper is the demonstration of over 1000x improvements in model capacity while incurring only minor losses in computational efficiency. The paper also explores the historical use of the mixture-of-experts approach in research and proposes the potential of using multiple MoEs with their own gating networks as parts of a deep model, enabling different gating decisions at each position in the text.

Furthermore, the paper discusses the Batchwise Mask, an alternative mask function, to ensure that each expert receives the exact same number of examples. They keep the top m values per expert across the training batch, where m = k|X|n, so that each example is sent to an average of k experts. The authors proposed using a threshold mask at inference time, which involves training a vector of per-expert threshold values to approximate the effects of the batchwise mask.

Implementation and Evaluation of Attention Mechanism for Conditional Computation
The paper also delves into the attention mechanism described in GNMT and its implementation as a feedforward neural network with a hidden layer of size n. For performance reasons, the authors used a slightly different attention function in their models, allowing for the simultaneous computation of the attention function on multiple source time steps and multiple target time steps using optimized matrix multiplications. They found little difference in quality between the two functions.

Overall, the paper presents an innovative technique to address challenges in conditional computation, specifically in the context of language modeling and machine translation tasks.

Reference: https://arxiv.org/abs/1701.06538