Implicit Memory Transformer for Computationally Efficient Simultaneous Speech Translation

Question

1. What is SimulST?

2. What is the Augmented Memory Transformer?

3. What is Implicit Memory Transformer?

4. What is the complexity of self-attention and convolution subsampling layers in Augmented Memory Transformer?

Accepted Answer

Simultaneous speech translation (SimulST) refers to the process of producing an output translation concurrently with an oncoming source speech input. For humans, performing accurate SimulST is extremely difficult and becomes nearly impossible to perform over long periods of time. Given the potential broad applications of SimulST in industry and government sectors, there is a strong need for machine learning models to perform the task to a level above the capabilities of humans. One branch of machine learning models that have been effective in SimulST is transformers (Vaswani et al., 2017) using block processing, a process that breaks an input sequence into segments which the encoder processes sequentially and individually (Dong et al., 2019).

Accepted Answer

The Augmented Memory Transformer is a transformer model that uses a wait-k policy, breaking input sequences into segments and incorporating memory banks for long-term memory retention. It has two subsampling convolution layers in the encoder and uses attention output for summarization queries. The model processes individual segments and concatenates them before decoding. Average Lagging and BLEU Score are metrics used to evaluate its performance.

Accepted Answer

Implicit Memory Transformer leverages a new left context generation method to retain implicit memory of previous segments, removing the need for expensive explicit memory banks. It uses a unique implicit memory left context at each encoder layer, composed of a portion of the output from the self-attention calculation of the previous segment's center context. This approach captures the benefits of implicit memory without the additional cost of computing memory banks. The Implicit Memory Transformer also removes left context in the queries, making the self-attention calculation more efficient than the Augmented Memory Transformer. It reduces computation costs by eliminating the need to process tokens in the left context and improves efficiency in feed-forward neural networks and convolution subsampling layers.

Accepted Answer

The complexity of self-attention layer in Augmented Memory Transformer is O(n 2 * d), while the convolution layer has a complexity of O(K * n * d 2 ). These complexities are derived from the input sequence length (n), hidden size (d), and kernel size (K). The complexity of the self-attention layer in the old Augmented Memory Transformer was O((N + l + c + r)(l + c + r) * d), which changed to O((c + r)(l + c + r) * d) with the new method of calculating left context. Similarly, the complexity of the convolution layers changed from O(K * (l + c + r) * d 2 ) to O(K * (c + r) * d 2 ). The decrease in computational complexity for all layers in the Augmented Memory Transformer with respect to the left context size and memory banks suggests the possibility of increasing the left context size for better translation performance.

Accepted Answer

The number of sentences in the train, development, and test sets of the MuST-C dataset for the en-de, en-fr, en-es language pairs is as follows: en-de - 250942, en-fr - 275085, en-es - 265625. These numbers were provided by Cattoni et al., 2021. The dataset is used for training and evaluating language pair translation models.

Accepted Answer

The Implicit Memory Transformer achieves almost identical performance in terms of BLEU score to the Augmented Memory Transformer using memory banks for SimulST between English and German without affecting the Average Lagging. Removing memory banks in the Augmented Memory Transformer results in an average 4.48 BLEU decrease across all waitk values. Similar results are observed with the English-French and English-Spanish language pairs, where the Implicit Memory Transformer performs nearly identically to the Augmented Memory Transformer using memory banks. The removal of memory banks in the Augmented Memory Transformer leads to an average decrease of 6.23 BLEU and 4.47 BLEU for the English-French and English-Spanish language pairs, respectively. These findings confirm the effectiveness of the attention-generated left context in the Implicit Memory Transformer, which does not see a performance decrease without memory banks.

Accepted Answer

The left context size significantly impacts the forward pass time in Augmented Memory Transformer models. In Figure 4, it is observed that the two Augmented Memory Transformer models exhibit a nonlinear curved relationship between left context size and forward pass time. As the left context size increases, the separation between the two Augmented Memory Transformer curves becomes more apparent, indicating the cost of memory banks becomes more noticeable. This suggests that larger left context sizes result in increased computational costs. On the other hand, the Implicit Memory Transformer model shows a flat relationship between left context size and forward pass time, indicating that it does not experience the same computational cost increase as the Augmented Memory Transformer models. Additionally, Figure 4 demonstrates that removing the left context from the query in the proposed Implicit Memory Transformer model leads to a considerable reduction in computation beyond the removal of memory banks.

Implicit Memory Transformer for Computationally Efficient Simultaneous Speech Translation

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What is SimulST?

2. What is the Augmented Memory Transformer?

3. What is Implicit Memory Transformer?

4. What is the complexity of self-attention and convolution subsampling layers in Augmented Memory Transformer?

5. What is the number of sentences in MuST-C dataset for en-de, en-fr, en-es pairs?

6. How does the Implicit Memory Transformer compare to Augmented Memory Transformers?

7. How does left context size affect forward pass time in Augmented Memory Transformer models?

Related Papers (5)

Transfer Learning for Chinese-Lao Neural Machine Translation with Linguistic Similarity

LightSeq2: Accelerated Training for Transformer-Based Models on GPUs

Hungarian-English machine translation using genpar

LightSeq: A High Performance Inference Library for Transformers

On the Sub-Layer Functionalities of Transformer Decoder.