1. What is SimulST?
Simultaneous speech translation (SimulST) refers to the process of producing an output translation concurrently with an oncoming source speech input. For humans, performing accurate SimulST is extremely difficult and becomes nearly impossible to perform over long periods of time. Given the potential broad applications of SimulST in industry and government sectors, there is a strong need for machine learning models to perform the task to a level above the capabilities of humans. One branch of machine learning models that have been effective in SimulST is transformers (Vaswani et al., 2017) using block processing, a process that breaks an input sequence into segments which the encoder processes sequentially and individually (Dong et al., 2019).
read more
2. What is the Augmented Memory Transformer?
The Augmented Memory Transformer is a transformer model that uses a wait-k policy, breaking input sequences into segments and incorporating memory banks for long-term memory retention. It has two subsampling convolution layers in the encoder and uses attention output for summarization queries. The model processes individual segments and concatenates them before decoding. Average Lagging and BLEU Score are metrics used to evaluate its performance.
read more
3. What is Implicit Memory Transformer?
Implicit Memory Transformer leverages a new left context generation method to retain implicit memory of previous segments, removing the need for expensive explicit memory banks. It uses a unique implicit memory left context at each encoder layer, composed of a portion of the output from the self-attention calculation of the previous segment's center context. This approach captures the benefits of implicit memory without the additional cost of computing memory banks. The Implicit Memory Transformer also removes left context in the queries, making the self-attention calculation more efficient than the Augmented Memory Transformer. It reduces computation costs by eliminating the need to process tokens in the left context and improves efficiency in feed-forward neural networks and convolution subsampling layers.
read more
4. What is the complexity of self-attention and convolution subsampling layers in Augmented Memory Transformer?
The complexity of self-attention layer in Augmented Memory Transformer is O(n 2 * d), while the convolution layer has a complexity of O(K * n * d 2 ). These complexities are derived from the input sequence length (n), hidden size (d), and kernel size (K). The complexity of the self-attention layer in the old Augmented Memory Transformer was O((N + l + c + r)(l + c + r) * d), which changed to O((c + r)(l + c + r) * d) with the new method of calculating left context. Similarly, the complexity of the convolution layers changed from O(K * (l + c + r) * d 2 ) to O(K * (c + r) * d 2 ). The decrease in computational complexity for all layers in the Augmented Memory Transformer with respect to the left context size and memory banks suggests the possibility of increasing the left context size for better translation performance.
read more