1. What have the authors contributed in "Attention-driven multi-sensor selection" ?
The use of attention for sensor selection in multi-sensor setups and the benefit of such an attention mechanism is less studied.. This work reports on a sensor transformation attention network ( STAN ) that embeds a sensory attention mechanism to dynamically weigh and combine individual input sensors based on their task-relevant information.. The authors demonstrate the correlation of the attentional signal to changing noise levels of each sensor on the audio-visual GRID dataset and synthetic noise ; and on CHiME-4, a multi-microphone real-world noisy dataset.. In addition, the authors demonstrate that the STAN model is able to deal with sensor removal and addition without retraining, and is invariant to channel order.
read more
2. How long does it take to generate the enhanced output from the five input channels?
In order to generate the enhanced output from the five input channels on a sample of average length 6s, the beamforming algorithm takes 3554ms (CPU) while the attention mechanism of STAN-5CH only takes 195ms (CPU) or 25ms (GPU), i.e. 25x to 142x faster (Skylake Xeon CPU with 4.3GHz, GTX 1080 GPU).
read more
3. What is the purpose of the random walk noise model?
The random walk noise model adds noise with a timevarying noise level, σ(k), to each sensor and is used for both training and testing.
read more
4. What is the scoring function for each channel?
Because the input channels are of the same modality, the authors apply the same scoring function Z to each channel i, therefore θZ1 = ... = θZN .
read more
![TABLE I: Results of the GRID experiments, averaged over 10 runs. All values are reported in the format mean ± standard deviation. The ATTCORR values are not computed for the hi-lo noise because the correlation function is not defined for constant functions. The lowest WER is printed bold. The ATTCORR and ATTACC values are rescaled to the range [−100, 100] in the interest of readability.](/figures/table-i-results-of-the-grid-experiments-averaged-over-10-6w09583x.png)



![TABLE II: Results for the CHiME-4 multi-channel ASR experiments. The CER [%] is given for the et05_real and dt05_real subsets. The attention weights for STAN-2CH and STAN-5CH are averaged over all frames of the dt05_real subset. The lowest CER and highest attention weight are printed bold. All models are trained and tested on matched channel configurations, and the CONCAT, AVG and STAN-2CH models are additionally tested on new channel configurations without re-training.](/figures/table-ii-results-for-the-chime-4-multi-channel-asr-v4akbapr.png)