1. What have the authors contributed in "Countnet: estimating the number of concurrent speakers using supervised learning" ?
The authors propose a unifying probabilistic paradigm, where deep neural network architectures are used to infer output posterior distributions.. Designing such architectures often involves two important and complementary aspects that the authors investigate and discuss.. First, the authors study how recent advances in deep architectures may be exploited for the task of speaker count estimation.. In particular, the authors show that convolutional recurrent neural networks outperform recurrent networks used in a previous study when adequate input features are used.. Second, through comprehensive evaluation, the authors compare the best-performing method to several baselines, as well as the influence of gain variations, different datasets, and reverberation.. Finally, the authors give insights into the strategy used by their proposed method.. Even for short segments of speech mixtures, the authors can estimate up to five speakers, with a significantly lower error than other methods.
read more
2. What are the future works in "Countnet: estimating the number of concurrent speakers using supervised learning" ?
The authors hope their research stimulates future research on data-driven count estimation, a task that currently lacks real-world datasets.. Finally, to underpin this hypothesis, the authors showed that the speaking rate has a significant effect on the error of their model.
read more
3. What is the first natural idea to imitate human performance?
Since humans do have two ears that provide spatial diversity, a first natural idea to imitate human performance is to exploit binaural information to proceed to source count estimation.
read more
4. What is the reason to stack a CNN layer?
As the output of a CNN layer is a 3D volume D × F × C and the input of a recurrent layer only takes a 2D sequence, the dimension would need to be reduced.
read more

![Table 4: Averaged MAE results of different methods on several datasets for k = [0 . . . 10] with equal power and random gains (up to ±6 dB)) as well as reverberation. Bold face indicates the best-performing method.](/figures/table-4-averaged-mae-results-of-different-methods-on-several-2a4w75nk.png)



![Figure 7: Illustration of intermediate outputs from the proposed CRNN for each convolutional layer for a given input with k = 3 speakers. Saliency map shows positive saliency of guided backpropagation [74]. For each convolutional layer the nine most relevant filters were selected based on their loss with respect to the input.](/figures/figure-7-illustration-of-intermediate-outputs-from-the-3g6eb32x.png)