Recent development of speech signal processing, such as speech recognition, speaker diarization, etc., has inspired numerous applications of speech technologies. The meeting scenario is one of the most valuable and, at the same time, most challenging scenarios for the deployment of speech technologies. Speaker diarization and multi-speaker automatic speech recognition in meeting scenarios have attracted much attention recently. However, the lack of large public meeting data has been a major obstacle for advancement of the field. Therefore, we make available the AliMeeting corpus, which consists of 120 hours of recorded Mandarin meeting data, including far-field data collected by 8-channel microphone array as well as near-field data collected by headset microphone. Each meeting session is composed of 2-4 speakers with different speaker overlap ratio, recorded in meeting rooms with different size. Along with the dataset, we launch the ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge (M2MeT) with two tracks, namely speaker diarization and multi-speaker ASR, aiming to provide a common testbed for meeting rich transcription and promote reproducible research in this field. In this paper we provide a detailed introduction of the AliMeeting dateset, challenge rules, evaluation methods and baseline systems. 

M2Met: The Icassp 2022 Multi-Channel Multi-Party Meeting Transcription Challenge

The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Mandarin meeting speech data with manual annotation, including far-field data collected by 8-channel micro-phone array as well as near-field data collected by each participants’ headset microphone. We briefly describe the released dataset, track setups, baselines and summarize the challenge results and major techniques used in the submissions.

/pdf/summary-on-the-icassp-2022-multi-channel-multi-party-meeting-2vtigsny.pdf

Summary on the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

In this paper we describe a speaker diarization system that enables localization and identification of all speakers present in a conversation or meeting. We propose a novel systematic approach to tackle several long-standing challenges in speaker diarization tasks: (1) to segment and separate overlapping speech from two speakers; (2) to estimate the number of speakers when participants may enter or leave the conversation at any time; (3) to provide accurate speaker identification on short text-independent utterances; (4) to track down speakers movement during the conversation; (5) to detect speaker change incidence real-time. First, a differential directional microphone array-based approach is exploited to capture the target speakers’ voice in far-field adverse environment. Second, an online speaker-location joint clustering approach is proposed to keep track of speaker location. Third, an instant speaker number detector is developed to trigger the mechanism that separates overlapped speech. The results suggest that our system effectively incorporates spatial information and achieves significant gains.

A Real-Time Speaker Diarization System Based on Spatial Spectrum

Traditional acoustic-phonetic approach makes use of both spectral and phonetic information when comparing the voice of speakers. While phonetic units are not equally informative, the phonetic context of speech plays an important role in speaker verification (SV). In this paper, we propose a neural acoustic-phonetic approach that learns to dynamically assign differentiated weights to spectral features for SV. Such differentiated weights form a phonetic attention mask (PAM). The neural acoustic-phonetic framework consists of two training pipelines, one for SV and another for speech recognition. Through the PAM, we leverage the phonetic information for SV. We evaluate the proposed neural acoustic-phonetic framework on the RSR2015 database Part III corpus, that consists of random digit strings. We show that the proposed framework with PAM consistently outperforms baseline with an equal error rate reduction of 13.45% and 10.20% for female and male data, respectively.

Neural Acoustic-Phonetic Approach for Speaker Verification With Phonetic Attention Mask

Recent development of speech signal processing, such as speech recognition, speaker diarization, etc., has inspired numerous applications of speech technologies. The meeting scenario is one of the most valuable and, at the same time, most challenging scenarios for speech technologies. Speaker diarization and multi-speaker automatic speech recognition in meeting scenarios have attracted increasing attention. However, the lack of large public real meeting data has been a major obstacle for advancement of the field. Therefore, we release the \emph{AliMeeting} corpus, which consists of 120 hours of real recorded Mandarin meeting data, including far-field data collected by 8-channel microphone array as well as near-field data collected by each participants' headset microphone. Moreover, we will launch the Multi-channel Multi-party Meeting Transcription Challenge (M2MeT), as an ICASSP2022 Signal Processing Grand Challenge. The challenge consists of two tracks, namely speaker diarization and multi-speaker ASR. In this paper we provide a detailed introduction of the dateset, rules, evaluation methods and baseline systems, aiming to further promote reproducible research in this field.

M2MeT: The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge

In this paper we propose an end-to-end phonetically-aware coupled network for short duration speaker verification tasks. Phonetic information is shown to be beneficial for identifying short utterances. A coupled network structure is proposed to exploit phonetic information. The coupled convolutional layers allow the network to provide frame-level supervision based on phonetic representations of the corresponding frames. The endto-end training scheme using triplet loss function provides direct comparison of speech contents between two utterances and hence enabling phonetic-based normalization. Our systems are compared against the current mainstream speaker verification systems on both NIST SRE and VoxCeleb evaluation datasets. Relative reductions of up to 34% in equal error rate are reported.

Phonetically-Aware Coupled Network For Short Duration Text-Independent Speaker Verification.

Investigation of Spatial-Acoustic Features for Overlapping Speech Detection in Multiparty Meetings

We propose BeamTransformer, an efficient architecture to leverage beamformer's edge in spatial filtering and transformer's capability in context sequence modeling. BeamTransformer seeks to optimize modeling of sequential relationship among signals from different spatial direction. Overlapping speech detection is one of the tasks where such optimization is favorable. In this paper we effectively apply BeamTransformer to detect overlapping segments. Comparing to single-channel approach, BeamTransformer exceeds in learning to identify the relationship among different beam sequences and hence able to make predictions not only from the acoustic signals but also the localization of the source. The results indicate that a successful incorporation of microphone array signals can lead to remarkable gains. Moreover, BeamTransformer takes one step further, as speech from overlapped speakers have been internally separated into different beams.

BeamTransformer: Microphone Array-based Overlapping Speech Detection.

Performance degradation caused by noise has been a long-standing challenge for speaker verification. Previous methods usually involve applying a denoising transformation to speaker embeddings or enhancing input features. Nevertheless, these methods are lossy and inefficient for speaker embedding. In this paper, we propose context- aware masking (CAM), a novel method to extract robust speaker embedding. CAM enables the speaker embedding network to "focus" on the speaker of interest and "blur" unrelated noise. The threshold of masking is dynamically controlled by an auxiliary context embedding that captures speaker and noise characteristics. Moreover, models adopting CAM can be trained in an end-to-end manner without using synthesized noisy-clean speech pairs. Our results show that CAM improves speaker verification performance in the wild by a large margin, compared to the baselines.

Cam: Context-Aware Masking for Robust Speaker Verification

Handle everyday research tasks with reliable, citation-backed results

Your personal Research Agent to handle research tasks with citation-backed results

Popular Tasks used by Researchers

How can I help with your research?

Meet SciSpace

Get more enhanced response by uploading the PDFs you want me to reference.

No relevant PDFs in your library

SciSpace is the AI research assistant for academics. Run systematic literature reviews on 280M+ papers, and write papers with cited sources. Trusted by 1M+ students, PhDs & researchers.

SciSpace | AI for Research

Analyze PDFs

Code & Manuscripts

Funding & Grants

Literature & Patents

Medical & Clinical Data

Systematic Review

Visualize & Present

Web & Data

Build a Google Scholar-like website for your research.

Build a website

Create charts and images for your research

Create a Chart

Write a paper for submission to a journal

Draft a manuscript

Patent Search

Design eye-catching scientific posters in minutes.

Scientific Poster Generation

Systematic Literature Review

One task is running at the moment. Your messages will be shown right after.

Drag and drop or click here to browse

Loved by <highlight>1 million+</highlight> researchers

Extract a list of specific topics and their sources from unstructured text

Topics

Compare and analyze relevant papers that matches with your search

Papers

Get insights from PDFs and bookmarked papers from your library

My library

Recent searches

Try searching for:

Catch AI-generated content in scholarly and non-scholarly content

{ai} Detector

Ai Writer

Get PDF Summaries, highlighted text explanations 

Chat with PDF

Effortlessly create in-text citations and bibliographies in APA and 2,500 other formats

Citation generator

Get explanations, summaries, and answers on academic papers

Ease up your research workflow with {scispace}'s cohort of exciting AI tools

Elevate your academic writing skills and convey your ideas the way you want

Paraphraser

Explore our range of reading and writing tools

Your file is being prepared and should be ready in a few minutes. If it's a large file, it might take a bit longer. You can close this window, and we'll email you the file when it's done.

You have reached a maximum limit of <strong>{limit}</strong> columns in the table. Remove at least <strong>1</strong> column to add or create another one.

Hongbin Suo

Author Tools

Chat about Author