TL;DR: An analysis of the performance of the best pitch detection algorithm with respect to estimated signal-to-noise ratio (SNR) shows that very similar performance is observed on the real noisy data recorded in public places, and on the clean data with addition of babble noise.
Abstract: This paper analyses the performance of a large bunch of pitch detection algorithms on clean and noisy speech data. Two sets of noisy speech data are considered. One corresponds to simulated noisy data, and is obtained by adding several types of noise signals at various levels on the clean speech data of the Pitch-Tracking Database from Graz University of Technology (PTDB-TUG). The second one, SPEECON, was recorded in several different acoustic environments. The paper discusses the performance of pitch detection algorithms on the simulated noisy data, and on the real noisy data of the SPEECON corpus. Also, an analysis of the performance of the best pitch detection algorithm with respect to estimated signal-to-noise ratio (SNR) shows that very similar performance is observed on the real noisy data recorded in public places, and on the clean data with addition of babble noise.
TL;DR: Both objective and subjective evaluation results show that the quality of speech synthesized with the proposed pitch estimation method is much better compared with HMM-based speech synthesis systems developed using the state-of-the-art pitch extraction methods.
Abstract: This letter proposes an efficient method for extracting pitch from speech signals for the hidden Markov model (HMM)-based speech synthesis system (HTS). In the proposed method, voicing detection and pitch estimation is performed using the mean signal obtained from continuous wavelet transform coefficients. The proposed pitch extraction method is integrated in the HMM-based speech synthesis system. The Performance of the proposed method is evaluated on CMU Arctic and Keele databases. Both objective and subjective evaluation results show that the quality of speech synthesized with the proposed pitch estimation method is much better compared with HMM-based speech synthesis systems developed using the state-of-the-art pitch extraction methods, namely, robust algorithm for pitch tracking and speech transformation and representation using adaptive interpolation of weighted spectrum employed in the HTS.
TL;DR: This study investigates the use of robust harmonic features for classification-based pitch estimation using a neural network for modeling the relationship between input harmonic features and output pitch salience for each specific pitch candidate.
Abstract: Pitch estimation in diverse naturalistic audio streams remains a challenge for speech processing and spoken language technology. In this study, we investigate the use of robust harmonic features for classification-based pitch estimation. The proposed pitch estimation algorithm is composed of two stages: pitch candidate generation and target pitch selection. Based on energy intensity and spectral envelope shape, five types of robust harmonic features are proposed to reflect pitch associated harmonic structure. A neural network is adopted for modeling the relationship between input harmonic features and output pitch salience for each specific pitch candidate. In the test stage, each pitch candidate is assessed with an output salience that indicates the potential as a true pitch value, based on its input feature vector processed through the neural network. Finally, according to the temporal continuity of pitch values, pitch contour tracking is performed using a hidden Markov model (HMM), and the Viterbi algorithm is used for HMM decoding. Experimental results show that the proposed algorithm outperforms several state-of-the-art pitch estimation methods in terms of accuracy in both high and low levels of additive noise.
TL;DR: This work focuses on four-part compositions, and evaluations on recordings of Bach Chorales and Barbershop quartets show that the integrated approach achieves an F-measure of over 70% for frame-based multipitch detection and over 45% for four-voice assignment.
Abstract: This paper presents a multi-pitch detection and voice assignment method applied to audio recordings containing a cappella performances with multiple singers. A novel approach combining an acoustic model for multi-pitch detection and a music language model for voice separation and assignment is proposed. The acoustic model is a spectrogram factorization process based on Probabilistic Latent Component Analysis (PLCA), driven by a 6-dimensional dictionary with pre-learned spectral templates. The voice separation component is based on hidden Markov models that use musicological assumptions. By integrating the models, the system can detect multiple concurrent pitches in vocal music and assign each detected pitch to a specific voice corresponding to a voice type such as soprano, alto, tenor or bass (SATB). This work focuses on four-part compositions, and evaluations on recordings of Bach Chorales and Barbershop quartets show that our integrated approach achieves an F-measure of over 70% for frame-based multipitch detection and over 45% for four-voice assignment.
TL;DR: In this article, a method that estimates the fundamental frequency in a real noisy environment when many persons speak at the same time and considers the case of two speakers is presented. But the method is not suitable for the case where multiple speakers are present.
TL;DR: It is demonstrated that rather than encourage dependency between activations, what is relevant for improving pitch detection is to learnt priors that fit the frequency content of the sound events to detect.
Abstract: Automatic music transcription (AMT) aims to infer a latent symbolic representation of a piece of music (piano-roll), given a corresponding observed audio recording Transcribing polyphonic music (when multiple notes are played simultaneously) is a challenging problem, due to highly structured overlapping between harmonics We study whether the introduction of physically inspired Gaussian process (GP) priors into audio content analysis models improves the extraction of patterns required for AMT Audio signals are described as a linear combination of sources Each source is decomposed into the product of an amplitude-envelope, and a quasi-periodic component process We introduce the Mat\'ern spectral mixture (MSM) kernel for describing frequency content of singles notes We consider two different regression approaches In the sigmoid model every pitch-activation is independently non-linear transformed In the softmax model several activation GPs are jointly non-linearly transformed This introduce cross-correlation between activations We use variational Bayes for approximate inference We empirically evaluate how these models work in practice transcribing polyphonic music We demonstrate that rather than encourage dependency between activations, what is relevant for improving pitch detection is to learnt priors that fit the frequency content of the sound events to detect
TL;DR: Speech parameterization results are used to segment the speech signal and to isolate the segments with stable spectral characteristics and can be used to generate a digital voice pattern of a person or be applied in the automatic speech recognition.
Abstract: Abstract Parameterization of the speech signal using the algorithms of analysis synchronized with the pitch frequency is discussed. Speech parameterization is performed by the average number of zero transitions function and the signal energy function. Parameterization results are used to segment the speech signal and to isolate the segments with stable spectral characteristics. Segmentation results can be used to generate a digital voice pattern of a person or be applied in the automatic speech recognition. Stages needed for continuous speech segmentation are described.
TL;DR: An audio mosaicing method that converts Pop songs into a specific music style called “chiptune,” or “8-bit music” is proposed, and it is validated through a subjective listening test that the proposed method creates much better 8- bit music than existing nonnegative matrix factorization based methods can do.
Abstract: In this paper, we propose an audio mosaicing method that converts Pop songs into a specific music style called “chiptune,” or “8-bit music.” The goal is to reproduce Pop songs by using the sound of the chips on the old game consoles in 1980s/1990s. The proposed method goes through a procedure that first analyzes the pitches of an incoming Pop song in the frequency domain, and then synthesizes the song with template waveforms in the time domain to make it sound like 8-bit music. Because a Pop song is usually composed of the vocal melody and the instrumental accompaniment, in the analysis stage we use a singing voice separation algorithm to separate the vocals from the instruments, and then apply different pitch detection algorithms to transcribe the two separated sources. We validate through a subjective listening test that the proposed method creates much better 8-bit music than existing nonnegative matrix factorization based methods can do. Moreover, we find that synthesis in the time domain is important for this task.
TL;DR: It is elucidated that the pitch average, uniformity, rotation angle, and orthogonal angle can be calculated using the PD method and this has been applied to the pitch evaluation of several 2D gratings and lattices, and the results are compared with the results of using the center-of-gravity and Fourier-transform-based method.
Abstract: We have mathematically explicated and experimentally demonstrated how a correlation and convolution filter can dramatically suppress the noise that coexists with the scanned topographic signals of two-dimensional (2D) gratings and lattices with 2D perspectives. To realize pitch evaluation, the true peaks' coordinates have been precisely acquired after detecting the local maxima from the filtered signal, followed by image processing. The combination of 2D filtering, local-maxima detecting, and image processing make up the pitch detection (PD) method. It is elucidated that the pitch average, uniformity, rotation angle, and orthogonal angle can be calculated using the PD method. This has been applied to the pitch evaluation of several 2D gratings and lattices, and the results are compared with the results of using the center-of-gravity (CG) and Fourier-transform-based (FT) method. The differences of pitch averages which are produced using the PD, CG, and FT methods are within 1.5 pixels. Moreover, the PD method has also been applied to detect the dense peaks of Si (111) 7×7 surface and the highly oriented pyrolytic graphite (HOPG) basal plane.
TL;DR: A new algorithm based on complementary ensemble empirical mode decomposition is developed and the results confirm the robustness of the algorithm in the presence of frequency modulation of the pitch of speech signals.
Abstract: The problem of increasing the precision with which the pitch frequency of speech signals is measured is considered. Existing algorithms for determining this frequency are presented and a new algorithm based on complementary ensemble empirical mode decomposition is developed. The results of the investigations confirm the robustness of the algorithm in the presence of frequency modulation of the pitch of speech signals.
TL;DR: A novel, noise-robust method for determining speech fundamental frequency and pitch segmentation, based on a short-time energy waveform (SEW), defined as a moving average squared signal is introduced.
Abstract: In general, speech is constituted of quasi-repetitive patterns called pitches representing the speech fundamental period and tonal information of the voice. Extraction of pitch information that is crucial for many speech processing techniques, usually faces a noise problem and interference caused by high-order harmonic components. This paper introduces a novel, noise-robust method for determining speech fundamental frequency and pitch segmentation, based on a short-time energy waveform (SEW), defined as a moving average squared signal. When applying a moving average filter with a window size closed to the fundamental period, nearly repetitive patterns, with fewer ripples, synchronizing with actual pitches can clearly be observed in the SEW. The DC component in the SEW is removed using morphological top-hat and bottom-hat transforms. The fundamental frequency is determined as the frequency corresponding to the largest peak of the power spectrum of the DC-removed SEW. Finally, a time-domain window search is then performed to locate local extrema associated with pitches. Compared to traditional pitch detection techniques, the proposed technique yields pitch segmentation results with a higher rate of accuracy and greater noise robustness.
TL;DR: Experiments show that both proposed models outperform a deep neural network (DNN) based model in most conditions and time-frequency LSTM achieves the best performance at negative SNRs.
Abstract: Pitch tracking in noisy speech is a challenging task as temporal and spectral patterns of the speech signal are both corrupted. This paper proposes long short-term memory (LSTM) based methods for pitch probability estimation. Two architectures are investigated. The first one is conventional LSTM that utilizes recurrent connections to model pitch dynamics. The second one is two-level time-frequency LSTM, with the first level scanning frequency bands and the second level connecting the first level through time. The Viterbi algorithm then takes the probabilistic output from LSTM to generate continuous pitch contours. Experiments show that both proposed models outperform a deep neural network (DNN) based model in most conditions. Time-frequency LSTM achieves the best performance at negative SNRs.
TL;DR: This method consists on the autocorrelation function of the Multi-scale product calculation of the mixture signal, its filtered version by a rectangular improved comb filter and the dynamic programming of the residual signal spectral density to improve the residue pitch estimation.
Abstract: There are many multi-pitch estimation methods, but most of them can't perform perfectly for intrusion pitch detection. For this reason, a new multi-pitch detection approach is proposed. This method consists on the autocorrelation function of the Multi-scale product calculation of the mixture signal, its filtered version by a rectangular improved comb filter and the dynamic programming of the residual signal spectral density. First, we analyze the composite speech. Then, we apply the autocorrelation on the multi-scale product (AMP). We find the first pitch which represents the dominant one. Then, we apply the rectangular comb filter which has adaptive amplitude to remove the resulting signal from the original one. We operate AMP on the residue to obtain a pitch estimation of the intrusion. To improve the residue pitch estimation, we apply the dynamic programming to the spectral density of the residual signal to get optimum pitches corresponding also to intrusion signal. After that, we compare the two resulting pitch residue series to choose the most appropriate. Finally, this method is evaluated using the Cooke database and is compared to other well-known techniques. Experimental results confirm the strength and the performance of the proposed approach.
TL;DR: A pitch detection method may apply Pseudo Weigner Ville Transformation (PWVT) as a spectral representation of speech signal as mentioned in this paper, which is used for pitch detection in speech signals.
Abstract: A pitch detection method. Such a pitch detection method may apply Pseudo Weigner Ville Transformation (PWVT) as a spectral representation of speech signal. Also, the pitch detection method may take the median value of each frame of the speech signal as a threshold for making the voicing decision. Additionally, the pitch detection method may take a moving average of PWVT as the threshold for voicing decision.
TL;DR: In this article, a modified version of YIN, called YIN-bird, is presented, which exploits spectrogram properties to automatically set a minimum fundamental frequency parameter for YIN.
Abstract: Pitch or fundamental frequency is an important feature of bird song, from which scientists can learn much about a population. To use pitch as a feature, researchers need confidence in their pitch extraction system. Pitch detection algorithms (PDAs) proven to work on human speech may not be suitable for all types of bird vocalizations. This paper discusses pitch estimation performance on a variety of common bird vocalizations. The presence of multiple partials or tones simultaneously, extended frequency sweeps through multiple octaves, and rapid pitch modulations are just some of the difficulties encountered when estimating the pitch of bird song. Carefully tuned parameters improve pitch tracking with YIN, but optimal parameters can change quickly even within one song. YIN is a PDA which estimates pitch of human speech very well. This paper presents YIN-bird, a modified version of YIN which exploits spectrogram properties to automatically set a minimum fundamental frequency parameter for YIN. Gross pitch e...
TL;DR: This paper proposes a modification to YIN function using Fourier series approximation to reduce harmonic errors, which gives better estimate of pitch in case of quasi-periodic signals like music due to use of Fourierseries coefficients which captures the signal in a better way.
Abstract: Several pitch estimation algorithms both in time domain as well as frequency domain have been proposed by researchers in last few decades. Among them, Square Difference Function (SDF) is one of the popular time-domain pitch detection algorithms. A modified SDF known as YIN is known to give better results. Each pitch detection algorithm has its limitations and error in pitch detection is observed across all algorithms. We specifically address one of the prominent pitch detection errors, known as harmonic errors where the estimated pitch is one of the multiples of the actual pitch. In this paper, we propose a modification to YIN function using Fourier series approximation to reduce harmonic errors. This method gives better estimate of pitch in case of quasi-periodic signals like music due to use of Fourier series coefficients which captures the signal in a better way. We have studied various harmonic signals and analyzed them to find out the cause for such harmonic errors. Then we apply the algorithm on real music signals as well as synthetic signals. Our simulation study shows that the new algorithm leads to an improved pitch estimation method.
TL;DR: An algorithm for signal processing in the frequency domain using wavelets that can be applied to spectral density estimation using the Fourier periodogram and estimation of energy in different frequency bands is suggested.
Abstract: We suggest an algorithm for signal processing in the frequency domain using wavelets. The algorithm can be applied to spectral density estimation using the Fourier periodogram and estimation of energy in different frequency bands. One of the applications of the algorithm is estimation vibrational signal parameters using parallel implementation of the algorithm.
TL;DR: The analysis of the excitation source features for various music components done in this work present some insightful observations and clues towards effective Music component processing.
Abstract: Regular pitch detection algorithms are known to be immensely useful for speech source analysis. Their utility is not as reliable when processing polyphonic acoustic mixtures like Music. This is an investigative study of music components like rhythm, accompaniment and Lyrical-voicing, that is seen as a critical task towards targeted music component identification and processing. Popular music forms like Western and Hindustani Classical are considered for our study dataset. For Western cases, comparative preliminary analysis of the spectral characteristics like Harmonics and Energy is done towards characterization of Music region against that of Lyrics-music mixture. \(F_{0}\) contour analysis for these regions, using Autocorrelation and Zero frequency filtering indicates the utility of the latter in Lyrical-voicing onset identification. Short-time spectral analysis leads to the distinctive understanding about the Harmonic structure according to the music polyphony. Strength of Excitation is found to be insightful towards characterizing sounds like base sounds, prominent in percussion instruments. For study on Classical music, \(F_{0}\) contour analysis using raw signal and LP Residual elucidate the characteristic average pitch effect, which comes out to be higher for the Alaap region in case of Female artists and Lyrics composition regions for the Male artists, giving cues towards the applications like Raaga identification and summarization. The analysis of the excitation source features for various music components done in this work present some insightful observations and clues towards effective Music component processing.
TL;DR: In this article, a cable pitch measuring equipment based on machine vision technique of automatic slotted line cable pitch is provided, including optical imaging ware, light source, assembly, testing platform, cable meter rice device, supporting wheel, and vision pitch detection device.
Abstract: The utility model relates to a check out test set, especially cable pitch measuring equipment based on machine vision technique. The utility model aims to solve the technical problem that a cable pitch measuring equipment based on machine vision technique of automatic slotted line cable pitch is provided, including optical imaging ware, light source, assembly, testing platform, cable meter rice device, supporting wheel, vision pitch detection device, the assembly is installed on testing platform, cable meter rice device, relative just cable meter rice unit bit in the supporting wheel top that sets up of supporting wheel, the light source is located the top of cable meter rice device, the optical imaging ware is located the top of light source, optical imaging ware, light source, cable meter rice device all are connected with vision pitch detection device. The utility model discloses application machine vision technique carries out online automatic measure to pitch and average pitch after the cable transposition to solve the problem that measured deviation is big that exists completely in the measurement of present cable pitch.
TL;DR: A novel F0 estimation algorithm that initially estimates the glottal closure instants (GCIs) or pitch and then computes the corresponding fundamental frequency (F0) is proposed and evaluated.
Abstract: We propose a novel F 0 estimation algorithm that initially estimates the glottal closure instants (GCIs) or pitch and then computes the corresponding fundamental frequency (F 0 ) The proposed method eliminates the assumption that F 0 is constant over a segment of short duration (ie, 20–30 ms) We use our previously proposed novel filtering-based approach for GCI estimation As the proposed method directly operates on the entire speech signal, it does not require to set the window length and thus, it is free from the problem of spectral leakage Measuring the effectiveness of proposed F 0 estimation algorithm wrt three state-of-the-art methods, namely, Yet Another Algorithm for Pitch tracking (YAAPT), Speech Transformation and Representation using Adaptive Interpolation of weiGHT spectrum (STRAIGHT) and Pitch Detection Algorithm (PDA), is challenging in the absence of the ground truth Since accurate estimation of F o will impact the quality of converted voice in Voice Conversion (VC) task Hence, we measure the effectiveness F o estimation algorithm in the application of VC task The quality and speaker similarity of the converted voice have been evaluated using two subjective measures, namely, Mean Opinion Scores (MOS) and ABX test, respectively
TL;DR: This work proposes a technique for predicting the pitch from Mel-frequency cepstral coefficients (MFCC) vectors using the deep neutral network (DNN) based predictor as baseline and proposes a novel method to estimate the spectrum from MFBE that exploits the sparse nature of the voiced speech spectrum.
Abstract: This work proposes a technique for predicting the pitch from Mel-frequency cepstral coefficients (MFCC) vectors. Previous pitch prediction methods are based on the statistical models such as Gaussian mixture models and hidden Markov models. In this paper, we propose a three-step method to estimate pitch from MFCC vectors. First the Mel-filterbank energies (MFBEs) are estimated from MFCC vectors. Secondly, we propose a novel method to estimate the spectrum from MFBE that exploits the sparse nature of the voiced speech spectrum. Finally, the pitch is estimated from the recovered spectrum. We also explore the effect of different levels of truncation of the discrete cosine transformation (DCT) coefficients in MFCC computation on the pitch prediction error. We use the deep neutral network (DNN) based predictor as baseline to predict the pitch from MFCC vectors. The experiments using CMU-ARCTIC and KEELE database show that the proposed three-step method generalizes better across databases and genders resulting in a drop of ∼8Hz and ∼5Hz in average RMSE of predicted pitch with respect to those from DNN when 13-dimensional and 26-dimensional MFCC vectors are used for pitch prediction respectively. We also find that the sparsity constraint performs better in recovering the spectrum at lower pitch values.
TL;DR: A pitch detection method may have M-PWVT-TEO algorithm to detect a pitch value from a speech signal, apply a partial auto-correlation to a current signal with the pitch value to compensate the delay of the pitch values as mentioned in this paper.
Abstract: A pitch detection method. Such a pitch detection method may have M-PWVT-TEO algorithm to detect a pitch value from a speech signal, apply a partial auto-correlation to a current signal with the pitch value to compensate the delay of the pitch value. Also, the pitch detection method may apply a full auto-correlation to the speech signal where the pitch value is not detected to recover on-sets of the speech signal.
TL;DR: The target of this thesis paper is to develop digital signal processing algorithms and implement a software environment to detect the pitch of music signals and the Wavelet transform method is used for monophonic pitch detection.
Abstract: The target of this thesis paper is to develop digital signal processing algorithms and implement a software environment to detect the pitch of music signals. Analysis of music signals, detecting the pitch and tracking notes is the main part of the work. Different state of the art pitch detection algorithms were investigated and their advantages and disadvantages were studied and compared among each other. In this thesis, the Wavelet transform method is used for monophonic pitch detection. The proposed method was developed using MATLAB® and tested for various music tracks which were produced from multi-track MIDI and audio editing software and some downloaded from the net. MATLAB® graphical user interface is used to display the detected as a feedback to the musician.
TL;DR: In this paper, the authors investigate the benefits and shortcomings of both the place theory and time theory approaches for pitch perception in psychoacoustics and propose a model consistent with the long standing focus on the frequency domain and then expand to a more modern approach that functions in the time domain.
Abstract: Pitch perception is a phenomenon that has been the subject of much debate within the psychoacoustics community. It is at once a psychological, physiological and mathematical issue that has divided scientists for the last 200 years. My project aims to investigate the benefits and shortcomings of both the place theory and time theory approaches. This is done first by a model consistent with the long standing focus on the frequency domain, and then by expanding to a more modern approach that functions in the time domain.
TL;DR: In this article, the adaptive pitch transposition module modifies the frequency content in a fashion so that the pitch is fixed to an optimal value, which is subsequently called the comfort pitch.