TL;DR: The pitch track onset detection algorithm shows an improvement over the previous best performing algorithm from a recent comparison study of onset detectors, and shows promise as a component of a note event analysis system.
Abstract: A segmentation strategy is explored for monophonic instrumental pitched non-percussive material (PNP) which proceeds from the assertion that human-like event analysis can be founded on a notion of stable pitch percept. A constant-Q pitch detector following the work of Brown and Puckette provides pitch tracks which are post processed in such a way as to identify likely transitions between notes. A core part of this preparation of the pitch detector signal is an algorithm for vibrato suppression. An evaluation task is undertaken on slow attack and high vibrato PNP source files with human annotated onsets, exemplars of a difficult case in monophonic source segmentation. The pitch track onset detection algorithm shows an improvement over the previous best performing algorithm from a recent comparison study of onset detectors. Whilst further timbral cues must play a part in a general solution, the method shows promise as a component of a note event analysis system.
TL;DR: A robust algorithm to detect the pitch of a singing voice in polyphonic audio is proposed and an HMM is employed to integrate the periodicity information across frequency channels and time frames.
Abstract: We propose a robust algorithm to detect the pitch of a singing voice in polyphonic audio. A new channel/peak selection scheme is introduced to exploit the salience of the singing voice and the beating phenomenon in high frequency channels. An HMM is employed to integrate the periodicity information across frequency channels and time frames. Quantitative evaluation shows that the new system performs significantly better than existing algorithms for predominant pitch detection in polyphonic audio.
TL;DR: A pitch synchronous overlap and add (PSOLA) algorithm is used for pitch and duration modifications in the watermark embedding phase and experiments with multiple speech codecs show very good robustness with low data-rate speech coders.
Abstract: We propose a speech watermarking algorithm based on the modification of the pitch (fundamental frequency) and duration of the quasi-periodic speech segments. Natural variability of these speech features allows watermarking modifications to be imperceptible to the human observer. On the other hand, the significance of these features makes the system robust to common signal processing operations and low data-rate source excitation based speech coders. This class of coders is particularly obstructive for conventional audio watermarking algorithms when applied to speech signals. A pitch synchronous overlap and add (PSOLA) algorithm is used for pitch and duration modifications in the watermark embedding phase. Experiments with multiple speech codecs show very good robustness with low data-rate (5-8 kbps) speech coders.
TL;DR: It is shown that the proposed method using joint time and frequency domain cues is able to give a superior accuracy relative to some of the existing methods even at a very low SNR of -10 dB.
Abstract: In this paper, we present a joint time/frequency domain approach for pitch estimation of speech at a very low SNR. The kernel of this approach lies in introducing a new function for detecting the time-domain cue by modifying the circular average magnitude difference function (CAMDF). By using the new function in conjunction with the half-wave rectified version of the autocorrelation function, the pitch-peak can be emphasized and the non-pitch peaks suppressed. To guarantee a robust pitch detection in noisy speech, a priori frequency-domain estimate of the dominant pitch-harmonic is extracted as an additional cue and is utilized to optimally match the pitch-peak in time-domain. The proposed approach is simulated using the Keele reference database. It is shown that the proposed method using joint time and frequency domain cues is able to give a superior accuracy relative to some of the existing methods even at a very low SNR of -10 dB.
TL;DR: A generalised IRN algorithm is presented, in which multiple time varying temporal correlations can be defined and the resulting time varying pitches are perceptually very salient.
Abstract: Iterated ripple noise (IRN) is a broadband noise with temporal regularities, which can give rise to a perceptible pitch. Since the perceptual pitch to noise ratio of these stimuli can be altered without substantially altering their spectral content, they have been useful in exploring the role of temporal processing in pitch perception [Yost, W.A., 1996. Pitch strength of iterated rippled noise, J. Acoust. Soc. Am. 100 (5), 3329-3335; Patterson, R.D., Handel, S.,Yost, W.A., Datta, A.J., 1996. The relative strength of the tone and noise components in iterated rippled noise, J. Acoust. Soc. Am. 100 (5), 3286-3294]. A generalised IRN algorithm is presented, in which multiple time varying temporal correlations can be defined. The resulting time varying pitches are perceptually very salient. It is also possible to segregate and track multiple simultaneous time varying pitches in these stimuli. Temporal auditory models have previously been shown to account for the perception of IRNs with static delays [Patterson, R.D., Handel, S.,Yost, W.A., Datta, A.J., 1996. The relative strength of the tone and noise components in iterated rippled noise, J. Acoust. Soc. Am. 100 (5), 3286-3294]. Here we show that some simple modifications to one such model [Meddis R., Hewitt, M.J., 1991. Virtual pitch and phase sensitivity of a computer model of the auditory periphery I. Pitch identification, J. Acoust. Soc. Am. 89, 2866-2882] allow it to track moving correlations, and also improve its performance in response to static correlations.
TL;DR: In this paper, a method for detecting music in a speech signal having a plurality of frames is presented, which consists of obtaining one or more first pitch correlation candidates from a first frame of the plurality of frame, obtaining another second pitch correlation candidate from a second frame of a plurality frame, and selecting a pitch correlation (Rp) from the first and second candidates.
Abstract: A method is provided for detecting music in a speech signal having a plurality of frames. The method comprises obtaining one or more first pitch correlation candidates from a first frame of the plurality of frames; obtaining one or more second pitch correlation candidates from a second frame of the plurality of frames; selecting a pitch correlation (Rp) from the one or more first pitch correlation candidates and the one or more second pitch correlation candidates; and distinguishing music from background noise based on analyzing the pitch correlation (Rp). The method may further comprise filtering the speech signal using a one-order low-pass filter prior to the obtaining the one or more first pitch correlation candidates, and down sampling the speech signal by four prior to the obtaining the one or more first pitch correlation candidates
TL;DR: A new method for pitch detection based on the continuous wavelet transform phase is presented that can serve as an accurate pitch detector, and also can offer an efficient solution to the end-point detection problem.
Abstract: Voice control has long been considered as a natural mechanism to assist powered wheelchair users. However, one implementation difficulty is that a voice input system may fail to recognise a user's voice. Indeed, speech activated interface between human and autonomous/semi-autonomous systems requires accurate detection and recognition. In this area pitch and end-point detection are of vital importance. This paper presents a new method for pitch detection based on the continuous wavelet transform phase. The proposed technique can serve as an accurate pitch detector, and also can offer an efficient solution to the end-point detection problem. The extracted features from a user's speech are then used to train a neural network for speech recognition. Experimental results are provided for the detection of pitch periods and end points and the recognition of a number of commands of male and female users. Laboratory tests are reported for the proposed voice control wheelchair system.
TL;DR: A new spectral representation-based pitch estimation method derived from the Shorttime Harmonic Chirp Transform that lets this technique to perform very well in noisy conditions, and to extract pitch values with high confidence, even from segments with strong intonations.
Abstract: This paper introduces a new spectral representation-based pitch estimation method. Since pitch is never stationary during real conversations, but often undergoes changes because of intonation, the spectral representation is derived from the Shorttime Harmonic Chirp Transform. This lets our technique to perform very well in noisy conditions, and to extract pitch values with high confidence, even from segments with strong intonations. The paper discusses a new way of segment-vice pitch extraction and does not deal with continuous pitch tracking, which is a topic of our future work. However, the performance of the proposed method is demonstrated on real recordings and the noise-dependency of its accuracy is numerically analyzed.
TL;DR: In this article, a spectrum extraction unit (104) extracts a pitch-harmonized spectrum from a voice spectrum and a spectral average calculation unit (106) calculates the average of the power of the pitchharmonised spectra extracted by the spectrum extraction units, in a manner to individually correspond to a plurality of pitch frequency candidates.
Abstract: A pitch frequency estimation device capable of estimating a pitch frequency precisely while reducing the computational complexity required for the estimation of the pitch frequency. In this device, a spectrum extraction unit (104) extracts a pitch-harmonized spectrum from a voice spectrum. A spectral average calculation unit (106) calculates the average of the power of the pitch-harmonized spectra extracted by the spectrum extraction unit (104), in a manner to individually correspond to a plurality of pitch frequency candidates. An estimation unit estimates the pitch frequency by using the average valve calculated by the spectral average calculation unit (106).
TL;DR: The results indicate that CBZ influences pitch detection peripheral of anoctave-circular pitch representation, which support previous evidence for pitch detection in the auditory midbrain and for octave- Circular pitch mapping in the audible thalamus.
TL;DR: The proposed algorithm is effective for signals with strong harmonic content, as well as for nearly sinusoidal ones, and as an extension to the presented octave error optimized algorithm, a method of estimating instantaneous pitch is described.
Abstract: The aim of this article is to present an octave error optimized pitch detection algorithm based on spectral analysis. The proposed algorithm is effective for signals with strong harmonic content, as well as for nearly sinusoidal ones. In addition, as an extension to the presented octave error optimized algorithm, a method of estimating instantaneous pitch is described. Experiments and estimation accuracy tests in terms of octave errors were performed on a variety of musical instruments (i.e., 567 sounds played on acoustic instruments with various articulations and dynamics, with fundamental frequencies ranging from 34 Hz up to 1700 Hz, were processed). Fine pitch error tests of the instantaneous pitch estimation algorithm were performed for 4,000 different synthetic signals, with frequencies ranging from 50 Hz to 4000 Hz, including both clean signals and signals contaminated with noise. Results exemplifying the main issues of both engineered algorithms are shown. In addition, a performance compar...
TL;DR: Simulation results have verified that the ALNF is very effective in estimating the pitch period of the given speech signal under background noise environments.
TL;DR: A new method employing music prediction to support pitch determination is introduced in order to override disadvantages of standard pitch detection algorithms.
Abstract: Pitch detection methods are widely used for extracting musical data from digital signals. A review of those methods is presented in the paper. Since musical signals may contain noise and distortion, detection results can be erroneous. In this paper a new method employing music prediction to support pitch determination is introduced. This method was developed in order to override disadvantages of standard pitch detection algorithms. The new approach utilizes signal segmentation and pitch prediction based on musical knowledge extraction employing artificial neural networks. Signal segmentation allows for estimating the pitch for a single note as a whole, therefore suppressing errors in transient and decay phases. Pitch prediction helps correcting pitch estimation errors by tracking musical context of the analyzed signal. As it was shown in the experimental results, pitch estimation errors may be reduced by using both signal segmentation and music prediction techniques.
TL;DR: In this article, a pitch pattern generation method which enables generation of a stable pitch pattern with high naturalness is provided, a pattern selection part 10 selects N pitch patterns 101 and M pitch patterns 103 for each prosody control unit from pitch patterns stored in a pitch patterns storage part 14 based on language attribute information obtained by analyzing a text and phoneme duration 111, pattern shape generation part 11 fuses the N selected pitch pattern 101 based on the language attributes 100 to generate a fused pitch pattern and performs expansion or contraction of the fused pattern in a time axis direction in accordance with the phon
Abstract: A pitch pattern generation method which enables generation of a stable pitch pattern with high naturalness is provided, a pattern selection part 10 selects N pitch patterns 101 and M pitch patterns 103 for each prosody control unit from pitch patterns stored in a pitch pattern storage part 14 based on language attribute information 100 obtained by analyzing a text and phoneme duration 111, a pattern shape generation part 11 fuses the N selected pitch patterns 101 based on the language attribute information 100 to generate a fused pitch pattern and performs expansion or contraction of the fused pitch pattern in a time axis direction in accordance with the phoneme duration 111 to generate a new pitch pattern 102, an offset control part 12 calculates a statistic amount of offset values from the M selected pitch patterns 103 and deforms the pitch pattern 102 in accordance with the statistic amount to output a pitch pattern 104, and a pattern connection part 13 connects the pitch pattern 104 generated for each prosody control unit, performs a process of smoothing so that discontinuity does not occur at a connection boundary portion, and outputs a sentence pattern 121.
TL;DR: A new, relatively accurate, and real-time pitch tracking algorithm is proposed which does not need any extra preprocessing and post-processing and can achieve remarkably good performance for pitch tracking.
Abstract: This paper presents a novel pitch tracking method in the time domain. Based on the difference function as used in YIN referred to as the sum magnitude difference square function (SMDSF) thereinafter -- we propose two modified types of SMDSFs, with several methods presented to calculate these SMDSFs efficiently and without bias by using the FFT algorithm. In pitch estimation, every type of SMDSF has its own estimation error characteristics. By analyzing these characteristics, we define a new function which combines the foresaid two types of SMDSFs to prevent estimation errors. A new, relatively accurate, and real-time pitch tracking algorithm is then proposed which does not need any extra preprocessing and post-processing. Experimental results show that this proposed algorithm can achieve remarkably good performance for pitch tracking.
TL;DR: This letter presents a new fast search algorithm for the multitap adaptive codebook used in the G.723.1 standard speech coder that adopts a sequential and restricted approach to determine the parameters.
Abstract: This letter presents a new fast search algorithm for the multitap adaptive codebook used in the G.723.1 standard speech coder. In contrast with the standard method that a closed-loop pitch lag and gains for a fifth-order pitch predictor are searched simultaneously, the proposed algorithm adopts a sequential and restricted approach to determine the parameters. In other words, the proposed scheme first determines a couple of pitch lag candidates using a first-order pitch predictor and then computes the pitch gains of the fifth-order predictor within a restricted search area. Experimental results confirm that the proposed algorithm reduces the total complexity by 30.69% in the encoding process and provides speech quality equivalent to the standard method.
TL;DR: This paper uses lower-lag and higher-lag ranges of the autocorrelation function separately for deriving speech recognition features, and investigates their role in terms of speech recognition performance.
Abstract: It is generally believed that the lower-lag autocorrelation coefficients carry information about the spectral envelope and the higher-lag autocorrelation coefficients are more related to pitch information. In this paper, we use lower-lag and higher-lag ranges of the autocorrelation function separately for deriving speech recognition features, and investigate their role in terms of speech recognition performance. The state-of-the-art MFCC (mel frequency cepstral coefficient) features use the whole autocorrelation function in their computation and are used here as a benchmark in our experiments. Our recognition results from the Aurora II corpus show that the higher-lag autocorrelation coefficients perform as well as the whole autocorrelation function for clean speech, and provide better performance for noisy speech, while lower-lag autocorrelation coefficients are not as effective in this aspect.
TL;DR: In this article, a new event detection pitch detector based on the dyadic wavelet transform was constructed by selecting an optimal scale, which is accurate, robust to noise and computationally simple.
Abstract: In tins paper, a new event detection pitch detector based on the dyadic wavelet transform was constructed by selecting an optimal scale. The proposed pitch detector is accurate, robust to noise and computationally simple. Experiments show the superior performance of this event-based pitch detector in comparison with previous event-based pitch detector and classical pitch detectors that use the autocorrelation and the cepstrum methods to estimate the pitch period.
TL;DR: A novel method for the separation of monaurally recorded speech signals based on pitch inspired by the ability of some auditory neurons to phase lock with the excitation signal, which performs slightly better than the commonly used autocorrelation at lower computational costs.
Abstract: We present a novel method for the separation of monaurally recorded speech signals based on pitch. Our method is inspired by the ability of some auditory neurons to phase lock with the excitation signal. After applying a Gammatone filter-bank on the original signal we compare the distances between zero crossings of possible harmonics and decide upon the result of this comparison if they share the same fundamental and hence originate from the same sound source. For higher frequencies we use the amplitude modulation property of unresolved harmonics to determine their fundamental frequency. When comparing our method to standard autocorrelation based methods we see that the pitch can be tracked more precisely and especially opens the way to extract also the pitch contour of a second speaker or other sound sources which can be of importance for the robots behavior. Tests in sound source separation of our algorithm on a database with several speakers and a large set of intrusions show that our algorithm performs slightly better than the commonly used autocorrelation at lower computational costs.
TL;DR: A new method is proposed that can improve the accuracy of cepstrum pitch detection and can reduce the processing time by omitting the bit-reversing process from the FFT and IFFT computation.
Abstract: In this paper, we proposed a new method that can improve the accuracy of cepstrum pitch detection and can reduce the processing time. We control the phase information of cepstrum for making the pitch peak maximum. So we extract the exact pitch period easily. We shorten the processing time by omitting the bit-reversing process from the FFT and IFFT computation.
TL;DR: In this paper, a digital guide audio signal and a digital new signal are matched to a time alignment process that produces a time-aligned new signal, time aligned to the guide signal.
Abstract: A digitised audio signal 310, such as an amateur's singing, and a digital guide audio signal 312 are supplied to a time alignment process 320 that produces a time-aligned new signal 330, time-aligned to the guide signal. Pitch along the time-aligned new signal 330 and along the guide signal 312 is measure in processes 340 and 345 which supply these measurement to a pitch adjustment calculator 370 which calculates a pitch correction factor C'(Fps) from these measurements and the nearest octave ratio of the signals. A pitch changing process 380 modulates the pitch of the time-aligned new signal 330 to produce a time-aligned and pitch adjusted new signal 390.
TL;DR: It is shown that the error of gross pitch error (GPE) in the proposed detection method is significantly decreased in severely noisy speech.
Abstract: In this report, we are presenting new robust pitch detection for noisy speech The conventional method, ie, AUTOC, is vulnerable to the serious noise environment, especially the periodical noise In the case of additive car noise, the detection accuracy is considerably deteriorated A new detection method is proposed by adding a process, which implements band-pass filtering on the modulation spectra of the speech sections to AUTOC The 2-nd power amplitude spectrum of speech in the autocorrelation computation of AUTOC is replaced by the 3-rd power amplitude spectrum In addition, a band-limitation operation in frequency domain is carried out It is adapted to the pitch features of human speech An evaluation using 10 Chinese words is undertaken to compare the proposed detection method with AUTOC and a recent method based on exponentiated band-limited amplitude spectrum The experiment is at the noise level ranged from 0 dB SNR to 10 dB SNR with white noise, colored noise and car interior noise It is shown that the error of gross pitch error (GPE) in the proposed detection method is significantly decreased in severely noisy speech
TL;DR: In this paper, the authors proposed a simple configuration for converting input voices into choral voices with many people and ensemble sound by simple configuration, where a pitch detection part 12 detects a pitch Pin of an input voice signal Vin supplied from a voice input part 61.
Abstract: PROBLEM TO BE SOLVED: To provide a voice processor and a program which convert input voices into choral voices with many people and ensemble sound by a simple configuration. SOLUTION: A pitch detection part 12 detects a pitch Pin of an input voice signal Vin supplied from a voice input part 61. An envelope detection part 13 detects a spectrum envelope of the input voice signal Vin. A spectrum acquisition means 30 acquires a frequency spectrum of the voices for conversion including a plurality of voices uttered in parallel. A pitch conversion part 21 varies each peak frequency of the frequency spectrum acquired by the spectrum acquisition means 30 according to the pitches Pin. An envelope adjustment part 22 adjusts the spectrum envelope of the frequency spectrum having been processed by the pitch conversion part 21 so that the envelope almost match with the spectrum envelope detected by the envelope detection part 13. A voice production means 40 produces an output voice signal Vnew from the frequency spectrum having been adjusted by the envelope adjustment part 22. COPYRIGHT: (C)2006,JPO&NCIPI
TL;DR: Experiments show that this new pitch detection method with the combination of morphology filter and wavelet transform can realize pitch period calculation accurately and it has better robustness.
Abstract: This paper proposes a new pitch detection method with the combination of morphology filter and wavelet transform. Before detection,a pre-process based on morphology filter is performed to remove the noise. Then the pitch is detected by wavelet transform. Experiments show that this method can realize pitch period calculation accurately and it has better robustness.
TL;DR: An integrated pitch estimation approach for severely colored noise-corrupted speech that provides a superior accuracy relative to some of the existing methods implemented in the presence of colored noise, even at a very low signal-to noise ratio (SNR) of -15 dB.
Abstract: We present an integrated pitch estimation approach for severely colored noise-corrupted speech. An effective colored noise-whitening process is first applied to the noisy speech. Then, a variable-length average magnitude difference function (VLAMDF) of the pre-filtered noisy speech (PFNS) is proposed, which almost conquers the trend of falling valleys in the conventional AMDF. The amplitude characteristic of the VLAMDF is reshaped by means of a simple linear transformation to reduce the possibility of double-pitch-errors. As the VLAMDF exhibits a valley while the autocorrelation function (ACF) of PFNS provides a peak, the ACF is weighted by the reciprocal of the VLAMDF to emphasize the pitch-candidate as well as to suppress the non-pitch peaks. Moreover, a noise-robust pitch detection in the time-domain is guaranteed by collaboration of this enhanced autocorrelation function with the reshaped version of the VLAMDF. The proposed approach is simulated using the Keele reference database and provides a superior accuracy relative to some of the existing methods implemented in the presence of colored noise, even at a very low signal-to noise ratio (SNR) of -15 dB.
TL;DR: The result of the experiment shows that in noisy environments these processions can prove the reliability and accuracy of classical pitch extraction method.
Abstract: In this paper,a new pitch detection method based on classical autocorrelation function is proposed.This method's feature lies in its pre-procession and post-procession.It overcomes the disadvantage of the auto-correlation function-the half and double frequency error of ten happened in the low SNR environm ents.It also overcomes the stochastic error problem when the speech signals have a comparatively big change curve.The result of the experiment shows that in noisy environments these processions can im prove the reliability and accuracy of classical pitch extraction method.
TL;DR: In this article, the authors explored the fourth order cumulant using autoregressive model (AR(p)) and presented a new algorithm for pitch detection of voiced sounds with and without colored Gaussian noise and showed the superiority of the novel method over the classical methods such as cepstral method.
Abstract: In a number of speech applications, such as coding, synthesis or recognition, it is crucial to make a reliable discrimination between voiced/unvoiced segments and accurately determine the pitch period. The problem of an accurate estimation and decision in noisy condition remains open; Higher-order statistics (H.O.S) have inherent properties that make them well suited when dealing with a mixture of Gaussian and non-Gaussian processes. This paper explores the fourth order cumulant using autoregressive model (AR(p)) and presents a new algorithm for pitch detection of voiced sounds with and without colored Gaussian noise and shows the superiority of the novel method over the classical methods such as cepstral method.
TL;DR: A real-time ‘Vocal Vibrato Effecter’ running under a Windows PC, which automatically adds a vibrato effect to the vocal input and the key novelty in this work is the combination of pitch detector and pitch shifter.
Abstract: Karaoke is one of the largest commercial industries using audio & video in Asia today. One of the most popular features incorporated into the vocals produced by Karaoke singers has been Echo/Reverb. In addition to Echo/Reverb, if vibrato is added to vocal signals, then the vocal vibrato produced has the potential of making the singer feel more comfortable, confident and professional with regards to their singing. In this paper, we present a real-time ‘Vocal Vibrato Effecter’ running under a Windows PC, which automatically adds a vibrato effect to the vocal input. The proposed technique exploits the vocal energy level and the temporal consistency of the pitch variation. The key novelty in this work is the combination of pitch detector and pitch shifter. This effecter can be applied to consumer/commercial Karaoke systems to enhance a vocal signal.
TL;DR: This CAR model is tested for applications such as pitch detection in a speech signal and detection of QRS complex in an electrocardiogram (ECG) signal.
Abstract: In this paper, we propose a novel autoregressive (AR), constrained autoregressive (CAR) model for various signal modeling applications. CAR model is based on constraining one of the model parameters of the autoregressive model. This helps obtain a modified or desired AR spectrum for the signal. Constraining different AR parameters or changing the values of a particular parameter results in dissimilar AR spectrum for the signal. The value of this constrained parameter can be used for externally controlling the gain or improving the spectral resolution between two peaks in the spectrum. In this work, a0 parameter is constrained and different values are assigned for this coefficient. This changes the spectral gain. This CAR model is tested for applications such as pitch detection in a speech signal and detection of QRS complex in an electrocardiogram (ECG) signal. A higher gain CAR error filter improves the efficiency of the pitch detection algorithms or QRS complex detection.
TL;DR: The results indicate that the additive white noise Kalman filters provide an audible improvement in output speech quality, and an improved pitch detection, and the feasibility of using the Kalman filter for noise reduction is clearly shown.
Abstract: It can be used to easily change or to maintain the naturalness and intelligibility of quality in speech synthesis and to eliminate the personality for speaker-independence in speech recognition. In this paper, we proposed a new pitch detection algorithm. And Kalman filters are implemented for filtering speech contaminated by additive white noise or colored noise and an iterative signal and parameter estimator which can be used for both noise type is presented. The performance was compared with LPC and Cepstrum, ACF. we have obtained the pitch information improved the accuracy of pitch detection and gross error rate is reduced in voice speech region and in transition region of changing the phoneme. Also the results indicate that the additive white noise Kalman filters provide an audible improvement in output speech quality, and an improved pitch detection. This paper clearly shows the feasibility of using the Kalman filter for noise reduction.