TL;DR: This chapter shows a comprehensive approximation to the main challenges in voice activity detection, the different solutions that have been reported in a complete review of the state of the art and the evaluation frameworks that are normally used.
Abstract: An important drawback affecting most of the speech processing systems is the environmental noise and its harmful effect on the system performance. Examples of such systems are the new wireless communications voice services or digital hearing aid devices. In speech recognition, there are still technical barriers inhibiting such systems from meeting the demands of modern applications. Numerous noise reduction techniques have been developed to palliate the effect of the noise on the system performance and often require an estimate of the noise statistics obtained by means of a precise voice activity detector (VAD). Speech/non-speech detection is an unsolved problem in speech processing and affects numerous applications including robust speech recognition (Karray and Marting, 2003; Ramirez et al. 2003), discontinuous transmission (ITU, 1996; ETSI, 1999), real-time speech transmission on the Internet (Sangwan et al., 2002) or combined noise reduction and echo cancellation schemes in the context of telephony (Basbug et al., 2004; Gustafsson et al., 2002). The speech/non-speech classification task is not as trivial as it appears, and most of the VAD algorithms fail when the level of background noise increases. During the last decade, numerous researchers have developed different strategies for detecting speech on a noisy signal (Sohn et al., 1999; Cho and Kondoz, 2001; Gazor and Zhang, 2003, Armani et al., 2003) and have evaluated the influence of the VAD effectiveness on the performance of speech processing systems (Bouquin-Jeannes and Faucon, 1995). Most of the approaches have focussed on the development of robust algorithms with special attention being paid to the derivation and study of noise robust features and decision rules (Woo et al., 2000; Li et al., 2002; Marzinzik and Kollmeier, 2002). The different VAD methods include those based on energy thresholds (Woo et al., 2000), pitch detection (Chengalvarayan, 1999), spectrum analysis (Marzinzik and Kollmeier, 2002), zero-crossing rate (ITU, 1996), periodicity measure (Tucker, 1992), higher order statistics in the LPC residual domain (Nemer et al., 2001) or combinations of different features (ITU, 1993; ETSI, 1999; Tanyer and Ozer, 2000). This chapter shows a comprehensive approximation to the main challenges in voice activity detection, the different solutions that have been reported in a complete review of the state of the art and the evaluation frameworks that are normally used. The application of VADs for speech coding, speech enhancement and robust speech recognition systems is shown and discussed. Three different VAD methods are described and compared to standardized and
TL;DR: This paper presents a revised version of an implementation of the Momel and INTSINT algorithms for the automatic modelling and symbolic coding of intonation patterns which are seamlessly integrated into the Praat speech manipulation software by means of the recently proposed plugin facility for Praat.
Abstract: This paper presents a revised version of an implementation of the Momel and INTSINT algorithms for the automatic modelling and symbolic coding of intonation patterns. The algorithms are implemented as external functions which are seamlessly integrated into the Praat speech manipulation software by means of the recently proposed plugin facility for Praat. Pitch detection is carried out using a subroutine to calculate optimal values of maximum and minimum F0 automatically. The implementation of the Momel algorithm incorporates an improved treatment of the modelling of pitch contours in the vicinity of onsets and offsets of voicing. The version of the INTSINT algorithm implemented is the two parameter robust version described in recent publications.
TL;DR: This paper is a first initiative to perform an evaluation of widely used PDA algorithms over an extensive and realistic database and proves the good performance of the described algorithm in noisy conditions.
Abstract: A novel algorithm based on classical cepstrum calculation followed by dynamic programming is presented in this paper. The algorithm has been evaluated with a 60-minutes database containing 60 speakers and different recording conditions and environments. A second reference database has also been used. In addition, the performance of four popular PDA algorithms has been evaluated with the same databases. The results prove the good performance of the described algorithm in noisy conditions. Furthermore, the paper is a first initiative to perform an evaluation of widely used PDA algorithms over an extensive and realistic database.
TL;DR: The results of the experiments show that the method presented leads to a higher accuracy of the estimate of the pitch than state-of-the-art methods.
Abstract: An accurate estimation of the pitch is essential for many speech processing applications, such as speech synthesis, speech coding, and speech enhancement. A widely used assumption in most common pitch estimation methods is that pitch is constant over a segment of short duration. This assumption does not apply in reality and leads to inaccurate pitch estimates. In this paper, we present a method for continuous pitch estimation that is able to track fast changes. In the presented framework, the pitch is modeled by a B-spline expansion and optimized in a multistage procedure for increased robustness. The performance of the continuous optimization procedure is compared to state-of-the-art pitch estimation methods and is evaluated both for artificial speech-like signals with known pitch, and for real speech signals. The results of the experiments show that our method leads to a higher accuracy of the estimate of the pitch than state-of-the-art methods
TL;DR: A new smoothing algorithm for detected pitch contours is presented and the effectiveness of the proposed method was shown using the HUB4-NE natural speech corpus.
Abstract: Chinese is known as a syllabic and tonal language and tone recognition plays an important role and provides very strong discriminative information for Chinese speech recognition [1]. Usually, the tone classification is based on the F0 (fundamental frequency) contours [2]. It is possible to infer a speaker's gender, age and emotion from his/her pitch, regardless of what is said. Meanwhile, the same sequence of words can convey very different meanings with variations in intonation. However, accurate pitch detection is difficult partly because tracked pitch contours are not ideal smooth curves. In this paper, we present a new smoothing algorithm for detected pitch contours. The effectiveness of the proposed method was shown using the HUB4-NE [3] natural speech corpus.
TL;DR: A hidden Markov model (HMM) based system to detect the pitch of an instrument in polyphonic music using an instrument tone model and a hypothesis selection method to choose pitch hypotheses with sufficiently high salience as pitch candidates is proposed.
Abstract: We propose a hidden Markov model (HMM) based system to detect the pitch of an instrument in polyphonic music using an instrument tone model. Our system calculates at every time frame the salience of a pitch hypothesis based on the magnitudes of harmonics associated with the hypothesis. A hypothesis selection method is introduced to choose pitch hypotheses with sufficiently high salience as pitch candidates. Then the system applies an instrument model to evaluate the likelihood of each candidate. The transition probability between successive pitch points is constructed using the prior knowledge of the musical key of the input. Finally an HMM integrates the instrument likelihood and the pitch transition probability. Quantitative evaluation shows the proposed system performs well for different instruments. We also compare a Gaussian mixture model and kernel density estimation for instrument modeling, and find that kernel density estimation gives better overall performance while the Gaussian mixture model is more robust.
TL;DR: Several established pitch detection algorithms (PDAs) are compared for verification of adequacy and gross pitch error showed some increases in cases of pathological voices; especially excessive increase in PDA based on nonlinear time-series.
Abstract: Robust pitch estimation is important in many areas of speech processing. In voice pathology, diverse statistics extracted form pitch estimation were commonly used to test voice quality. In this study, we compared several established pitch detection algorithms (PDAs) for verification of adequacy of the PDAs. In the database of total pathological voices of 99 and normal voices of 30, an analysis of errors related with pitch detection was evaluated between pathological and normal voices, or among the types of pathological voices. Pitch errors of all PDAs used in this study more or less showed some changes between pathological and normal voices. According to the results of pitch errors, gross pitch error showed some increases in cases of pathological voices; especially excessive increase in PDA based on nonlinear time-series. In an analysis of types of pathological voices classified by aperiodicity and the degree of chaos, the more voice has aperiodic and chaotic, the more growth of pitch errors increased. Consequently, it is required to survey the severity of tested voice in order to obtain accurate pitch estimates.
TL;DR: In this paper, pitch is tracked for individual samples, which are taken much more frequently than an analysis frame, and speech is identified based on the tracked pitch and the speech components of the signal are removed with a time-varying filter, leaving only an estimate of a time varying speech signal.
Abstract: Pitch is tracked for individual samples, which are taken much more frequently than an analysis frame. Speech is identified based on the tracked pitch and the speech components of the signal are removed with a time-varying filter, leaving only an estimate of a time-varying speech signal. This estimate is then used to generate a time-varying noise model which, in turn, can be used to enhance speech related systems.
TL;DR: It has been ascertained that the overall algorithm simulated using the Keele reference database is able to outperform some of the existing methods and well suited for a wide range of signal-to-noise ratios (SNRs) upto-10 dB.
Abstract: In this paper, we present a robust pitch estimation algorithm for noise-degraded speech. We propose a new circular average magnitude sum function (CAMSF) and a pseudo normalized correlation function (PNCF) both of which exhibit the periodicity at the pitch period of voiced speech. Exploiting the fact that CAMSF produces a peak while PNCF shows a notch, an integrated time-domain function (ITDF) is developed to enhance the pitch-harmonic-notches in presence of noise. Moreover, a frequency-frame relative smoothed noisy spectrum that acts as a harmonic spectral structure enhancer is utilized to accurately acquire a pitch-harmonic (PH) from noisy speech. We argued that employing the PH, pitch information can be effectively extracted through a variable-period impulse-train in conjunction with the proposed ITDF. It has been ascertained that the overall algorithm simulated using the Keele reference database is able to outperform some of the existing methods and well suited for a wide range of signal-to-noise ratios (SNRs) upto-10 dB.
TL;DR: In this article, a modified phase-opponency (MPO) model is proposed for single-channel speech enhancement when the speech is corrupted by additive noise, which is based on the auditory PO model, proposed for detecting tones in noise.
Abstract: In this paper we present a model called the Modified Phase-Opponency (MPO) model for single-channel speech enhancement when the speech is corrupted by additive noise. The MPO model is based on the auditory PO model, proposed for detection of tones in noise. The PO model includes a physiologically realistic mechanism for processing the information in neural discharge times and exploits the frequency-dependent phase properties of the tuned filters in the auditory periphery by using a cross-auditory-nerve-fiber coincidence detection for extracting temporal cues. The MPO model alters the components of the PO model such that the basic functionality of the PO model is maintained but the properties of the model can be analyzed and modified independently. The MPO-based speech enhancement scheme does not need to estimate the noise characteristics nor does it assume that the noise satisfies any statistical model. The MPO technique leads to the lowest value of the LPC-based objective measures and the highest value of the perceptual evaluation of speech quality measure compared to other methods when the speech signals are corrupted by fluctuating noise. Combining the MPO speech enhancement technique with our aperiodicity, periodicity, and pitch detector further improves its performance.
TL;DR: An algorithm for estimating the fundamental frequency in speech signals using non-negative matrix factorization and statistics of the succession of voiced segments to aggregate partial contours to the final contour of an utterance is presented.
Abstract: We present an algorithm for estimating the fundamental frequency in speech signals. Our approach incorporates models of voiced speech on three levels. First, we estimate the pitch for each time frame based on its harmonic structure using non-negative matrix factorization. The second level utilizes temporal pitch continuity to extract partial pitch contours. Thirdly, we incorporate statistics of the succession of voiced segments to aggregate partial contours to the final contour of an utterance. We evaluate our approach on the Keele database. The experimental results show the robustness of our method for noisy speech, and the good performance for clean speech in comparison with state-of-the-art algorithms.
TL;DR: In this article, a smoothing algorithm is applied to the chord sequence which optimizes the number of chord changes and thus takes into consideration the comparatively stable nature of chords, and the chords themselves are identified by comparing generated and reference pitch class profiles.
Abstract: We introduce an algorithm that analyzes audio signals to extract chord-sequence information. The main goal of our approach lies in incorporating music theoretical knowledge without restricting the input data to a narrow range of musical styles. At the basis of our approach lies pitch detection using enhanced autocorrelation, supported by key detection and beat tracking. The chords themselves are identified by comparing generated and reference pitch class profiles. A smoothing algorithm is applied to the chord sequence which optimizes the number of chord changes and thus takes into consideration the comparatively stable nature of chords. In this paper we present an evaluation performed on a large set of 35 pieces of diverse music showing an average performance of 65% accuracy.
TL;DR: An algorithm that analyzes audio signals to extract chord-sequence information using enhanced autocorrelation and a smoothing algorithm is applied to the chord sequence which optimizes the number of chord changes and thus takes into consideration the comparatively stable nature of chords.
Abstract: We introduce analgorithm that analyzes audio signals to extract chord-sequence information. Themaingoal ofour approach lies inincorporating musictheoretical knowledgewithout restricting theinput data toanarrow range of musical styles. Atthebasis ofourapproach lies pitch detection using enhanced autocorrelation, supported bykey detection andbeattracking. TheChords themselves are identified bycomparing generated andreference Pitch Class Profiles. A smoothing algorithm isapplied tothechordsequence whichoptimizes thenumberofchord changes and thustakes into consideration thecomparatively stable natureofchords. Inthis paper wepresent anevaluation performed onalarge setof35pieces ofdiverse musicshowinganaverage performance of65%accuracy. IndexTerms-Acoustic signal analysis, Audiosystems, Music
TL;DR: The experimental results show that the EMD based algorithm performs better in pitch estimation of noisy speech.
Abstract: This paper presents a pitch estimation method of noisy speech signal using empirical mode decomposition (EMD). The normalized autocorrelation function (NACF) of the noisy speech signal is decomposed into a finite set of band-limited signals termed as intrinsic mode functions (IMFs) using EMD. The periodicity of one IMF is supposed to be equal to the accurate pitch period. A conventional autocorrelation based pitch period detection method is used to select the IMF with pitch period. The accurate pitch period is obtained from the selected IMF. The pitch estimation performance in term of gross pitch error (GPE) of the proposed algorithm is compared with recently proposed methods. The experimental results show that the EMD based algorithm performs better in pitch estimation of noisy speech. Index Terms: empirical mode decomposition, pitch estimation, normalized autocorrelation.
TL;DR: In this article, a pitch detector was used to extract pitch information from a frame of an input speech signal and a pitch candidate value selector for selecting one or more pitch candidate values from the predicted pitch information according to a predetermined condition.
Abstract: An apparatus and method for extracting pitch information from a speech signal. The apparatus includes a pilot pitch detector for extracting predicted pitch information from a frame of an input speech signal, a pitch candidate value selector for selecting one or more pitch candidate values from the predicted pitch information according to a predetermined condition, a harmonic-noise region decomposer for decomposing a harmonic-noise region using each of the selected pitch candidate values, a harmonic-noise energy ratio calculator for calculating an energy ratio of each of the decomposed harmonic regions to each of the decomposed noise regions, and a pitch information selector for selecting a pitch candidate value of a harmonic-noise region in which the maximum value among the calculated harmonic-noise energy ratio exists as a pitch value of the input frame of the speech signal.
TL;DR: In this article, a two-pass method using summation of multiple harmonics in the spectral field is used for speaker identification, which is adapted for speech signal of very low quality and has also yielded good results for analogue and digital telephone channels.
Abstract: Analysis of pitch, or rather its measurable substrate fundamental frequency (f0), is a generally accepted component of speaker identification decision within both automatic and non automatic speaker recognition. Because the investigation of speech must be comprehensive and because pitch reflects important properties of the human voice, its analysis also should be an obligatory part of the whole investigation. Classic texts on forensic speaker identification use f0 both in automatic and nonautomatic speaker identification [5,8,12,13,15,16]. However, the majority of suggestions are for general, relatively simple statistical parameters for pitch curves such as average, range and variation of f0 values [1-4, 6, 9-11]. Our approach is based on many years of experience with f0 data for forensic applications. An algorithm for f0 detection and statistical analysis has been developed and implemented into standard STC software SIS v.6.1.2 or later. F0 is calculated by our two-pass-method using summation of multiple harmonics in the spectral field. This method is adapted for speech signal of very low quality and has also yielded good results for analogue and digital telephone channels. For forensic work the analyst should select speech produced with a similar emotional intention. Pitch detection quality is controlled by superposition of the calculated pitch curve on the cepstrogamm (i.e. signal periodicity degree function [7]) of the processed signal by means of the SIS software. The analyst can make manual adjustments to the automatically calculated curve in order to remove errors. For speech signals of standard telephone quality mistakes in pitch detection are infrequent. Values of pitch are transformed to a logarithmic scale, and then statistical pitch features are calculated. The typical set of the statistical parameters measured are: average, maximum, minimum, maximum -3% *, minimum +1%, median, percent of areas with raising f0*, f0 logarithm variation*, f0 logarithm distribution asymmetry*, f0 logarithm distribution excess, average velocity of f0 change, f0 logarithm variation derivative, f0 logarithm derivative distribution asymmetry, f0 logarithm derivative distribution excess, average velocity of f0 rise* and average velocity of f0 fall*. The asterisk indicates the more heavily weighted statistical features. A speaker identification algorithm was developed and trained using the STC corpus [14]. This RUSTEN speech corpus includes analogue telephone speech. The corpus contains dialogues of 126 speakers (67 women and 59 men) in 5 sessions using 5 different phone analog lines, plus about 1000 files of digital phone channel conversations of 130 speakers. The deviation of every statistical parameter was calculated for every file pair from the corpus. On the basis of these results the distributions of the deviations for pairs “same-different” and “same–same” were built; and functions false acceptance (FA), false rejection (FR) and EER (equal error rate) were calculated for every statistical parameter. In order to make general identification decisions a common metric was constructed as a weighted sum of separate statistical parameters. The weights were selected to minimize ERR for a given speech database. For the common weighted metric and the common identification metric the deviation distribution for pairs “same-same” and “same-different”, FR and FA curves, and ERR were calculated. Tables 1 and 2 show the results of the investigation of the method using the STC speech data base. In both cases the test data base includes about 1600 speech files of 256 speakers using both analog and digital channels).
TL;DR: This paper presents an alternative, time-series based framework for modeling the voicing structure of speech called the fine pitch model, which can more accurately account for the content in a voiced speech segment.
Abstract: An accurate model for the structure of speech is essential to many speech processing applications, including speech enhancement, synthesis, recognition, and coding. This paper explores some deficiencies of standard harmonic methods of modeling voiced speech. In particular, they ignore the effect of fundamental frequency changing within an analysis frame, and the fact that the fundamental frequency is not a continuously varying parameter, but a side effect of a series of discrete events. We present an alternative, time-series based framework for modeling the voicing structure of speech called the fine pitch model. By precisely modeling the voicing structure, it can more accurately account for the content in a voiced speech segment.
TL;DR: General rules are formulates guiding the choice of cost functions participating in the DP based post-processing for this particular problem of pitch tracking of the singing voice in the presence of Indian percussive interference.
Abstract: The problem of pitch tracking of the singing voice in the presence of Indian percussive interference, specifically the tabla, is considered. To overcome the problems due to this particular type of interference, a pitch tracker is used that applies dynamic programming (DP) based smoothing on pitch estimates obtained from a spectral-domain pitch detection algorithm (PDA) that uses harmonic matching. Experiments on real and simulated signals show the superiority of the spectral domain PDA over a correlation domain PDA in terms of pitch detection accuracy and suitability of the PDA output for post-processing. A new smoothing cost function is proposed and evaluated. The paper formulates general rules guiding the choice of cost functions participating in the DP based post-processing for this particular problem.
TL;DR: A hybrid method for pitch marking that combines the advantages of the Electroglottograph (EGG) and the speech signals is proposed that provides better performance than PMA based on EGG signal or speech signal.
Abstract: Pitch marking is very significant in speech signal processing. In a text-to-speech (TTS) system based on the Time-Domain Pitch-Synchronous Overlap-Add (TD-PSOLA) method, robust estimation of pitch marks (PM) is especially important to the modification of the time and pitch scale of a speech signal in order to match it to that of the target speaker. The aim of this paper is to improve the accuracy of automatic Pitch Mark Algorithms (PMA). Therefore, we propose a hybrid method for pitch marking that combines the advantages of the Electroglottograph (EGG) and the speech signals. We evaluate this hybrid algorithm for pitch marking against pitch mark algorithm used by Praat program [1]. The results of the evaluation indicate that the suggested method provides better performance than PMA based on EGG signal or speech signal.
TL;DR: This work focuses on detecting irregular phonation without assuming prior knowledge of voiced regions of speech, and improves the pitch estimation accuracy of a current pitch tracking algorithm in regions of irregular phonations, where most pitch trackers fail to perform well.
Abstract: The problem addressed here is that of detecting irregular phonation during conversational speech. While most published work tackles this problem only by focusing on the voiced regions of speech, we focus on detecting irregular phonation without assuming prior knowledge of voiced regions. In addition, we improve the pitch estimation accuracy of a current pitch tracking algorithm in regions of irregular phonation, where most pitch trackers fail to perform well. The algorithm has been tested on the TIMIT and NIST 98 databases. The detection rate for the TIMIT database is 91.8% (17.42% false detections). The detection rate for the NIST 98 database is 91.5% (12.8% false detections). The pitch detection accuracy increased from 95.4% to 98.3% for the TIMIT database, and from 94.8% to 97.4% for the NIST 98 database.
TL;DR: In this paper, a dominant melody separation method based on spectral clustering of sinusoidal peaks is used for adaptive harmonization and pitch correction in mono polyphonic audio mixtures.
Abstract: There are several well known harmonization and pitch correction techniques that can be applied to monophonic sound sources. They are based on automatic pitch detection and frequency shifting without time stretching. In many applications it is desired to apply such effects on the dominant melodic instrument of a polyphonic audio mixture. However, applying them directly to the mixture results in artifacts, and automatic pitch detection becomes unreliable. In this paper we describe how a dominant melody separation method based on spectral clustering of sinusoidal peaks can be used for adaptive harmonization and pitch correction in mono polyphonic audio mixtures. Motivating examples from a violin tutoring perspective as well as modifying the saxophone melody of an old jazz mono recording are presented.
TL;DR: A pitch wave signal creation method for efficiently coding a speech wave signal having a fluctuated pitch period is provided in this paper. But this method is not suitable for high quality and high efficiency.
Abstract: A pitch wave signal creation method as a preliminary process for efficiently coding a speech wave signal having a fluctuated pitch period is provided. A speech signal compressing/expanding apparatus and a speech signal synthesizing apparatus using the method, and a signal processing associated therewith are further provided. The pitch wave creation method of the invention is essentially comprised of a method of detecting the instantaneous pitch period of each pitch wave element of the speech wave signal, and a process of converting a corresponding pitch wave element into a normalized pitch wave element having a predetermined fixed time length by expanding and compressing the pitch wave element on a time axis while retaining its wave pattern based on the each detected instantaneous pitch period. The speech signal having a pitch fluctuation can be compressed in high quality and high efficiency by coding or synthesizing the speech wave signal using the pitch wave signal creation method of the invention. Text-to-speech conversion using pitch wave signals.
TL;DR: In this paper, a tuning device includes an input terminal configured to receive an input electrical signal and a pitch detector to detect a pitch of the input electrical signals and a display device displays the results of the comparison.
Abstract: A tuning device includes an input terminal configured to receive an input electrical signal and a pitch detector to detect a pitch of the input electrical signal. A manual pitch designator designates a standard pitch from pitches of a scale. An automatic pitch designator designates a standard pitch from a scale that is closest to the pitch of the input electrical signal. A mode selector selects either a manual mode where the standard pitch is designated by the manual pitch designator, or an auto mode where the standard pitch is designated by the automatic pitch designator. A pitch comparator compares the pitch of the input electrical signal and the standard pitch and a display device displays the results of the comparison. The display device is configured such that when the standard pitch is designated by the automatic pitch designator, the standard pitch is not displayed.
TL;DR: A powerful pitch estimation algorithm called SWIPE is shown to outperform existing algorithms on several publicly available speech and musical instrument databases, and a disordered speech database, reducing the gross error rate by 40%, relative to the best competing algorithm.
Abstract: A powerful pitch estimation algorithm called SWIPE has been developed for processing speech and music. SWIPE is shown to outperform existing algorithms on several publicly available speech and musical instrument databases, and a disordered speech database, reducing the gross error rate by 40%, relative to the best competing algorithm. In short, SWIPE estimates the pitch as the fundamental frequency of a sawtooth waveform, whose spectrum best matches the spectrum of the input signal. The short‐time Fourier transform of the sawtooth waveform provides an extension to older frequency‐based, sieve‐type estimation algorithms by providing smooth peaks with decaying amplitudes to correlate with the fundamental frequency (if present) and its harmonics. An improvement on the algorithm is achieved by using only the first and prime harmonics, which significantly reduces subharmonic errors commonly found in other pitch estimation algorithms.
TL;DR: In this paper, a modified power spectral subtraction (PSD) scheme was proposed to enhance the pre-processed speech prior to pitch estimation for pitch detection of speech severely degraded by a white noise.
Abstract: A new method for the pitch detection of speech severely degraded by a white noise is presented in this paper. We intend to incorporate a noise reduction approach based on a modified power spectral subtraction scheme to enhance the pre-processed speech prior to pitch estimation. The de-noised speech is then passed through an inverse filter, whose parameters are derived from the linear prediction (LP) analysis, yielding an output referred to as the LP residual. Since the LP residual is capable of delivering the knowledge of glottal closure events, it is utilized to propose a new average magnitude sum function (AMSF) and an average magnitude difference function (AMDF) both of which exhibit the periodicity at the pitch period. Exploiting the property that the AMDF shows a notch while the AMSF produces a peak, the AMDF is weighted by the reciprocal of the AMSF to reinforce the pitch-harmonic-notches in a heavy noise. Simulation results using the Keele database guarantee a superior pitch detection efficacy of the proposed approach for a white noise-corrupted speech compared to some of the existing methods at a very low signal-to-noise ratio (SNR).
TL;DR: This paper combines LPC-based cepstrum and HPS to deal with low frequency including much pitch information have been cutoff, which can significantly attenuate the fundamental pitch frequency and pitch errors.
Abstract: A novel method of pitch detection,combining LPC-based cepstrum and Harmonic Product Spectrum(HPS),has been proposedWe use linear prediction residual to eliminate the vocal tract information to improve the accurateIn real world,when the speech signal has been transmitted through the telephone system,low frequency including much pitch information have been cutoff,which can significantly attenuate the fundamental pitch frequencyIn this paper,we combine LPC-based cepstrum and HPS to deal with this problem and pitch errorsExperiment studied indicates that this novel method is very effective and valuable for application,since it robustly handles noise and pitch errors
TL;DR: In this paper, a speech synthesis device includes a pitch cycle correction unit (40 ) which extracts a fluctuation component of the pitch cycle of the original speech waveform which is obtained from the original waveform information storage unit (25 ) in order to generate the synthesized speech and which corrects, based on the extracted fluctuation components, the pitch cycles of the synthesised speech obtained by analyzing the input text sentence.
Abstract: Even when a pitch cycle has a large fluctuation and the pitch cycle string changes abruptly, it possible to suppress the affect of the pitch cycle fluctuation and generate high-quality synthesized speech. A speech synthesis device generates a synthesized speech corresponding to an input text sentence according to an original speech waveform stored in original speech waveform information storage unit ( 25 ). The speech synthesis device includes pitch cycle correction unit ( 40 ) which extracts a fluctuation component of the pitch cycle of the original speech waveform which is obtained from original speech waveform information storage unit ( 25 ) in order to generate the synthesized speech and which corrects, based on the extracted fluctuation component, the pitch cycle of the synthesized speech obtained by analyzing the input text sentence. Pitch cycle correction unit ( 40 ) connects the pitch cycle waveform of the original speech waveform at the pitch cycle of the corrected synthesized speech.
TL;DR: In this article, the pitch period estimation method used in the system is a hybrid method that is based on YIN fundamental frequency estimation algorithm and a method for fundamental frequency detection on magnitude of the speech signal.
Abstract: Pitch period estimation (also called fundamental frequency estimation) is widely needed in speech processing for many purposes. In our system for prosodic modification of speech, the pitch period estimation is used as a basis for frame length detection. The pitch period estimation method used in the system is a hybrid method that is based on YIN fundamental frequency estimation algorithm and a method for fundamental frequency detection on magnitude of the speech signal. The experiments show, that the method is useful in sinusoidal modeling domain, as in other domains, too.
TL;DR: In this paper, an effective pitch period detection method combining Teager energy operator (TEO) with spatial correlation function is proposed, which has better robustness and more precision compared with the classical wavelet-based methods and auto-correlated function (ACF).
Abstract: An effective pitch period detection method combining Teager energy operator (TEO) with spatial correlation function is proposed. A voiced regions detection (VRD) algorithm based on lifting wavelet transform and Teager energy operator is proposed firstly. Then, an algorithm based on spatial correlation function for estimating pitch frequency only in voiced regions is presented. Experiments show that the proposed algorithm has a better robustness and more precision compared with the classical wavelet-based methods and auto-correlated function (ACF).
TL;DR: This Chapter shows three methodologies for VAD: i) statistical likelihood ratio tests (LRTs) formulated in terms of the integrated bispectrum of the noisy signal, ii) hard decision clustering approach where a set of prototypes is used to characterize the noisy channel, and iii) an effective method employing support vector machines.
Abstract: Nowadays, the emerging wireless communication applications require increasing levels of performance and speech processing systems working in noise adverse environments. These systems often benefit from using voice activity detectors (VADs) which are frequently used in such application scenarios for different purposes. Speech/non-speech detection is an unsolved problem in speech processing and affects numerous applications including robust speech recognition (Karray & Martin, 2003), (Ramirez et al., 2003), discontinuous transmission (ETSI, 1999), (ITU, 1996), estimation and detection of speech signals (Krasny, 2000), real-time speech transmission on the Internet (Sangwan et al., 2002) or combined noise reduction and echo cancellation schemes in the context of telephony (Basbug et al., 2003). The speech/non-speech classification task is not as trivial as it appears, and most of the VAD algorithms fail when the level of background noise increases. During the last decade, numerous researchers have developed different strategies for detecting speech on a noisy signal (Sohn et al., 1999), (Cho & Kondoz 2001) and have evaluated the influence of the VAD effectiveness on the performance of speech processing systems (Bouquin-Jeannes & Faucon, 1995) (see also the preceding chapter about VAD). Most of them have focussed on the development of robust algorithms with special attention on the derivation and study of noise robust features and decision rules (Woo et al., 2000), (Li et al., 2002), (Marzinzik & Kollmeier, 2002), (Sohn et al., 1999). The different approaches include those based on energy thresholds (Woo et al., 2000), pitch detection (Chengalvarayan, 1999), spectrum analysis (Marzinzik & Kollmeier, 2002), zero-crossing rate (ITU, 1996), periodicity measures (Tucker, 1992) or combinations of different features (ITU, 1996), (ETSI, 1999). In this Chapter we show three methodologies for VAD: i) statistical likelihood ratio tests (LRTs) formulated in terms of the integrated bispectrum of the noisy signal. The integrated bispectrum is defined as a cross spectrum between the signal and its square, and therefore a function of a single frequency variable. It inherits the ability of higher order statistics to detect signals in noise with many other additional advantages (Gorriz, 2006a), (Ramirez et al, 2006); ii) Hard decision clustering approach where a set of prototypes is used to characterize the noisy channel. Detecting the presence of speech is enabled by a decision rule formulated in terms of an averaged distance between the observation vector and a cluster-based noise model; and iii) an effective method employing support vector machines