TL;DR: An algorithm to estimate a signal from its modified short-time Fourier transform (STFT) by minimizing the mean squared error between the STFT of the estimated signal and the modified STFT magnitude is presented.
Abstract: In this paper, we present an algorithm to estimate a signal from its modified short-time Fourier transform (STFT). This algorithm is computationally simple and is obtained by minimizing the mean squared error between the STFT of the estimated signal and the modified STFT. Using this algorithm, we also develop an iterative algorithm to estimate a signal from its modified STFT magnitude. The iterative algorithm is shown to decrease, in each iteration, the mean squared error between the STFT magnitude of the estimated signal and the modified STFT magnitude. The major computation involved in the iterative algorithm is the discrete Fourier transform (DFT) computation, and the algorithm appears to be real-time implementable with current hardware technology. The algorithm developed in this paper has been applied to the time-scale modification of speech. The resulting system generates very high-quality speech, and appears to be better in performance than any existing method.
TL;DR: A vocoder technique is described in which speech signals are represented by their short-time phase and amplitude spectra, which leads to an economy in transmission bandwidth and to a means for time compression and expansion of speech signals.
Abstract: A vocoder technique is described in which speech signals are represented by their short-time phase and amplitude spectra. A complete transmission system utilizing this approach is simulated on a digital computer. The encoding method leads to an economy in transmission bandwidth and to a means for time compression and expansion of speech signals.
TL;DR: This paper examines the problem of phasiness in the context of time-scale modification and provides new insights into its causes, and two extensions to the standard phase vocoder algorithm are introduced, and the resulting sound quality is shown to be significantly improved.
Abstract: The phase vocoder is a well established tool for time scaling and pitch shifting speech and audio signals via modification of their short-time Fourier transforms (STFTs). In contrast to time-domain time-scaling and pitch-shifting techniques, the phase vocoder is generally considered to yield high quality results, especially for large modification factors and/or polyphonic signals. However, the phase vocoder is also known for introducing a characteristic perceptual artifact, often described as "phasiness", "reverberation", or "loss of presence". This paper examines the problem of phasiness in the context of time-scale modification and provides new insights into its causes. Two extensions to the standard phase vocoder algorithm are introduced, and the resulting sound quality is shown to be significantly improved. Moreover, the modified phase vocoder is shown to provide a factor-of-two decrease in computational cost.
TL;DR: In this paper, the authors developed the theoretical basis for time-scale modification of speech based on short-time Fourier analysis and developed a high quality system for changing the apparent rate of articulation of recorded speech, while at the same time preserving such qualities as naturalness, intelligibility, and speaker-dependent features.
Abstract: This paper develops the theoretical basis for time-scale modification of speech based on short-time Fourier analysis. The goal is the development of a high-quality system for changing the apparent rate of articulation of recorded speech, while at the same time preserving such qualities as naturalness, intelligibility, and speaker-dependent features. The results of the theoretical study were used as the framework for the design of a high-quality speech rate-change system that was simulated on a general-purpose minicomputer.
TL;DR: The fiddle and bonk objects are low tech; the algorithms would be easy to re-code in another language or for other environments from the ones considered here, and the main concern is to get predictable and acceptable behavior using easy-to-understand techniques which won't place an unacceptable computational load on a late-model computer.
Abstract: Two \objects," which run under Max/MSP or Pd, do di erent kinds of real-time analysis of musical sounds. Fiddle is a monophonic or polyphonic maximum-likelihood pitch detector similar to Rabiner's, which can also be used to obtain a raw list of a signal's sinusoidal components. Bonk does a bounded-Q analysis of an incoming sound to detect onsets of percussion instruments in a way which outperforms the standard envelope following technique. The outputs of both objects appear as Max-style control messages. 1 Tools for real-time audio analysis The new real-time patchable software synthesizers have nally brought audio signal processing out of the ivory tower and into the homes of working computer musicians. Now audio can be placed at the center of real-time computer music production, and MIDI, which for a decade was the backbone of the electronic music studio, can be relegated to its appropriate role as a low-bandwidth I/O solution for keyboards and other input devices. Many other sources of control \input" can be imagined than are provided by MIDI devices. This paper, for example, explores two possibilities for deriving a control stream from an incoming audio stream. First, the sound might contain quasi-sinusoidal \partials" and we might wish to know their frequencies and amplitudes. In the case that the audio stream comes from a monophonic or polyphonic pitched instrument, we would like to be able to determine the pitch(es) and loudness(es) of the components. It's clear that we'll never have a perfect pitch detector, but the fiddle object described here does fairly well in some cases. For the many sounds which don't lend themselves to sinusoidal decomposition, we can still get useful information from the overall spectral envelope. For instance, rapid changes in the spectral envelope turn out to be a much more reliable indicator of percussive attacks than are changes in the overall power reported by a classical envelope follower. The bonk object does a bounded-Q lterbank of an incoming sound and can either output the raw analysis or detect onsets which can then be compared to a collection of known spectral templates in order to guess which of several possible kinds of attack has occurred. The fiddle and bonk objects are low tech; the algorithms would be easy to re-code in another language or for other environments from the ones considered here. Our main concern is to get predictable and acceptable behavior using easy-to-understand techniques which won't place an unacceptable computational load on a late-model computer. Some e ort was taken to make fiddle and bonk available on a variety of platforms. They run under Max/MSP (Macintosh), Pd (Wintel, SGI, Linux) and fiddle also runs under FTS (available on several platforms.) Both are distributed with source code; see http://man104nfs.ucsd.edu/~mpuckett/ for details. 2 Analysis of discrete spectra Two problems are of interest here: getting the frequencies and amplitudes of the constituent partials of a sound, and then guessing the pitch. Our program follows the ideas of [Noll 69] and [Rabiner 78]. Whereas the earlier pitch~ object reported in [Puckette 95] departs substantially from the earlier approaches, the algorithmused here adhere more closely to them. First we wish to get a list of peaks with their frequencies and amplitudes. The incoming signal is broken into segments of N samples with N a power of two typically between 256 and 2048. A new analysis is made every N=2 samples. For each analysis the N samples are zero-padded to 2N samples and a rectangular-window DFT is taken. An interesting trick reduces the computation time roughly in half for this setup; see the source code to see how this is done. If we let X[k] denote the zero-padded DFT, we can do a three-point convolution in the frequency domain to get the Hanning-windowed DFT: XH [k] = X[k]=2 (X[k + 2] +X[k 2])=4 Any of the usual criteria can be applied to identify peaks in this spectrum. We then go back to the nonwindowed spectrum to nd the peak frequency using the phase vocoder with hop 1: ! = N k + re X[k 2] X[k + 2] 2X[k] X[k 2] X[k + 2] : This is a special case of a more general formula derived in [Puckette 98]. The amplitude estimate is simply the windowed peak strength at the strongest bin, which because of the zero-padding won't di er by more than about 1 dB from the true peak strength. The phase could be obtained in the same way but we won't bother with that here. 2.1 Guessing fundamental frequencies Fundamental frequencies are guessed using a scheme somewhat suggestive of the maximum-likelihood estimator. Our \likelihood function" is a non-negative function L(f) where f is frequency. The presence of peaks at or near multiples of f increases L(f) in a way which depends on the peak's amplitude and frequency as shown: