TL;DR: For each of the three clustering techniques, a seven-level Parsons algorithm provided better clustering than the correlation and dynamic time warping algorithms, and was closer to the near-perfect visual categorisations of human judges.
Abstract: Bottlenose dolphins (Tursiops truncatus) produce many vocalisations, including whistles that are unique to the individual producing them. Such “signature whistles” play a role in individual recognition and maintaining group integrity. Previous work has shown that humans can successfully group the spectrographic representations of signature whistles according to the individual dolphins that produced them. However, attempts at using mathematical algorithms to perform a similar task have been less successful. A greater understanding of the encoding of identity information in signature whistles is important for assessing similarity of whistles and thus social influences on the development of these learned calls. We re-examined 400 signature whistles from 20 individual dolphins used in a previous study, and tested the performance of new mathematical algorithms. We compared the measure used in the original study (correlation matrix of evenly sampled frequency measurements) to one used in several previous studies (similarity matrix of time-warped whistles), and to a new algorithm based on the Parsons code, used in music retrieval databases. The Parsons code records the direction of frequency change at each time step, and is effective at capturing human perception of music. We analysed similarity matrices from each of these three techniques, as well as a random control, by unsupervised clustering using three separate techniques: k-means clustering, hierarchical clustering, and an adaptive resonance theory neural network. For each of the three clustering techniques, a seven-level Parsons algorithm provided better clustering than the correlation and dynamic time warping algorithms, and was closer to the near-perfect visual categorisations of human judges. Thus, the Parsons code captures much of the individual identity information present in signature whistles, and may prove useful in studies requiring quantification of whistle similarity.
TL;DR: This work presents techniques used to harness such utterances, in addition to whistling, as a means of communication and control for patients who, while unable to speak, are capable of making reproducible utterances.
Abstract: During the course of our work in the National Rehabilitation Hospital in Ireland we have encountered patients who, while unable to speak, are capable of making reproducible utterances. We present techniques used to harness such utterances, in addition to whistling, as a means of communication and control. A simple technique for identifying the phonemes /o/ and /s/ (in single-symbol ARPAbet notation) is presented with applications. The use of pitch variation as a means of controlling a continuously variable parameter is described with two applications a microcontroller based light dimmer switch and a computer program which facilitates mouse pointer control. Finally, a technique for the recognition of short note sequences is presented. A program is described which allows arbitrary commands to be executed in response to tunes either sung or whistled by the user. These commands may be used to switch on or off electrical appliances in the home. 1 Simple Phoneme Recognition Phonemic, non-verbal input has previously been used as means of communication in rehabilitation (Igarashi and Hughes, 2001). Here, phoneme recognition based on the timing of positive-going zero-crossings (PGZCs) in the audio signal is used. The criterion for recognition of the phoneme, /o/, requires the time interval between successive PGZCs to remain roughly constant, and to be greater than a threshold value. The criterion for recognition of the phoneme, /s/, requires that the average rate of PGZCs be greater than a threshold value, typically set at or above 2000 PGZC/s. Using this technique, a microcontroller-based phoneme recognition device was developed to control any device operable by two switches, and in particular, a reading machine designed for users with severe physical disabilities. The signal from a microphone is amplified and infinitely clipped. Two PIC16F84 microcontrollers one for each phoneme take this rectangular wave as input. Each actuates a relay switch at its output when the appropriate phoneme is detected. A C++ class, called audio_widget, was also developed implementing this phoneme recognition technique. The audio_widget facilitates the integration of phoneme recognition into many programs, including one which provides a graphical menu for a patient, using icons drawn by a therapist, each of which may be associated with an arbitrary command. The user may scroll through menu items by making an /s/ sound and may select a menu item by making an /o/ sound. 2 Continuous Pitch Control A microcontroller-based light dimmer switch was developed which is controlled using whistles (or vocalisations) of varying pitch. The instantaneous pitch of the controlling sound is calculated from the time which elapsed between the two most recent PGZCs of the signal. The controlling sound is required to have only a single PGZC per pitch period, a criterion typically satisfied by both whistling and the phoneme /o/. While the instantaneous pitch is varying only gradually (e.g. during a whistle), these changes in pitch are mirrored by changes in the light intensity. The audio_widget features a pitch-tracking mode, which uses a power spectrum based pitch estimation algorithm called Harmonic Series Identification (Burke, 2002). Using this, a computer program for controlling movement of the mouse pointer and simulating clicks was developed. Horizontal and vertical mouse movement are controlled alternately using whistles (or vocalisations) of varying pitch in a manner similar to that used in the pitch controlled dimmer switch described above. Any whistle shorter than a threshold duration is interpreted as a click. 3 Tune Based Control Another computer program incorporating the audio_widget allows a number of user-defined commands to be triggered by whistling or singing the appropriate tune. A subsidiary application, called X10action, facilitates the control of household appliances. The initial pitch of each new note is appended to an array of recent note pitches maintained by the program. If a specified timeout period elapses after the end of one note, without a new note beginning, a (non-unique) tune signature is generated from the stored sequence of note pitches, before clearing the array. If the signature matches that of one of the user-defined tunes, then the corresponding command is executed. Each signature is a sequence of binary digits, each representing the change of pitch in a pair of successive notes: ‘1’ represents an increase in pitch. A ‘0’ represents a decrease in pitch. Note transitions involving no significant change in pitch are ignored. This method of encoding note sequences is similar to the Parsons code used by Prechelt and Typke (2001).
TL;DR: In this paper, a note-based query-by-humming system was proposed for melody extraction using the annotated pitch vectors in the MIR-QBSH dataset, which is corrected for noise and distortion using the approach of total variation denoising.
Abstract: In this work, we propose a note-based Query-by-Humming system. For melody extraction, we use the annotated pitch vectors in the MIR-QBSH dataset to extract the note sequence. It is corrected for noise and distortion using the approach of Total Variation Denoising. It is then converted into Parsons Code which represents the final melody contour. For candidate melody retrieval, we employ unique incorporation of bioinformatics to find the closest match among the candidate songs. Sequence alignment is a salient field of research in bioinformatics. We use the approach of semi-global alignment since it is perfectly adaptable to our system requirements. We discuss experimental results of our approach, and enumerate new possibilities of applying bioinformatic algorithms in the music domain.
TL;DR: The aim of the paper is to present an idea of using the automatic detection and correction of detuned singing as a subsystem in query-by-humming (QBH) applications, and four possible combinations of the fundamental frequency detection and pitch shifting algorithms are reviewed.
Abstract: The aim of the paper is to present an idea of using the automatic detection and correction of detuned singing as a subsystem in query-by-humming (QBH) applications. The common approach to searching for a requested song basing on the melody retrieved from hummed pattern usually employs the so-called Parsons code or melody contour. In such a case information about sound pitch is discarded. It was thought out that an additional module added to the QBH system indicating notes which were sung out of tune and correcting them might be useful. For this purpose two fundamental frequency detection algorithms, i.e. the fast autocorrelation and HPS (Harmonic Product Spectrum), and two pitch shifting algorithms, i.e. the modified phase vocoder and PSOLA (Pitch-Synchronous Overlap-Add) are chosen and examined. Four possible combinations of the algorithms are reviewed in the context of correctness of the fundamental frequency detection and pitch shifting. Basing on the results, the sub-system for automatic detection and correction of detuned singing for use with QBH applications is implemented. In addition, listening tests and objective measurements of the obtained pitch correction are performed. Conclusions are drawn and proposals of further improvements are provided.