Speech Evaluation Study (English Version) -- please click here
Sprachauswertung (Deutsche Version) -- Bitte hier clicken
Robust Machine Perception of Nonverbal Speech - Holger Quast
Don't imagine you know what a computer terminal is. A computer terminal is not some clunky old television with a typewriter in front of it. It is an interface where the mind and body can connect with the universe and move bits of it about. Douglas Adams
And hopefully some time in the future computers shall be true interfaces indeed. If you think about it, connecting with a computer is fairly clumsy and unnatural if, like most of us, you have to do it with a keyboard and a mouse, at best with a couple of fancier input devices like a MIDI keyboard or a graphpad. The most important means of natural communication throughout the evolution of mankind has been vocal communication, namely speech. We use speech to defend our Masters thesis, propose to our loved one, and express happiness about the fact that all the whiskey in a bar is gone except for our favorite brand. If a computer wanted to interact with a user, it seems logical it would make use of the many channels of communication that speech offers, like purely linguistic information but also information about the speaker's state, if he was serious or joking, emotionally agitated or calm, etc. Some research even suggests that emotional capabilities of computers are not merely an interface issure, but that machines need emotions or similar concepts to become intelligent and perform tasks like pattern recognition or scheduling (see Picard 1997). Some speech recognition systems exist today, but they make only use of verbal information, the words. Nonverbal vocal communication strikes me as especially interesting for a number of reasons. Conventional speech recognition (or more precisely, word recognition) has been investigated for quite some time now, improvement of system performance is fairly slow, and big technical efforts are necessary for small further enhancement in accuracy. What gives our brain's speech recognition the major edge is that we make use of multiple information channels such as verbal/linguistic speech and nonverbal, or para- and extralinguistic speech like prosody and gestures. We also process these signals on multiple levels, not only the audio input, but also on speaker modeling, word correlation, context, and more. The statistics of word correlations change with the emotional context. There is great potential in including further dimensions to speech recognition systems; they can make recognition more stable and allow the extraction of more information.The use of prosody - i.e. how we say something with respect to intonation, rhythm etc. - becomes even more crucial in meaning recognition. In irony, for example, the words usually are supposed to have the opposite effect of their literal meaning which often becomes clear only when one includes the intonation of the speaker. The linguistic channel, the words alone, is often not enough to convey the whole meaning of what is said, hence the use of :-) smiling and frowning faces in emails for instance to assure the receiver something was said in a joking manner. Natural language understanding and nonverbal speech recognition can benefit substantially from each other. Language parsing is easier with including both linguistic information and the proper prosodic cues, for example for finding phrase boundaries or foci. In turn, knowing what function/part of a sentence and what phoneme coincided with which prosodic element will allow a more intelligent extraction of parameters for modeling the impression this speaker would generate.
The same tools can be used to help a foreign lanuage student learn both pronunciation and prosody of the language she learns. A lot of people don't like to speak in a foreign language because they feel their accent might make them a target for ridicule. With a program like this, students can learn in the privacy of their own home with the most patient teacher possible.
A lot of people's success in their jobs depends on how well they can present themselves orally. Think about a politician who constantly mumbles as if he's going to sleep in a moment. Or a teacher who sounds as interesting as elevator music in a retirement home. Or an inconfident lawyer. Or a depressed comedian. The pattern recognition used here on top of the prosody detection can in principle be trained to model any perceived impression that can be generated by speech, be it confidence, happiness, liveliness, or whatever desired.
It is also possible to monitor the stress level in speakers, especially with my absolute loudness model for speech (see Quast 2000). This can find use in monitoring airtraffic control or to prevent panic of a user in human computer dialogue systems if the computer is aware of the situation and then modifies his speech synthesizer to calm the person down, say in an autonomous space flight scenario where real-time human supervision from earth is impossible.
The prosody perception tools developed here can easily be adapted to be used in speech therapy for people who have been born deaf. They usually speak with a very unnatural prosody, and guiding them to a clearer pronunciation is a costly therapy effort. With software available for them, these patients could work with tireless training aids available at any time in addition to sessions with logopedics experts, and so greatly enhance the patients' flexibility and reduce the cost for human resources.
In this project it is investigated how speech data can best be represented to enable the recognition of para- and extralinguistic speech information. Features related to pitch, loudness, spectral and prosodic parameters are extracted from audio files. On the psycholinguistic side, a database of German speech recordings is evaluated according to a semantic differential. A method to create synthetic voice samples that represent extremal values for the semantic differential categories is suggested to allow pretraining pattern recognition systems. This description is a patchwork of stuff from different papers and letters; it mainly consists of a presentation I gave at last year's Joint Symposium on Neural Computation at Caltech (see Quast 1999) and there's been lots of new stuff since then. If you have any questions or ideas, I'd be happy to hear from you, contact me at the address below.
Nonverbal Information in the Speech Signal
The Linguistic, Paralinguistic, and Extralinguistic Channels
The information communicated in spoken language can be categorized as linguistic, paralinguistic, or extralinguistic [Eckert/Laver 1994]. Whereas the verbal content, the actual meaning of the words, is thought of as linguistic information, the extralinguistic channel contains information about the speaker’s basic state, e.g. a big person with a large vocal tract will usually have a lower voice than a child. Some extralinguistic parameters are also determined by the culture of the speaker. Compare for instance Swedish – where the pitch vividly goes up and down – to English in which pitch changes by far less throughout a sentence.
The paralinguistic channel carries information about momentary changes in the usual (extralinguistic) baseline, such as whispering in a situation that calls for silence, or expression of emotions.Expression vs. Impression
As in linguistic communication, nonverbal vocal information is also transmitted from a sender as expression to a receiver who receives an impression [Scherer 1978], which implies that the message at one end is not necessarily the one understood at the other end. Take for instance the Swedish speaker that generates the impression of a happy extrovert person in a non-Scandinavian listener because of his fundamental frequency’s strong modulation – which is a normal (extralinguistic) expression of his language. Or a person expressing happiness by laughing, gasping for air in tears, erroneously understood as crying and sobbing.
The origin of the expression is yet another level away from the addressee. Push effects, described as externalizations of internal states [Scherer 1988], are portrayed through culture-specific standards (pull effects). The utterance may have an emotional, e.g. displaying true inner emotion, or an emotive cause, consciously bringing affective information across [Marty 1908].
Socially less apt people can have a low sending accuracy, meaning they have problems expressing their feelings.
When attempting to automatically recognize patterns in nonverbal speech, it thus seems to be advantageous to stay as close as possible to the perception, that is, model the impression a listener has, rather than a speaker’s expression such as emotions, or internal state. This can be easily quantified in evaluation experiments, and one has to worry only about receiving accuracy, i.e, how well a listerner can understand emotional content. Staying close to the perception side seems to be even more important because human listeners have shown to achieve only about 50% accuracy in classifying speech recordings of actors as happy, angry, sad, disgusted, or afraid [Pittam/Scherer 1993].
Data
Psycholinguistic Representation
The database used in this project contains 146 recordings of the eight-sentence German monologue noted below; 118 of professional actors producing the monologue picturing themselves in different given situations, 28 of nonactors asked to speak in a natural register. The text tries to combine a variety of different sentence structures while still be somewhat coherent as a whole. It contains two exclamations, one of which is an interjection (8), the other one a request (5); one question (7), and regular sentences, two of which (4,6) have the same number of syllables and intonation structure to allow for training on one sentence and test generalization ability on the other.
In der Vergangenheit ist schon einiges an guter Vorarbeit geleistet worden. Die Ziele, die wir jetzt verfolgen, sind die gleichen und müssen auch auf die gleiche Weise behandelt werden. Unsere Aufgabe ist nun, noch einmal die Zeiteinteilung durchzusehen. Sie überprüfen dann das Weitere. Bitte notieren Sie die Punkte, die Sie heraussuchen, und tragen Sie uns diese vor! Wir erledigen alles Andere. Glauben Sie, daß Sie das schaffen? Gut! The length of one recording averages about 30 seconds.
Evaluation
The recordings are evaluated by listeners according to a semantic differential approach [Osgood/ Snider 1969]. The idea behind the semantic differential scheme is to rate data in categories belonging to four groups:
Evaluation – description of personal appealThe categories then span the hyperspace that contains the speech samples.
Activity – description of the item/action
Understandability – a meta-category group describing for instance the naturalness of a sample
Potency – an intensity category group
In this case, the bipolar dimensions unpleasant–pleasant, unhappy–happy, shy–confident, passive–agitated, slow–fast, naturalness of recording, gender, and weak–strong were used to describe the recordings.Signal Representation
The speech samples were recorded with an active microphone worn on the speakers’ heads to keep the mouth-to-microphone distance constant. The signal was augmented by a custom-made preamplifier, recorded on DAT with a sampling rate of 48 kHz and stored as 16-bit mono linear pcm 48-kHz .raw data file.
The Speech Signal
(See Schroeder 1999.) The voiced part of a speech signal can be understood as a convolution of periodic excitation pulses p(t) occurring with fundamental frequency F0, with the impulse response h(t) of the vocal tract:
The excitation pulse originates as a short puff of air released through the glottis, the opening between the vocal cords, once during part of one fundamental period T0.
The fundamental frequency in speech is in almost all cases equal to the pitch perceived by a listener (except for very rare psychoacoustic anomalies).
Fast Fourier transformation allows examining the signal in the frequency domain, see Fig. 1. Since the excitation pulses are non-sinusoidal, they are represented by the fundamental frequency and its harmonics (overtones) at multiples of F0 that can clearly be seen as parallel ripples from 50 to 4000 Hz in the spectrogram during voiced periods.
The convolution in (1) now becomes a simple product:i.e., the amplitudes of the harmonic series are multiplied by the frequency response H(w) of the vocal tract thereby creating frequency bands with low intensity at the zeros of H(w), and so-called formants (corresponding to the resonances of the vocal tract) at the poles.
During unvoiced parts of the signal – in this case, during ‘z’, ‘v’, ‘f’, and ‘ss’ – the glottis does not periodically modify the airflow in the vocal tract. These sounds are generated by partially obstructing the airflow in the vocal tract, creating noise-like frequency distributions in higher spectral regions. (Note the presence of both harmonic ripples and noise-like distribution during a voiced 's' in Fig. 1 at t = 2.9 s.)
Fig. 1 Spectrogram of sentence 2 for a female voice.
FeaturesThe speech segments are classified as voiced, unvoiced, or silent.
The following features are candidates for analysis:
Fundamental Frequency:
Contour,
Mean,
Range,
Fluctuation,
Choice of frequenciesIntensity Parameters:
Absolute magnitude, energy, power,
Loudness corresponding to psychoacoustic model
Spectrum:
Width,
Percentile bandwidths,
Distribution of frequencies,
Microtremors, voice stress analysisOther Prosodic Features
Speech rate,
Stress & intonation patterns
Vowel intonation length and vocal effort contour per vowel
...
Feature ExtractionFundamental Frequency
A time window wn of 43 ms (2048 samples) is extracted from the speech recording, analyzed with an autocorrelation and a cepstrum technique, each yielding 4 candidates for a possible F0 value. The best one is picked with respect to system knowledge accumulated from previous values. The time window is then shifted by 17 ms (800 samples), and the process is repeated until the end of the file is reached.
Autocorrelation with Centerclipping
The autocorrelation is used here to find periodicity in the time domain as well as for voiced/unvoiced or silent classification (similar to the process suggested in [Rabiner/Shafer 1978]). The scalar product of the extracted window wn with a time-shifted copy wn+t of the same vector is built. The (discrete) autocorrelation acw(t) is then expressed as a function of the time shift t (lag):
For periodic signals (or quasiperiodic signals like the voiced speech signal), the autocorrelation peaks at lags that are integer multiples of the fundamental period.
Other, unwanted peaks originate from the harmonics of F0. These are subdued by means of centerclipping the signal vector prior to autocorrelation. In this procedure, all values whose absolute magnitudes are smaller than a given threshold are set to zero, from the ones with greater magnitude the threshold value is subtracted/added (for values greater/smaller than zero, respectively). To compute the cutting level, the largest (absolute) value in the first third of w is compared to the largest extremum in the last third of w; the smaller one of these two is multiplied by 0.5 and taken as the threshold.
The highest possible value for the autocorrelation function appears at t=0 when the vector is simply squared. Dividing this value by the windowlength yields a power value for this time segment and can thus, with a manually set silence threshold, be used to classify it as silent or articulated.
The autocorrelation function for unvoiced signals (essentially noise with a frequency bandwidth of over 10 kHz) approaches zero quickly since random processes are uncorrelated. Therefore, this information can be used to classify the signal as unvoiced if after a short time the autocorrelation does not reach a maximum greater than 30 % of the initial value at t=0. For efficiency reasons, the autocorrelation is implemented with an FFT.Cepstrum
The cepstrum (coined as an anagram of “spec-trum”) essentially analyzes the signal in the frequency domain (see Schroeder 1999, Hess 1983). Before Fourier-transforming, the data vector is multiplied by a 1-cos2(x) window to avoid the strong spectral splatter that would result from transforming a signal with harsh edges. As seen in Eq. (2), the speech signal in the frequency domain can be modeled as the product of the excitation pulses’ spectrum and the frequency response of the vocal tract. Taking a logarithm turns this product into a sum:
A further Fourier transformation returns the system to the time domain,
where the new dimension is called quefrency given in units of time. Through this, spectral envelope information which appears mostly in the 0–3ms range of the cepstrum c(q) and excitation period information can be separated. The fundamental period manifests itself as a peak in the cepstrum. Geometrically, the cepstrum can be interpreted as finding the periodicity of the harmonic ripples in the logarithmic power spectrum by means of another FFT. The logarithm flattens the contour in the frequency domain and thereby supresses high-frequency information.
An interesting alternative is given by Schroeder (1999): instead of using the logarithm. spectral flattening can also be achieved by taking the square root of each value in the power spectrum. This eliminates the definition gap at zero power positions and very negative values for small inputs, returns only positive values and has no free parameters to worry about like the base of the logarithm. Geometrically, the explanation why this spectral leveling is successful is the same as for the log-cepstrum; analytically, it does not offer any clear explanations besides the intuition that setting all phases equal (by building the power spectrum) and removing high frequencies would yield a sharply spiked signal. The ultimate spectral flattening can be achieved by setting the spectral envelope to a constant value, thereby eliminating all poles (formants) and zeroes. This can be done by inverse filtering the signal with information from a linear prediction filter/LPC as done in the SIFT pitch finder.
Neither cepstrum nor autocorrelation work perfectly all of the time. Although the pure cepstrum and the autocorrelation are essentially the same (think about the correlation as a convolution "in the other direction", which becomes a product in the frequency domain; this product is nothing else but the power spectrum, since one squares each value), the nonlinear filtering in the autocorrelation as described above lets both processes display slightly different behaviors. For a very low voice, say 50 Hz, one time window might be too short to do an autocorrelation, or for a very high voice at 500 Hz, the cepstral peak corresponding to the fundamental period may be covered by the envelope quefrencies. To obtain the best possible solution, both cepstrum and autocorrelation return their 4 most likely values and their probabilities that are then weighted with respect to the history of the system. An average with stress on values in the immediate past is kept. Although for low voices the fundamental frequency can halve from one period to the next one, in most cases the change rate doesn’t exceed 0.7 % per millisecond [Hess 1983]. Therefore, multiplying all probabilities of values closer than 1 %/ms to the previous fundamental frequency by 2 works well. For the first voiced window after a silent or unvoiced one, the value has to be inside a range no larger than 1.2 times the previous range, or its probability is divided by 2. The fundamental frequency contour describes the values of F0 as a function of time, see Fig. 2.
Fig. 2 F0 contour of a recording of sentence 2. The resulting F0 contour can further be rid of errors and smoothed by median picking in a 3- or 5-value window if necessary, and/or by lowpass filtering.
The computation of the mean is straightforward, range and choice of frequencies can best be expressed in a histogram. The fluctuation can be measured as average absolute value of the derivative |dF0/dt| (or rather |DF0/Dt|), or as the ratio of F0 extrema and time.
Intensity Parameters
The intensity parameters absolute magnitude, power and energy are readily computed “on the way” during the autocorrelation. Their values have no absolute meaning, however. A more meaningful understanding of loudness is motivated by the psychophysics of hearing. Human loudness perception can be modeled by dividing the audible frequency range in third-octave bands (19 for the first 10 kHz) corresponding to neighboring areas on the inner ear’s basilar membrane. If two simultaneous sounds are closer together than the interval given by these frequency bands, their intensities do not add up, but the stronger sound masks the weaker one [Zwicker/Fastl 1990]. If they are further apart, the overall loudness increases. Also, the sensitivity changes over the hearing range. A useful value is obtained by dividing the spectrum of the sound in each window into critical bands and summing the intensities in them according to their respective sensitivities. This can be normalized with respect to the loudest frequency to obtain an absolute speech loudness value (see Quast 2000).
Spectrum
Common approaches to extract spectral parameters are to define percentile ranges, i.e. the width of a frequency band whose intensity is higher than the given percentage, or divide a suitable frequency range, say, 50-5000 Hz, into subbands (e.g. third-octave bands as pointed out above) and measure the intensity in them for every time window. Histograms are useful to describe the frequency content of an utterance without recording values at each window.
Microtremors, an inaudible modulation of the voice in the 8-12 Hz range, are believed to be an indicator of stress (as an inverse measure: more microtremor, less stress). However, they are hard to automatically extract and to quantify, and their usefulness is doubtful [Horvarth 1982].Prosodic features
Prosody describes a wide range of speech attributes such as intonation, voice quality (timbre), accentuation, and temporal variation. Intonation and timbre can be expressed through fundamental frequency contour and the spectral parameters mentioned above, but prosodic features, in addition to displaying the state of the speaker, also structure the message and assigns focus to important parts and therefore reveal its full meaning only in conjunction with the related linguistic, verbal information.
Two measures previously used for speech rate are the inverse of the length of articulated and of voiced periods in a text (which limits the use to either working with standard sentences as used by Scherer [Banse/Scherer 1996], averaging over long recordings (T>1 min), or including linguistic information about respective average word lengths). Also the empirically defined thresholds for the voiced/unvoiced and for the silent/articulated classification might erroneously attribute an unvoiced value to a voiced sample or a silent label to a very weak signal in automatic classification, which doesn't harm the F0-contours but results in noticeable error for the speech rate ratios. Another measure for speech rate is the distance between peaks in the absolute magnitude vs. time plot which usually coincide with the syllables of the speech sample and can be computed with higher accuracy and more easily than voiced-period-lengths. Due to the fact that syllable lengths are a lot more regular than lengths of voiced or articulated periods for one speaker, it suffices to average over a time frame as small as 3 seconds.
Auditory Inspection of Nonverbal Speech Features
and Pattern Recognition ConsiderationsF0
The pitch of an utterance offers a both meaningful and reasonably reliably detectable representation. The scheme outlined above yielded F0 values consistent with the perceived pitch. In most analyses, pitch values were used only as static description of the system, e.g. in the form of average values, deviation, or percentile ranges. The relationships to other variables and the resulting perceived impression can readily be investigated by analysis of variance, correlation, multiple regression, perceptrons, etc (cf. Banse/Scherer 1996, Scherer 1979). However, these statistic approaches clearly are unable to capture the pitch dynamics of the utterance.
F0 contours on the other hand do just this. When using the contours to modulate a sine wave as
,
these accurately reproduced the perceived pitch of the speech recording and by this intonation, accentuation (as in structuring an utterance through pitch extrema), and temporal variance. Auditory inspection of these waveforms – although it takes a while to get accustomed to the clean glissando sound of a single sine wave as representation for pitch – suggest that the contours hold valuable information for example about the arousal level of a speaker, how calm or vivid, how relaxed or nervous, how involved or indifferent speech is perceived, especially when the gender of the speaker is also given. Additionally, the contour seems to hold part of the information necessary to classify the recording according to other impressions, as for instance angry or happy. By means of the fundamental frequency contour, a lot of information can be preserved without a lot of data. Whereas the wavefile-recording occupies more than 90 kilobytes per second, the 60 values taken for each second represent a fraction of 1/800 of the original data volume. It has even been reported that it suffices to record the position and size of the contour maxima (in addition to information about the onset of the vowel) and the slope of both adjacent parts on the curve, then in reconstruction to approximate all other points with pieces of sinusoidals [IKP 1995].Loudness
Adding the recorded loudness values (as the wave envelope) structures the contour and noticeably separates the syllables of the recording, thereby adding to the naturalness of the synthesized sound and allowing to pick discrete values, one vector corresponding to each syllable, for the pattern recognition. As explained above, these volume values carry no absolute meaning, the same sound samples could be recorded with twice the gain or played at one third of the volume. One solution to be considered is to measure the volume of a narrow frequency band around the fundamental frequency and then change the overall volume in that sample so that all recordings share the same average amplitude in this band. Alternately, the loudest contributing frequency can be picked in each time window rather than the F0 band. After this normalization, loudness values can be assigned according to how wide the frequency band of a voice is, see the psychoacoustic considerations above. In this manner, both loudness and a spectral value are combined in one variable, and together with the F0-contour can preserve the vocal characteristics of the speech sample beyond a fixed number of categories. This however still does not take care of loudness changes due to different communication situations. A person at a busy airport will speak louder than the same person in the same internal state at a cozy restaurant. This information could either be included by assigning a situation-baseline volume, or by restricting the data set to recordings from situations which require roughly the same volume.
Frequency Distribution
Including dynamic spectral properties in the analysis poses a problem because the system does not know how to separate linguistic (phonetic) from para/extralinguistic information if the type of the vowel or consonant is not explicitly given. The timbre or voice quality is determined by the frequency distribution of a speaker, but so are the phonemes. Clearly an ‘ah’ has a different frequency distribution than a ‘oh’ at the same fundamental frequency and speaker state due to the different formant structure, here related to a wide or small mouth opening. The human auditory system performs this separation task extremely fast and with very little data [Ladefoged/Broadbent 1957]. After a few fundamental periods a model of a speaker's vocal tract is created which maps the speaker-specific formant structures to the phonemes that make up speech. The capabilities of today’s automatic speaker normalization schemes – which in turn would allow finding out about a speaker’s specifics – are still far away from the human brain’s power. Including the frequency histogram of a suitably large segment like one sentence for voiced sounds and describing the bandwidth dynamics indirectly, as outlined above, by an envelope size that reflects the richness in frequency bands, seems to be a good compromise.
Pattern Recognition
Most investigations of the nonverbal vocal communication channels’ content (usually emotions) so far have used only static, statistic descriptions of speech (see Banse/Scherer 1996) and are rarely aimed at automatization. In an example of a dynamic approach that tried to work directly with F0 values as a function of time, the contour of an utterance was represented as the best-fit (in a least-square-error sense) sinusoidal [Cohn/Katz 1998]. This model yielded only good results when the auditory analysis was combined with information about facial motion from the processing of video images. Ultimately, the least biased solution for pattern recognition would be training a network on pure sound data that has been evaluated according to nonverbal vocal content, to assure that no acoustic cues are missing in the representation. However, the huge amount of data and the high intrinsic dimensionality obviously render this approach unpractical. When reducing the input to the dimensions F0 contour with loudness value taken every 20 to 100 milliseconds and frequency histogram, working with a standard text might be possible if massive numbers of locally connected neurons in a high number of layers are used to achieve shift- and warp invariance, as in a large convolution network [LeCun 1990], but generalizing this framework to recognize sentences of different structures would still be practically impossible. When the input is reduced to the number of syllables (voiced intensity peaks) in a sentence, their positions, the fundamental frequencies at these instances, a loudness value, and a spectral histogram, the dimensionality seems to come in reach of what can successfully be tackled with a large network, especially when hardcoded in VLSI circuitry [Säckinger 1992]. Focussing on discrete values taken at the intensity peaks seems to be furthermore justified by temporal pre- and especially postmasking effects of hearing that render sounds shortly before and after a loud sound harder to perceive [Zwicker/Fastl 1990]. However, the number of variations that can appear due to different sentence structures still create a problem. Here, auditory inspection of the F0/loudness/spectrum speech contours as described above suggests it is possible to identify a limited number of parameters carrying the crucial information, e.g., location of the sentence foci, intensity values, relation of fundamental frequencies at focus and end of sentence, speech rate, and a spectral histogram. Then, synthetic sound samples modeling sentences using the dimensions time, F0 as contour, loudness as envelope, and frequency distribution can be created and evaluated by listeners at different values for the dimensions that are picked. The knowledge gained from these evaluations can be used either to automatically build an array of synthetic sentence contours as representants of possible sentence structures and impressions to train a network, a task which seems feasible with realistic number of syllables, foci, speech rate variations, and frequency distributions, or to build a preprocessor extracting these parameters and performing pattern recognition on them (i.e., a static approach). The system can furthermore be trained (and finally tested) on real speech data. Both schemes are designed for automatization and have no limitations as to discriminating only between standard sentences.
PUBLICATIONS
Quast, H.: Speech Dialogue Systems and Natural Language Processing.
In: Schroeder, M. R.: Computer Speech, 2nd ed. (Springer, Berlin Heidelberg New York 2004)Quast, H., Scheideck, T., Geutner, P. Korthauer, A.: RoBoDiMa: ADialog-Object-Based Natural Language Speech Dialog System. In: Proceedings of The Eighth biannual IEEE workshop on Automatic Speech Recognition and Understanding - ASRU 2003 St Thomas (IEEE 2003)
Quast, H., Schreiner, O., Schroeder, M. R.: Robust Pitch Tracking in the Car Environment. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing - ICASSP 2002 Orlando (IEEE 2002)
Schreiner, O., Quast, H.: Grundfrequenzbestimmung aus dem Modulationsspektrum. In: Fortschritte der Akustik, DAGA 2002 Bochum (DEGA 2002)
Quast, H.: Automatische Erkennung nonverbaler Sprache. In: Fortschritte der Akustik, DAGA 2001 Hamburg-Harburg (DEGA 2001)
ze works:
Quast, H.: Automatic Recognition of Nonverbal Speech: An Approach to Model the Perception of Para- and Extralinguistic Vocal Communication with Neural Networks. Thesis. (University of Göttingen 2001)
Quast, H.: Absolute Perceived Loudness of Speech. In: Proceedings of the 7th Joint Symposium on Neural Computation, USC (INC 2000)
Quast, H.: Recognition of Nonverbal Speech Features. In: Proceedings of the 6th Joint Symposium on Neural Computation, Caltech (INC 1999)
REFERENCES / SUGGESTED READING:
[Banse/Scherer 1996] Banse, R., Scherer, L.R.: Acoustic Profiles in Vocal Emotion Expression. Journal of Personality and Social Psychology 70, No.3, 614–636 (1996)
[Cohn/Katz 1998] Cohn, J.F., Katz, G.S.: Bimodal Expressions of Emotion by Face and Voice. Workshop on Face/Gesture Recognition and their Applications, The Sixth ACM International Multimedia Conference, Bristol, England
[Eckert/Laver 1994] Eckert, H., Laver, J.: Menschen und Ihre Stimmen. (Beltz Psychologie Verlags Union, Weinheim 1994)
[Hess 1983] Hess, W.: Pitch Determination of Speech Signals. (Springer, Heidelberg New York 1983)
[Horvarth 1982] Horvarth, F.: Detecting Deception: The Promise and the Reality of Voice Stress Analysis. Journal of Forensic Sciences, JFSCA 27, No.2, pp. 340–351 (April 1982)
[IKP 1995] Heuft, B., Portele, T., Höfer, T., Krämer, J., Meyer, H., Rauth, M. Sonntag, G.: Parametric Description of F0-Contours in a Prosodic Database. Proc. ICPHS 2, 378–381 (1995)
[Ladefoged/Broadbent 1957] Ladefoged, P., Broadbent, D.E.: Information conveyed by vowels. Journal of the Acoustic Society of America 29, pp. 98–104 (1957)
[LeCun 1990] LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Handwritten digit recognition with a back-propagation network. In Touretzky, D.S. (ed.): Advances in Neural Information Processing Systems 2, pp. 396–404 (Morgan Kaufmann, San Mateo 1990)
[Marty 1908] Marty, A.: Untersuchungen zur allgemeinen Grundlegung der Grammatik und Sprachphilosophie. (Niemeyer, Halle/Saale 1908)
[Osgood/Snider 1969] Osgood, C.E., Snider, J.G.: Semantic Differential Technique: A Sourcebook. (Aldine Publishing Co., Chicago 1969)
[Picard 1997] Picard, Rosalind W.: Affective Computing. (MIT Press, Cambridge, Massachusetts, 1997)
[Pittam/Scherer 1993] Pittam, J., Scherer, K.R.: Vocal Expression and Communication of Emotion (1993) [results of two studies, van Bezooijen, 1984, and Scherer et al, 1991] in Lewis, M., Haviland, J.M. (Eds.): Handbook of Emotions. (Guilford Press, New York)
[Quast 1999] Recognition of Nonverbal Speech Features. In Proceedings of the 6th Joint Symposium on Neural Computation.
[Quast 2000] Absolute Perceived Loudness of Speech. In Proceedings of the 7th Joint Symposium on Neural Computation.
[Rabiner/Shafer 1978] Rabiner, L.R., Shafer, R.W.: Digital Processing of Speech Signals. (Prentice-Hall, Englewood Cliffs, New Jersey 1978)
[Säckinger 1992] Säckinger, E., Boser, B.E., Jackel, L.D.: A neurocomputer board based on the ANNA neural network chip. In Moody, J.E., Hanson, S.J., Lippmann, R.P. (eds.): Advances in Neural Information Processing Systems 4, pp. 773–780 (Morgan Kaufmann, San Mateo 1992)
[Scherer 1988] Scherer, K.R.: On the symbolic functions of vocal affect expression. Journal of Language and Social Psychology 7, pp. 79-100 (1988)
[Scherer 1979] Scherer, K.R.: Nonlinguistic vocal indicators of emotion and psychopathology. In Izard, C.E. (Ed.): Emotion in Personality and Psychopathology. (Plenum Press, New York 1979)
[Scherer 1978] Scherer, K.R.: Personality inference from voice quality: the loud voice of extroversion. European Journal of Social Psychology 8, pp. 467–487 (1978)
[Schroeder 1999] Schroeder, M.R.: Computer Speech – Recognition, Compression, Synthesis. (Springer, Heidelberg Berlin New York 1999)
[Zwicker/Fastl 1990] Zwicker, E., Fastl, H.: Psychoacoustics. Facts and Models. (Springer, Heidelberg Berlin New York 1990)
![]()
Dr. Holger Quast Drittes Physikalisches Institut, Universität Göttingen Friedrich-Hund-Platz 1 37077 Göttingen Germany holcus@_no_spam_physik3.gwdg.de
My San Diego home:
Machine Perception Lab, Institute for Neural Computation
University of California, San Diego
9500 Gilman Drive Mailcode 0523
La Jolla, CA 92093-0523
HOME SPEECH SCUBA SAVOIR VIVRE last updated March 16, 2005 HQ