Reading Interspeech 2010 Program

Luckily speech people don't have so many conferences. In machine learning world it seems it's getting crazy. You can have conference every month. Researchers travel more than sales managers. In speech there are ASRU, ICASSP but they don't really matter. It's enough to track Interspeech. Since Tokio is too far, I'm just reading the abstract list from the program. First impressions are:

  • Keynotes are all boring
  • Interesting rise of the subject "automatic error detection in unit selection". At least three! papers are presented on the subject while I haven't seen any of them before. Looks like idea appeared in less then a year! Are they spying each other?
  • RWTH Aachen presented enormous amount of papers, LIUM is also quite fruitful
  • Well, IBM T. J. Watson Research Center is active as well, but thats more a tradition
  • I've met in one paper: "yields modest gains on the English broadcast news RT-04 task, reducing the word error rate from 14.6% to 14.4%" Was it worth writing an article?
  • Cognitive status assessment from speech is important in dialogs. SRI is doing that
  • Strange that reverberation issues are a separate class of problems to solve and largely covered.The problem as a whole looks rather generic - create noise and corruption-stable features. Not sure how reverberation is special here
  • WFST is loudly mentioned
  • Andreas Stolke on SRILM noted that pruning doesn't work with KN-smoothed model! Damn, I was using it
  • Only 2 Russian papers on the whole conference. Well, it's 50% growth to previous year. And one of them is on speech recognition, that's definitely a progress
  • Suprisingly not so much research on confidence measures! Confidence is a REALLY IMPORTANT THING

Reading the abstracts I also selected some papers which could be interesting for Nexiwave. Probably you'll find this list easier to read than 200 papers from original program. Let's hope this list will be useful for me as well. To be honest I didn't manage to read the papers I selected previous year from Interspeech 2009.


Unsupervised Discovery and Training of Maximally Dissimilar Cluster Models
Francoise Beaufays (Google)
Vincent Vanhoucke (Google)
Brian Strope (Google)
One of the difficult problems of acoustic modeling for Automatic Speech Recognition (ASR) is how to adequately model the wide variety of acoustic conditions which may be present in the data. The problem is especially acute for tasks such as Google Search by Voice, where the amount of speech available per transaction is small, and adaptation techniques start showing their limitations. As training data from a very large user population is available however, it is possible to identify and jointly model subsets of the data with similar acoustic qualities. We describe a technique which allows us to perform this modeling at scale on large amounts of data by learning a tree-structured partition of the acoustic space,and we demonstrate that we can significantly improve recognition accuracy in various conditions through unsupervised Maximum Mutual Information (MMI) training. Being fully unsupervised, this technique scales easily to increasing numbers of conditions.

Techniques for topic detection based processing in spoken dialog systems
Rajesh Balchandran (IBM T J Watson Research Center)
Leonid Rachevsky (IBM T J Watson Research Center)
Bhuvana Ramabhadran (IBM T J Watson Research Center)
Miroslav Novak (IBM T J Watson Research Center)
In this paper we explore various techniques for topic detection in the context of conversational spoken dialog systems and also propose variants over known techniques to address the constraints of memory, accuracy and scalability associated with their practical implementation of spoken dialog systems. Tests were carried out on a multiple-topic spoken dialog system to compare and analyze these techniques. Results show benefits and compromises with each approach suggesting that the best choice of technique for topic detection would be dependent on the specific deployment requirements.

A Hybrid Approach to Robust Word Lattice Generation Via Acoustic-Based Word Detection
Icksang Han (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
Chiyoun Park (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
Jeongmi Cho (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
Jeongsu Kim (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
A large-vocabulary continuous speech recognition (LVCSR) system usually utilizes a language model in order to reduce the complexity of the algorithm. However, the constraint also produces side-effects including low accuracy of the out-of-grammar sentences and the error propagation of misrecognized words. In order to compensate for the side-effects of the language model, this paper proposes a novel lattice generation method that adopts the idea from the keyword detection method. By combining the word candidates detected mainly from the acoustic aspect of the signal to the word lattice from the ordinary speech recognizer, a hybrid lattice is constructed. The hybrid lattice shows 33% improvement in terms of the lattice accuracy under the condition where the lattice density is the same. In addition, it is observed that the proposed model shows less sensitivity to the out-of-grammar sentences and to the error propagation due to misrecognized words.

Time Condition Search in Automatic Speech Recognition Reconsidered
David Nolden (RWTH Aachen)
Hermann Ney (RWTH Aachen)
Ralf Schlueter (RWTH Aachen)
In this paper we re-investigate the time conditioned search (TCS) method in comparison to the well known word conditioned search, and analyze its applicability on state-of-the-art large vocabulary continuous speech recognition tasks. In contrast to current standard approaches, time conditioned search offers theoretical advantages particularly in combination with huge vocabularies and huge language models, but it is difficult to combine with across word modelling, which was proven to be an important technique in automatic speech recognition. Our novel contributions for TCS are a pruning step during the recombination called Early Word End Pruning, an additional recombination technique called Context Recombination, the idea of a Startup Interval to reduce the number of started trees, and a mechanism to combine TCS with across word modelling. We show that, with these techniques, TCS can outperform WCS on a current task.

Direct Construction of Compact Context-Dependency Transducers From Data
David Rybach (RWTH Aachen University, Germany)
Michael Riley (Google Inc., USA)
This paper describes a new method for building compact context-dependency transducers for finite-state transducer-based ASR decoders. Instead of the conventional phonetic decision-tree growing followed by FST compilation, this approach incorporates the phonetic context splitting directly into the transducer construction. The objective function of the split optimization is augmented with a regularization term that measures the number of transducer states introduced by a split. We give results on a large spoken-query task for various n-phone orders and other phonetic features that show this method can greatly reduce the size of the resulting context-dependency transducer with no significant impact on recognition accuracy. This permits using context sizes and features that might otherwise be unmanageable.

On the relation of Bayes Risk, Word Error, and Word Posteriors in ASR
Ralf Schlueter (Lehrstuhl fuer Informatik 6 - Computer Science Department, RWTH Aachen University)
Markus Nussbaum-Thom (Lehrstuhl fuer Informatik 6 - Computer Science Department, RWTH Aachen University)
Hermann Ney (Lehrstuhl fuer Informatik 6 - Computer Science Department, RWTH Aachen University)
In automatic speech recognition, we are faced with a well-known inconsistency: Bayes decision rule is usually used to minimize sentence (word sequence) error, whereas in practice we want to minimize word error, which also is the usual evaluation measure. Recently, a number of speech recognition approaches to approximate Bayes decision rule with word error (Levenshtein/edit distance) cost were proposed. Nevertheless, experiments show that the decisions often remain the same and that the effect on the word error rate is limited, especially at low error rates. In this work, further analytic evidence for these observations is provided. A set of conditions is presented, for which Bayes decision rule with sentence and word error cost function leads to the same decisions. Furthermore, the case of word error cost is investigated and related to word posterior probabilities. The analytic results are verified experimentally on several large vocabulary speech recognition tasks.

Efficient Data Selection for Speech Recognition Based on Prior Confidence Estimation Using Speech and Context Independent Models
Satoshi KOBASHIKAWA (NTT Cyber Space Laboratories, NTT Corporation)
Taichi ASAMI (NTT Cyber Space Laboratories, NTT Corporation)
Yoshikazu YAMAGUCHI (NTT Cyber Space Laboratories, NTT Corporation)
Hirokazu MASATAKI (NTT Cyber Space Laboratories, NTT Corporation)
Satoshi TAKAHASHI (NTT Cyber Space Laboratories, NTT Corporation)
This paper proposes an efficient data selection technique to identify well recognized texts in massive volumes of speech data. Conventional confidence measure techniques can be used to obtain this accurate data, but they require speech recognition results to estimate confidence. Without a significant level of confidence, considerable computer resources are wasted since inaccurate recognition results are generated only to be rejected later. The technique proposed herein rapidly estimates the prior confidence based on just an acoustic likelihood calculation by using speech and context independent models before speech recognition processing; it then recognizes data with high confidence selectively. Simulations show that it matches the data selection performance of the conventional posterior confidence measure with less than 2 % of the computation time.

Discovering an Optimal Set of Minimally Contrasting Acoustic Speech Units: A Point of Focus for Whole-Word Pattern Matching
Guillaume Aimetti (University of Sheffield)
Roger Moore (Universty of Sheffield)
Louis ten Bosch (Radboud University)
This paper presents a computational model that can automatically learn words, made up from emergent sub-word units, with no prior linguistic knowledge. This research is inspired by current cognitive theories of human speech perception, and therefore strives for ecological plausibility with the desire to build more robust speech recognition technology. Firstly, the particulate structure of the raw acoustic speech signal is derived through a novel acoustic segmentation process, the `acoustic DP-ngram algorithm'. Then, using a cross-modal association learning mechanism, word models are derived as a sequence of the segmented units. An efficient set of sub-word units emerge as a result of a general purpose lossy compression mechanism and the algorithms sensitivity to discriminate acoustic differences. The results show that the system can automatically derive robust word representations and dynamically build re-usable sub-word acoustic units with no pre-defined language-specific rules.

Modeling pronunciation variation using context-dependent articulatory feature decision trees
Samuel Bowman (Linguistics, The University of Chicago)
Karen Livescu (TTI-Chicago)
We consider the problem of predicting the surface pronunciations of a word in conversational speech, using a feature-based model of pronunciation variation. We build context-dependent decision trees for both phone-based and feature-based models, and compare their perplexities on conversational data from the Switchboard Transcription Project. We find that feature-based decision trees using featur e bundles based on articulatory phonology outperform phone-based decision trees, and are much more r obust to reductions in training data. We also analyze the usefulness of various context variables.

Accelerating Hierarchical Acoustic Likelihood Computation on Graphics Processors
Pavel Kveton (IBM)
Miroslav Novak (IBM)
The paper presents a method for performance improvements of a speech recognition system by moving a part of the computation - acoustic likelihood computation - onto a Graphics Processor Unit (GPU). In the system, GPU operates as a low cost powerful coprocessor for linear algebra operations. The paper compares GPU implementation of two techniques of acoustic likelihood computation: full Gaussian computation of all components and a significantly faster Gaussian selection method using hierarchical evaluation. The full Gaussian computation is an ideal candidate for GPU implementation because of its matrix multiplication nature. The hierarchical Gaussian computation is a technique commonly used on a CPU since it leads to much better performance by pruning the computation volume. Pruning techniques are generally much harder to implement on GPUs, nevertheless, the paper shows that hierarchical Gaussian computation can be efficiently implemented on GPUs.

The AMIDA 2009 Meeting Transcription System
Thomas Hain (Univ Sheffield)
Lukas Burget (Brno Univ. of Technology)
John Dines (Idiap)
Philip N. Garner (Idiap)
Asmaa El Hannani (Univ. Sheffield)
Marijn Huijbregts (Univ. Twente)
Martin Karafiat (Brno Univ. of Technology)
Mike Lincoln (Univ. of Edinburgh)
Wan Vincent (Univ. Of Sheffield)
We present the AMIDA 2009 system for participation in the NIST RT'2009 STT evaluations. Systems for close-talking, far field and speaker attributed STT conditions are described. Improvements to our previous systems are: segmentation and diarisation; stacked bottle-neck posterior feature extraction; fMPE training of acoustic models; adaptation on complete meetings; improvements to WFST decoding; automatic optimisation of decoders and system graphs. Overall these changes gave a 6-13% relative reduction in word error rate while at the same time reducing the real-time factor by a factor of five and using considerably less data for acoustic model training.

A FACTORIAL SPARSE CODER MODEL FOR SINGLE CHANNEL SOURCE SEPARATION
Robert Peharz (Graz University of Technology)
Michael Stark (Graz University of Technology)
Franz Pernkopf (Graz University of Technology)
Yannis Stylianou (University of Crete)
We propose a probabilistic factorial sparse coder model for single channel source separation in the magnitude spectrogram domain. The mixture spectrogram is assumed to be the sum of the sources, which are assumed to be generated frame-wise as the output of sparse coders plus noise. For dictionary training we use an algorithm which can be described as non-negative matrix factorization with ℓ0 sparseness constraints. In order to infer likely source spectrogram candidates, we approximate the intractable exact inference by maximizing the posterior over a plausible subset of solutions. We compare our system to the factorial-max vector quantization model, where the proposed method shows a superior performance in terms of signal-to-interference ratio. Finally, the low computational requirements of the algorithm allows close to real time applications.

ORIENTED PCA METHOD FOR BLIND SPEECH SEPARATION OF CONVOLUTIVE MIXTURES
Yasmina Benabderrahmane (INRS-EMT Telecommunications Canada)
Sid Ahmed Selouani (Université de Moncton Canada)
Douglas O’Shaughnessy (INRS-EMT Telecommunications Canada)
This paper deals with blind speech separation of convolutive mixtures of sources. The separation criterion is based on Oriented Principal Components Analysis (OPCA) in the frequency domain. OPCA is a (second order) extension of standard Principal Component Analysis (PCA) aiming at maximizing the power ratio of a pair of signals. The convolutive mixing is obtained by modeling the Head Related Transfer Function (HRTF). Experimental results show the efficiency of the proposed approach in terms of subjective and objective evaluation, when compared to the Degenerate Unmixing Evaluation Technique (DUET) and the widely used C-FICA (Convolutive Fast-ICA) algorithm

Speaker Adaptation Based on System Combination Using Speaker-Class Models
Tetsuo Kosaka (Yamagata University)
Takashi Ito (Yamagata University)
Masaharu Kato (Yamagata University)
Masaki Kohda (Yamagata University)
In this paper, we propose a new system combination approach for an LVCSR system using speaker-class (SC) models and a speaker adaptation technique based on these SC models. The basic concept of the SC-based system is to select speakers who are acoustically similar to a target speaker to train acoustic models. One of the major problems regarding the use of the SC model is determining the selection range of the speakers. In other words, it is difficult to determine the number of speakers that should be selected. In order to solve this problem, several SC models, which are trained by a variety of number of speakers are prepared in advance. In the recognition step, acoustically similar models are selected from the above SC models, and the scores obtained from these models are merged using a word graph combination technique. The proposed method was evaluated using the Corpus of Spontaneous Japanese (CSJ), and showed significant improvement in a lecture speech recognition task.

Feature versus Model Based Noise Robustness
Kris Demuynck (Katholieke Universiteit Leuven,  dept. ESAT)
Xueru Zhang (Katholieke Universiteit Leuven,  dept. ESAT)
Dirk Van Compernolle (Katholieke Universiteit Leuven,  dept. ESAT)
Hugo Van hamme (Katholieke Universiteit Leuven,  dept. ESAT)
Over the years, the focus in noise robust speech recognition has shifted from noise robust features to model based techniques such as parallel model combination and uncertainty decoding. In this paper, we contrast prime examples of both approaches in the context of large vocabulary recognition systems such as used for automatic audio indexing and transcription. We look at the approximations the techniques require to keep the computational load reasonable, the resulting computational cost, and the accuracy measured on the Aurora4 benchmark. The results show that a well designed feature based scheme is capable of providing recognition accuracies at least as good as the model based approaches at a substantially lower computational cost

The role of higher-level linguistic features in HMM-based speech synthesis
Oliver Watts (Centre for Speech Technology Research, University of Edinburgh, UK)
Junichi Yamagishi (Centre for Speech Technology Research, University of Edinburgh, UK)
Simon King (Centre for Speech Technology Research, University of Edinburgh, UK)
We analyse the contribution of higher-level elements of the linguistic specification of a data-driven speech synthesiser to the naturalness of the synthetic speech which it generates. The system is trained using various subsets of the full feature-set, in which features relating to syntactic category, intonational phrase boundary, pitch accent and boundary tones are selectively removed. Utterances synthesised by the different configurations of the system are then compared in a subjective evaluation of their naturalness. The work presented forms background analysis for an on-going set of experiments in performing text-to-speech (TTS) conversion based on shallow features: features that can be trivially extracted from text. By building a range of systems, each assuming the availability of a different level of linguistic annotation, we obtain benchmarks for our on-going work.

Latent Perceptual Mapping: A New Acoustic Modeling Framework for Speech Recognition
Shiva Sundaram (Deutsche Telekom Laboratories, Ernst-Reuter-Platz-7, Berlin 10587. Germany)
Jerome Bellegarda (Apple Inc., 3 Infinte Loop, Cupertino, 95014 California. USA.)
While hidden Markov modeling is still the dominant paradigm for speech recognition, in recent years there has been renewed interest in alternative, template-like approaches to acoustic modeling. Such methods sidestep usual HMM limitations as well as inherent issues with parametric statistical distributions, though typically at the expense of large amounts of memory and computing power. This paper introduces a new framework, dubbed latent perceptual mapping, which naturally leverages a reduced dimensionality description of the observations. This allows for a viable parsimonious template-like solution where models are closely aligned with perceived acoustic events. Context-independent phoneme classification experiments conducted on the TIMIT database suggest that latent perceptual mapping achieves results comparable to conventional acoustic modeling but at potentially significant savings in online costs.

State-based labelling for a sparse representation of speech and its application to robust speech recognition
Tuomas Virtanen (Department of Signal Processing, Tampere University of Technology, Finland)
Jort F. Gemmeke (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)
Antti Hurmalainen (Department of Signal Processing, Tampere University of Technology, Finland)
This paper proposes a state-based labeling for acoustic patterns of speech and a method for using this labelling in noise-robust automatic speech recognition. Acoustic time-frequency segments of speech, exemplars, are obtained from a training database and associated with time-varying state labels using the transcriptions. In the recognition phase, noisy speech is modeled by a sparse linear combination of noise and speech exemplars. The likelihoods of states are obtained by linear combination of the exemplar weights, which can then be used to estimate the most likely state transition path. The proposed method was tested in the connected digit recognition task with noisy speech material from the Aurora-2 database where it is shown to produce better results than the existing histogram-based labeling method.

Single-channel speech enhancement using Kalman filtering in the modulation domain
Stephen So (Signal Processing Laboratory, Griffith University)
Kamil K. Wojcicki (Signal Processing Laboratory, Griffith University)
Kuldip K. Paliwal (Signal Processing Laboratory, Griffith University)
In this paper, we propose the modulation-domain Kalman filter (MDKF) for speech enhancement. In contrast to previous modulation domain-enhancement methods based on bandpass filtering, the MDKF is an adaptive and linear MMSE estimator that uses models of the temporal changes of the magnitude spectrum for both speech and noise. Also, because the Kalman filter is a joint magnitude and phase spectrum estimator, under non-stationarity assumptions, it is highly suited for modulation-domain processing, as modulation phase tends to contain more speech information than acoustic phase. Experimental results from the NOIZEUS corpus show the ideal MDKF (with clean speech parameters) to outperform all the acoustic and time-domain enhancement methods that were evaluated, including the conventional time-domain Kalman filter with clean speech parameters. A practical MDKF that uses the MMSE-STSA method to enhance noisy speech in the acoustic domain prior to LPC analysis was also evaluated and showed promising results.

Metric Subspace Indexing for Fast Spoken Term Detection
Taisuke Kaneko (Toyohashi University of Technology)
Tomoyosi Akiba (Toyohashi University of Technology)
In this paper, we propose a novel indexing method for Spoken Term Detection (STD). The proposed method can be considered as using metric space indexing for the approximate string-matching problem, where the distance between a phoneme and a position in the target spoken document is defined. The proposed method does not require the use of thresholds to limit the output, instead being able to output the results in increasing order of distance. It can also deal easily with the multiple candidates obtained via Automatic Speech Recognition (ASR). The results of preliminary experiments show promise for achieving fast STD.

Discriminative Language Modeling Using Simulated ASR Errors
Preethi Jyothi (Department of Computer Science and Engineering, The Ohio State University, USA)
Eric Fosler-Lussier (Department of Computer Science and Engineering, The Ohio State University, USA)
In this paper, we approach the problem of discriminatively training language models using a weighted finite state transducer (WFST) framework that does not require acoustic training data. The phonetic confusions prevalent in the recognizer are modeled using a confusion matrix that takes into account information from the pronunciation model (word-based phone confusion log likelihoods) and information from the acoustic model (distances between the phonetic acoustic models). This confusion matrix, within the WFST framework, is used to generate confusable word graphs that serve as inputs to the averaged perceptron algorithm to train the parameters of the discriminative language model. Experiments on a large vocabulary speech recognition task show significant word error rate reductions when compared to a baseline using a trigram model trained with the maximum likelihood criterion.

Learning a Language Model from Continuous Speech
Graham Neubig (Graduate School of Informatics, Kyoto University)
Masato Mimura (Graduate School of Informatics, Kyoto University)
Shinsuke Mori (Graduate School of Informatics, Kyoto University)
Tatsuya Kawahara (Graduate School of Informatics, Kyoto University)
This paper presents a new approach to language model construction, learning a language model not from text, but directly from continuous speech. A phoneme lattice is created using acoustic model scores, and Bayesian techniques are used to robustly learn a language model from this noisy input. A novel sampling technique is devised that allows for the integrated learning of word boundaries and an n-gram language model with no prior linguistic knowledge. The proposed techniques were used to learn a language model directly from continuous, potentially large-vocabulary speech. This language model was able to significantly reduce the ASR phoneme error rate over a separate set of test data, and the proposed lattice processing and lexical acquisition techniques were found to be important factors in this improvement.

New Insights into Subspace Noise Tracking
Mahdi Triki (Philips Research Laboratories)
Various speech enhancement techniques rely on the knowledge of the clean signal and noise statistics. In practice, however, these statistics are not explicitly available, and the overall enhancement accuracy critically depends on the estimation quality of the unknown statistics. The estimation of noise (and speech) statistics is particularly challenging under non-stationary noise conditions. In this respect, subspace-based approaches have been shown to provide a good tracking vs. final misadjustment tradeoff. Subspace-based techniques hinge critically on both rank-limited and spherical assumptions of the speech and the noise DFT matrices, respectively. The speech rank-limited assumption was previously experimentally tested and validated. In this paper, we will investigate the structure of nuisance sources. We will discuss the validity of the spherical assumption for a variety of nuisance sources (environmental noise, reverberation), and preprocessing (overlapping segmentation).

Acoustic Correlates of Meaning Structure in Conversational Speech
Alexei V. Ivanov (DISI, University of Trento, Italy)
Giuseppe Riccardi (DISI, University of Trento, Italy)
Sucheta Ghosh (DISI, University of Trento, Italy)
Sara Tonelli (FBK-IRST, Trento, Italy)
Evgeny Stepanov (DISI, University of Trento, Italy)
We are interested in the problem of extracting meaning structures from spoken utterances in human communication. In SLU systems, parsing of meaning structures is carried over the word hypotheses generated by the ASR. This approach suffers from high word error rates and ad-hoc conceptual representations. In contrast, in this paper we aim at discovering meaning components from direct measurements of acoustic and non-verbal linguistic features. The meaning structures are taken from the frame semantics model proposed in FrameNet. We give a quantitative analysis of meaning structures in terms of speech features across human--human dialogs from the manually annotated LUNA corpus. We show that the acoustic correlations between pitch, formant trajectories, intensity and harmonicity and meaning features are statistically significant over the whole corpus as well as relevant in classifying the target words evoked by a semantic frame.

Using Harmonic Phase Information to Improve ASR Rate
Ibon Saratxaga (Aholab Signal Processing Laboratory, University of the Basque Country)
Inma Hernáez (Aholab Signal Processing Laboratory, University of the Basque Country)
Igor Odriozola (Aholab Signal Processing Laboratory, University of the Basque Country)
Eva Navas (Aholab Signal Processing Laboratory, University of the Basque Country)
Iker Luengo (Aholab Signal Processing Laboratory, University of the Basque Country)
Daniel Erro (Aholab Signal Processing Laboratory, University of the Basque Country)
Spectral phase information is usually discarded in automatic speech recognition (ASR). The Relative Phase Shift (RPS), a novel representation of the phase information of the speech, has features which seem to be appropriate to improve the ASR recognition rate. In this paper we describe the RPS representation, discuss different ways to parameterize this information in a suitable way for the HMM modelling, and present the results of the evaluation experiments. WER improvements ranging from 12 to 22% open promising perspectives for the use of this information jointly with the classical MFCC parameterization. Index Terms: ASR, phase spectrum, harmonic analysis

Using Dependency Parsing and Machine Learning for Factoid Question Answering on Spoken Documents
Pere R. Comas (TALP Research Center, Technical University of Catalonia (UPC))
Lluís Màrquez (TALP Research Center, Technical University of Catalonia (UPC))
Jordi Turmo (TALP Research Center, Technical University of Catalonia (UPC))
This paper presents our experiments in question answering for speech corpora. These experiments focus on improving the answer extraction step of the QA process. We present two approaches to answer extraction in question answering for speech corpora that apply machine learning to improve the coverage and precision of the extraction. The first one is a reranker that uses only lexical information, the second one uses dependency parsing to score similarity between syntactic structures. Our experimental results show that the proposed learning models improve our previous results using only hand-made ranking rules with small syntactic information. We evaluate the system on manual transcripts of speech from EPPS English corpus and a set of questions transcribed from spontaneous oral questions. This data belongs to the CLEF 2009 evaluation track on QA on speech transcripts (QAst).

A Spoken Term Detection Framework for Recovering Out-of-Vocabulary Words Using the Web
Carolina Parada (Johns Hopkins University)
Abhinav Sethy (IBM TJ Watson Research Center)
Mark Dredze (Johns Hopkins University)
Frederick Jelinek (Johns Hopkins University)
Vocabulary restrictions in large vocabulary continuous speech recognition (LVCSR) systems mean that out-of-vocabulary (OOV) words are lost in the output. However, OOV words tend to be information rich terms (often named entities) and their omission from the transcript negatively affects both usability and downstream NLP technologies, such as machine translation or knowledge distillation. We propose a novel approach to OOV recovery that uses a spoken term detection (STD) framework. Given an identified OOV region in the LVCSR output, we recover the uttered OOVs by utilizing contextual information and the vast and constantly updated vocabulary on the Web. Discovered words are integrated into the system output, recovering up to 40% of the OOV terms and resulting in a reduction in system error.

Boosting Systems for LVCSR
George Saon (IBM T.J. Watson Research Center)
Hagen Soltau (IBM T.J. Watson Research Center)
We employ a variant of the popular Adaboost algorithm to train multiple acoustic models such that the aggregate system exhibits improved performance over the individual recognizers. Each model is trained sequentially on re-weighted versions of the training data. At each iteration, the weights are decreased for the frames that are correctly decoded by the current system. These weights are then multiplied with the frame-level statistics for the decision trees and Gaussian mixture components of the next iteration system. The composite system uses a log-linear combination of HMM state observation likelihoods. We report experimental results on several broadcast news transcription setups which differ in the language being spoken (English and Arabic) and amounts of training data. Our findings suggest that significant gains can be obtained for small amounts of training data even after feature and model-space discriminative training.

Incorporating Sparse Representation Phone Identification Features in Automatic Speech Recognition using Exponential Families
Vaibhava Goel (IBM T.J. Watson Research Center)
Tara Sainath (IBM T.J. Watson Research Center)
Bhuvana Ramabhadran (IBM T.J. Watson Research Center)
Peder Olsen (IBM T.J. Watson Research Center)
David Nahamoo (IBM T.J. Watson Research Center)
Dimitri Kanevsky (IBM T.J. Watson Research Center)
Sparse representation phone identification features (SPIF) is a recently developed technique to obtain an estimate of phone posterior probabilities conditioned on an acoustic feature vector. In this paper, we explore incorporating SPIF phone posterior probability estimates in large vocabulary continuous speech recognition (LVCSR) task by including them as additional features of exponential densities that model the HMM state emission likelihoods. We compare our proposed approach to a number of other well known methods of combining feature streams or multiple LVCSR systems. Our experiments show that using exponential models to combine features results in a word error rate reduction of 0.5% absolute (18.7% down to 18.2%); this is comparable to best error rate reduction obtained from system combination methods, but without having to build multiple systems or tune the system combination weights.

Improving ASR-based topic segmentation of TV programs with confidence measures and semantic relations
Camille Guinaudeau (INRIA/IRISA)
Guillaume Gravier (IRISA/CNRS)
Pascale Sébillot (IRISA/INSA)
The increasing quantity of video material requires methods to help users navigate such data, among which topic segmentation techniques. The goal of this article is to improve ASR-based topic segmentation methods to deal with peculiarities of professionnal-video transcripts (transcription errors and lack of repetitions) while remaining generic enough. To this end, we introduce confidence measures and semantic relations in a segmentation method based on lexical cohesion. We show significant improvements of the F1-measure, +1.7 and +1.9 when integrating confidence measures and semantic relations respectively. Such improvement demonstrates that simple clues can conteract errors in automatic transcripts and lack of repetitions.

A Novel text-independent phonetic segmentation algorithm based on the Microcanonical Multiscale Formalism
Vahid Khanagha (INRIA Bordeaux Sud-Ouest)
Khalid Daoudi (INRIA Bordeaux Sud-Ouest)
Oriol Pont (INRIA Bordeaux Sud-Ouest)
Hussein Yahia (INRIA Bordeaux Sud-Ouest)
We propose a radically novel approach to analyze speech signals from a statistical physics perspective. Our approach is based on a new framework, the Microcanonical Multiscale Formalism (MMF), which is based on the computation of singularity exponents, defined at each point in the signal domain. The latter allows nonlinear analysis of complex dynamics and, particularly, characterizes the intermittent signature. We study the validity of the MMF for the speech signal and show that singularity exponents convey indeed valuable information about its local dynamics. We define an accumulative measure on the exponents which reveals phoneme boundaries as the breaking points of a piecewise linear-like curve. We then develop a simple automatic phonetic segmentation algorithm usinhttp://www.interspeech2010.org/g piecewise linear curve fitting. We present experiments on the full TIMIT database. The results show that our algorithm yields considerably better accuracy than recently published ones.

Exploring Recognition Network Representations for Efficient Speech Inference on Highly Parallel Platforms
Jike Chong (University of California, Berkeley; Parasians, LLC)
Ekaterina Gonina (University of California, Berkeley)
Kisun You (Seoul National University)
Kurt Keutzer (Unversity of California, Berkeley)
The emergence of highly parallel computing platforms is enabling new trade-offs in algorithm design for automatic speech recognition. It naturally motivates the following investigation: Do the most computationally efficient sequential algorithms lead to the most computationally efficient parallel algorithms? In this paper we explore two contending recognition network representations for speech inference engines: the linear lexical model (LLM) and the weighted finite state transducer (WFST). We demonstrate that while an inference engine using the simpler LLM representation evaluates 22x more transitions per second than the advanced WFST representation, the simple structure of the LLM representation allows 4.7-6.4x faster evaluation and 53-65x faster operand-gathering for each state transition. We use the 5k Wall Street Journal Corpus to experiment on the NVIDIA GTX480 (Fermi) and the NVIDIA GTX285 Graphics Processing Units (GPUs), and illustrate that the performance of a speech inference engine based on the LLM representation is competitive with the WFST representation on highly parallel implementation platforms.

Speech Recognizer Optimization under Speed Constraints
Ivan Bulyko (Raytheon BBN Technologies)
We present an efficient algorithm for optimizing parameters of a speech recognizer aimed at obtaining maximum accuracy at a specified decoding speed. This algorithm is not tied to any particular decoding architecture or type of tunable parameter being used. It can also be applied to any performance metric (e.g. WER, keyword search or topic ID accuracy) and thus allows tuning to the target application. We demonstrate the effectiveness of this approach by tuning BBN’s Byblos recognizer to run at 15 times faster than real time while maximizing keyword search accuracy.

The 2010 CMU GALE Speech-to-Text System
Florian Metze (Carnegie Mellon University)
Roger Hsiao (Carnegie Mellon University)
Qin Jin (Carnegie Mellon University)
Udhyakumar Nallasamy (Carnegie Mellon University)
Tanja Schultz (Karlsruhe Institute of Technology)
This paper describes the latest Speech-to-Text system developed for the Global Autonomous Language Exploitation ("GALE") domain by Carnegie Mellon University (CMU). This systems uses discriminative training, bottle-neck features and other techniques that were not used in previous versions of our system, and is trained on 1150 hours of data from a variety of Arabic speech sources. In this paper, we show how different lexica, pre-processing, and system combination techniques can be used to improve the final output, and provide analysis of the improvements achieved by the individual techniques.

The RWTH 2009 Quaero ASR Evaluation System for English and German
Markus Nußbaum-Thom (RWTH Aachen University)
Simon Wiesler (RWTH Aachen University)
Martin Sundermeyer (RWTH Aachen University)
Christian Plahl (RWTH Aachen University)
Stefan Hahn (RWTH Aachen University)
Ralf Schlüter (RWTH Aachen University)
Hermann Ney (RWTH Aachen University)
In this work, the RWTH automatic speech recognition systems for English and German for the second Quaero evaluation campaign 2009 are presented. The systems are designed to transcribe web data, European parliament plenary sessions and broadcast news data. Another challenge in the 2009 evaluation is that almost no in-domain training data is provided and the test data contains a large variety of speech types. The RWTH participates for the English and German languages with the best results for German and competitive results for the English. Contributing to the enhancements are the systematic use of hierarchical neural network based posterior features, system combination, speaker adaptation, cross speaker adaptation, domain dependent modeling and the usage of additional training data.

The Impact of ASR on Abstractive vs. Extractive Meeting Summaries
Gabriel Murray (University of British Columbia)
Giuseppe Carenini (University of British Columbia)
Raymond Ng (University of British Columbia)
In this paper we describe a complete abstractive summarizer for meeting conversations, and evaluate the usefulness of the automatically generated abstracts in a browsing task. We contrast these abstracts with extracts for use in a meeting browser and investigate the effects of manual versus ASR transcripts on both summary types.

This is a comparision that I feel is wrong
An Empirical Comparison of the T3, Juicer, HDecode and Sphinx3 Decoders
Josef R. Novak (Tokyo Institute of Technology)
Paul R. Dixon (National Institute of Information and Communications Technology)
Sadaoki Furui (Tokyo Institute of Technology)
In this paper we perform a cross-comparison of the T3 WFST decoder against three different speech recognition decoders on three separate tasks of variable difficulty. We show that the T3 decoder performs favorably against several established veterans in the field, including the Juicer WFST decoder, Sphinx3, and HDecode in terms of RTF versus Word Accuracy. In addition to comparing decoder performance, we evaluate both Sphinx and HTK acoustic models on a common footing inside T3, and show that the speed benefits that typically accompany the WFST approach increase with the size of the vocabulary and other input knowledge sources. In the case of T3, we also show that GPU acceleration can significantly extend these gains.

CRF-based Combination of Contextual Features to Improve A Posteriori Word-level Confidence Measures
Julien Fayolle (IRISA/INRIA Rennes, France)
Fabienne Moreau (University of Rennes 2/IRISA Rennes, France)
Christian Raymond (IRISA/INSA Rennes, France)
Guillaume Gravier (IRISA/CNRS Rennes, France)
Patrick Gros (IRISA/INRIA Rennes, France)
This paper addresses the issue of confidence measure reliability provided by automatic speech recognition systems for use in various spoken language processing applications. We propose a method based on conditional random field to combine contextual features to improve word-level confidence measures. The method consists in combining various knowledge sources (acoustic, lexical, linguistic, phonetic and morphosyntactic) to enhance confidence measures, explicitly exploiting context information. Experiments were conducted on a large French broadcast news corpus from the ESTER benchmark. Results demonstrate the added-value of our method with a significant improvement of the normalized cross entropy and of the equal error rate.

Interesting, do they mention Voxforge in this paper
Building transcribed speech corpora quickly and cheaply for many languages
Thad Hughes (Google)
Kaisuke Nakajima (Google)
Linne Ha (Google)
Atul Vasu (Google)
Pedro Moreno (Google)
Mike LeBeau (Google)
We present a system for quickly and cheaply building transcribed speech corpora containing utterances from many speakers in a variety of acoustic conditions. The system consists of a client application running on an Android mobile device with an intermittent Internet connection to a server. The client application collects demographic information about the speaker, fetches textual prompts from the server for the speaker to read, records the speaker’s voice, and uploads the audio and associated metadata to the server. The system has so far been used to collect over 3000 hours of transcribed audio in 17 languages around the world

On Generating Combilex Pronunciations via Morphological Analysis
Korin Korin Richmond (Centre for Speech Technology Research, Edinburgh University)
Robert Robert Clark (Centre for Speech Technology Research)
Sue Sue Fitt (Centre for Speech Technology Research)
Combilex is a high-quality lexicon that has been developed specifically for speech technology purposes and recently released by CSTR. Combilex benefits from many advanced features. This paper explores one of these: the ability to generate fully-specified transcriptions for morphologically derived words automatically. This functionality was originally implemented to encode the pronunciations of derived words in terms of their constituent morphemes, thus accelerating lexicon development and ensuring a high level of consistency. In this paper, we propose this method of modelling pronunciations can be exploited further by combining it with a morphological parser, thus yielding a method to generate full transcriptions for unknown derived words. Not only could this accelerate adding new derived words to Combilex, but it could also serve as an alternative to conventional letter-to-sound rules. This paper presents preliminary work indicating this is a promising direction.

Content-Based Advertisement Detection
Patrick Cardinal (CRIM)
Vishwa Gupta (CRIM)
Gilles Boulianne (CRIM)
Television advertising is widely used by companies to promote their products among the public but it is hard for an advertiser to know if its advertisements are broadcast as they should. For this reason, some companies are specialized in the monitoring of audio/video streams for validating that ads are broadcast according to what was requested and paid for by the advertiser. The procedure for searching specific ads in an audio stream is very similar to the copy detection task for which we have developed very efficient algorithms. This work reports results of applying our copy detection algorithms to the advertisement detection task. Compared to a commercial software, we detected 18% more advertisements and the system runs at 0.003x of real-time.

Continuous Speech Recognition with a TF-IDF Acoustic Model
Geoffrey Zweig (Microsoft)
Patrick Nguyen (Microsoft)
Jasha Droppo (Microsoft)
Alex Acero (Microsoft)
Information retrieval methods are frequently used for indexing and retrieving spoken documents, and more recently have been proposed for voice-search amongst a pre-defined set of business entries. In this paper, we show that these methods can be used in an even more fundamental way, as the core component in a continuous speech recognizer. Speech is initially processed and represented as a sequence of discrete symbols, specifically phoneme or multi-phone units. Recognition then operates on this sequence. The recognizer is segment-based, and the acoustic score for labeling a segment with a word is based on the TF-IDF similarity between the subword units detected in the segment, and those typically seen in association with the word. We present promising results on both a voice search task and the Wall Street Journal task. The development of this method brings us one step closer to being able to do speech recognition based on the detection of sub-word audio attributes.

Improved topic classification and keyword discovery using an HMM-based speech recognizer trained without supervision
Man-Hung Siu (Raytheon BBN Technologies)
Herbert Gish (Raytheon BBN Technologies)
Arthur Chan (Raytheon BBN Technologies)
William Belfield (Raytheon BBN Technologies)
In our previous publication, we presented a new approach to HMM training, viz., training without supervision. We used an HMM trained without supervision for transcribing audio into self-organized units (SOUs) for the purpose of topic classification. In this paper we report improvements made to the system, including the use of context dependent acoustic models and lattice based features that together reduce the topic verification equal error rate from 12% to 7%. In addition to discussing the effectiveness of the SOU approach we describe how we analyzed some selected SOU n-grams and found that they were highly correlated with keywords, demonstrating the ability of the SOU technology to discover topic relevant keywords.