Techniques for Noise Robustness in Automatic Speech Recognition

Besorgungstitel - wird vorgemerkt | Lieferzeit: Besorgungstitel - Lieferbar innerhalb von 10 Werktagen I

175,26 €*

Alle Preise inkl. MwSt.|Versandkostenfrei
ISBN-13:
9781119970880
Veröffentl:
2012
Erscheinungsdatum:
28.11.2012
Seiten:
514
Autor:
Tuomas Virtanen
Gewicht:
907 g
Format:
249x175x28 mm
Sprache:
Englisch
Beschreibung:

Automatic speech recognition (ASR) systems are finding increasing use in everyday life. Many of the commonplace environments where the systems are used are noisy, for example users calling up a voice search system from a busy cafeteria or a street. This can result in degraded speech recordings and adversely affect the performance of speech recognition systems. As the use of ASR systems increases, knowledge of the state-of-the-art in techniques to deal with such problems becomes critical to system and application engineers and researchers who work with or on ASR technologies. This book presents a comprehensive survey of the state-of-the-art in techniques used to improve the robustness of speech recognition systems to these degrading external influences.Key features:* Reviews all the main noise robust ASR approaches, including signal separation, voice activity detection, robust feature extraction, model compensation and adaptation, missing data techniques and recognition of reverberant speech.* Acts as a timely exposition of the topic in light of more widespread use in the future of ASR technology in challenging environments.* Addresses robustness issues and signal degradation which are both key requirements for practitioners of ASR.* Includes contributions from top ASR researchers from leading research units in the field
List of Contributors xvAcknowledgments xvii1 Introduction 1Tuomas Virtanen, Rita Singh, Bhiksha Raj1.1 Scope of the Book 11.2 Outline 21.3 Notation 4Part One FOUNDATIONS2 The Basics of Automatic Speech Recognition 9Rita Singh, Bhiksha Raj, Tuomas Virtanen2.1 Introduction 92.2 Speech Recognition Viewed as Bayes Classification 102.3 Hidden Markov Models 112.3.1 Computing Probabilities with HMMs 122.3.2 Determining the State Sequence 172.3.3 Learning HMM Parameters 192.3.4 Additional Issues Relating to Speech Recognition Systems 202.4 HMM-Based Speech Recognition 242.4.1 Representing the Signal 242.4.2 The HMM for a Word Sequence 252.4.3 Searching through all Word Sequences 26References 293 The Problem of Robustness in Automatic Speech Recognition 31Bhiksha Raj, Tuomas Virtanen, Rita Singh3.1 Errors in Bayes Classification 313.1.1 Type 1 Condition: Mismatch Error 333.1.2 Type 2 Condition: Increased Bayes Error 343.2 Bayes Classification and ASR 353.2.1 All We Have is a Model: A Type 1 Condition 353.2.2 Intrinsic Interferences--Signal Components that are Unrelated to the Message: A Type 2 Condition 363.2.3 External Interferences--The Data are Noisy: Type 1 and Type 2 Conditions 363.3 External Influences on Speech Recordings 363.3.1 Signal Capture 373.3.2 Additive Corruptions 413.3.3 Reverberation 423.3.4 A Simplified Model of Signal Capture 433.4 The Effect of External Influences on Recognition 443.5 Improving Recognition under Adverse Conditions 463.5.1 Handling the Model Mismatch Error 463.5.2 Dealing with Intrinsic Variations in the Data 473.5.3 Dealing with Extrinsic Variations 47References 50Part Two SIGNAL ENHANCEMENT4 Voice Activity Detection, Noise Estimation, and Adaptive Filters for Acoustic Signal Enhancement 53Rainer Martin, Dorothea Kolossa4.1 Introduction 534.2 Signal Analysis and Synthesis 554.2.1 DFT-Based Analysis Synthesis with Perfect Reconstruction 554.2.2 Probability Distributions for Speech and Noise DFT Coefficients 574.3 Voice Activity Detection 584.3.1 VAD Design Principles 584.3.2 Evaluation of VAD Performance 624.3.3 Evaluation in the Context of ASR 624.4 Noise Power Spectrum Estimation 654.4.1 Smoothing Techniques 654.4.2 Histogram and GMM Noise Estimation Methods 674.4.3 Minimum Statistics Noise Power Estimation 674.4.4 MMSE Noise Power Estimation 684.4.5 Estimation of the A Priori Signal-to-Noise Ratio 694.5 Adaptive Filters for Signal Enhancement 714.5.1 Spectral Subtraction 714.5.2 Nonlinear Spectral Subtraction 734.5.3 Wiener Filtering 744.5.4 The ETSI Advanced Front End 754.5.5 Nonlinear MMSE Estimators 754.6 ASR Performance 804.7 Conclusions 81References 825 Extraction of Speech from Mixture Signals 87Paris Smaragdis5.1 The Problem with Mixtures 875.2 Multichannel Mixtures 885.2.1 Basic Problem Formulation 885.2.2 Convolutive Mixtures 925.3 Single-Channel Mixtures 985.3.1 Problem Formulation 985.3.2 Learning Sound Models 1005.3.3 Separation by Spectrogram Factorization 1015.3.4 Dealing with Unknown Sounds 1055.4 Variations and Extensions 1075.5 Conclusions 107References 1076 Microphone Arrays 109John McDonough, Kenichi Kumatani6.1 Speaker Tracking 1106.2 Conventional Microphone Arrays 1136.3 Conventional Adaptive Beamforming Algorithms 1206.3.1 Minimum Variance Distortionless Response Beamformer 1206.3.2 Noise Field Models 1226.3.3 Subband Analysis and Synthesis 1236.3.4 Beamforming Performance Criteria 1266.3.5 Generalized Sidelobe Canceller Implementation 1296.3.6 Recursive Implementation of the GSC 1306.3.7 Other Conventional GSC Beamformers 1316.3.8 Beamforming based on Higher Order Statistics 1326.3.9 Online Implementation 1366.3.10 Speech-Recognition Experiments 1406.4 Spherical Microphone Arrays 1426.5 Spherical Adaptive Algorithms 1486.6 Comparative Studies 1496.7 Comparison of Linear and Spherical Arrays for DSR 1526.8 Conclusions and Further Reading 154References 155Part Three FEATURE ENHANCEMENT7 From Signals to Speech Features by Digital Signal Processing 161Matthias W¿olfel7.1 Introduction 1617.1.1 About this Chapter 1627.2 The Speech Signal 1627.3 Spectral Processing 1637.3.1 Windowing 1637.3.2 Power Spectrum 1657.3.3 Spectral Envelopes 1667.3.4 LP Envelope 1667.3.5 MVDR Envelope 1697.3.6 Warping the Frequency Axis 1717.3.7 Warped LP Envelope 1757.3.8 Warped MVDR Envelope 1767.3.9 Comparison of Spectral Estimates 1777.3.10 The Spectrogram 1797.4 Cepstral Processing 1797.4.1 Definition and Calculation of Cepstral Coefficients 1807.4.2 Characteristics of Cepstral Sequences 1817.5 Influence of Distortions on Different Speech Features 1827.5.1 Objective Functions 1827.5.2 Robustness against Noise 1857.5.3 Robustness against Echo and Reverberation 1877.5.4 Robustness against Changes in Fundamental Frequency 1897.6 Summary and Further Reading 191References 1918 Features Based on Auditory Physiology and Perception 193Richard M. Stern, Nelson Morgan8.1 Introduction 1938.2 Some Attributes of Auditory Physiology and Perception 1948.2.1 Peripheral Processing 1948.2.2 Processing at more Central Levels 2008.2.3 Psychoacoustical Correlates of Physiological Observations 2028.2.4 The Impact of Auditory Processing on Conventional Feature Extraction 2068.2.5 Summary 2088.3 "Classic" Auditory Representations 2088.4 Current Trends in Auditory Feature Analysis 2138.5 Summary 221Acknowledgments 222References 2229 Feature Compensation 229Jasha Droppo9.1 Life in an Ideal World 2299.1.1 Noise Robustness Tasks 2299.1.2 Probabilistic Feature Enhancement 2309.1.3 Gaussian Mixture Models 2319.2 MMSE-SPLICE 2329.2.1 Parameter Estimation 2339.2.2 Results 2369.3 Discriminative SPLICE 2379.3.1 The MMI Objective Function 2389.3.2 Training the Front-End Parameters 2399.3.3 The Rprop Algorithm 2409.3.4 Results 2419.4 Model-Based Feature Enhancement 2429.4.1 The Additive Noise-Mixing Equation 2439.4.2 The Joint Probability Model 2449.4.3 Vector Taylor Series Approximation 2469.4.4 Estimating Clean Speech 2479.4.5 Results 2479.5 Switching Linear Dynamic System 2489.6 Conclusion 249References 24910 Reverberant Speech Recognition 251Reinhold Haeb-Umbach, Alexander Krueger10.1 Introduction 25110.2 The Effect of Reverberation 25210.2.1 What is Reverberation? 25210.2.2 The Relationship between Clean and Reverberant Speech Features 25410.2.3 The Effect of Reverberation on ASR Performance 25810.3 Approaches to Reverberant Speech Recognition 25810.3.1 Signal-Based Techniques 25910.3.2 Front-End Techniques 26010.3.3 Back-End Techniques 26210.3.4 Concluding Remarks 26510.4 Feature Domain Model of the Acoustic Impulse Response 26510.5 Bayesian Feature Enhancement 26710.5.1 Basic Approach 26810.5.2 Measurement Update 26910.5.3 Time Update 27010.5.4 Inference 27110.6 Experimental Results 27210.6.1 Databases 27210.6.2 Overview of the Tested Methods 27310.6.3 Recognition Results on Reverberant Speech 27410.6.4 Recognition Results on Noisy Reverberant Speech 27610.7 Conclusions 277Acknowledgment 278References 278Part Four MODEL ENHANCEMENT11 Adaptation and Discriminative Training of Acoustic Models 285Yannick Est`eve, Paul Del¿eglise11.1 Introduction 28511.1.1 Acoustic Models 28611.1.2 Maximum Likelihood Estimation 28711.2 Acoustic Model Adaptation and Noise Robustness 28811.2.1 Static (or Offline) Adaptation 28911.2.2 Dynamic (or Online) Adaptation 28911.3 Maximum A Posteriori Reestimation 29011.4 Maximum Likelihood Linear Regression 29311.4.1 Class Regression Tree 29411.4.2 Constrained Maximum Likelihood Linear Regression 29711.4.3 CMLLR Implementation 29711.4.4 Speaker Adaptive Training 29811.5 Discriminative Training 29911.5.1 MMI Discriminative Training Criterion 30111.5.2 MPE Discriminative Training Criterion 30211.5.3 I-smoothing 30311.5.4 MPE Implementation 30411.6 Conclusion 307References 30812 Factorial Models for Noise Robust Speech Recognition 311John R. Hershey, Steven J. Rennie, Jonathan Le Roux12.1 Introduction 31112.2 The Model-Based Approach 31312.3 Signal Feature Domains 31412.4 Interaction Models 31712.4.1 Exact Interaction Model 31812.4.2 Max Model 32012.4.3 Log-Sum Model 32112.4.4 Mel Interaction Model 32112.5 Inference Methods 32212.5.1 Max Model Inference 32212.5.2 Parallel Model Combination 32412.5.3 Vector Taylor Series Approaches 32612.5.4 SNR-Dependent Approaches 33112.6 Efficient Likelihood Evaluation in Factorial Models 33212.6.1 Efficient Inference using the Max Model 33212.6.2 Efficient Vector-Taylor Series Approaches 33412.6.3 Band Quantization 33512.7 Current Directions 33712.7.1 Dynamic Noise Models for Robust ASR 33812.7.2 Multi-Talker Speech Recognition using Graphical Models 33912.7.3 Noise Robust ASR using Non-Negative Basis Representations 340References 34113 Acoustic Model Training for Robust Speech Recognition 347Michael L. Seltzer13.1 Introduction 34713.2 Traditional Training Methods for Robust Speech Recognition 34813.3 A Brief Overview of Speaker Adaptive Training 34913.4 Feature-Space Noise Adaptive Training 35113.4.1 Experiments using fNAT 35213.5 Model-Space Noise Adaptive Training 35313.6 Noise Adaptive Training using VTS Adaptation 35513.6.1 Vector Taylor Series HMM Adaptation 35513.6.2 Updating the Acoustic Model Parameters 35713.6.3 Updating the Environmental Parameters 36013.6.4 Implementation Details 36013.6.5 Experiments using NAT 36113.7 Discussion 36413.7.1 Comparison of Training Algorithms 36413.7.2 Comparison to Speaker Adaptive Training 36413.7.3 Related Adaptive Training Methods 36513.8 Conclusion 366References 366Part Five COMPENSATION FOR INFORMATION LOSS14 Missing-Data Techniques: Recognition with Incomplete Spectrograms 371Jon Barker14.1 Introduction 37114.2 Classification with Incomplete Data 37314.2.1 A Simple Missing Data Scenario 37414.2.2 Missing Data Theory 37614.2.3 Validity of the MAR Assumption 37814.2.4 Marginalising Acoustic Models 37914.3 Energetic Masking 38114.3.1 The Max Approximation 38114.3.2 Bounded Marginalisation 38214.3.3 Missing Data ASR in the Cepstral Domain 38414.3.4 Missing Data ASR with Dynamic Features 38614.4 Meta-Missing Data: Dealing with Mask Uncertainty 38814.4.1 Missing Data with Soft Masks 38814.4.2 Sub-band Combination Approaches 39114.4.3 Speech Fragment Decoding 39314.5 Some Perspectives on Performance 395References 39615 Missing-Data Techniques: Feature Reconstruction 399Jort Florent Gemmeke, Ulpu Remes15.1 Introduction 39915.2 Missing-Data Techniques 40115.3 Correlation-Based Imputation 40215.3.1 Fundamentals 40215.3.2 Implementation 40415.4 Cluster-Based Imputation 40615.4.1 Fundamentals 40615.4.2 Implementation 40815.4.3 Advances 40915.5 Class-Conditioned Imputation 41115.5.1 Fundamentals 41115.5.2 Implementation 41215.5.3 Advances 41315.6 Sparse Imputation 41415.6.1 Fundamentals 41415.6.2 Implementation 41615.6.3 Advances 41815.7 Other Feature-Reconstruction Methods 42015.7.1 Parametric Approaches 42015.7.2 Nonparametric Approaches 42115.8 Experimental Results 42115.8.1 Feature-Reconstruction Methods 42215.8.2 Comparison with Other Methods 42415.8.3 Advances 42615.8.4 Combination with Other Methods 42715.9 Discussion and Conclusion 428Acknowledgments 429References 43016 Computational Auditory Scene Analysis and Automatic Speech Recognition 433Arun Narayanan, DeLiang Wang16.1 Introduction 43316.2 Auditory Scene Analysis 43416.3 Computational Auditory Scene Analysis 43516.3.1 Ideal Binary Mask 43516.3.2 Typical CASA Architecture 43816.4 CASA Strategies 44016.4.1 IBM Estimation Based on Local SNR Estimates 44016.4.2 IBM Estimation using ASA Cues 44216.4.3 IBM Estimation as Binary Classification 44816.4.4 Binaural Mask Estimation Strategies 45116.5 Integrating CASA with ASR 45216.5.1 Uncertainty Transform Model 45416.6 Concluding Remarks 458Acknowledgment 458References 45817 Uncertainty Decoding 463Hank Liao17.1 Introduction 46317.2 Observation Uncertainty 46517.3 Uncertainty Decoding 46617.4 Feature-Based Uncertainty Decoding 46817.4.1 SPLICE with Uncertainty 47017.4.2 Front-End Joint Uncertainty Decoding 47117.4.3 Issues with Feature-Based Uncertainty Decoding 47217.5 Model-Based Joint Uncertainty Decoding 47317.5.1 Parameter Estimation 47517.5.2 Comparisons with Other Methods 47617.6 Noisy CMLLR 47717.7 Uncertainty and Adaptive Training 48017.7.1 Gradient-Based Methods 48117.7.2 Factor Analysis Approaches 48217.8 In Combination with Other Techniques 48317.9 Conclusions 484References 485Index 487

Kunden Rezensionen

Zu diesem Artikel ist noch keine Rezension vorhanden.
Helfen sie anderen Besuchern und verfassen Sie selbst eine Rezension.

Google Plus
Powered by Inooga