Blick ins Buch

New Era for Robust Speech Recognition (eBook)

Exploiting Deep Learning

Shinji Watanabe, Marc Delcroix, Florian Metze, John R. Hershey (Herausgeber)

eBook Download: PDF

2017 | 1st ed. 2017
XVII, 436 Seiten
Springer International Publishing (Verlag)
978-3-319-64680-0 (ISBN)

Lese- und Medienproben

Ebook-Leseprobe (PDF)

This book covers the state-of-the-art in deep neural-network-based methods for noise robustness in distant speech recognition applications. It provides insights and detailed descriptions of some of the new concepts and key technologies in the field, including novel architectures for speech enhancement, microphone arrays, robust features, acoustic model adaptation, training data augmentation, and training criteria. The contributed chapters also include descriptions of real-world applications, benchmark tools and datasets widely used in the field.

This book is intended for researchers and practitioners working in the field of speech processing and recognition who are interested in the latest deep learning techniques for noise robustness. It will also be of interest to graduate students in electrical engineering or computer science, who will find it a useful guide to this field of research.

Preface 6
Acknowledgments 7
Contents 8
Acronyms 11
Part I Introduction 14
1 Preliminaries 15
1.1 Introduction 15
1.1.1 Motivation 15
1.1.2 Before the Deep Learning Era 16
1.1.2.1 Feature Space Approaches 17
1.1.2.2 Model Space Approaches 18
1.2 Basic Formulation and Notations 18
1.2.1 General Notations (Tables 1.1 and 1.2) 19
1.2.2 Matrix and Vector Operations (Table 1.3) 20
1.2.3 Probability Distribution Functions (Table 1.4) 20
1.2.3.1 Expectation 21
1.2.3.2 Kullback–Leibler Divergence 21
1.2.4 Signal Processing 22
1.2.5 Automatic Speech Recognition 23
1.2.6 Hidden Markov Model 24
1.2.7 Gaussian Mixture Model 25
1.2.8 Neural Network 26
1.3 Book Organization 27
References 28
Part II Approaches to Robust Automatic Speech Recognition 30
2 Multichannel Speech Enhancement Approaches to DNN-Based Far-Field Speech Recognition 31
2.1 Introduction 31
2.1.1 Categories of Speech Enhancement 32
2.1.2 Problem Formulation 32
2.2 Dereverberation 34
2.2.1 Problem Description 34
2.2.2 Overview of Existing Dereverberation Approaches 36
2.2.3 Linear-Prediction-Based Dereverberation 37
2.3 Beamforming 39
2.3.1 Types of Beamformers 40
2.3.1.1 Delay-and-Sum Beamformer 40
2.3.1.2 Minimum Variance Distortionless Response Beamformer 42
2.3.1.3 Max-SNR Beamformer 43
2.3.1.4 Multichannel Wiener Filter 44
2.3.2 Parameter Estimation 45
2.3.2.1 TDOA Estimation 46
2.3.2.2 Steering-Vector Estimation 47
2.3.2.3 Time–Frequency-Masking-Based Spatial Correlation Matrix Estimation 48
2.4 Examples of Robust Front Ends 52
2.4.1 A Reverberation-Robust ASR System 53
2.4.1.1 Experimental Settings 53
2.4.1.2 Experimental Results 53
2.4.2 Robust ASR System for Mobile Devices 55
2.4.2.1 Experimental Settings 55
2.4.2.2 Experimental Results 56
2.5 Concluding Remarks and Discussion 56
References 57
3 Multichannel Spatial Clustering Using Model-Based Source Separation 60
3.1 Introduction 60
3.2 Multichannel Speech Signals 61
3.2.1 Binaural Cues Used by Human Listeners 62
3.2.2 Parameters for More than Two Channels 64
3.3 Spatial-Clustering Approaches 66
3.3.1 Binwise Clustering and Alignment 67
3.3.1.1 Cross-Frequency Source Alignment 68
3.3.2 Fuzzy c-Means Clustering of Direction of Arrival 69
3.3.3 Binaural Model-Based EM Source Separation and Localization (MESSL) 70
3.3.4 Multichannel MESSL 71
3.4 Mask-Smoothing Approaches 73
3.4.1 Fuzzy Clustering with Context Information 73
3.4.2 MESSL in a Markov Random Field 74
3.4.2.1 Pairwise Markov Random Fields 74
3.4.2.2 MESSL-MRF 75
3.5 Driving Beamforming from Spatial Clustering 76
3.6 Automatic Speech Recognition Experiments 78
3.6.1 Results 79
3.6.2 Example Separations 81
3.7 Conclusion 83
References 83
4 Discriminative Beamforming with Phase-Aware Neural Networks for Speech Enhancement and Recognition 87
4.1 Introduction 88
4.2 Beamforming for ASR 88
4.2.1 Geometric Beamforming 89
4.2.2 Statistical Methods 91
4.2.3 Learning-Based Methods 92
4.2.3.1 Maximum Likelihood Approach 92
4.2.3.2 Neural Network Approaches with Multichannel Inputs 93
4.2.3.3 Neural Networks for Better Spatial-Statistics Estimation 94
4.3 Beamforming Networks 95
4.3.1 Motivation 95
4.3.2 System Overview 95
4.3.3 Predicting Beamforming Weights by DNN 97
4.3.3.1 Extraction of GCC Features 98
4.3.3.2 Beamforming Weight Vector 100
4.3.4 Extraction of Log Mel Filterbanks 100
4.3.5 Training Procedure 102
4.4 Experiments 103
4.4.1 Settings 103
4.4.1.1 Corpus 103
4.4.1.2 Network Configurations 104
4.4.2 Beam Patterns 104
4.4.3 Speech Enhancement Results 107
4.4.4 Speech Recognition Results 107
4.5 Summary and Future Directions 109
References 110
5 Raw Multichannel Processing Using Deep Neural Networks 113
5.1 Introduction 114
5.2 Experimental Details 116
5.2.1 Data 116
5.2.2 Baseline Acoustic Model 117
5.3 Multichannel Raw-Waveform Neural Network 118
5.3.1 Motivation 118
5.3.2 Multichannel Filtering in the Time Domain 119
5.3.3 Filterbank Spatial Diversity 120
5.3.4 Comparison to Log Mel 123
5.3.5 Comparison to Oracle Knowledge of Speech TDOA 124
5.3.6 Summary 125
5.4 Factoring Spatial and Spectral Selectivity 125
5.4.1 Architecture 125
5.4.2 Number of Spatial Filters 127
5.4.3 Filter Analysis 127
5.4.4 Results Summary 129
5.5 Adaptive Beamforming 129
5.5.1 NAB Model 129
5.5.1.1 Adaptive Filters 130
5.5.1.2 Gated Feedback 131
5.5.1.3 Regularization with MTL 132
5.5.2 NAB Filter Analysis 132
5.5.3 Results Summary 133
5.6 Filtering in the Frequency Domain 134
5.6.1 Factored Model 134
5.6.1.1 Spatial Filtering 134
5.6.1.2 Spectral Filtering: Complex Linear Projection 134
5.6.2 NAB Model 135
5.6.3 Results: Factored Model 135
5.6.3.1 Performance 135
5.6.3.2 Comparison Between Learning in Time vs. Frequency 136
5.6.4 Results: Adaptive Model 138
5.7 Final Comparison, Rerecorded Data 138
5.8 Conclusions and Future Work 139
References 139
6 Novel Deep Architectures in Speech Processing 142
6.1 Introduction 143
6.1.1 Relationship to the Literature 144
6.2 General Formulation of Deep Unfolding 145
6.3 Unfolding Markov Random Fields 147
6.3.1 Mean-Field Inference 148
6.3.2 Belief Propagation 150
6.4 Deep Nonnegative Matrix Factorization 152
6.5 Multichannel Deep Unfolding 155
6.5.1 Source Separation Using Multichannel Gaussian Mixture Model 156
6.5.2 Unfolding the Multichannel Gaussian Mixture Model 158
6.5.3 MRF Extension of the MCGMM 159
6.5.4 Experiments and Discussion 161
6.6 End-to-End Deep Clustering 163
6.6.1 Deep-Clustering Model 164
6.6.2 Optimizing Signal Reconstruction 165
6.6.3 End-to-End Training 166
6.6.4 Experiments 167
6.6.4.1 ASR Performance 167
6.7 Conclusion 168
References 168
7 Deep Recurrent Networks for Separation and Recognition of Single-Channel Speech in Nonstationary Background Audio 172
7.1 Introduction 172
7.2 Problem Description 173
7.3 Learning-Free Methods 175
7.4 Nonnegative Matrix Factorization 176
7.5 Deep Learning for Source Separation 177
7.5.1 Recurrent and Long Short-Term Memory Networks 178
7.5.2 Mask Versus Signal Prediction 179
7.5.2.1 Ideal Masks and Phase-Sensitive Mask 179
7.5.2.2 Evaluating Ideal Masks 180
7.5.3 Loss Functions and Inputs 181
7.5.4 Phase-Sensitive Approximation Loss Function 182
7.5.5 Inputs to the Network 183
7.5.5.1 Spectral Features 183
7.5.5.2 Speech-State Information 183
7.5.5.3 Enhanced Features 184
7.6 Experiments and Results 185
7.6.1 Neural Network Training 185
7.6.2 Results on CHiME-2 186
7.6.3 Discussion of Results 191
7.7 Conclusion 191
References 191
8 Robust Features in Deep-Learning-Based Speech Recognition 194
8.1 Introduction 195
8.2 Background 197
8.3 Approaches 198
8.3.1 Speech Enhancement 199
8.3.2 Signal-Theoretic Techniques 200
8.3.3 Perceptually Motivated Features 200
8.3.3.1 TempoRAl PatternS (TRAPS) 202
8.3.3.2 Frequency-Domain Linear Prediction (FDLP) 203
8.3.3.3 Power-Normalized Cepstral Coefficients (PNCC) 204
8.3.3.4 Modulation Spectrum Features 204
8.3.3.5 Normalized Modulation Coefficient (NMC) 205
8.3.3.6 Modulation of Medium Duration Speech Amplitudes (MMeDuSA) 207
8.3.3.7 Two Dimensional Modulation Extraction: Gabor Features 209
8.3.3.8 Damped Oscillator Coefficient (DOC) 210
8.3.4 Current Trends 212
8.4 Case Studies 214
8.4.1 Speech Processing for Noise- and Channel-Degraded Audio 214
8.4.2 Speech Processing Under Reverberated Conditions 215
8.5 Conclusion 217
References 218
9 Adaptation of Deep Neural Network Acoustic Models for Robust Automatic Speech Recognition 225
9.1 Introduction 225
9.1.1 DNN Adaptation Strategies 226
9.1.1.1 Test-Time Adaptation 227
9.1.1.2 Attribute-Aware Training 227
9.1.1.3 Adaptive Training 227
9.1.2 Overview of DNN Adaptation Methods 228
9.1.2.1 Constrained Adaptation 228
9.1.2.2 Feature Normalisation 228
9.1.2.3 Feature Augmentation 229
9.1.2.4 Structured DNN Parameterisation 229
9.1.3 Chapter Organisation 229
9.2 Feature Augmentation 230
9.2.1 Speaker-Aware Training 231
9.2.2 Noise-Aware Training 232
9.2.3 Room-Aware Training 233
9.2.4 Multiattribute-Aware Training 234
9.2.5 Refinement of Augmented Features 236
9.3 Structured DNN Parameterisation 237
9.3.1 Structured Bias Vectors 237
9.3.2 Structured Linear Transformation Adaptation 238
9.3.3 Learning Hidden Unit Contribution 239
9.3.4 SVD-Based Structure 239
9.3.5 Factorised Hidden Layer Adaptation 240
9.3.6 Cluster Adaptive Training for DNNs 241
9.4 Summary and Future Directions 243
References 244
10 Training Data Augmentation and Data Selection 250
10.1 Introduction 250
10.1.1 Data Augmentation in the Literature 251
10.1.2 Complementary Approaches 252
10.2 Data Augmentation in Mismatched Environments 253
10.2.1 Data Generation 253
10.2.2 Speech Enhancement 254
10.2.2.1 WPE-Based Dereverberation 254
10.2.2.2 Denoising Autoencoder 255
10.2.3 Results with Speech Enhancement on Test Data 255
10.2.4 Results with Training Data Augmentation 256
10.3 Data Selection 257
10.3.1 Introduction 257
10.3.2 Sequence-Summarizing Neural Network 258
10.3.3 Configuration of the Neural Network 260
10.3.4 Properties of the Extracted Vectors 261
10.3.5 Results with Data Selection 262
10.4 Conclusions 263
References 263
11 Advanced Recurrent Neural Networks for Automatic Speech Recognition 266
11.1 Introduction 266
11.2 Basic Deep Long Short-Term Memory RNNs 267
11.2.1 Long Short-Term Memory RNNs 267
11.2.2 Deep LSTM RNNs 268
11.3 Prediction–Adaptation–Correction Recurrent Neural Networks 268
11.4 Deep Long Short-Term Memory RNN Extensions 270
11.4.1 Highway RNNs 270
11.4.2 Bidirectional Highway LSTM RNNs 272
11.4.3 Latency-Controlled Bidirectional Highway LSTM RNNs 272
11.4.4 Grid LSTM RNNs 274
11.4.5 Residual LSTM RNNs 275
11.5 Experiment Setup 275
11.5.1 Corpus 275
11.5.1.1 IARPA-Babel Corpus 275
11.5.1.2 AMI Meeting Corpus 275
11.5.2 System Description 276
11.6 Evaluation 277
11.6.1 PAC-RNN 277
11.6.1.1 Low-Resource Language 277
11.6.1.2 Distant Speech Recognition 278
11.6.2 Highway LSTMP 279
11.6.2.1 Three-Layer Highway (B)LSTMP 279
11.6.2.2 Highway (B)LSTMP with Dropout 279
11.6.2.3 Deeper Highway LSTMP 280
11.6.2.4 Grid LSTMP 280
11.6.2.5 Residual LSTMP 281
11.6.2.6 Summary of Results 281
11.7 Conclusion 282
References 283
12 Sequence-Discriminative Training of Neural Networks 285
12.1 Introduction 285
12.2 Training Criteria 287
12.2.1 Maximum Mutual Information 287
12.2.2 Boosted Maximum Mutual Information 288
12.2.3 Minimum Phone Error/State-Level Minimum Bayes Risk 289
12.3 Practical Training Strategy 290
12.3.1 Criterion Selection 290
12.3.2 Frame-Smoothing 291
12.3.3 Lattice Generation 292
12.3.3.1 Numerator Lattice 292
12.3.3.2 Denominator Lattice 293
12.4 Two-Forward-Pass Method for Sequence Training 294
12.5 Experiment Setup 295
12.5.1 Corpus 296
12.5.2 System Description 296
12.6 Evaluation 297
12.6.1 Practical Strategy 297
12.6.2 Two-Forward-Pass Method 297
12.6.2.1 Speed 298
12.6.2.2 Performance 298
12.7 Conclusion 299
References 300
13 End-to-End Architectures for Speech Recognition 302
13.1 Introduction 302
13.1.1 Complexity and Suboptimality of the Conventional ASR Pipeline 303
13.1.2 Simplification of the Conventional ASR Pipeline 305
13.1.3 End-to-End Learning 306
13.2 End-to-End ASR Architectures 306
13.2.1 Connectionist Temporal Classification 307
13.2.2 Encoder–Decoder Paradigm 307
13.2.3 Learning the Front End 309
13.2.4 Other Ideas 310
13.3 The EESEN Framework 310
13.3.1 Model Structure 311
13.3.2 Model Training 312
13.3.3 Decoding 314
13.3.3.1 Grammar 315
13.3.3.2 Lexicon 315
13.3.3.3 Token 316
13.3.3.4 Search Graph 316
13.3.4 Experiments and Analysis 317
13.3.4.1 Wall Street Journal 317
13.3.4.2 Switchboard 319
13.3.4.3 HKUST Mandarin Chinese 320
13.4 Summary and Future Directions 321
References 322
Part III Resources 327
14 The CHiME Challenges: Robust Speech Recognition in Everyday Environments 328
14.1 Introduction 328
14.2 The 1st and 2nd CHiME Challenges (CHiME-1 and CHiME-2) 329
14.2.1 Domestic Noise Background 330
14.2.2 The Speech Recognition Task Design 330
14.2.2.1 CHiME-1: Small Vocabulary 331
14.2.2.2 CHiME-2 Track 1: Simulated Motion 331
14.2.2.3 CHiME-2 Track 2: Medium Vocabulary 332
14.2.3 Overview of System Performance 332
14.2.4 Interim Conclusions 333
14.3 The 3rd CHiME Challenge (CHiME-3) 334
14.3.1 The Mobile Tablet Recordings 334
14.3.2 The CHiME-3 Task Design: Real and Simulated Data 335
14.3.3 The CHiME-3 Baseline Systems 336
14.3.3.1 Simulation 336
14.3.3.2 Enhancement 336
14.3.3.3 ASR 337
14.4 The CHiME-3 Evaluations 337
14.4.1 An Overview of CHiME-3 System Performance 338
14.4.2 An Overview of Successful Strategies 338
14.4.2.1 Strategies for Improved Signal Enhancement 339
14.4.2.2 Strategies for Improved Statistical Modelling 339
14.4.2.3 Strategies for Improved System Training 340
14.4.3 Key Findings 340
14.5 Future Directions: CHiME-4 and Beyond 341
References 343
15 The REVERB Challenge: A Benchmark Task for Reverberation-Robust ASR Techniques 346
15.1 Introduction 347
15.2 Challenge Scenarios, Data, and Regulations 348
15.2.1 Scenarios Assumed in the Challenge 348
15.2.2 Data 348
15.2.2.1 Test Data: Dev and Eval Test Sets 348
15.2.2.2 Training Data 350
15.2.3 Regulations 350
15.3 Performance of Baseline and Top-Performing Systems 351
15.3.1 Benchmark Results with GMM-HMM and DNN-HMM Systems 351
15.3.2 Top-Performing 1-ch and 8-ch Systems 352
15.3.3 Current State-of-the-Art Performance 353
15.4 Summary and Remaining Challenges for Reverberant Speech Recognition 354
References 354
16 Distant Speech Recognition Experiments Using the AMI Corpus 356
16.1 Introduction 356
16.2 Meeting Corpora 357
16.3 Baseline Speech Recognition Experiments 359
16.4 Channel Concatenation Experiments 362
16.5 Convolutional Neural Networks 363
16.5.1 SDM Recordings 365
16.5.2 MDM Recordings 365
16.5.3 IHM Recordings 366
16.6 Discussion and Conclusions 367
References 367
17 Toolkits for Robust Speech Processing 370
17.1 Introduction 370
17.2 General Speech Recognition Toolkits 371
17.3 Language Model Toolkits 373
17.4 Speech Enhancement Toolkits 375
17.5 Deep Learning Toolkits 376
17.6 End-to-End Speech Recognition Toolkits 378
17.7 Other Resources for Speech Technology 380
17.8 Conclusion 380
References 381
Part IV Applications 384
18 Speech Research at Google to Enable Universal Speech Interfaces 385
18.1 Early Development 385
18.2 Voice Search 387
18.3 Text to Speech 387
18.4 Dictation/IME/Transcription 388
18.5 Internationalization 389
18.6 Neural-Network-Based Acoustic Modeling 391
18.7 Adaptive Language Modeling 392
18.8 Mobile-Device-Specific Technology 393
18.9 Robustness 395
References 396
19 Challenges in and Solutions to Deep Learning Network Acoustic Modeling in Speech Recognition Products at Microsoft 400
19.1 Introduction 401
19.2 Effective and Efficient DL Modeling 401
19.2.1 Reducing Run-Time Cost with SVD-Based Training 402
19.2.2 Speaker Adaptation on Small Amount of Parameters 402
19.2.2.1 SVD Bottleneck Adaptation 403
19.2.2.2 DNN Adaptation Through Activation Function 404
19.2.2.3 Low-Rank Plus Diagonal (LRPD) Adaptation 404
19.2.3 Improving the Accuracy of Small-Size DNNs with Teacher–Student Training 405
19.3 Invariance Modeling 406
19.3.1 Improving the Robustness to Accent/Dialect with Model Adaptation 406
19.3.2 Improving the Robustness to Acoustic Environment with Variable-Component DNN Modeling 408
19.3.3 Improving the Time and Frequency Invariance with Time–Frequency Long Short-Term Memory RNNs 409
19.3.4 Exploring the Generalization Capability to Unseen Data with Maximum Margin Sequence Training 409
19.4 Effective Training-Data Usage 411
19.4.1 Use of Unsupervised Data to Improve SR Accuracy 411
19.4.2 Expanded Language Capability by Reusing Speech-Training Material Across Languages 412
19.5 Conclusion 413
References 414
20 Advanced ASR Technologies for Mitsubishi Electric Speech Applications 417
20.1 Introduction 417
20.2 ASR for Car Navigation Systems 418
20.2.1 Introduction 418
20.2.2 ASR and Postprocessing Technologies 418
20.2.2.1 ASR Using Statistical LM 418
20.2.2.2 POI Name Search Using High-Speed Text Search Technique 419
20.2.2.3 Application to Commercial Car Navigation System 420
20.3 Dereverberation for Hands-Free Elevator 420
20.3.1 Introduction 420
20.3.2 A Dereverberation Method Using SS 421
20.3.3 Experiments 422
20.4 Discriminative Methods 423
20.4.1 Introduction 423
20.4.2 Discriminative Training for AMs 424
20.4.3 Discriminative Training for RNN-LM 425
20.5 Conclusion 426
References 427
Index 428

Erscheint lt. Verlag	30.10.2017
Zusatzinfo	XVII, 436 p. 76 illus., 26 illus. in color.
Verlagsort	Cham
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Informatik
Themenwelt	Technik ► Elektrotechnik / Energietechnik
Schlagworte	Acoustic Model Adaptation • Automatic speech recognition (ASR) • Deep learning • Distant Speech • Natural Language Processing (NLP) • neural networks (NNs) • Noise robustness • Signal Processing • Speech processing • Speech Recognition
ISBN-10	3-319-64680-X / 331964680X
ISBN-13	978-3-319-64680-0 / 9783319646800

Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)
Größe: 9,1 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Hardcover

181,89 €