Automatic Speech Recognition - Dong Yu, Li Deng

Blick ins Buch

Automatic Speech Recognition (eBook)

A Deep Learning Approach

Dong Yu, Li Deng (Autoren)

eBook Download: PDF

2014 | 2015
XXVI, 321 Seiten
Springer London (Verlag)
978-1-4471-5779-3 (ISBN)

Lese- und Medienproben

Ebook-Leseprobe (PDF)

This book provides a comprehensive overview of the recent advancement in the field of automatic speech recognition with a focus on deep learning models including deep neural networks and many of their variants. This is the first automatic speech recognition book dedicated to the deep learning approach. In addition to the rigorous mathematical treatment of the subject, the book also presents insights and theoretical foundation of a series of highly successful deep learning models.

Foreword 7
Preface 9
Contents 12
Acronyms 19
Symbols 22
1 Introduction 24
1.1 Automatic Speech Recognition: A Bridge for Better Communication 24
1.1.1 Human--Human Communication 25
1.1.2 Human--Machine Communication 25
1.2 Basic Architecture of ASR Systems 27
1.3 Book Organization 28
1.3.1 Part I: Conventional Acoustic Models 29
1.3.2 Part II: Deep Neural Networks 29
1.3.3 Part III: DNN-HMM Hybrid Systems for ASR 30
1.3.4 Part IV: Representation Learning in Deep Neural Networks 30
1.3.5 Part V: Advanced Deep Models 30
References 31
Part IConventional Acoustic Models 33
2 Gaussian Mixture Models 34
2.1 Random Variables 34
2.2 Gaussian and Gaussian-Mixture Random Variables 35
2.3 Parameter Estimation 38
2.4 Mixture of Gaussians as a Model for the Distribution of Speech Features 39
References 41
3 Hidden Markov Models and the Variants 43
3.1 Introduction 43
3.2 Markov Chains 45
3.3 Hidden Markov Sequences and Models 46
3.3.1 Characterization of a Hidden Markov Model 47
3.3.2 Simulation of a Hidden Markov Model 49
3.3.3 Likelihood Evaluation of a Hidden Markov Model 49
3.3.4 An Algorithm for Efficient Likelihood Evaluation 50
3.3.5 Proofs of the Forward and Backward Recursions 52
3.4 EM Algorithm and Its Application to Learning HMM Parameters 53
3.4.1 Introduction to EM Algorithm 53
3.4.2 Applying EM to Learning the HMM---Baum-Welch Algorithm 55
3.5 Viterbi Algorithm for Decoding HMM State Sequences 59
3.5.1 Dynamic Programming and Viterbi Algorithm 59
3.5.2 Dynamic Programming for Decoding HMM States 60
3.6 The HMM and Variants for Generative Speech Modeling and Recognition 62
3.6.1 GMM-HMMs for Speech Modeling and Recognition 63
3.6.2 Trajectory and Hidden Dynamic Models for Speech Modeling and Recognition 64
3.6.3 The Speech Recognition Problem Using Generative Models of HMM and Its Variants 66
References 68
Part IIDeep Neural Networks 75
4 Deep Neural Networks 76
4.1 The Deep Neural Network Architecture 76
4.2 Parameter Estimation with Error Backpropagation 78
4.2.1 Training Criteria 79
4.2.2 Training Algorithms 80
4.3 Practical Considerations 84
4.3.1 Data Preprocessing 84
4.3.2 Model Initialization 86
4.3.3 Weight Decay 87
4.3.4 Dropout 88
4.3.5 Batch Size Selection 89
4.3.6 Sample Randomization 91
4.3.7 Momentum 92
4.3.8 Learning Rate and Stopping Criterion 92
4.3.9 Network Architecture 94
4.3.10 Reproducibility and Restartability 94
References 95
5 Advanced Model Initialization Techniques 97
5.1 Restricted Boltzmann Machines 97
5.1.1 Properties of RBMs 99
5.1.2 RBM Parameter Learning 101
5.2 Deep Belief Network Pretraining 104
5.3 Pretraining with Denoising Autoencoder 107
5.4 Discriminative Pretraining 109
5.5 Hybrid Pretraining 110
5.6 Dropout Pretraining 111
References 112
Part IIIDeep Neural Network-Hidden MarkovModel Hybrid Systems for AutomaticSpeech Recognition 114
6 Deep Neural Network-Hidden Markov Model Hybrid Systems 115
6.1 DNN-HMM Hybrid Systems 115
6.1.1 Architecture 115
6.1.2 Decoding with CD-DNN-HMM 117
6.1.3 Training Procedure for CD-DNN-HMMs 118
6.1.4 Effects of Contextual Window 120
6.2 Key Components in the CD-DNN-HMM and Their Analysis 122
6.2.1 Datasets and Baselines for Comparisons and Analysis 122
6.2.2 Modeling Monophone States or Senones 124
6.2.3 Deeper Is Better 125
6.2.4 Exploit Neighboring Frames 127
6.2.5 Pretraining 127
6.2.6 Better Alignment Helps 128
6.2.7 Tuning Transition Probability 129
6.3 Kullback-Leibler Divergence-Based HMM 129
References 130
7 Training and Decoding Speedup 133
7.1 Training Speedup 133
7.1.1 Pipelined Backpropagation Using Multiple GPUs 134
7.1.2 Asynchronous SGD 137
7.1.3 Augmented Lagrangian Methods and Alternating Directions Method of Multipliers 140
7.1.4 Reduce Model Size 142
7.1.5 Other Approaches 143
7.2 Decoding Speedup 143
7.2.1 Parallel Computation 144
7.2.2 Sparse Network 146
7.2.3 Low-Rank Approximation 148
7.2.4 Teach Small DNN with Large DNN 149
7.2.5 Multiframe DNN 150
References 151
8 Deep Neural Network Sequence-Discriminative Training 153
8.1 Sequence-Discriminative Training Criteria 153
8.1.1 Maximum Mutual Information 153
8.1.2 Boosted MMI 155
8.1.3 MPE/sMBR 156
8.1.4 A Uniformed Formulation 157
8.2 Practical Considerations 158
8.2.1 Lattice Generation 158
8.2.2 Lattice Compensation 159
8.2.3 Frame Smoothing 161
8.2.4 Learning Rate Adjustment 162
8.2.5 Training Criterion Selection 162
8.2.6 Other Considerations 163
8.3 Noise Contrastive Estimation 163
8.3.1 Casting Probability Density Estimation Problem as a Classifier Design Problem 164
8.3.2 Extension to Unnormalized Models 166
8.3.3 Apply NCE in DNN Training 167
References 169
Part IVRepresentation Learningin Deep Neural Networks 170
9 Feature Representation Learning in Deep Neural Networks 171
9.1 Joint Learning of Feature Representation and Classifier 171
9.2 Feature Hierarchy 173
9.3 Flexibility in Using Arbitrary Input Features 176
9.4 Robustness of Features 177
9.4.1 Robust to Speaker Variations 177
9.4.2 Robust to Environment Variations 179
9.5 Robustness Across All Conditions 181
9.5.1 Robustness Across Noise Levels 181
9.5.2 Robustness Across Speaking Rates 183
9.6 Lack of Generalization Over Large Distortions 184
References 187
10 Fuse Deep Neural Network and Gaussian Mixture Model Systems 190
10.1 Use DNN-Derived Features in GMM-HMM Systems 190
10.1.1 GMM-HMM with Tandem and Bottleneck Features 190
10.1.2 DNN-HMM Hybrid System Versus GMM-HMM System with DNN-Derived Features 193
10.2 Fuse Recognition Results 195
10.2.1 ROVER 196
10.2.2 SCARF 197
10.2.3 MBR Lattice Combination 198
10.3 Fuse Frame-Level Acoustic Scores 199
10.4 Multistream Speech Recognition 200
References 202
11 Adaptation of Deep Neural Networks 205
11.1 The Adaptation Problem for Deep Neural Networks 205
11.2 Linear Transformations 206
11.2.1 Linear Input Networks 207
11.2.2 Linear Output Networks 208
11.3 Linear Hidden Networks 210
11.4 Conservative Training 211
11.4.1 L2 Regularization 211
11.4.2 KL-Divergence Regularization 212
11.4.3 Reducing Per-Speaker Footprint 214
11.5 Subspace Methods 216
11.5.1 Subspace Construction Through Principal Component Analysis 216
11.5.2 Noise-Aware, Speaker-Aware, and Device-Aware Training 217
11.5.3 Tensor 221
11.6 Effectiveness of DNN Speaker Adaptation 222
11.6.1 KL-Divergence Regularization Approach 222
11.6.2 Speaker-Aware Training 224
References 225
Part VAdvanced Deep Models 228
12 Representation Sharing and Transfer in Deep Neural Networks 229
12.1 Multitask and Transfer Learning 229
12.1.1 Multitask Learning 229
12.1.2 Transfer Learning 230
12.2 Multilingual and Crosslingual Speech Recognition 231
12.2.1 Tandem/Bottleneck-Based Crosslingual Speech Recognition 232
12.2.2 Shared-Hidden-Layer Multilingual DNN 233
12.2.3 Crosslingual Model Transfer 236
12.3 Multiobjective Training of Deep Neural Networks for Speech Recognition 240
12.3.1 Robust Speech Recognition with Multitask Learning 240
12.3.2 Improved Phone Recognition with Multitask Learning 240
12.3.3 Recognizing both Phonemes and Graphemes 241
12.4 Robust Speech Recognition Exploiting Audio-Visual Information 242
References 243
13 Recurrent Neural Networks and Related Models 246
13.1 Introduction 246
13.2 State-Space Formulation of the Basic Recurrent Neural Network 248
13.3 The Backpropagation-Through-Time Learning Algorithm 249
13.3.1 Objective Function for Minimization 250
13.3.2 Recursive Computation of Error Terms 250
13.3.3 Update of RNN Weights 251
13.4 A Primal-Dual Technique for Learning Recurrent Neural Networks 253
13.4.1 Difficulties in Learning RNNs 253
13.4.2 Echo-State Property and Its Sufficient Condition 254
13.4.3 Learning RNNs as a Constrained Optimization Problem 254
13.4.4 A Primal-Dual Method for Learning RNNs 255
13.5 Recurrent Neural Networks Incorporating LSTM Cells 258
13.5.1 Motivations and Applications 258
13.5.2 The Architecture of LSTM Cells 259
13.5.3 Training the LSTM-RNN 259
13.6 Analyzing Recurrent Neural Networks---A Contrastive Approach 260
13.6.1 Direction of Information Flow: Top-Down versus Bottom-Up 260
13.6.2 The Nature of Representations: Localist or Distributed 263
13.6.3 Interpretability: Inferring Latent Layers versus End-to-End Learning 264
13.6.4 Parameterization: Parsimonious Conditionals versus Massive Weight Matrices 265
13.6.5 Methods of Model Learning: Variational Inference versus Gradient Descent 267
13.6.6 Recognition Accuracy Comparisons 267
13.7 Discussions 268
References 270
14 Computational Network 276
14.1 Computational Network 276
14.2 Forward Computation 278
14.3 Model Training 280
14.4 Typical Computation Nodes 284
14.4.1 Computation Node Types with No Operand 285
14.4.2 Computation Node Types with One Operand 285
14.4.3 Computation Node Types with Two Operands 290
14.4.4 Computation Node Types for Computing Statistics 296
14.5 Convolutional Neural Network 297
14.6 Recurrent Connections 300
14.6.1 Sample by Sample Processing Only Within Loops 301
14.6.2 Processing Multiple Utterances Simultaneously 302
14.6.3 Building Arbitrary Recurrent Neural Networks 302
References 306
15 Summary and Future Directions 308
15.1 Road Map 308
15.1.1 Debut of DNNs for ASR 308
15.1.2 Speedup of DNN Training and Decoding 311
15.1.3 Sequence Discriminative Training 311
15.1.4 Feature Processing 312
15.1.5 Adaptation 313
15.1.6 Multitask and Transfer Learning 314
15.1.7 Convolution Neural Networks 314
15.1.8 Recurrent Neural Networks and LSTM 315
15.1.9 Other Deep Models 315
15.2 State of the Art and Future Directions 316
15.2.1 State of the Art---A Brief Analysis 316
15.2.2 Future Directions 317
References 318
Index 325

Erscheint lt. Verlag	11.11.2014
Reihe/Serie	Signals and Communication Technology
Reihe/Serie	Signals and Communication Technology
Zusatzinfo	XXVI, 321 p. 62 illus.
Verlagsort	London
Sprache	englisch
Themenwelt	Informatik ► Theorie / Studium ► Künstliche Intelligenz / Robotik
	Naturwissenschaften ► Physik / Astronomie
	Technik ► Elektrotechnik / Energietechnik
Schlagworte	Adaptive Training • Auditory Signal Processing • Automatic speech recognition • Computational Network • Deep Architecture • Deep Generative Model • Deep learning • Deep Neural Network • Distributed Representation • Full-Sequence Training • Hidden Markov Model • LSTM • Machine Learning in Language • Neural Network Language Model • Recurrent Neural Network • Sequential Model • transfer learning
ISBN-10	1-4471-5779-6 / 1447157796
ISBN-13	978-1-4471-5779-3 / 9781447157793

Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)
Größe: 7,8 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Hardcover

160,49 €