Speech and Audio Processing for Coding, Enhancement and Recognition (eBook)
X, 345 Seiten
Springer New York (Verlag)
978-1-4939-1456-2 (ISBN)
This book describes the basic principles underlying the generation, coding, transmission and enhancement of speech and audio signals, including advanced statistical and machine learning techniques for speech and speaker recognition with an overview of the key innovations in these areas. Key research undertaken in speech coding, speech enhancement, speech recognition, emotion recognition and speaker diarization are also presented, along with recent advances and new paradigms in these areas.
Tokunbo Ogunfunmi is an Associate Professor of Electrical Engineering and an Associate Dean for Research and Fac. Dev. at Santa Clara University.
Roberto Togneri is a professor with the School of Electrical, Electronic and Computer Engineering at The University of Western Australia.
Madihally (Sim) Narasimha is a Senior Director of Technology at Qualcomm Inc.
This book describes the basic principles underlying the generation, coding, transmission and enhancement of speech and audio signals, including advanced statistical and machine learning techniques for speech and speaker recognition with an overview of the key innovations in these areas. Key research undertaken in speech coding, speech enhancement, speech recognition, emotion recognition and speaker diarization are also presented, along with recent advances and new paradigms in these areas.
Tokunbo Ogunfunmi is an Associate Professor of Electrical Engineering and an Associate Dean for Research and Fac. Dev. at Santa Clara University.Roberto Togneri is a professor with the School of Electrical, Electronic and Computer Engineering at The University of Western Australia.Madihally (Sim) Narasimha is a Senior Director of Technology at Qualcomm Inc.
Preface 6
Contents 10
Part I Overview of Speech and Audio Coding 12
1 From “Harmonic Telegraph” to Cellular Phones 13
1.1 Introduction 13
1.1.1 The Multiple Telegraph “Harmonic Telegraph” 14
1.1.2 Bell's Theory of Transmitting Speech 14
1.2 Early History of the Telephone 15
1.2.1 The Telephone Is Born 15
1.2.2 Birth of the Telephone Company 15
1.2.2.1 Research at Bell Company 16
1.2.2.2 New York to San Francisco Telephone Service in 1915, Nobel Prize, and More 16
1.3 Speech Bandwidth Compression at AT& T
1.3.1 Early Research on “vocoders” 17
1.3.2 Predictive Coding 18
1.3.3 Efficient Encoding of Prediction Error 19
1.3.3.1 Some Comments on the Nature of Prediction Error for Speech 19
1.3.3.2 Information Rate of Gaussian Signals with Specified Fidelity Criterion 20
1.3.3.3 Predictive Coding with Specified Error Spectrum 20
1.3.3.4 Overcoming the Computational Complexity of Predictive Coders 22
1.4 Cellular Telephone Service 24
1.4.1 Digital Cellular Standards 25
1.4.1.1 North American Digital Cellular Standards 25
1.4.1.2 European Digital Cellular Standards 25
1.5 The Future 26
References 26
2 Challenges in Speech Coding Research 28
2.1 Introduction 28
2.2 Speech Coding 29
2.2.1 Speech Coding Methods 30
2.2.1.1 Waveform Coding [2] 30
2.2.1.2 Subband and Transform Methods [2] 31
2.2.1.3 Analysis-by-Synthesis Methods [2, 10] 32
2.2.1.4 Postfiltering [11] 35
2.2.1.5 Voice Activity Detection and Silence Coding 35
2.2.2 Speech Coding Standards 35
2.2.2.1 ITU-T Standards 36
2.2.2.2 Digital Cellular Standards 38
2.2.2.3 VoIP Standards 40
2.3 Audio Coding [25, 26] 40
2.4 Newer Standards 41
2.5 Emerging Topics 45
2.6 Conclusions and Future Research Directions 46
References 46
3 Scalable and Multi-Rate Speech Coding for Voice-over-Internet Protocol (VoIP) Networks 49
3.1 Introduction 49
3.2 VoIP Networks 50
3.2.1 Overview of VoIP Networks 51
3.2.2 Robust Voice Communication 51
3.2.3 Packet Loss Concealment (PLC) 51
3.3 Analysis-by-Synthesis Speech Coding 53
3.3.1 Analysis-by-Synthesis Principles 53
3.3.2 CELP-Based Coders 53
3.3.2.1 Perceptual Error Weighting 56
3.3.2.2 Pitch Estimation 56
3.4 Multi-Rate Speech Coding 57
3.4.1 Basic Principles 57
3.4.2 Adaptive Multi-Rate (AMR) Codec 59
3.5 Scalable Speech Coding 60
3.5.1 Basic Principles 60
3.5.2 Standardized Scalable Speech Codecs 60
3.5.2.1 ITU-T G.729.1 61
3.5.2.2 ITU-T G.718 62
3.6 Packet-Loss Robust Speech Coding 65
3.6.1 Internet Low Bitrate Codec (iLBC) 67
3.6.2 Scalable Multi-Rate Speech Codec 68
3.6.2.1 Narrowband Codec 68
3.6.2.2 Wideband Codec 74
3.7 Conclusions 79
References 79
4 Recent Speech Coding Technologies and Standards 83
4.1 Recent Speech Codec Technologies and Features 84
4.1.1 Active Speech Source-Controlled Variable Bit Rate, Constant Bit Rate Operation and Voice Activity Detectors 84
4.1.1.1 Source-Controlled Variable Bit Rate (SC-VBR) Versus Constant/Fixed Bit Rate (CBR) Vocoders 85
4.1.2 Layered Coding 86
4.1.3 Bandwidth Extension of Speech 87
4.1.3.1 Harmonic Bandwidth Extension Architecture 88
4.1.3.2 Spectral Band Replication (SBR) 89
4.1.4 Blind Bandwidth Extension 90
4.1.4.1 High Band Model and Prediction Methods 91
4.1.4.2 BBE for Speech Coding 91
4.1.4.3 BBE for Bandwidth Increase 92
4.1.4.4 Quality Evaluation 92
4.1.4.5 Encoder Based BBE 93
4.1.5 Packet Loss Concealment 95
4.1.5.1 Code Excited Linear Prediction Coders 96
4.1.5.2 Adaptive Differential Pulse Code Modulation (ADPCM) Based Coders 97
4.1.6 Voice Over Internet Protocol (VoIP) 97
4.1.6.1 Management of Time Varying Delay 98
4.1.6.2 Packet Loss Concealment for VoIP 99
4.2 Recent Speech Coding Standards 102
4.2.1 Advanced Standards in ITU-T 102
4.2.1.1 G.729.1: Scalable Extension of G.729 103
4.2.1.2 G.718: Layered Coder with Interoperable Modes 104
4.2.1.3 Super-Wideband Extensions: G.729.1 Annex E and G.718 Annex B 104
4.2.1.4 G.711.1: Scalable Wideband Extension of G.711 105
4.2.1.5 Super-Wideband and Stereo Extensions of G.711.1 and G.722 105
4.2.1.6 Full-Band Coding in G.719 107
4.2.1.7 G.711.0 Lossless Coding 107
4.2.1.8 Packet Loss Concealment Algorithms for G.711 and G.722 108
4.2.2 IETF Codecs and Transport Protocols 108
4.2.2.1 Opus Codec 108
Audio Bandwidths and Bit Rate Sweet Spots 109
Variable and Constant Bit Rate Modes of Operation 109
Mono and Stereo Coding 110
Packet Loss Resilience 110
Forward Error Correction (Low Bit Rate Redundancy) 110
4.2.2.2 RTP Payload Formats 110
4.2.3 3GPP and the Enhanced Voice Services (EVS) Codec 111
4.2.4 Recent Codec Development in 3GPP2 112
4.2.5 Conversational Codecs in MPEG 113
References 115
Part II Review and Challenges in Speech, Speaker and Emotion Recognition 118
5 Ensemble Learning Approaches in Speech Recognition 119
5.1 Introduction 119
5.2 Background of Ensemble Methods in Machine Learning 120
5.2.1 Ensemble Learning 120
5.2.2 Boosting 121
5.2.3 Bagging 122
5.2.4 Random Forest 122
5.2.5 Classifier Combination 123
5.2.6 Ensemble Error Analyses 124
5.2.6.1 Added Error of an Ensemble Classifier 124
5.2.6.2 Bias–Variance–Covariance Decomposition 124
5.2.6.3 Error-Ambiguity Decomposition 125
5.2.7 Diversity Measures 126
5.2.8 Ensemble Pruning 126
5.2.9 Ensemble Clustering 127
5.3 Background of Speech Recognition 127
5.3.1 State-of-the-Art Speech Recognition System Architecture 127
5.3.2 Front-End Processing 128
5.3.3 Lexicon 129
5.3.4 Acoustic Model 129
5.3.5 Language Model 130
5.3.6 Decoding Search 131
5.4 Generating and Combining Diversity in Speech Recognition 132
5.4.1 System Places for Generating Diversity 132
5.4.1.1 Front End Processing 132
5.4.1.2 Acoustic Model 132
5.4.1.3 Language Model 133
5.4.2 System Levels for Utilizing Diversity 133
5.4.2.1 Utterance Level Combination 134
5.4.2.2 Word Level Combination 134
5.4.2.3 Subword Level Combination 136
5.4.2.4 State Level Combination 136
5.4.2.5 Feature Level Combination 139
5.5 Ensemble Learning Techniques for Acoustic Modeling 139
5.5.1 Explicit Diversity Generation 140
5.5.1.1 Boosting 140
5.5.1.2 Minimum Bayes Risk Leveraging (MBRL) 143
5.5.1.3 Directed Decision Trees 144
5.5.1.4 Deep Stacking Network 144
5.5.2 Implicit Diversity Generation 145
5.5.2.1 Multiple Systems and Multiple Models 145
5.5.2.2 Random Forest 146
5.5.2.3 Data Sampling 147
5.6 Ensemble Learning Techniques for Language Modeling 148
5.7 Performance Enhancing Mechanism of Ensemble Learning 149
5.7.1 Classification Margin 149
5.7.2 Diversity 150
5.7.3 Bias and Variance 151
5.8 Compacting Ensemble Models to Improve Efficiency 152
5.8.1 Model Clustering 153
5.8.2 Density Matching 153
5.9 Conclusion 154
References 155
6 Deep Dynamic Models for Learning Hidden Representations of Speech Features 159
6.1 Introduction 160
6.2 Generative Deep-Structured Speech Dynamics: Model Formulation 161
6.2.1 Generative Learning in Speech Recognition 161
6.2.2 A Hidden Dynamic Model with Nonlinear Observation Equation 166
6.2.3 A Linear Hidden Dynamic Model Amenable to Variational EM Training 167
6.3 Generative Deep-Structured Speech Dynamics: Model Estimation 169
6.3.1 Learning a Hidden Dynamic Model Using the Extended Kalman Filter 169
6.3.1.1 E-Step 169
6.3.1.2 M-Step 170
6.3.2 Learning a Hidden Dynamic Model Using Variational EM 172
6.3.2.1 Model Inference and Learning 172
6.3.2.2 The GMM Posterior 172
6.3.2.3 The HMM Posterior 173
6.4 Discriminative Deep Neural Networks Aided by Generative Pre-training 175
6.4.1 Restricted Boltzmann Machines 176
6.4.2 Stacking Up RBMs to Form a DBN 178
6.4.3 Interfacing the DNN with an HMM to Incorporate Sequential Dynamics 180
6.5 Recurrent Neural Networks for Discriminative Modeling of Speech Dynamics 181
6.5.1 RNNs Expressed in the State-Space Formalism 182
6.5.2 The BPTT Learning Algorithm 183
6.5.3 The EKF Learning Algorithm 186
6.6 Comparing Two Types of Dynamic Models 187
6.6.1 Top-Down Versus Bottom-Up 187
6.6.1.1 Top-Down Generative Hidden Dynamic Modeling 187
6.6.1.2 Bottom-Up Discriminative Recurrent Neural Networks and the ``Generative'' Counterpart 188
6.6.2 Localist Versus Distributed Representations 190
6.6.3 Latent Explanatory Variables Versus End-to-End Discriminative Learning 191
6.6.4 Parsimonious Versus Massive Parameters 192
6.6.5 Comparing Recognition Accuracy of the Two Types of Models 194
6.7 Summary and Discussions on Future Directions 194
References 196
7 Speech Based Emotion Recognition 202
7.1 Introduction 202
7.1.1 What Are Emotions? 203
7.1.2 Emotion Labels 205
7.1.3 The Emotion Recognition Task 207
7.2 Emotion Classification Systems 208
7.2.1 Short-Term Features 209
7.2.1.1 Pitch 209
7.2.1.2 Loudness/Energy 209
7.2.1.3 Spectral Features 210
7.2.1.4 Cepstral Features 210
7.2.2 High Dimensional Representation 210
7.2.2.1 Functional Approach to a High-Dimensional Representation 211
7.2.2.2 GMM Supervector Approach to High-Dimensional Representation 212
7.2.3 Modelling Emotions 213
7.2.3.1 Emotion Models: Linear Support Vector Machines 214
7.2.3.2 Emotion Models: Nonlinear Support Vector Machines 215
7.2.4 Alternative Emotion Modelling Methodologies 216
7.2.4.1 Supra-Frame Level Feature 217
7.2.4.2 Dynamic Emotion Models 218
7.3 Dealing with Variability 219
7.3.1 Phonetic Variability in Emotion Recognition Systems 219
7.3.2 Speaker Variability 221
7.3.2.1 Speaker Normalisation 222
7.3.2.2 Speaker Adaptation 223
7.4 Comparing Systems 224
7.5 Conclusions 226
References 228
8 Speaker Diarization: An Emerging Research 234
8.1 Overview 234
8.2 Signal Processing 235
8.2.1 Wiener Filtering 236
8.2.2 Acoustic Beamforming 236
8.3 Feature Extraction 237
8.3.1 Acoustic Features 237
8.3.1.1 Short-Term Spectral Features 238
8.3.1.2 Prosodic Features 239
8.3.2 Sound Source Features 239
8.3.3 Feature Normalization Techniques 241
8.3.3.1 RASTA Filtering 241
8.3.3.2 Cepstral Mean Normalization 242
8.3.3.3 Feature Warping 242
8.4 Speech Activity Detection 242
8.4.1 Energy-Based Speech Detection 243
8.4.2 Model Based Speech Detection 243
8.4.3 Hybrid Speech Detection 243
8.4.4 Multi-Channel Speech Detection 245
8.5 Clustering Architecture 245
8.5.1 Speaker Modeling 247
8.5.1.1 Gaussian Mixture Model 247
8.5.1.2 Hidden Markov Model 248
8.5.1.3 Total Factor Vector 249
8.5.1.4 Other Modeling Approaches 250
8.5.2 Distance Measures 251
8.5.2.1 Symmetric Kullback-Leibler Distance 252
8.5.2.2 Divergence Shape Distance 253
8.5.2.3 Arithmetic Harmonic Sphericity 253
8.5.2.4 Generalized Likelihood Ratio 253
8.5.2.5 Bayesian Information Criterion 254
8.5.2.6 Cross Likelihood Ratio 256
8.5.2.7 Normalized Cross Likelihood Ratio 256
8.5.2.8 Other Distance Measures 256
8.5.3 Speaker Segmentation 257
8.5.3.1 Silence Detection Based Methods 257
8.5.3.2 Metric-Based Segmentation 258
8.5.3.3 Hybrid Segmentation 260
8.5.3.4 Segmentation Evaluation 260
8.5.4 Speaker Clustering 261
8.5.4.1 Agglomerative Hierarchical Clustering 261
8.5.4.2 Divisive Hierarchical Clustering 266
8.5.4.3 Other Approaches 267
8.5.4.4 Multiple Systems Combination 268
8.5.5 Online Speaker Clustering 268
Segmentation. 268
Novelty Detection. 269
Speaker Modeling. 270
8.5.5.1 Speaker Clustering Evaluation 270
8.6 Speaker Diarization Evaluation 272
8.7 Databases for Speaker Diarization in Meeting 272
8.8 Related Projects in Meeting Room 273
8.9 NIST Rich Transcription Benchmarks 273
8.10 Summary 274
References 275
Part III Current Trends in Speech Enhancement 283
9 Maximum A Posteriori Spectral Estimation with Source Log-Spectral Priors for Multichannel Speech Enhancement 284
9.1 Introduction 285
9.2 Signal Representation and Modeling for Multichannel Speech Enhancement 287
9.2.1 General Speech Capture Scenario for Multichannel Speech Enhancement 287
9.2.2 Time-Frequency Domain Representation of Signals 289
9.2.3 Generative Model of Desired Signals 290
9.2.4 Generative Model of Interference 292
9.3 Speech Enhancement Based on Maximum Likelihood Spectral Estimation (MLSE) 293
9.3.1 Maximum Likelihood Spectral Estimation (MLSE) 293
9.3.2 Processing Flow of MLSE Based Speech Enhancement 294
9.4 Speech Enhancement Based on Maximum A Posteriori Spectral Estimation (MAPSE) 295
9.4.1 Maximum A Posteriori Spectral Estimation (MAPSE) 296
9.4.2 Log-Spectral Prior of Speech 297
9.4.3 Expectation Maximization (EM) Algorithm 299
9.4.4 Update of n,f Based on Newton–Raphson Method 301
9.4.5 Processing Flow 302
9.5 Application to Blind Source Separation (BSS) 303
9.5.1 MLSE for BSS (ML-BSS) 303
9.5.1.1 Generative Models for ML-BSS 304
9.5.1.2 MLSE Based on EM Algorithm 305
9.5.1.3 Processing Flow of ML-BSS Based on EM Algorithm 307
9.5.2 MAPSE for BSS (MAP-BSS) 308
9.5.2.1 Generative Models for MAP-BSS 308
9.5.2.2 MAPSE Based on EM Algorithm 309
9.5.2.3 Processing Flow of MAP-BSS Based on EM Algorithm 311
9.5.2.4 Initialization of and (or ) 312
9.6 Experiments 313
9.6.1 Evaluation 1 with Aurora-2 Speech Database 313
9.6.2 Evaluation 2 with SiSEC Database 316
9.7 Concluding Remarks 318
References 318
10 Modulation Processing for Speech Enhancement 321
10.1 Introduction 322
10.2 Methods 324
10.2.1 Modulation AMS-Based Framework 324
10.2.2 Modulation Spectral Subtraction 327
10.2.3 MMSE Modulation Magnitude Estimation 330
10.2.3.1 MMSE Modulation Magnitude Estimation with SPU 333
10.2.3.2 MMSE Log-Modulation Magnitude Estimation 333
10.2.3.3 MME Parameters 334
10.3 Speech Quality Assessment 334
10.4 Evaluation of Short-Time Modulation-Domain Based Methods with Respect to Quality 335
10.5 Conclusion 342
References 344
Erscheint lt. Verlag | 14.10.2014 |
---|---|
Zusatzinfo | X, 345 p. 79 illus., 32 illus. in color. |
Verlagsort | New York |
Sprache | englisch |
Themenwelt | Mathematik / Informatik ► Informatik ► Grafik / Design |
Informatik ► Theorie / Studium ► Künstliche Intelligenz / Robotik | |
Technik ► Elektrotechnik / Energietechnik | |
Schlagworte | Audio Coding • audio processing • Speaker Recognition • Speech coding • Speech Enhancement • Speech Noise Reduction • Speech processing • Speech Recognition |
ISBN-10 | 1-4939-1456-1 / 1493914561 |
ISBN-13 | 978-1-4939-1456-2 / 9781493914562 |
Haben Sie eine Frage zum Produkt? |
Größe: 6,3 MB
DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.
Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich