Blick ins Buch

Handbook of Big Data Analytics (eBook)

Wolfgang Karl Härdle, Henry Horng-Shing Lu, Xiaotong Shen (Herausgeber)

eBook Download: PDF

2018 | 1st ed. 2018
VIII, 538 Seiten
Springer International Publishing (Verlag)
978-3-319-18284-1 (ISBN)

Lese- und Medienproben

Ebook-Leseprobe (PDF)

Addressing a broad range of big data analytics in cross-disciplinary applications, this essential handbook focuses on the statistical prospects offered by recent developments in this field. To do so, it covers statistical methods for high-dimensional problems, algorithmic designs, computation tools, analysis flows and the software-hardware co-designs that are needed to support insightful discoveries from big data. The book is primarily intended for statisticians, computer experts, engineers and application developers interested in using big data analytics with statistics. Readers should have a solid background in statistics and computer science.

Wolfgang Karl Härdle is Ladislaus von Bortkievicz Professor of Statistics at the Humboldt University of Berlin and director of C.A.S.E. (Center for Applied Statistics and Economics), director of the Collaborative Research Center 649 'Economic Risk' and also of the IRTG 1792 'High Dimensional Nonstationary Time Series'. He teaches quantitative finance and semiparametric statistics. Professor Härdle's research focuses on dynamic factor models, multivariate statistics in finance and computational statistics. He is an elected member of the International Statistical Institute (ISI) and advisor to the Guanghua School of Management, Peking University, China.

Henry Horng-Shing Lu is Professor at the Institute of Statistics of the National Chiao Tung University, Taiwan and serves as the Vice President of Academic Affairs. He received his Ph.D. in Statistics from Cornell University, NY in 1994. He is an elected member of the International Statistical Institute (ISI). His research interests include statistics, applications and big data analytics. Professor Lu analyzes different types of data by developing statistical methodologies for machine learning with the power of statistical inference and computation algorithms. His findings were published in a wide spectrum of journals and conference papers. He also co-edited the Handbook of Statistical Bioinformatics, published by Springer in 2011.

Xiaotong Shen is John Black Johnston Distinguished Professor at the School of Statistics of the University of Minnesota, MN. He received his Ph.D. in Statistics from the University of Chicago, IL in 1991. He is Fellow of the American Statistical Association (ASA), the Institute of Mathematical Statistics (IMS), and the American Association for the Advancement of Science (AAAS) as well as an elected member of the International Statistical Institute (ISI). Professor Shen's areas of interest include machine learning and data mining, likelihood-based inference, semiparametric and nonparametric models, model selection and averaging. His current research efforts are mainly devoted to the further development of structured learning as well as high-dimensional/high-order analysis. The targeted application areas are biomedical sciences and engineering.

Wolfgang Karl Härdle is Ladislaus von Bortkievicz Professor of Statistics at the Humboldt University of Berlin and director of C.A.S.E. (Center for Applied Statistics and Economics), director of the Collaborative Research Center 649 "Economic Risk" and also of the IRTG 1792 "High Dimensional Nonstationary Time Series". He teaches quantitative finance and semiparametric statistics. Professor Härdle's research focuses on dynamic factor models, multivariate statistics in finance and computational statistics. He is an elected member of the International Statistical Institute (ISI) and advisor to the Guanghua School of Management, Peking University, China.Henry Horng-Shing Lu is Professor at the Institute of Statistics of the National Chiao Tung University, Taiwan and serves as the Vice President of Academic Affairs. He received his Ph.D. in Statistics from Cornell University, NY in 1994. He is an elected member of the International Statistical Institute (ISI). His research interests include statistics, applications and big data analytics. Professor Lu analyzes different types of data by developing statistical methodologies for machine learning with the power of statistical inference and computation algorithms. His findings were published in a wide spectrum of journals and conference papers. He also co-edited the Handbook of Statistical Bioinformatics, published by Springer in 2011. Xiaotong Shen is John Black Johnston Distinguished Professor at the School of Statistics of the University of Minnesota, MN. He received his Ph.D. in Statistics from the University of Chicago, IL in 1991. He is Fellow of the American Statistical Association (ASA), the Institute of Mathematical Statistics (IMS), and the American Association for the Advancement of Science (AAAS) as well as an elected member of the International Statistical Institute (ISI). Professor Shen’s areas of interest include machine learning and data mining, likelihood-based inference, semiparametric and nonparametric models, model selection and averaging. His current research efforts are mainly devoted to the further development of structured learning as well as high-dimensional/high-order analysis. The targeted application areas are biomedical sciences and engineering.

Preface 6
Contents 7
Part I Overview 9
1 Statistics, Statisticians, and the Internet of Things 10
1.1 Introduction 11
1.1.1 The Internet of Things 11
1.1.2 What Is Big Data in an Internet of Things? 11
1.1.3 Building Blocks 12
1.1.4 Ubiquity 13
1.1.5 Consumer Applications 15
1.1.6 The Internets of [Infrastructure] Things 17
1.1.7 Industrial Scenarios 19
1.2 What Kinds of Statistics Are Needed for Big IoT Data? 20
1.2.1 Coping with Complexity 20
1.2.2 Privacy 21
1.2.3 Traditional Statistics Versus the IoT 22
1.2.4 A View of the Future of Statistics in an IoT World 23
1.3 Big Data in the Real World 24
1.3.1 Skills 24
1.3.2 Politics 25
1.3.3 Technique 25
1.3.4 Traditional Databases 26
1.3.5 Cognition 26
1.4 Conclusion 27
2 Cognitive Data Analysis for Big Data 29
2.1 Introduction 30
2.1.1 Big Data 30
2.1.2 Defining Cognitive Data Analysis 31
2.1.3 Stages of CDA 33
2.2 Data Preparation 35
2.2.1 Natural Language Query 36
2.2.2 Data Integration 37
2.2.3 Metadata Discovery 38
2.2.4 Data Quality Verification 39
2.2.5 Data Type Detection 40
2.2.6 Data Lineage 41
2.3 Automated Modeling 42
2.3.1 Descriptive Analytics 42
2.3.2 Predictive Analytics 43
2.3.3 Starting Points 44
2.3.4 System Recommendations 45
2.4 Application of Results 46
2.4.1 Gaining Insights 46
2.4.2 Sharing and Collaborating 47
2.4.3 Deployment 47
2.5 Use Case 48
2.6 Conclusion 52
References 52
Part II Methodology 54
3 Statistical Leveraging Methods in Big Data 55
3.1 Background 55
3.2 Leveraging Approximation for Least Squares Estimator 58
3.2.1 Leveraging for Least Squares Approximation 58
3.2.2 A Matrix Approximation Perspective 60
3.2.3 The Computation of Leveraging Scores 61
3.2.4 An Innovative Proposal: Predictor-Length Method 61
3.2.5 More on Modeling 63
3.2.6 Statistical Leveraging Algorithms in the Literature: A Summary 63
3.3 Statistical Properties of Leveraging Estimator 64
3.3.1 Weighted Leveraging Estimator 64
3.3.2 Unweighted Leveraging Estimator 66
3.4 Simulation Study 68
3.4.1 UNIF and BLEV 68
3.4.2 BLEV and LEVUNW 69
3.4.3 BLEV and SLEV 69
3.4.4 BLEV and PL 70
3.4.5 SLEV and PL 70
3.5 Real Data Analysis 72
3.6 Beyond Linear Regression 74
3.6.1 Logistic Regression 74
3.6.2 Time Series Analysis 75
3.7 Discussion and Conclusion 76
References 76
4 Scattered Data and Aggregated Inference 79
4.1 Introduction 80
4.2 Problem Formulation 84
4.2.1 Notations 84
4.2.2 Review on M-Estimators 86
4.2.3 Simple Averaging Estimator 86
4.2.4 One-Step Estimator 87
4.3 Main Results 88
4.3.1 Assumptions 89
4.3.2 Asymptotic Properties and Mean Squared Errors (MSE) Bounds 90
4.3.3 Under the Presence of Communication Failure 91
4.4 Numerical Examples 92
4.4.1 Logistic Regression 93
4.4.2 Beta Distribution 95
4.4.3 Beta Distribution with Possibility of Losing Information 97
4.4.4 Gaussian Distribution with Unknown Mean and Variance 99
4.5 Discussion on Distributed Statistical Inference 100
4.6 Other Problems 102
4.7 Conclusion 104
References 104
5 Nonparametric Methods for Big Data Analytics 107
5.1 Introduction 107
5.2 Classical Methods for Nonparametric Regression 109
5.2.1 Additive Models 109
5.2.2 Generalized Additive Models (GAM) 111
5.2.3 Smoothing Spline ANOVA (SS-ANOVA) 111
5.3 High Dimensional Additive Models 113
5.3.1 COSSO Method 114
5.3.2 Adaptive COSSO 117
5.3.3 Linear and Nonlinear Discover (LAND) 119
5.3.4 Adaptive Group LASSO 122
5.3.5 Sparse Additive Models (SpAM) 123
5.3.6 Sparsity-Smoothness Penalty 124
5.4 Nonparametric Independence Screening (NIS) 125
References 126
6 Finding Patterns in Time Series 129
6.1 Introduction 130
6.1.1 Regime Descriptors: Local Models 130
6.1.2 Changepoints 131
6.1.3 Patterns 131
6.1.4 Clustering, Classification, and Prediction 132
6.1.5 Measures of Similarity/Dissimilarity 132
6.1.6 Outline 132
6.2 Data Reduction and Changepoints 133
6.2.1 Piecewise Constant Models 134
6.2.2 Models with Changing Scales 135
6.2.3 Trends 136
6.3 Model Building 138
6.3.1 Batch Methods 139
6.3.2 Online Methods 139
6.4 Model Building: Alternating Trends Smoothing 139
6.4.1 The Tuning Parameter 141
6.4.2 Modifications and Extensions 144
6.5 Bounding Lines 145
6.6 Patterns 148
6.6.1 Time Scaling and Junk 149
6.6.2 Further Data Reduction: Symbolic Representation 150
6.6.3 Symbolic Trend Patterns (STP) 151
6.6.4 Patterns in Bounding Lines 152
6.6.5 Clustering and Classification of Time Series 153
References 154
7 Variational Bayes for Hierarchical Mixture Models 155
7.1 Introduction 156
7.2 Variational Bayes 158
7.2.1 Overview of the VB Method 158
7.2.2 Practicality 160
7.2.3 Over-Confidence 161
7.2.4 Simple Two-Component Mixture Model 161
7.2.5 Marginal Posterior Approximation 164
7.3 VB for a General Finite Mixture Model 166
7.3.1 Motivation 166
7.3.2 The B-LIMMA Model 167
7.4 Numerical Illustrations 169
7.4.1 Simulation 169
7.4.1.1 The B-LIMMA Model 170
7.4.1.2 A Mixture Model Extended from the LIMMA Model 173
7.4.1.3 A Mixture Model for Count Data 179
7.4.2 Real Data Examples 181
7.4.2.1 APOA1 Data 181
7.4.2.2 Colon Cancer Data 184
7.5 Discussion 185
Appendix: The VB-LEMMA Algorithm 187
The B-LEMMA Model 187
Algorithm 188
The VB-Proteomics Algorithm 193
The Proteomics Model 193
Algorithm 194
References 203
8 Hypothesis Testing for High-Dimensional Data 206
8.1 Introduction 206
8.2 Applications 208
8.2.1 Testing of Covariance Matrices 208
8.2.2 Testing of Independence 209
8.2.3 Analysis of Variance 210
8.3 Tests Based on L? Norms 211
8.4 Tests Based on L2 Norms 214
8.5 Asymptotic Theory 216
8.5.1 Preamble: i.i.d. Gaussian Data 217
8.5.2 Rademacher Weighted Differencing 218
8.5.3 Calculating the Power 219
8.5.4 An Algorithm with General Testing Functionals 220
8.6 Numerical Experiments 220
8.6.1 Test of Mean Vectors 220
8.6.2 Test of Covariance Matrices 224
8.6.2.1 Sizes Accuracy 224
8.6.2.2 Power Curve 224
8.6.3 A Real Data Application 225
References 226
9 High-Dimensional Classification 228
9.1 Introduction 228
9.2 LDA, Logistic Regression, and SVMs 230
9.2.1 LDA 230
9.2.2 Logistic Regression 230
9.2.3 The Support Vector Machine 231
9.3 Lasso and Elastic-Net Penalized SVMs 233
9.3.1 The 1 SVM 233
9.3.2 The DrSVM 234
9.4 Lasso and Elastic-Net Penalized Logistic Regression 235
9.5 Huberized SVMs 237
9.6 Concave Penalized Margin-Based Classifiers 243
9.7 Sparse Discriminant Analysis 247
9.7.1 Independent Rules 248
9.7.2 Linear Programming Discriminant Analysis 250
9.7.3 Direct Sparse Discriminant Analysis 251
9.8 Sparse Semiparametric Discriminant Analysis 253
9.9 Sparse Penalized Additive Models for Classification 256
References 262
10 Analysis of High-Dimensional Regression Models Using Orthogonal Greedy Algorithms 265
10.1 Introduction 265
10.2 Convergence Rates of OGA 267
10.2.1 Random Regressors 267
10.2.2 The Fixed Design Case 270
10.3 The Performance OGA Under General Sparse Conditions 272
10.3.1 Rates of Convergence 272
10.3.2 Comparative Studies 273
10.4 The Performance of OGA in High-Dimensional Time Series Models 276
References 284
11 Semi-supervised Smoothing for Large Data Problems 286
11.1 Introduction 286
11.2 Semi-supervised Local Kernel Regression 287
11.2.1 Supervised Kernel Regression 288
11.2.2 Semi-supervised Kernel Regression with a Latent Response 291
11.2.3 Adaptive Semi-supervised Kernel Regression 294
11.2.4 Computational Issues for Large Data 296
11.3 Optimization Frameworks for Semi-supervised Learning 296
References 299
12 Inverse Modeling: A Strategy to Cope with Non-linearity 301
12.1 Introduction 301
12.2 SDR and Inverse Modeling 303
12.2.1 From SIR to PFC 303
12.2.2 Revisit SDR from an Inverse Modeling Perspective 305
12.3 Variable Selection 308
12.3.1 Beyond Sufficient Dimension Reduction: The Necessity of Variable Selection 308
12.3.2 SIR as a Transformation-Projection Pursuit Problem 308
12.3.3 COP: Correlation Pursuit 309
12.3.4 From COP to SIRI 312
12.3.5 Simulation Study for Variable Selection and SDR Estimation 315
12.4 Nonparametric Dependence Screening 317
12.5 Conclusion 321
References 322
13 Sufficient Dimension Reduction for Tensor Data 324
13.1 Curse of Dimensionality 324
13.2 Sufficient Dimension Reduction 326
13.3 Tensor Sufficient Dimension Reduction 329
13.3.1 Tensor Sufficient Dimension Reduction Model 329
13.3.2 Estimate a Single Direction 330
13.4 Simulation Studies 332
13.5 Example 334
13.6 Discussion 335
References 336
14 Compressive Sensing and Sparse Coding 338
14.1 Leveraging the Sparsity Assumption for Signal Recovery 338
14.2 From Combinatorial to Convex Optimization 339
14.3 Dealing with Noisy Measurements 339
14.4 Other Common Forms and Variations 340
14.5 The Theory Behind 340
14.5.1 The Restricted Isometry Property 340
14.5.2 Guaranteed Signal Recovery 341
14.5.3 Random Matrix is Good Enough 341
14.6 Compressive Sensing in Practice 342
14.6.1 Solving the Compressive Sensing Problem 342
14.6.2 Sparsifying Basis 342
14.6.3 Sensing Matrix 343
14.7 Sparse Coding Overview 344
14.7.1 Compressive Sensing and Sparse Coding 345
14.7.1.1 Compressed Domain Feature Extraction 346
14.7.1.2 Compressed Domain Classification 346
14.8 Compressive Sensing Extensions 347
14.8.1 Reconstruction with Additional Information 347
14.8.2 Compressive Sensing with Distorted Measurements 347
References 348
15 Bridging Density Functional Theory and Big Data Analytics with Applications 350
15.1 Introduction 351
15.2 Structure of Data Functionals Defined in the DFT Perspectives 353
15.3 Determinations of Number of Data Groups and the Corresponding Data Boundaries 358
15.4 Physical Phenomena of the Mixed Data Groups 362
15.4.1 Physical Structure of the DFT-Based Algorithm 362
15.4.2 Typical Problem of the Data Clustering:The Fisher's Iris 364
15.4.3 Tentative Experiments on Dataset of MRI with Brain Tumors 366
15.5 Conclusion 369
References 370
Part III Software 374
16 Q3-D3-LSA: D3.js and Generalized Vector Space Models for Statistical Computing 375
16.1 Introduction: From Data to Information 376
16.1.1 Transparency, Collaboration, and Reproducibility 377
16.2 Related Work 378
16.3 Q3-D3 Genesis 378
16.4 Vector Space Representations 382
16.4.1 Text to Vector 382
16.4.2 Weighting Scheme, Similarity, Distance 384
16.4.3 Shakespeare's Tragedies 389
16.4.4 Generalized VSM (GVSM) 391
16.4.4.1 Basic VSM (BVSM) 392
16.4.4.2 GVSM: Term–Term Correlations 392
16.4.4.3 GVSM: Latent Semantic Analysis (LSA) 393
16.4.4.4 Closer Look at the LSA Implementation 394
16.4.4.5 GVSM Applicability for Big Data 395
16.5 Methods 396
16.5.1 Cluster Analysis 396
16.5.1.1 Partitional Clustering 397
16.5.1.2 Hierarchical Clustering 399
16.5.2 Cluster Validation Measures 399
16.5.2.1 Connectivity 400
16.5.2.2 Silhouette 401
16.5.2.3 Dunn Index 402
16.5.3 Visual Cluster Validation 402
16.6 Results 403
16.6.1 Text Preprocessing Results 403
16.6.2 Sparsity Results 404
16.6.3 Three Models, Three Methods, Three Measures 406
16.6.4 LSA Anatomy 411
16.7 Application 411
16.8 Outlook 413
16.8.1 GitHub Mining Infrastructure in R 413
16.8.2 Future Developments 414
Appendix 415
References 420
17 A Tutorial on Libra: R Package for the Linearized Bregman Algorithm in High-Dimensional Statistics 423
17.1 Introduction to brownLibra 424
17.2 Linear Model 427
17.2.1 Example: Simulation Data 429
17.2.2 Example: Diabetes Data 431
17.3 Logistic Model 432
17.3.1 Binomial Logistic Model 432
17.3.1.1 Example: Publications of COPSS Award Winners 434
17.3.1.2 Example: Journey to the West 435
17.3.2 Multinomial Logistic Model 436
17.4 Graphical Model 438
17.4.1 Gaussian Graphical Model 439
17.4.1.1 Example: Journey to the West 440
17.4.2 Ising Model 442
17.4.2.1 Example: Simulation Data 443
17.4.2.2 Example: Journey to the West 444
17.4.2.3 Example: Dream of the Red Chamber 446
17.4.3 Potts Model 448
17.5 Discussion 450
References 451
Part IV Application 452
18 Functional Data Analysis for Big Data: A Case Study on California Temperature Trends 453
18.1 Introduction 453
18.2 Basic Statistics for Functional Data 455
18.3 Dimension Reduction for Functional Data 456
18.4 Functional Principal Component Analysis 457
18.4.1 Smoothing and Interpolation 459
18.4.2 Sample Size Considerations 462
18.5 Functional Variance Process 463
18.6 Functional Data Analysis for Temperature Trends 465
18.7 Conclusions 475
References 476
19 Bayesian Spatiotemporal Modeling for Detecting Neuronal Activation via Functional Magnetic Resonance Imaging 480
19.1 Introduction 481
19.1.1 Emotion Processing Data 482
19.2 Variable Selection in Bayesian Spatiotemporal Models 483
19.2.1 Bezener et al.'s (2015) Areal Model 484
19.2.1.1 Posterior Distribution and MCMC Algorithm 486
19.2.1.2 Starting Values 487
19.2.1.3 Emotion Processing Data 487
19.2.2 Musgrove et al.'s (2015) Areal Model 488
19.2.2.1 Partitioning the Image 489
19.2.2.2 Spatial Bayesian Variable Selection with Temporal Correlation 489
19.2.2.3 Sparse SGLMM Prior 490
19.2.2.4 Posterior Computation and Inference 491
19.2.2.5 Emotion Processing Data 492
19.2.3 Activation Maps for Emotion Processing Data 493
19.3 Discussion 494
References 494
20 Construction of Tight Frames on Graphs and Application to Denoising 497
20.1 Introduction 497
20.1.1 Motivation 497
20.1.2 Relation to Previous Work 498
20.2 Notation and Basics 499
20.2.1 Setting 499
20.2.2 Frames 500
20.2.3 Neighborhood Graphs 501
20.2.4 Spectral Graph Theory 502
20.3 Construction and Properties 503
20.3.1 Construction of a Tight Graph Frame 503
20.3.2 Spatial Localization 505
20.4 Denoising 508
20.5 Numerical Experiments 511
20.6 Outlook 512
Appendix 514
Proof of Theorem 3 514
References 515
21 Beta-Boosted Ensemble for Big Credit Scoring Data 517
21.1 Introduction 517
21.2 Method Description 519
21.2.1 Beta Binomial Distribution 519
21.2.2 Beta-Boosted Ensemble Model 520
21.2.3 Toy Example 522
21.2.4 Relation to Existing Solutions 525
21.3 Experiments 525
21.4 Conclusion and Future Work 531
References 531

Erscheint lt. Verlag	20.7.2018
Reihe/Serie	Springer Handbooks of Computational Statistics
Reihe/Serie	Springer Handbooks of Computational Statistics
Zusatzinfo	VIII, 538 p. 147 illus., 109 illus. in color.
Verlagsort	Cham
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Informatik ► Datenbanken
	Mathematik / Informatik ► Mathematik ► Statistik
	Mathematik / Informatik ► Mathematik ► Wahrscheinlichkeit / Kombinatorik
Schlagworte	Big Data • Computational Statistics • data analytics • High-dimensional data analysis • Quantlet • Software-hardware Co-designs
ISBN-10	3-319-18284-6 / 3319182846
ISBN-13	978-3-319-18284-1 / 9783319182841

Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)
Größe: 16,4 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Hardcover

353,09 €