Machine Learning Models and Algorithms for Big Data Classification - Shan Suthaharan

Blick ins Buch

Machine Learning Models and Algorithms for Big Data Classification (eBook)

Thinking with Examples for Effective Learning

Shan Suthaharan (Autor)

eBook Download: PDF

2015 | 1. Auflage
XIX, 364 Seiten
Springer US (Verlag)
978-1-4899-7641-3 (ISBN)

Lese- und Medienproben

Ebook-Leseprobe (PDF)

The presentation format of this book focuses on simplicity, readability, and dependability so that both undergraduate and graduate students as well as new researchers, developers, and practitioners in this field can easily trust and grasp the concepts, and learn them effectively. It has been written to reduce the mathematical complexity and help the vast majority of readers to understand the topics and get interested in the field. This book consists of four parts, with the total of 14 chapters. The first part mainly focuses on the topics that are needed to help analyze and understand data and big data. The second part covers the topics that can explain the systems required for processing big data. The third part presents the topics required to understand and select machine learning techniques to classify big data. Finally, the fourth part concentrates on the topics that explain the scaling-up machine learning, an important solution for modern big data problems.

Shan Suthaharan is a Professor of Computer Science at the University of North Carolina at Greensboro (UNCG), North Carolina, USA. He also serves as the Director of Undergraduate Studies at the Department of Computer Science at UNCG. He has more than twenty-five years of university teaching and administrative experience, and has taught both undergraduate and graduate courses. His aspiration is to educate and train students so that they can prosper in the computer field by understanding current real-world and complex problems, and develop efficient techniques and technologies. His current teaching interests include big data analytics and machine learning, cryptography and network security, and computer networking and analysis. He earned his doctorate in Computer Science from Monash University, Australia. Since then, he has been actively working on disseminating his knowledge and experience through teaching, advising, seminars, research, and publications. Dr. Suthaharan enjoys investigating real-world, complex problems, and developing and implementing algorithms to solve those problems using modern technologies. The main theme of his current research is the signature discovery and event detection for a secure and reliable environment. The ultimate goal of his research is to build a secure and reliable environment using modern and emerging technologies. His current research primarily focuses on the characterization and detection of environmental events, the exploration of machine learning techniques, and the development of advanced statistical and computational techniques to discover key signatures and detect emerging events from structured and unstructured big data. Dr. Suthaharan has authored or co-authored more than seventy-five research papers in the areas of computer science, and published them in international journals and referred conference proceedings. He also invented a key management and encryption technology, which has been patented in Australia, Japan, and Singapore. He also received visiting scholar awards from and served as a visiting researcher at the University of Sydney, Australia; the University of Melbourne, Australia; and the University of California, Berkeley, USA. He was a senior member of the Institute of Electrical and Electronics Engineers, and volunteered as an elected chair of the Central North Carolina Section twice. He is a member of Sigma Xi, the Scientific Research Society, and a Fellow of the Institution of Engineering and Technology.

This book presents machine learning models and algorithms to address big data classification problems. Existing machine learning techniques like the decision tree (a hierarchical approach), random forest (an ensemble hierarchical approach), and deep learning (a layered approach) are highly suitable for the system that can handle such problems. This book helps readers, especially students and newcomers to the field of big data and machine learning, to gain a quick understanding of the techniques and technologies; therefore, the theory, examples, and programs (Matlab and R) presented in this book have been simplified, hardcoded, repeated, or spaced for improvements. They provide vehicles to test and understand the complicated concepts of various topics in the field. It is expected that the readers adopt these programs to experiment with the examples, and then modify or write their own programs toward advancing their knowledge for solving more complex and challenging problems. The presentation format of this book focuses on simplicity, readability, and dependability so that both undergraduate and graduate students as well as new researchers, developers, and practitioners in this field can easily trust and grasp the concepts, and learn them effectively. It has been written to reduce the mathematical complexity and help the vast majority of readers to understand the topics and get interested in the field. This book consists of four parts, with the total of 14 chapters. The first part mainly focuses on the topics that are needed to help analyze and understand data and big data. The second part covers the topics that can explain the systems required for processing big data. The third part presents the topics required to understand and select machine learning techniques to classify big data. Finally, the fourth part concentrates on the topics that explain the scaling-up machine learning, an important solution for modern big data problems.

Shan Suthaharan is a Professor of Computer Science at the University of North Carolina at Greensboro (UNCG), North Carolina, USA. He also serves as the Director of Undergraduate Studies at the Department of Computer Science at UNCG. He has more than twenty-five years of university teaching and administrative experience, and has taught both undergraduate and graduate courses. His aspiration is to educate and train students so that they can prosper in the computer field by understanding current real-world and complex problems, and develop efficient techniques and technologies. His current teaching interests include big data analytics and machine learning, cryptography and network security, and computer networking and analysis. He earned his doctorate in Computer Science from Monash University, Australia. Since then, he has been actively working on disseminating his knowledge and experience through teaching, advising, seminars, research, and publications. Dr. Suthaharan enjoys investigating real-world, complex problems, and developing and implementing algorithms to solve those problems using modern technologies. The main theme of his current research is the signature discovery and event detection for a secure and reliable environment. The ultimate goal of his research is to build a secure and reliable environment using modern and emerging technologies. His current research primarily focuses on the characterization and detection of environmental events, the exploration of machine learning techniques, and the development of advanced statistical and computational techniques to discover key signatures and detect emerging events from structured and unstructured big data. Dr. Suthaharan has authored or co-authored more than seventy-five research papers in the areas of computer science, and published them in international journals and referred conference proceedings. He also invented a key management and encryption technology, which has been patented in Australia, Japan, and Singapore. He also received visiting scholar awards from and served as a visiting researcher at the University of Sydney, Australia; the University of Melbourne, Australia; and the University of California, Berkeley, USA. He was a senior member of the Institute of Electrical and Electronics Engineers, and volunteered as an elected chair of the Central North Carolina Section twice. He is a member of Sigma Xi, the Scientific Research Society, and a Fellow of the Institution of Engineering and Technology.

Preface 8
Acknowledgements 10
About the Author 12
Contents 14
1 Science of Information 21
1.1 Data Science 21
1.1.1 Technological Dilemma 22
1.1.2 Technological Advancement 22
1.2 Big Data Paradigm 23
1.2.1 Facts and Statistics of a System 23
1.2.1.1 Data 23
1.2.1.2 Knowledge 24
1.2.1.3 Physical Operation 24
1.2.1.4 Mathematical Operation 25
1.2.1.5 Logical Operation 25
1.2.2 Big Data Versus Regular Data 25
1.2.2.1 Scenario 25
1.2.2.2 Data Representation 26
1.3 Machine Learning Paradigm 27
1.3.1 Modeling and Algorithms 27
1.3.2 Supervised and Unsupervised 27
1.3.2.1 Classification 28
1.3.2.2 Clustering 29
1.4 Collaborative Activities 30
1.5 A Snapshot 30
1.5.1 The Purpose and Interests 30
1.5.2 The Goal and Objectives 31
1.5.3 The Problems and Challenges 31
Problems 31
References 32
Part I Understanding Big Data 34
2 Big Data Essentials 35
2.1 Big Data Analytics 35
2.1.1 Big Data Controllers 36
2.1.2 Big Data Problems 37
2.1.3 Big Data Challenges 37
2.1.4 Big Data Solutions 38
2.2 Big Data Classification 38
2.2.1 Representation Learning 39
2.2.2 Distributed File Systems 40
2.2.3 Classification Modeling 41
2.2.3.1 Class Characteristics 41
2.2.3.2 Error Characteristics 42
2.2.3.3 Domain Characteristics 43
2.2.4 Classification Algorithms 43
2.2.4.1 Training 44
2.2.4.2 Validation 44
2.2.4.3 Testing 44
2.3 Big Data Scalability 44
2.3.1 High-Dimensional Systems 45
2.3.2 Low-Dimensional Structures 45
Problems 46
References 46
3 Big Data Analytics 48
3.1 Analytics Fundamentals 48
3.1.1 Research Questions 49
3.1.2 Choices of Data Sets 50
3.2 Pattern Detectors 51
3.2.1 Statistical Measures 51
3.2.1.1 Counting 51
3.2.1.2 Mean and Variance 51
3.2.1.3 Covariance and Correlation 54
3.2.2 Graphical Measures 55
3.2.2.1 Histogram 55
3.2.2.2 Skewness 55
3.2.2.3 Scatter Plot 58
3.2.3 Coding Example 58
3.3 Patterns of Big Data 61
3.3.1 Standardization: A Coding Example 64
3.3.2 Evolution of Patterns 66
3.3.3 Data Expansion Modeling 68
3.3.3.1 Orthogonalization: A Coding Example 69
3.3.3.2 No Mean-Shift, Max Weights, Gaussian Increase 72
3.3.3.3 Mean-Shift, Max Weights, Gaussian Increase 72
3.3.3.4 No Mean-Shift, Gaussian Weights, Gaussian Increase 74
3.3.3.5 Mean-Shift, Gaussian Weights, Gaussian Increase 74
3.3.3.6 Coding Example 74
3.3.4 Deformation of Patterns 79
3.3.4.1 Imbalanced Data 80
3.3.4.2 Inaccurate Data 80
3.3.4.3 Incomplete data 81
3.3.4.4 Coding Example 82
3.3.5 Classification Errors 83
3.3.5.1 Approximation 83
3.3.5.2 Estimation 84
3.3.5.3 Optimization 84
3.4 Low-Dimensional Structures 84
3.4.1 A Toy Example 84
3.4.2 A Real Example 86
3.4.2.1 Relative Scoring 86
3.4.2.2 Coding Example 87
Problems 90
References 91
Part II Understanding Big Data Systems 93
4 Distributed File System 94
4.1 Hadoop Framework 94
4.1.1 Hadoop Distributed File System 95
4.1.2 MapReduce Programming Model 96
4.2 Hadoop System 96
4.2.1 Operating System 97
4.2.2 Distributed System 97
4.2.3 Programming Platform 98
4.3 Hadoop Environment 98
4.3.1 Essential Tools 99
4.3.1.1 Windows 7 (WN) 99
4.3.1.2 VirtualBox (VB) 99
4.3.1.3 Ubuntu Linux (UB) 99
4.3.1.4 Cloudera Hadoop (CH) 100
4.3.1.5 R and RStudio (RR) 100
4.3.2 Installation Guidance 100
4.3.2.1 Internet Resources 101
4.3.2.2 Setting Up a Virtual Machine 102
4.3.2.3 Setting Up a Ubuntu O/S 102
4.3.2.4 Setting Up a Hadoop Distributed File System 103
4.3.2.5 Setting Up an R Environment 104
4.3.2.6 RStudio 107
4.3.3 RStudio Server 108
4.3.3.1 Server Setup 108
4.3.3.2 Client Setup 108
4.4 Testing the Hadoop Environment 109
4.4.1 Standard Example 109
4.4.2 Alternative Example 110
4.5 Multinode Hadoop 110
4.5.1 Virtual Network 111
4.5.2 Hadoop Setup 111
Problems 112
References 112
5 MapReduce Programming Platform 113
5.1 MapReduce Framework 113
5.1.1 Parametrization 114
5.1.2 Parallelization 115
5.2 MapReduce Essentials 116
5.2.1 Mapper Function 116
5.2.2 Reducer Function 117
5.2.3 MapReduce Function 118
5.2.4 A Coding Example 118
5.3 MapReduce Programming 121
5.3.1 Naming Convention 121
5.3.2 Coding Principles 122
5.3.2.1 Input: Initialization 122
5.3.2.2 Input: Fork MapReduce job 123
5.3.2.3 Input: Add Input to dfs 123
5.3.2.4 Processing: Mapper 124
5.3.2.5 Processing: Reducer 124
5.3.2.6 Processing: MapReduce 124
5.3.2.7 Output: Get Output from dfs 124
5.3.3 Application of Coding Principles 124
5.3.3.1 A Coding Example 125
5.3.3.2 Pythagorean Numbers 126
5.3.3.3 Summarization 127
5.4 File Handling in MapReduce 127
5.4.1 Pythagorean Numbers 128
5.4.2 File Split Example 129
5.4.3 File Split Improved 130
Problems 132
References 132
Part III Understanding Machine Learning 134
6 Modeling and Algorithms 135
6.1 Machine Learning 135
6.1.1 A Simple Example 136
6.1.2 Domain Division Perspective 137
6.1.3 Data Domain 140
6.1.4 Domain Division 141
6.2 Learning Models 142
6.2.1 Mathematical Models 144
6.2.2 Hierarchical Models 146
6.2.3 Layered Models 147
6.2.4 Comparison of the Models 147
6.2.4.1 Data Domain Perspective 147
6.2.4.2 Programming Perspective 148
6.3 Learning Algorithms 152
6.3.1 Supervised Learning 152
6.3.2 Types of Learning 153
Problems 154
References 154
7 Supervised Learning Models 156
7.1 Supervised Learning Objectives 156
7.1.1 Parametrization Objectives 157
7.1.1.1 Prediction Point of View 157
7.1.1.2 Classification Point of View 158
7.1.2 Optimization Objectives 159
7.1.2.1 Prediction Point of View 160
7.1.2.2 Classification Point of View 161
7.2 Regression Models 161
7.2.1 Continuous Response 162
7.2.2 Theory of Regression Models 162
7.2.2.1 Standard Regression 162
7.2.2.2 Ridge Regression 165
7.2.2.3 Lasso Regression 167
7.2.2.4 Elastic-Net Regression 169
7.3 Classification Models 171
7.3.1 Discrete Response 171
7.3.2 Mathematical Models 173
7.3.2.1 Logistic Regression 173
7.3.2.2 SVM Family 175
7.4 Hierarchical Models 177
7.4.1 Decision Tree 178
7.4.2 Random Forest 178
7.4.2.1 A Coding Example 180
7.5 Layered Models 181
7.5.1 Shallow Learning 182
7.5.1.1 A Coding Example 182
7.5.2 Deep Learning 188
7.5.2.1 Some Modern Deep Learning Models 190
Problems 190
References 191
8 Supervised Learning Algorithms 193
8.1 Supervised Learning 193
8.1.1 Learning 195
8.1.2 Training 196
8.1.3 Testing 198
8.1.4 Validation 200
8.1.4.1 Testing of Models on Seen Data 201
8.1.4.2 Testing of Models on Unseen Data 201
8.1.4.3 Testing of Models on Partially Seen and Unseen Data 202
8.2 Cross-Validation 202
8.2.1 Tenfold Cross-Validation 203
8.2.2 Leave-One-Out 203
8.2.3 Leave-p-Out 204
8.2.4 Random Subsampling 205
8.2.5 Dividing Data Sets 205
8.2.5.1 Possible Ratios 206
8.2.5.2 Significance 206
8.3 Measures 206
8.3.1 Quantitative Measure 207
8.3.1.1 Distance-Based 207
8.3.1.2 Irregularity-Based 207
8.3.1.3 Probability-Based 208
8.3.2 Qualitative Measure 208
8.3.2.1 Visualization-Based 208
8.3.2.2 Confusion-Based 209
8.3.2.3 Oscillation-Based 211
8.4 A Simple 2D Example 212
Problems 214
References 215
9 Support Vector Machine 217
9.1 Linear Support Vector Machine 217
9.1.1 Linear Classifier: Separable Linearly 218
9.1.1.1 The Learning Model 220
9.1.1.2 A Coding Example: Two Points, Single Line 221
9.1.1.3 A Coding Example: Two Points, Three Lines 222
9.1.1.4 A Coding Example: Five Points, Three Lines 226
9.1.2 Linear Classifier: Nonseparable Linearly 228
9.2 Lagrangian Support Vector Machine 229
9.2.1 Modeling of LSVM 229
9.2.2 Conceptualized Example 229
9.2.3 Algorithm and Coding of LSVM 230
9.3 Nonlinear Support Vector Machine 233
9.3.1 Feature Space 234
9.3.2 Kernel Trick 234
9.3.3 SVM Algorithms on Hadoop 237
9.3.3.1 SVM: Reducer Implementation 237
9.3.3.2 LSVM: Mapper Implementation 240
9.3.4 Real Application 243
Problems 244
References 245
10 Decision Tree Learning 246
10.1 The Decision Tree 246
10.1.1 A Coding Example—Classification Tree 250
10.1.2 A Coding Example—Regression Tree 253
10.2 Types of Decision Trees 254
10.2.1 Classification Tree 255
10.2.2 Regression Tree 256
10.3 Decision Tree Learning Model 257
10.3.1 Parametrization 257
10.3.2 Optimization 258
10.4 Quantitative Measures 259
10.4.1 Entropy and Cross-Entropy 259
10.4.2 Gini Impurity 261
10.4.3 Information Gain 264
10.5 Decision Tree Learning Algorithm 265
10.5.1 Training Algorithm 266
10.5.2 Validation Algorithm 272
10.5.3 Testing Algorithm 272
10.6 Decision Tree and Big Data 275
10.6.1 Toy Example 275
Problems 277
References 278
Part IV Understanding Scaling-Up Machine Learning 279
11 Random Forest Learning 280
11.1 The Random Forest 280
11.1.1 Parallel Structure 281
11.1.2 Model Parameters 282
11.1.3 Gain/Loss Function 283
11.1.4 Bootstrapping and Bagging 283
11.1.4.1 Bootstrapping 283
11.1.4.2 Overlap Thinning 284
11.1.4.3 Bagging 285
11.2 Random Forest Learning Model 285
11.2.1 Parametrization 286
11.2.2 Optimization 286
11.3 Random Forest Learning Algorithm 286
11.3.1 Training Algorithm 287
11.3.1.1 Coding Example 288
11.3.2 Testing Algorithm 290
11.4 Random Forest and Big Data 291
11.4.1 Random Forest Scalability 291
11.4.2 Big Data Classification 291
Problems 294
References 295
12 Deep Learning Models 296
12.1 Introduction 296
12.2 Deep Learning Techniques 298
12.2.1 No-Drop Deep Learning 298
12.2.2 Dropout Deep Learning 298
12.2.3 Dropconnect Deep Learning 299
12.2.4 Gradient Descent 300
12.2.4.1 Conceptualized Example 301
12.2.4.2 Numerical Example 302
12.2.5 A Simple Example 304
12.2.6 MapReduce Implementation 305
12.3 Proposed Framework 308
12.3.1 Motivation 308
12.3.2 Parameters Mapper 308
12.4 Implementation of Deep Learning 310
12.4.1 Analysis of Domain Divisions 310
12.4.2 Analysis of Classification Accuracies 310
12.5 Ensemble Approach 312
Problems 313
References 313
13 Chandelier Decision Tree 315
13.1 Unit Circle Algorithm 315
13.1.1 UCA Classification 316
13.1.2 Improved UCA Classification 317
13.1.3 A Coding Example 318
13.1.4 Drawbacks of UCA 321
13.2 Unit Circle Machine 321
13.2.1 UCM Classification 321
13.2.2 A Coding Example 322
13.2.3 Drawbacks of UCM 324
13.3 Unit Ring Algorithm 324
13.3.1 A Coding Example 325
13.3.2 Unit Ring Machine 327
13.3.3 A Coding Example 327
13.3.4 Drawbacks of URM 329
13.4 Chandelier Decision Tree 329
13.4.1 CDT-Based Classification 330
13.4.2 Extension to Random Chandelier 334
Problems 334
References 334
14 Dimensionality Reduction 335
14.1 Introduction 335
14.2 Feature Hashing Techniques 336
14.2.1 Standard Feature Hashing 337
14.2.2 Flagged Feature Hashing 337
14.3 Proposed Feature Hashing 338
14.3.1 Binning and Mitigation 338
14.3.2 Mitigation Justification 339
14.3.3 Toy Example 339
14.4 Simulation and Results 340
14.4.1 A Matlab Implementation 340
14.4.2 A MapReduce Implementation 343
14.5 Principal Component Analysis 346
14.5.1 Eigenvector 347
14.5.2 Principal Components 349
14.5.3 The Principal Directions 352
14.5.4 A 2D Implementation 354
14.5.5 A 3D Implementation 356
14.5.6 A Generalized Implementation 358
Problems 360
References 360
Index 362

Erscheint lt. Verlag	20.10.2015
Reihe/Serie	Integrated Series in Information Systems
Zusatzinfo	XIX, 359 p. 149 illus., 82 illus. in color.
Verlagsort	New York
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Informatik ► Datenbanken
	Informatik ► Theorie / Studium ► Künstliche Intelligenz / Robotik
	Wirtschaft ► Betriebswirtschaft / Management ► Unternehmensführung / Management
Schlagworte	Big Data • classification • Data Visualization • machine learning • supervised learning • Unit Circle Machine
ISBN-10	1-4899-7641-8 / 1489976418
ISBN-13	978-1-4899-7641-3 / 9781489976413

Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)
Größe: 9,6 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Hardcover

160,49 €