Data Preprocessing in Data Mining - Salvador García, Julián Luengo, Francisco Herrera

Blick ins Buch

Data Preprocessing in Data Mining (eBook)

Salvador García, Julián Luengo, Francisco Herrera (Autoren)

eBook Download: PDF

2014 | 1. Auflage
XV, 327 Seiten
Springer-Verlag
978-3-319-10247-4 (ISBN)

Lese- und Medienproben

Ebook-Leseprobe (PDF)

Data Preprocessing for Data Mining addresses one of the most important issues within the well-known Knowledge Discovery from Data process. Data directly taken from the source will likely have inconsistencies, errors or most importantly, it is not ready to be considered for a data mining process. Furthermore, the increasing amount of data in recent science, industry and business applications, calls to the requirement of more complex tools to analyze it. Thanks to data preprocessing, it is possible to convert the impossible into possible, adapting the data to fulfill the input demands of each data mining algorithm. Data preprocessing includes the data reduction techniques, which aim at reducing the complexity of the data, detecting or removing irrelevant and noisy elements from the data.

This book is intended to review the tasks that fill the gap between the data acquisition from the source and the data mining process. A comprehensive look from a practical point of view, including basic concepts and surveying the techniques proposed in the specialized literature, is given.Each chapter is a stand-alone guide to a particular data preprocessing topic, from basic concepts and detailed descriptions of classical algorithms, to an incursion of an exhaustive catalog of recent developments. The in-depth technical descriptions make this book suitable for technical professionals, researchers, senior undergraduate and graduate students in data science, computer science and engineering.

Preface 7
Contents 9
Acronyms 15
1 Introduction 16
1.1 Data Mining and Knowledge Discovery 16
1.2 Data Mining Methods 17
1.3 Supervised Learning 21
1.4 Unsupervised Learning 22
1.4.1 Pattern Mining 25 23
1.4.2 Outlier Detection 9 23
1.5 Other Learning Paradigms 23
1.5.1 Imbalanced Learning 22 23
1.5.2 Multi-instance Learning 5 24
1.5.3 Multi-label Classification 8 24
1.5.4 Semi-supervised Learning 33 24
1.5.5 Subgroup Discovery 17 24
1.5.6 Transfer Learning 26 25
1.5.7 Data Stream Learning 13 25
1.6 Introduction to Data Preprocessing 25
1.6.1 Data Preparation 26
1.6.2 Data Reduction 28
References 31
2 Data Sets and Proper Statistical Analysis of Data Mining Techniques 33
2.1 Data Sets and Partitions 33
2.1.1 Data Set Partitioning 35
2.1.2 Performance Measures 38
2.2 Using Statistical Tests to Compare Methods 39
2.2.1 Conditions for the Safe Use of Parametric Tests 40
2.2.2 Normality Test over the Group of Data Sets and Algorithms 41
2.2.3 Non-parametric Tests for Comparing Two Algorithms in Multiple Data Set Analysis 43
2.2.4 Non-parametric Tests for Multiple Comparisons Among More than Two Algorithms 46
References 51
3 Data Preparation Basic Models 53
3.1 Overview 53
3.2 Data Integration 54
3.2.1 Finding Redundant Attributes 55
3.2.2 Detecting Tuple Duplication and Inconsistency 57
3.3 Data Cleaning 59
3.4 Data Normalization 60
3.4.1 Min-Max Normalization 60
3.4.2 Z-score Normalization 61
3.4.3 Decimal Scaling Normalization 62
3.5 Data Transformation 62
3.5.1 Linear Transformations 63
3.5.2 Quadratic Transformations 63
3.5.3 Non-polynomial Approximations of Transformations 64
3.5.4 Polynomial Approximations of Transformations 65
3.5.5 Rank Transformations 66
3.5.6 Box-Cox Transformations 67
3.5.7 Spreading the Histogram 68
3.5.8 Nominal to Binary Transformation 68
3.5.9 Transformations via Data Reduction 69
References 69
4 Dealing with Missing Values 72
4.1 Introduction 72
4.2 Assumptions and Missing Data Mechanisms 74
4.3 Simple Approaches to Missing Data 76
4.4 Maximum Likelihood Imputation Methods 77
4.4.1 Expectation-Maximization (EM) 78
4.4.2 Multiple Imputation 81
4.4.3 Bayesian Principal Component Analysis (BPCA) 85
4.5 Imputation of Missing Values. Machine Learning Based Methods 89
4.5.1 Imputation with K-Nearest Neighbor (KNNI) 89
4.5.2 Weighted Imputation with K-Nearest Neighbour (WKNNI) 90
4.5.3 K-means Clustering Imputation (KMI) 91
4.5.4 Imputation with Fuzzy K-means Clustering (FKMI) 91
4.5.5 Support Vector Machines Imputation (SVMI) 92
4.5.6 Event Covering (EC) 95
4.5.7 Singular Value Decomposition Imputation (SVDI) 99
4.5.8 Local Least Squares Imputation (LLSI) 99
4.5.9 Recent Machine Learning Approaches to Missing Values Imputation 103
4.6 Experimental Comparative Analysis 103
4.6.1 Effect of the Imputation Methods in the Attributes' Relationships 103
4.6.2 Best Imputation Methods for Classification Methods 110
4.6.3 Interesting Comments 113
References 114
5 Dealing with Noisy Data 119
5.1 Identifying Noise 119
5.2 Types of Noise Data: Class Noise and Attribute Noise 122
5.2.1 Noise Introduction Mechanisms 123
5.2.2 Simulating the Noise of Real-World Data Sets 126
5.3 Noise Filtering at Data Level 127
5.3.1 Ensemble Filter 128
5.3.2 Cross-Validated Committees Filter 129
5.3.3 Iterative-Partitioning Filter 129
5.3.4 More Filtering Methods 130
5.4 Robust Learners Against Noise 130
5.4.1 Multiple Classifier Systems for Classification Tasks 132
5.4.2 Addressing Multi-class Classification Problems by Decomposition 135
5.5 Empirical Analysis of Noise Filters and Robust Strategies 137
5.5.1 Noise Introduction 137
5.5.2 Noise Filters for Class Noise 139
5.5.3 Noise Filtering Efficacy Prediction by Data Complexity Measures 141
5.5.4 Multiple Classifier Systems with Noise 145
5.5.5 Analysis of the OVO Decomposition with Noise 148
References 152
6 Data Reduction 158
6.1 Overview 158
6.2 The Curse of Dimensionality 159
6.2.1 Principal Components Analysis 160
6.2.2 Factor Analysis 162
6.2.3 Multidimensional Scaling 163
6.2.4 Locally Linear Embedding 166
6.3 Data Sampling 167
6.3.1 Data Condensation 169
6.3.2 Data Squashing 170
6.3.3 Data Clustering 170
6.4 Binning and Reduction of Cardinality 172
References 173
7 Feature Selection 174
7.1 Overview 174
7.2 Perspectives 175
7.2.1 The Search of a Subset of Features 175
7.2.2 Selection Criteria 179
7.2.3 Filter, Wrapper and Embedded Feature Selection 184
7.3 Aspects 187
7.3.1 Output of Feature Selection 187
7.3.2 Evaluation 188
7.3.3 Drawbacks 190
7.3.4 Using Decision Trees for Feature Selection 190
7.4 Description of the Most Representative Feature Selection Methods 191
7.4.1 Exhaustive Methods 192
7.4.2 Heuristic Methods 193
7.4.3 Nondeterministic Methods 193
7.4.4 Feature Weighting Methods 195
7.5 Related and Advanced Topics 196
7.5.1 Leading and Recent Feature Selection Techniques 197
7.5.2 Feature Extraction 199
7.5.3 Feature Construction 200
7.6 Experimental Comparative Analyses in Feature Selection 201
References 202
8 Instance Selection 205
8.1 Introduction 205
8.2 Training Set Selection Versus Prototype Selection 207
8.3 Prototype Selection Taxonomy 209
8.3.1 Common Properties in Prototype Selection Methods 209
8.3.2 Prototype Selection Methods 212
8.3.3 Taxonomy of Prototype Selection Methods 212
8.4 Description of Methods 216
8.4.1 Condensation Algorithms 216
8.4.2 Edition Algorithms 220
8.4.3 Hybrid Algorithms 222
8.5 Related and Advanced Topics 231
8.5.1 Prototype Generation 231
8.5.2 Distance Metrics, Feature Weighting and Combinations with Feature Selection 231
8.5.3 Hybridizations with Other Learning Methods and Ensembles 232
8.5.4 Scaling-Up Approaches 233
8.5.5 Data Complexity 233
8.6 Experimental Comparative Analysis in Prototype Selection 234
8.6.1 Analysis and Empirical Results on Small Size Data Sets 235
8.6.2 Analysis and Empirical Results on Medium Size Data Sets 240
8.6.3 Global View of the Obtained Results 241
8.6.4 Visualization of Data Subsets: A Case Study Based on the Banana Data Set 243
References 246
9 Discretization 254
9.1 Introduction 254
9.2 Perspectives and Background 256
9.2.1 Discretization Process 256
9.2.2 Related and Advanced Work 259
9.3 Properties and Taxonomy 260
9.3.1 Common Properties 260
9.3.2 Methods and Taxonomy 264
9.3.3 Description of the Most Representative Discretization Methods 268
9.4 Experimental Comparative Analysis 274
9.4.1 Experimental Set up 274
9.4.2 Analysis and Empirical Results 277
References 287
10 A Data Mining Software Package Including Data Preparation and Reduction: KEEL 293
10.1 Data Mining Softwares and Toolboxes 293
10.2 KEEL: Knowledge Extraction Based on Evolutionary Learning 295
10.2.1 Main Features 296
10.2.2 Data Management 297
10.2.3 Design of Experiments: Off-Line Module 299
10.2.4 Computer-Based Education: On-Line Module 301
10.3 KEEL-Dataset 302
10.3.1 Data Sets Web Pages 302
10.3.2 Experimental Study Web Pages 305
10.4 Integration of New Algorithms into the KEEL Tool 306
10.4.1 Introduction to the KEEL Codification Features 306
10.5 KEEL Statistical Tests 311
10.5.1 Case Study 312
10.6 Summarizing Comments 318
References 319
Index 322

Erscheint lt. Verlag	30.8.2014
Reihe/Serie	Intelligent Systems Reference Library
Zusatzinfo	XV, 320 p. 41 illus.
Verlagsort	Cham
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Informatik ► Datenbanken
	Mathematik / Informatik ► Informatik ► Grafik / Design
	Technik
Schlagworte	Data Mining • Data Preparation • Data preprocessing • Data reduction • discretization • Feature Selection • Instance Selection • machine learning • Missing Values • Noisy Data
ISBN-10	3-319-10247-8 / 3319102478
ISBN-13	978-3-319-10247-4 / 9783319102474

Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)
Größe: 8,2 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Hardcover

213,99 €