Data Mining and Knowledge Discovery Handbook (eBook)

eBook Download: PDF

2010 | 2. Auflage
XX, 1285 Seiten
Springer US (Verlag)
978-0-387-09823-4 (ISBN)

This book organizes key concepts, theories, standards, methodologies, trends, challenges and applications of data mining and knowledge discovery in databases. It first surveys, then provides comprehensive yet concise algorithmic descriptions of methods, including classic methods plus the extensions and novel methods developed recently. It also gives in-depth descriptions of data mining applications in various interdisciplinary industries.

Dr. Lior Rokach is a senior lecturer at the Department of Information System Engineering at Ben-Gurion University. He is a recognized expert in intelligent information systems and has held several leading positions in this field. His main areas of interest are Data Mining, Pattern Recognition, and Recommender Systems. Dr. Rokach is the author of over 70 refereed papers in leading journals, conference proceedings and book chapters. In addition he has authored six books and edited three others books.

Knowledge Discovery demonstrates intelligent computing at its best, and is the most desirable and interesting end-product of Information Technology. To be able to discover and to extract knowledge from data is a task that many researchers and practitioners are endeavoring to accomplish. There is a lot of hidden knowledge waiting to be discovered - this is the challenge created by today's abundance of data. Data Mining and Knowledge Discovery Handbook, 2nd Edition organizes the most current concepts, theories, standards, methodologies, trends, challenges and applications of data mining (DM) and knowledge discovery in databases (KDD) into a coherent and unified repository. This handbook first surveys, then provides comprehensive yet concise algorithmic descriptions of methods, including classic methods plus the extensions and novel methods developed recently. This volume concludes with in-depth descriptions of data mining applications in various interdisciplinary industries including finance, marketing, medicine, biology, engineering, telecommunications, software, and security.Data Mining and Knowledge Discovery Handbook, 2nd Edition is designed for research scientists, libraries and advanced-level students in computer science and engineering as a reference. This handbook is also suitable for professionals in industry, for computing applications, information systems management, and strategic research management.

Prof. Oded Maimon is the Oracle chaired Professor at Tel-Aviv University, Previously at MIT. Oded is a leader expert in the field of data mining and knowledge discovery. He published many articles on new algorithms and seven significant award winning books in the field since 2000. He has also developed and implemented successful applications in the Industry. He heads an international research group sponsored by European Union awards.Dr. Lior Rokach is a senior lecturer at the Department of Information System Engineering at Ben-Gurion University. He is a recognized expert in intelligent information systems and has held several leading positions in this field. His main areas of interest are Data Mining, Pattern Recognition, and Recommender Systems. Dr. Rokach is the author of over 70 refereed papers in leading journals, conference proceedings and book chapters. In addition he has authored six books and edited three others books.

Preface 6
Contents 8
List of Contributors 13
1 Introduction to Knowledge Discovery and Data Mining 19
1.1 The KDD Process 20
1.2 Taxonomy of Data Mining Methods 23
1.3 Data Mining within the Complete Decision Support System 25
1.4 KDD and DM Research Opportunities and Challenges 26
1.5 KDD & DM Trends
1.6 The Organization of the Handbook 28
1.7 New to This Edition 29
1.7.1 Mining Rich Data Formats 29
1.7.2 New Techniques 30
1.7.3 New Application Domains 30
1.7.4 New Consideration 31
1.7.5 Software 31
1.7.6 Major Updates 31
References 31
Part I Preprocessing Methods 34
2 Data Cleansing: A Prelude to Knowledge Discovery 35
2.1 INTRODUCTION 35
2.2 DATA CLEANSING BACKGROUND 36
2.3 GENERAL METHODS FOR DATA CLEANSING 39
2.4 APPLYING DATA CLEANSING 40
2.4.1 Statistical Outlier Detection 40
2.4.2 Clustering 41
2.4.3 Pattern-based detection 41
2.4.4 Association Rules 42
2.5 CONCLUSIONS 45
References 45
3 Handling Missing Attribute Values 49
3.1 Introduction 49
3.2 Sequential Methods 51
3.2.1 Deleting Cases with Missing Attribute Values 51
3.2.2 The Most Common Value of an Attribute 51
3.2.3 The Most Common Value of an Attribute Restricted to a Concept 52
3.2.4 Assigning All Possible Attribute Values to a Missing Attribute Value 52
3.2.5 Assigning All Possible Attribute Values Restricted to a Concept 54
3.2.6 Replacing Missing Attribute Values by the Attribute Mean 55
3.2.7 Replacing Missing Attribute Values by the Attribute Mean Restricted to a Concept 56
3.2.8 Global Closest Fit 56
3.2.9 Concept Closest Fit 57
3.2.10 Other Methods 58
3.3 Parallel Methods 59
3.3.1 Blocks of Attribute-Value Pairs and Characteristic Sets 60
3.3.2 Lower and Upper Approximations 62
3.3.3 Rule Induction—MLEM2 63
3.3.4 Other Approaches to Missing Attribute Values 64
3.4 Conclusions 64
References 64
4 Geometric Methods for Feature Extraction and Dimensional Reduction - A Guided Tour 68
Introduction 68
4.1 Projective Methods 70
4.1.1 Principal Component Analysis (PCA) 72
4.1.2 Probabilistic PCA (PPCA) 76
4.1.3 Kernel PCA 77
4.1.4 Oriented PCA and Distortion Discriminant Analysis 80
4.2 Manifold Modeling 81
4.2.1 The Nystr¨om method 81
4.2.2 Multidimensional Scaling 84
4.2.3 Isomap 89
4.2.4 Locally Linear Embedding 89
4.2.5 Graphical Methods 91
4.3 Pulling the Threads Together 93
Acknowledgments 95
References 95
5 Dimension Reduction and Feature Selection 98
5.1 Introduction 98
5.2 Feature Selection Techniques 101
5.2.1 Feature Filters 101
5.2.2 Feature Wrappers 106
5.3 Variable Selection 110
5.3.1 Mallows Cp (Mallows, 1973) 110
5.3.2 AIC, BIC and F ratio 111
5.3.3 Principal Component Analysis (PCA) 111
5.3.4 Factor Analysis (FA) 112
5.3.5 Projection Pursuit 112
5.3.6 Advanced Methods for Variable Selection 112
References 112
6 Discretization Methods 116
Introduction 116
6.1 Terminology 117
6.1.1 Qualitative vs. quantitative 117
6.1.2 Levels of measurement scales 117
6.1.3 Summary 118
6.2 Taxonomy 119
6.3 Typical methods 120
6.3.1 Background and terminology 121
6.3.2 Equal-width, equal-frequency and fixed-frequency discretization 121
6.3.3 Multi-interval-entropy-minimization discretization ((MIEMD) 122
6.3.4 ChiMerge, StatDisc and InfoMerge discretization 122
6.3.5 Cluster-based discretization 123
6.3.6 ID3 discretization 123
6.3.7 Non-disjoint discretization 123
6.3.8 Lazy discretization 124
6.3.9 Dynamic-qualitative discretization 125
6.3.10 Ordinal discretization 125
6.3.11 Fuzzy discretization 125
6.3.12 Iterative-improvement discretization 126
6.3.13 Summary 126
6.4 Discretization and the learning context 126
6.4.1 Discretization for decision tree learning 127
6.4.2 Discretization for naive-Bayes learning 127
6.5 Summary 128
References 129
7 Outlier Detection 132
7.1 Introduction: Motivation, Definitions and Applications 132
7.2 Taxonomy of Outlier Detection Methods 133
7.3 Univariate Statistical Methods 134
7.3.1 Single-step vs. Sequential Procedures 134
7.3.2 Inward and Outward Procedures 135
7.3.3 Univariate Robust Measures 135
7.3.4 Statistical Process Control (SPC) 136
7.4 Multivariate Outlier Detection 137
7.4.1 Statistical Methods for Multivariate Outlier Detection 138
7.4.2 Multivariate Robust Measures 139
7.4.3 Data-Mining Methods for Outlier Detection 139
7.4.4 Preprocessing Procedures 141
7.5 Comparison of Outlier Detection Methods 141
References 142
Part II Supervised Methods 146
8 Supervised Learning 147
8.1 Introduction 147
8.2 Training Set 148
8.3 Definition of the Classification Problem 148
8.4 Induction Algorithms 149
8.5 Performance Evaluation 150
8.5.1 Generalization Error 150
8.5.2 Theoretical Estimation of Generalization Error 151
8.5.3 Empirical Estimation of Generalization Error 153
8.5.4 Computational Complexity 154
8.5.5 Comprehensibility 154
8.6 Scalability to Large Datasets 155
8.7 The “Curse of Dimensionality” 156
8.8 Classification Problem Extensions 158
References 159
9 Classification Trees 162
9.1 Decision Trees 162
9.2 Algorithmic Framework for Decision Trees 164
9.3 Univariate Splitting Criteria 164
9.3.1 Overview 164
9.3.2 Impurity-based Criteria 166
9.3.3 Information Gain 166
9.3.4 Gini Index 166
9.3.5 Likelihood-Ratio Chi–Squared Statistics 167
9.3.6 DKM Criterion 167
9.3.7 Normalized Impurity Based Criteria 167
9.3.8 Gain Ratio 168
9.3.9 Distance Measure 168
9.3.10 Binary Criteria 168
9.3.11 Twoing Criterion 168
9.3.12 Orthogonal (ORT) Criterion 169
9.3.13 Kolmogorov–Smirnov Criterion 169
9.3.14 AUC–Splitting Criteria 169
9.3.15 Other Univariate Splitting Criteria 169
9.3.16 Comparison of Univariate Splitting Criteria 170
9.4 Multivariate Splitting Criteria 170
9.5 Stopping Criteria 170
9.6 Pruning Methods 171
9.6.1 Overview 171
9.6.2 Cost–Complexity Pruning 171
9.6.3 Reduced Error Pruning 172
9.6.4 Minimum Error Pruning (MEP) 172
9.6.5 Pessimistic Pruning 172
9.6.6 Error–based Pruning (EBP) 173
9.6.7 Optimal Pruning 173
9.6.8 Minimum Description Length (MDL) Pruning 174
9.6.9 Other Pruning Methods 174
9.6.10 Comparison of Pruning Methods 174
9.7 Other Issues 175
9.7.1 Weighting Instances 175
9.7.2 Misclassification costs 175
9.7.3 Handling Missing Values 175
9.8 Decision Trees Inducers 176
9.8.1 ID3 176
9.8.2 C4.5 176
9.8.3 CART 177
9.8.4 CHAID 177
9.8.5 QUEST 178
9.8.6 Reference to Other Algorithms 178
9.9 Advantages and Disadvantages of Decision Trees 178
9.10 Decision Tree Extensions 180
9.10.1 Oblivious Decision Trees 180
9.10.2 Fuzzy Decision Trees 181
9.10.3 Decision Trees Inducers for Large Datasets 182
9.10.4 Incremental Induction 182
References 183
10 Bayesian Networks 188
10.1 Introduction 188
10.2 Representation 189
10.3 Reasoning 192
10.4 Learning 194
10.4.1 Scoring Metrics 194
10.4.2 Model Search 201
10.4.3 Validation 202
10.5 Bayesian Networks in Data Mining 204
10.5.1 Bayesian Networks and Classification 204
10.5.2 Generalized Gamma Networks 206
10.5.3 Bayesian Networks and Dynamic Data 208
10.6 Data Mining Applications 211
10.6.1 Survey Data 211
10.6.2 Customer Profiling 214
10.7 Conclusions and Future Research Directions 216
Acknowledgments 218
References 218
11 Data Mining within a Regression Framework 222
11.1 Introduction 222
11.2 Some Definitions 223
11.3 Regression Splines 224
11.4 Smoothing Splines 227
11.5 LocallyWeighted Regression as a Smoother 229
11.6 Smoothers for Multiple Predictors 230
11.6.1 The Generalized Additive Model 231
11.7 Recursive Partitioning 233
11.7.1 Classification and Regression Trees and Extensions 233
11.7.2 Overfitting and Ensemble Methods 239
11.8 Conclusions 242
Acknowledgments 242
References 242
12 Support Vector Machines 244
12.1 Introduction 244
12.2 Hyperplane Classifiers 245
12.2.1 The Linear Classifier 246
12.2.2 The Kernel Trick 248
12.2.3 The Optimal Margin Support Vector Machine 249
12.3 Non-Separable SVM Models 250
12.3.1 Soft Margin Support Vector Classifiers 250
12.3.2 Support Vector Regression 252
12.3.3 SVM-like Models 254
12.4 Implementation Issues with SVM 254
12.4.1 Optimization Techniques 255
12.4.2 Model Selection 256
12.4.3 Multi-Class SVM 256
12.5 Extensions and Application 257
12.6 Conclusion 258
References 258
13 Rule Induction 261
13.1 Introduction 261
13.2 Types of Rules 263
13.3 Rule Induction Algorithms 265
13.3.1 LEM1 Algorithm 265
13.3.2 LEM2 269
13.3.3 AQ 272
13.4 Classification Systems 274
13.5 Validation 275
13.6 Advanced Methodology 276
References 276
Part III Unsupervised Methods 278
14 A survey of Clustering Algorithms 279
14.1 Introduction 279
14.2 Distance Measures 280
14.2.1 Minkowski: Distance Measures for Numeric Attributes 280
14.2.2 Distance Measures for Binary Attributes 281
14.2.3 Distance Measures for Nominal Attributes 281
14.2.4 Distance Metrics for Ordinal Attributes 281
14.2.5 Distance Metrics for Mixed-Type Attributes 282
14.3 Similarity Functions 282
14.3.1 Cosine Measure 282
14.3.2 Pearson Correlation Measure 283
14.3.3 Extended Jaccard Measure 283
14.3.4 Dice Coefficient Measure 283
14.4 Evaluation Criteria Measures 283
14.4.1 Internal Quality Criteria 283
14.4.2 External Quality Criteria 287
14.5 Clustering Methods 288
14.5.1 Hierarchical Methods 288
14.5.2 Partitioning Methods 290
14.5.3 Density-based Methods 292
14.5.4 Model-based Clustering Methods 293
14.5.5 Grid-based Methods 294
14.5.6 Soft-computing Methods 294
14.5.7 Which Technique To Use? 298
14.6 Clustering Large Data Sets 299
14.6.1 Decomposition Approach 300
14.6.2 Incremental Clustering 300
14.6.3 Parallel Implementation 302
14.7 Determining the Number of Clusters 302
14.7.1 Methods Based on Intra-Cluster Scatter 302
14.7.2 Methods Based on both the Inter- and Intra-Cluster Scatter 303
14.7.3 Criteria Based on Probabilistic 305
References 305
15 Association Rules 309
15.1 Introduction 309
15.1.1 Formal Problem Definition 310
15.2 Association Rule Mining 311
15.2.1 Association Mining Phase 312
15.2.2 Rule Generation Phase 315
15.3 Application to Other Types of Data 317
15.4 Extensions of the Basic Framework 318
15.4.1 Some other Rule Evaluation Measures 319
15.4.2 Interactive or Knowledge-Based Filtering 320
15.4.3 Compressed Representations 321
15.4.4 Additional Constraints for Dense Databases 322
15.4.5 Rules without Minimum Support 324
15.5 Conclusions 326
References 327
16 Frequent Set Mining 330
Introduction 330
16.1 Problem Description 330
16.2 Apriori 333
16.3 Eclat 336
16.4 Optimizations 337
16.4.1 Item reordering 338
16.4.2 Partition 338
16.4.3 Sampling 339
16.4.4 FP-tree 339
16.5 Concise representations 340
16.5.1 Maximal Frequent Sets 340
16.5.2 Closed Frequent Sets 341
16.5.3 Non Derivable Frequent Sets 342
16.6 Theoretical Aspects 342
16.7 Further Reading 343
References 344
17 Constraint-based Data Mining 348
17.1 Motivations 348
17.2 Background and Notations 350
17.3 Solving Anti-Monotonic Constraints 353
17.4 Introducing non Anti-Monotonic Constraints 354
17.4.1 The Seminal Work 355
17.4.2 Generic Algorithms 357
17.4.3 Ad-hoc Strategies 359
17.4.4 Other Directions of Research 359
17.5 Conclusion 360
References 361
18 Link Analysis 364
18.1 Introduction 364
18.2 Social Network Analysis 366
18.3 Search Engines 368
18.4 Viral Marketing 370
18.5 Law Enforcement & Fraud Detection
18.6 Combining with Traditional Methods 374
18.7 Summary 375
References 375
Part IV Soft Computing Methods 378
19 A Review of Evolutionary Algorithms for Data Mining 379
19.1 Introduction 379
19.2 An Overview of Evolutionary Algorithms 380
19.3 Evolutionary Algorithms for Discovering Classification Rules 382
19.3.1 Individual Representation for Classification-Rule Discovery 382
19.3.2 Searching for a Diverse Set of Rules 385
19.3.3 Fitness Evaluation 386
19.4 Evolutionary Algorithms for Clustering 389
19.4.1 Individual Representation for Clustering 389
19.4.2 Fitness Evaluation for Clustering 391
19.5 Evolutionary Algorithms for Data Preprocessing 392
19.5.1 Genetic Algorithms for Attribute Selection 392
19.5.2 Genetic Programming for Attribute Construction 394
19.6 Multi-Objective Optimization with Evolutionary Algorithms 397
19.7 Conclusions 401
References 403
20 A Review of Reinforcement Learning Methods 409
20.1 Introduction 409
20.2 The Reinforcement-Learning Model 410
20.3 Reinforcement-Learning Algorithms 412
20.3.1 Dynamic-Programming 412
20.3.2 Generalization of Dynamic-Programming to Reinforcement-Learning 413
20.4 Extensions to Basic Model and Algorithms 416
20.4.1 Multi-Agent RL 416
20.4.2 Tackling Large Sets of States and Actions 417
20.5 Applications of Reinforcement-Learning 417
20.6 Reinforcement-Learning and Data-Mining 418
20.7 An Instructive Example 419
References 422
21 Neural Networks For Data Mining 426
21.1 Introduction 426
21.2 A Brief History 427
21.3 Neural Network Models 429
21.3.1 Feedforward Neural Networks 429
21.3.2 Hopfield Neural Networks 438
21.3.3 Kohonen’s Self-organizing Maps 440
21.4 Data Mining Applications 443
21.5 Conclusions 445
References 446
22 Granular Computing and Rough Sets - An Incremental Development 452
22.1 Introduction 452
22.2 Naive Model for Problem Solving 453
22.2.1 Information Granulations/Partitions 453
22.2.2 Knowledge Level Processing and Computing with Words 454
22.2.3 Information Integration and Approximation Theory 454
22.3 A Geometric Models of Information Granulations 455
22.4 Information Granulations/Partitions 456
22.4.1 Equivalence Relations(Partitions) 456
22.4.2 Binary Relation (Granulation) - Topological Partitions 457
22.4.3 Fuzzy Binary Granulations (Fuzzy Binary Relations) 457
22.5 Non-partition Application - Chinese Wall Security Policy Model 457
22.5.1 Simple Chinese Wall Security Policy 458
22.6 Knowledge Representations 459
22.6.1 Relational Tables and Partitions 459
22.6.2 Table Representations of Binary Relations 460
22.6.3 New representations of topological relations 463
22.7 Topological Concept Hierarchy Lattices/Trees 464
22.7.1 Granular Lattice 464
22.7.2 Granulated/Quotient Sets 466
22.7.3 Tree of centers 466
22.7.4 Topological tree 467
22.7.5 Table Representation of Fuzzy Binary Relations 468
22.8 Knowledge Processing 469
22.8.1 The Notion of Knowledge 469
22.8.2 Strong,Weak and Knowledge Dependence 470
22.8.3 Knowledge Views of Binary Granulations 470
22.9 Information Integration 471
22.9.1 Extensions 471
22.9.2 Approximations in Rough Set Theory (RST) 471
22.9.3 Binary Neighborhood System Spaces 472
22.10 Conclusions 473
References 473
23 Pattern Clustering Using a Swarm Intelligence Approach 476
23.1 Introduction 476
23.2 An Introduction to Swarm Intelligence 478
23.2.1 The Ant Colony Systems 480
23.3 Data Clustering – An Overview 485
23.3.1 Problem Definition 485
23.3.2 The Classical Clustering Algorithms 486
23.3.3 Relevance of SI Algorithms in Clustering 488
23.4 Clustering with the SI Algorithms 488
23.4.1 The Ant Colony Based Clustering Algorithms 488
23.4.2 The PSO-based Clustering Algorithms 490
23.5 Automatic Kernel-based Clustering with PSO 492
23.5.1 The Kernel Based Similarity Measure 493
23.5.2 Reformulation of CS Measure 494
23.5.3 The Multi-Elitist PSO (MEPSO) Algorithm 495
23.5.4 Particle Representation 496
23.5.5 The Fitness Function 497
23.5.6 Avoiding Erroneous particles with Empty Clusters or Unreasonable Fitness Evaluation 497
23.5.7 Putting It All Together 498
23.5.8 Experimental Results 498
23.6 Conclusion and Future Directions 503
References 507
24 Using Fuzzy Logic in Data Mining 512
24.1 Introduction 512
24.2 Basic Concepts of Fuzzy Set Theory 512
24.2.1 Membership function 513
24.2.2 Fuzzy Set Operations 515
24.3 Fuzzy Supervised Learning 516
24.3.1 Growing Fuzzy Decision Tree 516
24.3.2 Soft Regression 521
24.3.3 Neuro-fuzzy 521
24.4 Fuzzy Clustering 521
24.5 Fuzzy Association Rules 523
24.6 Conclusion 525
References 525
Part V Supporting Methods 528
25 Statistical Methods for Data Mining 529
25.1 Introduction 529
25.2 Statistical Issues in DM 530
25.2.1 Size of the Data and Statistical Theory 530
25.2.2 The Curse of Dimensionality and Approaches to Address It 531
25.2.3 Assessing Uncertainty 532
25.2.4 Automated Analysis 532
25.2.5 Algorithms for Data Analysis in Statistics 533
25.2.6 Visualization 533
25.2.7 Scalability 534
25.2.8 Sampling 534
25.3 Modeling Relationships using Regression Models 535
25.3.1 Linear Regression Analysis 535
25.3.2 Generalized Linear Models 536
25.3.3 Logistic Regression 537
25.3.4 Survival Analysis 538
25.4 False Discovery Rate (FDR) Control in Hypotheses Testing 539
25.5 Model (Variables or Features) Selection using FDR Penalization in GLM 542
25.6 Concluding Remarks 543
References 544
26 Logics for Data Mining 547
Introduction 547
26.1 Generalized quantifiers 548
26.2 Some important classes of quantifiers 550
26.2.1 One-dimensional 550
26.2.2 Two-dimensional 551
26.3 Some comments and conclusion 554
Acknowledgments 555
References 555
27 Wavelet Methods in Data Mining 558
27.1 Introduction 558
27.2 A Framework for Data Mining Process 559
27.3 Wavelet Background 559
27.3.1 Basics of Wavelet in L2(R) 559
27.3.2 Dilation Equation 560
27.3.3 Multiresolution Analysis (MRA) and Fast DWT Algorithm 561
27.3.4 Illustrations of Harr Wavelet Transform 562
27.3.5 Properties of Wavelets 563
27.4 Data Management 564
27.5 Preprocessing 564
27.5.1 Denoising 565
27.5.2 Data Transformation 566
27.5.3 Dimensionality Reduction 566
27.6 Core Mining Process 567
27.6.1 Clustering 567
27.6.2 Classification 568
27.6.3 Regression 568
27.6.4 Distributed Data Mining 569
27.6.5 Similarity Search/Indexing 570
27.6.6 Approximate Query Processing 571
27.6.7 Traffic Modeling 572
27.7 Conclusion 573
References 574
28 Fractal Mining - Self Similarity-based Clustering and its Applications 577
28.1 Introduction 577
28.2 Fractal Dimension 578
28.3 Clustering Using the Fractal Dimension 582
28.3.1 FC Initialization Step 582
28.3.2 Incremental Step 582
28.3.3 Reshaping Clusters in Mid-Flight 584
28.3.4 Complexity of the Algorithm 585
28.3.5 Confidence Bounds 585
28.3.6 Memory Management 586
28.3.7 Experimental Results 587
28.4 Projected Fractal Clustering 589
28.5 Tracking Clusters 590
28.5.1 Experiment on a Real Dataset 590
28.6 Conclusions 591
References 592
29 Visual Analysis of Sequences Using Fractal Geometry 594
29.1 Introduction 594
29.2 Iterated Function System (IFS) 595
29.3 Algorithmic Framework 597
29.3.1 Overview 597
29.3.2 Sequence Representation 598
29.3.3 Sequence Transformation 598
29.3.4 Sequence Pattern Detection 599
29.3.5 Sequence Pattern Detection Algorithm Description: 599
29.3.6 Classifiers Selection 601
29.4 Fault Sequence Detection Application 602
29.5 Conclusions and Future Research 602
References 603
30 Interestingness Measures - On Determining What Is Interesting 605
Introduction 605
30.1 Definitions and Notations 606
30.2 Subjective Interestingness 606
30.2.1 The Expert-Driven Grammatical Approach 607
30.2.2 The Rule-By-Rule Classification Approach 607
30.2.3 Interestingness Via What Is Not Interesting Approach 607
30.3 Objective Interestingness 608
30.3.1 Ranking Patterns 608
30.3.2 Pruning and Application of Constraints 608
30.3.3 Summarization of Patterns 609
30.4 Impartial Interestingness 610
30.5 Concluding Remarks 611
References 611
31 Quality Assessment Approaches in Data Mining 615
Introduction 615
31.1 Data Pre-processing and Quality Assessment 617
31.2 Evaluation of Classification Methods 617
31.2.1 Classification Model Accuracy 617
31.2.2 Evaluating the Accuracy of Classification Algorithms 619
31.2.3 Interestingness Measures of Classification Rules 622
31.3 Association Rules 622
31.3.1 Association Rules Interestingness Measures 623
31.3.2 Other approaches for evaluating association rules 625
31.4 Cluster Validity 626
31.4.1 Fundamental Concepts of Cluster Validity 626
31.4.2 External Criteria 628
31.4.3 Internal Criteria 630
31.4.4 Relative Criteria 631
31.4.5 Fuzzy Clustering 637
31.4.6 Other Approaches for Cluster Validity 639
References 640
32 Data Mining Model Comparison 642
32.1 Data Mining and Statistics 642
32.2 Data Mining Model Comparison 643
32.3 Application to Credit Risk Management 647
32.4 Conclusions 654
References 655
33 Data Mining Query Languages 656
33.1 The Need for Data Mining Query Languages 656
33.2 Supporting Association Rule Mining Processes 657
33.3 A Few Proposals for Association Rule Mining 659
33.3.1 MSQL 659
33.3.2 MINE RULE 659
33.3.3 DMQL 660
33.3.4 OLE DB for DM 661
33.3.5 A Critical Evaluation 662
33.4 Conclusion 663
References 664
Part VI Advanced Methods 666
34 Mining Multi-label Data 667
34.1 Introduction 667
34.2 Learning 667
34.2.1 Problem Transformation 669
34.2.2 Algorithm Adaptation 672
34.3 Dimensionality Reduction 673
34.3.1 Feature Selection 674
34.3.2 Feature Extraction 674
34.4 Exploiting Label Structure 674
34.5 Scaling Up 675
34.6 Statistics and Datasets 676
34.7 Evaluation Measures 677
34.7.1 Bipartitions 677
34.7.2 Ranking 679
34.7.3 Hierarchical 680
34.8 Related Tasks 680
34.9 Multi-Label Data Mining Software 681
References 681
35 Privacy in Data Mining 686
35.1 Introduction 686
35.2 On the Classification of Protection Procedures 687
35.2.1 Computation-Driven Protection Procedures: the Cryptographic Approach 689
35.2.2 Data-driven Protection Procedures 690
35.3 Disclosure Risk Measures 690
35.3.1 An Scenario for Identity Disclosure 691
35.3.2 Measures for Identity Disclosure 692
35.4 Data Protection Procedures 696
35.4.1 Perturbative Methods 697
35.4.2 Non-perturbative Methods 702
35.4.3 Synthetic Data Generators 703
35.4.4 k-Anonymity 704
35.5 Information Loss Measures 705
35.5.1 Generic Information Loss Measures 705
35.5.2 Specific Information Loss Measures 707
35.6 Trade-off and Visualization 707
35.6.1 The Score 707
35.6.2 R-U Maps 709
35.7 Conclusions 709
Acknowledgements 709
References 709
36 Meta-Learning - Concepts and Techniques 716
36.1 Introduction 716
36.2 A Meta-Learning Architecture 717
36.2.1 Knowledge-Acquisition Mode 718
36.2.2 Advisory Mode 718
36.3 Techniques in Meta-Learning 720
36.3.1 Dataset Characterization 720
36.3.2 Mapping Datasets to Predictive Models 722
36.3.3 Learning from Base-Learners 723
36.3.4 Inductive Transfer and Learning to Learn 724
36.3.5 Dynamic-Bias Selection 725
36.4 Tools and Applications 725
36.4.1 METAL DM Assistant 725
36.5 Future Directions and Conclusions 726
References 727
37 Bias vs Variance Decomposition For Regression and Classification 731
37.1 Introduction 731
37.2 Bias/Variance Decompositions 733
37.2.1 Bias/Variance Decomposition of the Squared Loss 733
37.2.2 Bias/variance decompositions of the 0-1 loss 735
37.3 Estimation of Bias and Variance 738
37.4 Experiments and Applications 740
37.4.1 Bias/variance tradeoff 740
37.4.2 Comparison of some learning algorithms 741
37.4.3 Ensemble methods: bagging 742
37.5 Discussion 743
References 743
38 Mining with Rare Cases 745
38.1 Introduction 745
38.2 Why Rare Cases are Problematic 747
38.3 Techniques for Handling Rare Cases 749
38.3.1 Obtain Additional Training Data 749
38.3.2 Use a More Appropriate Inductive Bias 750
38.3.3 Using More Appropriate Metrics 751
38.3.4 Employ Non-Greedy Search Techniques 751
38.3.5 Utilize Knowledge/Human Interaction 752
38.3.6 Employ Boosting 752
38.3.7 Place Rare Cases Into Separate Classes 753
38.4 Conclusion 753
References 754
39 Data Stream Mining 756
39.1 Introduction 756
39.2 Clustering Techniques 759
39.3 Classification Techniques 761
39.4 Frequent Pattern Mining Techniques 769
39.5 Time Series Analysis 770
39.6 Systems and Applications 771
39.7 Taxonomy of Data Stream Mining Approaches 773
39.7.1 Data-based Techniques 773
39.7.2 Task-based Techniques 775
39.8 RelatedWork 777
39.9 Future Directions 779
39.10 Summary 779
References 780
40 Mining Concept-Drifting Data Streams 785
40.1 Introduction 785
40.2 The Data Expiration Problem 787
40.3 Classifier Ensemble for Drifting Concepts 788
40.3.1 Accuracy-Weighted Ensembles 789
40.4 Experiments 791
40.4.1 Algorithms used in Comparison 791
40.4.2 Streaming Data 791
40.4.3 Experimental Results 792
40.5 Discussion and RelatedWork 796
References 797
41 Mining High-Dimensional Data 799
41.1 Introduction 799
41.2 Chanllenges 800
41.3 Frequent Pa 800
41.4 Clustering 801
41.5 Classification 802
References 803
42 Text Mining and Information Extraction 805
42.1 Introduction 805
42.2 Text Mining vs. Text Retrieval 807
42.3 Task-Oriented Approaches vs. Formal Frameworks 808
42.4 Task-Oriented Approaches 808
42.4.1 Problem Dependant Task - Information Extraction in Text Mining 810
42.5 Formal Frameworks And Algorithm-Based Techniques 812
42.5.1 Text Categorization 812
42.5.2 Probabilistic models for Information Extraction 815
42.6 Hybrid Approaches - TEG 817
42.7 Text Mining – Visualization and Analytics 818
42.7.1 Clear Research 818
42.7.2 Other Visualization and Analytical Approaches 822
References 823
43 Spatial Data Mining 832
43.1 Introduction 832
43.2 Spatial Data 833
43.3 Spatial Outliers 836
43.4 Spatial Co-location Rules 840
43.5 Predictive Models 843
43.6 Spatial Clusters 846
43.7 Summary 847
Acknowledgments 847
References 848
44 Spatio-temporal clustering 850
44.1 Introduction 850
44.2 Spatio-temporal clustering 851
44.2.1 A classification of spatio-temporal data types 851
44.2.2 Clustering Methods for Trajectory D 854
44.3 Applications 861
44.3.1 Movement data 861
44.3.2 Cellular networks 863
44.3.3 Environmental data 864
44.4 Open Issues 865
44.5 Conclusions 866
References 866
45 Data Mining for Imbalanced Datasets: An Overview 870
45.1 Introduction 870
45.2 Performance Measure 871
45.2.1 ROC Curves 872
45.2.2 Precision and Recall 873
45.2.3 Cost-sensitive Measures 874
45.3 Sampling Strategies 874
45.3.1 Synthetic Minority Oversampling TEchnique: SMOTE 875
45.4 Ensemble-based Methods 876
45.4.1 SMOTEBoost 877
45.5 Discussion 877
Acknowledgements 878
References 878
46 Relational Data Mining 882
46.1 In a Nutshell 882
46.1.1 Relational Data 882
46.1.2 Relational Patterns 883
46.1.3 Relational to propositional 884
46.1.4 Algorithms for relational Data Mining 884
46.1.5 Applications of relational Data Mining 885
46.1.6 What’s in this chapter 886
46.2 Inductive logic programming 886
46.2.1 Logic programs and databases 886
46.2.2 The ILP task of relational rule induction 887
46.2.3 Structuring the space of clauses 889
46.2.4 Searching the space of clauses 890
46.2.5 Transforming ILP problems to propositional form 892
46.2.6 Upgrading propositional approaches 894
46.3 Relational Association Rules 894
46.3.1 Frequent Datalog queries and query extensions 894
46.3.2 Discovering frequent queries: WARMR 896
46.4 Relational Decision Trees 898
46.4.1 Relational Classification, Regression, and Model Trees 899
46.4.2 Induction of Relational Decision Trees 902
46.5 RDM Literature and Internet Resources 903
References 904
47 Web Mining 907
47.1 Introduction 907
47.2 Graph Properties of theWeb 908
47.3 Web Search 909
47.4 Text Classification 911
47.5 Hypertext Classification 911
47.6 Information Extraction and Wrapper Induction 913
47.7 The SemanticWeb 914
47.8 Web Usage Mining 915
47.9 Collaborative Filtering 915
47.10 Conclusion 916
References 916
48 A Review of Web Document Clustering Approaches 924
48.1 Introduction 924
48.2 Motivation for Document Clustering 925
48.3 Web Document Clustering Approaches 926
48.3.1 Text-based Clustering 927
48.3.2 Link-based Clustering 932
48.3.3 Hybrid Approaches 934
48.4 Comparison 935
48.5 Conclusions and Open Issues 936
References 936
49 Causal Discovery 942
49.1 Introduction 942
49.2 Background Knowledge 943
49.3 Theoretical Foundation 945
49.4 Learning a DAG of CN by FDs 946
49.4.1 Learning an Ordering of Variables from FDs 946
49.4.2 Learning the Markov Boundaries of Undecided Variables 947
49.5 Experimental Results 948
49.6 Conclusion 950
References 950
50 Ensemble Methods in Supervised Learning 952
50.1 Introduction 952
50.2 Sequential Methodology 953
50.2.1 Model-guided Instance Selection 953
50.2.2 Incremental Batch Learning 958
50.3 Concurrent Methodology 958
50.4 Combining Classifiers 959
50.4.1 Simple Combining Methods 959
50.4.2 Meta-combining Methods 962
50.5 Ensemble Diversity 965
50.5.1 Manipulating the Inducer 965
50.5.2 Manipulating the Training Set 966
50.5.3 Measuring the Diversity 966
50.6 Ensemble Size 967
50.6.1 Selecting the Ensemble Size 967
50.6.2 Pruning Ensembles 967
50.7 Cluster Ensemble 968
References 968
51 Data Mining using Decomposition Methods 973
51.1 Introduction 973
51.2 Decomposition Advantages 975
51.2.1 Increasing Classification Performance (Classification Accuracy) 975
51.2.2 Scalability to Large Databases 976
51.2.3 Increasing Comprehensibility 976
51.2.4 Modularity 976
51.2.5 Suitability for Parallel Computation 976
51.2.6 Flexibility in Techniques Selection 976
51.3 The Elementary Decomposition Methodology 977
51.4 The Decomposer’s Characteristics 981
51.4.1 Overview 981
51.4.2 The Structure Acquiring Method 981
51.4.3 The Mutually Exclusive Property 982
51.4.4 The Inducer Usage 983
51.4.5 Exhaustiveness 983
51.4.6 Combiner Usage 984
51.4.7 Sequentially or Concurrently 984
51.5 The Relation to Other Methodologies 985
51.6 Summary 986
References 986
52 Information Fusion - Methods and Aggregation Operators 991
52.1 Introduction 991
52.2 Preprocessing Data 992
52.2.1 Re-identification Algorithms 992
52.2.2 Fusion to Improve the Quality of Data 993
52.3 Building Data Models 994
52.3.1 Data Models Using Aggregation Operators 995
52.3.2 Aggregation Operators to Fuse Data Models 996
52.4 Information Extraction 996
52.4.1 Summarization 996
52.4.2 Knowledge from Aggregation Operators 997
52.5 Conclusions 997
References 998
53 Parallel And Grid-Based Data Mining – Algorithms, Models and Systems for High-Performance KDD 1001
53.1 Introduction 1001
53.2 Parallel Data Mining 1003
53.2.1 Parallelism in Data Mining Techniques 1003
53.2.2 Architectural and Research Issues 1008
53.3 Grid-Based Data Mining 1009
53.3.1 Grid-Based Data Mining Systems 1009
53.4 The Knowledge Grid 1013
53.4.1 Knowledge Grid Components and Tools 1015
53.5 Summary 1018
References 1018
54 Collaborative Data Mining 1021
54.1 Introduction 1021
54.2 Remote Collaboration 1022
54.2.1 E-Collaboration:Motivations and Forms 1022
54.2.2 E-Collaboration Space 1023
54.2.3 Collaborative Data Mining in E-Collaboration Space 1023
54.3 The Data Mining Process 1024
54.4 Collaborative Data Mining Guidelines 1025
54.4.1 Collaboration Principles 1025
54.4.2 Data Mining model evaluation and combination 1026
54.5 Discussion 1028
54.6 Conclusions 1029
References 1029
55 Organizational Data Mining 1032
55.1 Introduction 1032
55.2 Organizational Data Mining 1033
55.3 ODM versus Data Mining 1034
55.3.1 Organizational Theory and ODM 1035
55.4 Ongoing ODM Research 1035
55.5 ODM Advantages 1036
55.6 ODM Evolution 1037
55.6.1 Past 1037
55.6.2 Present 1037
55.6.3 Future 1037
55.7 Summary 1039
References 1039
56 Mining Time Series Data 1040
56.1 Introduction 1040
56.2 Time Series Similarity Measures 1041
56.2.1 Euclidean Distances and Lp Norms 1041
56.2.2 Dynamic TimeWarping 1042
56.2.3 Longest Common Subsequence Similarity 1043
56.2.4 Probabilistic methods 1045
56.2.5 General Transformations 1046
56.3 Time Series Data Mining 1046
56.3.1 Classification 1047
56.3.2 Indexing (Query by Content) 1047
56.3.3 Clustering 1050
56.3.4 Prediction (Forecasting) 1051
56.3.5 Summarization 1051
56.3.6 Anomaly Detection 1054
56.3.7 Segmentation 1055
56.4 Time Series Representations 1056
56.4.1 Discrete Fourier Transform 1057
56.4.2 DiscreteWavelet Transform 1058
56.4.3 Singular Value Decomposition 1059
56.4.4 Piecewise Linear Approximation 1059
56.4.5 Piecewise Aggregate Approximation 1060
56.4.6 Adaptive Piecewise Constant Approximation 1060
56.4.7 Symbolic Aggregate Approximation (SAX) 1062
56.5 Summary 1064
References 1064
Part VII Applications 1069
57 Multimedia Data Mining 1070
57.1 Introduction 1070
57.2 A Typical Architecture of a Multimedia Data Mining System 1074
57.3 An Example— Concept Discovery in Imagery Data 1074
57.3.1 Background and Related Work 1075
57.3.2 Region Based Image Representation 1077
57.3.3 Probabilistic Hidden Semantic Model 1083
57.3.4 Posterior Probability Based Image Mining and Retrieval 1086
57.3.5 Approach Analysis 1088
57.3.6 Experimental Results 1089
57.4 Summary 1093
Ackonwledgments 1094
References 1094
58 Data Mining in Medicine 1099
58.1 Introduction 1099
58.2 Symbolic Classification Methods 1101
58.2.1 Rule Induction 1101
58.2.2 Learning of Classification and Regression Trees 1105
58.2.3 Inductive Logic Programming 1107
58.2.4 Discovery of Concept Hierarchies and Constructive Induction 1107
58.2.5 Case-Based Reasoning 1109
58.3 Subsymbolic Classification Methods 1110
58.3.1 Instance-Based Learning 1110
58.3.2 Neural Networks 1111
58.3.3 Bayesian Classifier 1113
58.4 Other Methods Supporting Medical Knowledge Discovery 1114
58.5 Conclusions 1116
Acknowledgments 1117
References 1117
59 Learning Information Patterns in Biological Databases - Stochastic Data Mining 1125
59.1 Background 1125
59.2 Learning Stochastic Pattern Models 1127
59.2.1 Assimilating the Pattern Sets 1127
59.2.2 Clustering Biological Patterns 1129
59.2.3 Learning Cluster Models 1131
59.3 Searching for Meta-Patterns 1132
59.3.1 Level I Search: Locating High Pattern Density Region 1134
59.3.2 Level II Search: Meta-Pattern Hypotheses 1137
59.4 Conclusions 1138
References 1139
60 Data Mining for Financial Applications 1141
60.1 Introduction: Financial Tasks 1141
60.2 Specifics of Data Mining in Finance 1143
60.2.1 Time series analysis 1144
60.2.2 Data selection and forecast horizon 1144
60.2.3 Measures of success 1145
60.2.4 QUALITY OF PATTERNS AND HYPOTHESIS EVALUATION 1145
60.3 Aspects of Data Mining Methodology in Finance 1146
60.3.1 Attribute-based and relational methodologies 1147
60.3.2 Attribute-based relational methodologies 1147
60.3.3 Problem ID and method profile 1148
60.3.4 Relational Data Mining in finance 1148
60.5 Conclusion 1153
References 1154
60.4 Data Mining Models and Practice in Finance 1149
60.4.1 Portfolio management and neural networks 1149
60.4.2 Interpretable trading rules and relational Data Mining 1150
60.4.3 Discovering money laundering and attribute-based relational Data Mining 1151
60.5 Conclusion 1153
References 1154
61 Data Mining for Intrusion Detection 1158
61.1 Introduction 1158
61.2 Data Mining Basics 1159
61.3 Data Mining Meets Intrusion Detection 1161
61.3.1 ADAM 1162
61.3.2 MADAM ID 1164
61.3.3 MINDS 1164
61.3.4 Clustering of Unlabeled ID 1165
61.3.5 Alert Correlation 1165
61.4 Conclusions and Future Research Directions 1166
References 1166
62 Data Mining for CRM 1168
62.1 What is CRM? 1168
62.2 Data Mining and Campaign Management 1169
62.3 An Example: Customer Acquisition 1170
62.3.1 How Data Mining and Statistical Modeling Changes Things 1171
62.3.2 Defining Some Key Acquisition Concepts 1171
62.3.3 It All Begins with the Data 1173
62.3.4 Test Campaigns 1174
62.3.5 Building Data Mining Models Using Response Behaviors 1174
63 Data Mining for Target Marketing 1176
63.1 Introduction 1176
63.2 Modeling Process 1177
63.3 Evaluation Metrics 1178
63.3.1 Gains Charts 1178
63.3.2 Prediction Accuracy 1181
63.3.3 Profitability/ROI 1181
63.3.4 Gains Table 1181
63.4 Segmentation Methods 1182
63.4.1 Judgmentally-based RFM/FRAT methods 1182
63.4.2 Clustering 1183
63.4.3 Classification Methods 1185
63.4.4 Decision Making 1186
63.5 Predictive Modeling 1187
63.5.1 Linear Regression 1187
63.5.2 Logistic Regression 1188
63.5.3 Neural Networks 1189
63.5.4 Decision Making 1190
63.6 In-Market Timing 1192
63.6.1 Logistic Regression 1192
63.6.2 Survival Analysis 1193
63.7 Pitfalls of Targeting 1195
63.7.1 Modeling Pitfalls 1196
63.7.2 Data Pitfalls 1200
63.7.3 Implementation Pitfalls 1202
63.8 Conclusions 1205
63.8.1 Multiple Offers 1205
63.8.2 Multiple Products/Services 1205
References 1206
64 NHECD - Nano Health and Environmental Commented Database 1208
64.1 Introduction 1208
64.2 The NHECD Model 1216
64.3 NHECD implementation 1217
64.3.1 Taxonomies 1217
64.3.2 Crawling 1217
64.3.3 Information extraction 1220
64.3.4 NHECD products 1222
64.3.5 Scientific paper rating 1222
64.3.6 NHECD Frontend 1223
64.4 Conclusions 1224
64.5 Further research 1226
References 1227
Part VIII Software 1229
65 Commercial Data Mining Software 1230
65.1 Introduction 1230
65.2 Literature Review 1231
65.3 Data Mining Software 1232
65.3.1 BioDiscovery GeneSight 1233
65.3.2 Megaputer PolyAnalyst 5.0 1233
65.3.3 SAS Enterprise Miner 1234
65.3.4 PASW Modeler/ Formerly SPSS Clementine 1236
65.3.5 IBM DB2 Intelligent Miner 1237
65.4 Supercomputing Data Mining Software 1239
65.4.1 Data Visualization using Avizo 1239
65.4.2 Data Visualization using JMP Genomics 1241
65.5 Text Mining Software 1243
65.5.1 SAS Text Miner 1243
65.5.2 Megaputer PolyAnalyst 1245
65.6 Web Mining Software 1246
65.6.1 Megaputer PolyAnalyst 1247
65.6.2 SPSS Clementine 1249
65.7 Conclusion and Future Research 1250
References 1251
66 Weka-A Machine LearningWorkbench for Data Mining 1254
66.1 Introduction 1254
Acknowledgments 1261
References 1261
Index 1263

Erscheint lt. Verlag	10.9.2010
Zusatzinfo	XX, 1285 p. 40 illus.
Verlagsort	New York
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Informatik ► Datenbanken
	Mathematik / Informatik ► Informatik ► Netzwerke
	Informatik ► Theorie / Studium ► Künstliche Intelligenz / Robotik
	Mathematik / Informatik ► Informatik ► Web / Internet
Schlagworte	algorithm • Bayesian networks • currentjm • Data Mining • data mining applications • decision trees • ensemble method • KAP_D018 • KDD • KLT • KLTcatalog • Knowledge Discovery • large datasets • preprocessing method • soft computing method • statistical method • text min • Text Mining • Web mining
ISBN-10	0-387-09823-2 / 0387098232
ISBN-13	978-0-387-09823-4 / 9780387098234

Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)
Größe: 19,3 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Hardcover

374,49 €