Data Mining and Knowledge Discovery via Logic-Based Methods (eBook)

Theory, Algorithms, and Applications
eBook Download: PDF
2010 | 2010
XXXIV, 350 Seiten
Springer US (Verlag)
978-1-4419-1630-3 (ISBN)

Lese- und Medienproben

Data Mining and Knowledge Discovery via Logic-Based Methods - Evangelos Triantaphyllou
Systemvoraussetzungen
149,79 inkl. MwSt
  • Download sofort lieferbar
  • Zahlungsarten anzeigen
The importance of having ef cient and effective methods for data mining and kn- ledge discovery (DM&KD), to which the present book is devoted, grows every day and numerous such methods have been developed in recent decades. There exists a great variety of different settings for the main problem studied by data mining and knowledge discovery, and it seems that a very popular one is formulated in terms of binary attributes. In this setting, states of nature of the application area under consideration are described by Boolean vectors de ned on some attributes. That is, by data points de ned in the Boolean space of the attributes. It is postulated that there exists a partition of this space into two classes, which should be inferred as patterns on the attributes when only several data points are known, the so-called positive and negative training examples. The main problem in DM&KD is de ned as nding rules for recognizing (cl- sifying) new data points of unknown class, i. e. , deciding which of them are positive and which are negative. In other words, to infer the binary value of one more attribute, called the goal or class attribute. To solve this problem, some methods have been suggested which construct a Boolean function separating the two given sets of positive and negative training data points.
The importance of having ef cient and effective methods for data mining and kn- ledge discovery (DM&KD), to which the present book is devoted, grows every day and numerous such methods have been developed in recent decades. There exists a great variety of different settings for the main problem studied by data mining and knowledge discovery, and it seems that a very popular one is formulated in terms of binary attributes. In this setting, states of nature of the application area under consideration are described by Boolean vectors de ned on some attributes. That is, by data points de ned in the Boolean space of the attributes. It is postulated that there exists a partition of this space into two classes, which should be inferred as patterns on the attributes when only several data points are known, the so-called positive and negative training examples. The main problem in DM&KD is de ned as nding rules for recognizing (cl- sifying) new data points of unknown class, i. e. , deciding which of them are positive and which are negative. In other words, to infer the binary value of one more attribute, called the goal or class attribute. To solve this problem, some methods have been suggested which construct a Boolean function separating the two given sets of positive and negative training data points.

Foreword 7
Preface 11
Acknowledgments 16
List of Figures 25
List of Tables 29
Part I Algorithmic Issues 32
Introduction 33
What Is Data Mining and Knowledge Discovery? 33
Some Potential Application Areas for Data Mining and Knowledge Discovery 34
Applications in Engineering 35
Applications in Medical Sciences 35
Applications in the Basic Sciences 36
Applications in Business 36
Applications in the Political and Social Sciences 37
The Data Mining and Knowledge Discovery Process 37
Problem Definition 37
Collecting the Data 39
Data Preprocessing 40
Application of the Main Data Mining and Knowledge Discovery Algorithms 41
Interpretation of the Results of the Data Mining and Knowledge Discovery Process 42
Four Key Research Challenges in Data Mining and Knowledge Discovery 42
Collecting Observations about the Behavior of the System 43
Identifying Patterns from Collections of Data 44
Which Data to Consider for Evaluation Next? 47
Do Patterns Always Exist in Data? 49
Concluding Remarks 50
Inferring a Boolean Function from Positive and Negative Examples 51
An Introduction 51
Some Background Information 52
Data Binarization 56
Definitions and Terminology 59
Generating Clauses from Negative Examples Only 62
Clause Inference as a Satisfiability Problem 63
An SAT Approach for Inferring CNF Clauses 64
The One Clause At a Time (OCAT) Concept 65
A Branch-and-Bound Approach for Inferring a Single Clause 68
A Heuristic for Problem Preprocessing 75
Some Computational Results 77
Concluding Remarks 80
Appendix 82
A Revised Branch-and-Bound Approach for Inferring a Boolean Function from Examples 87
Some Background Information 87
The Revised Branch-and-Bound Algorithm 87
Generating a Single CNF Clause 88
Generating a Single DNF Clause 92
Some Computational Results 94
Concluding Remarks 99
Some Fast Heuristics for Inferring a Boolean Function from Examples 103
Some Background Information 103
A Fast Heuristic for Inferring a Boolean Function from Complete Data 105
A Fast Heuristic for Inferring a Boolean Function from Incomplete Data 110
Some Computational Results 114
Results for the RA1 Algorithm on the Wisconsin Cancer Data 116
Results for the RA2 Heuristic on the Wisconsin Cancer Data with Some Missing Values 121
Comparison of the RA1 Algorithm and the B& B Method Using Large Random Data Sets
Concluding Remarks 128
An Approach to Guided Learning of Boolean Functions 131
Some Background Information 131
Problem Description 134
The Proposed Approach 135
On the Number of Candidate Solutions 140
An Illustrative Example 141
Some Computational Results 143
Concluding Remarks 152
An Incremental Learning Algorithm for Inferring Boolean Functions 154
Some Background Information 154
Problem Description 155
Some Related Developments 156
The Proposed Incremental Algorithm 159
Repairing a Boolean Function that Incorrectly Rejects a Positive Example 160
Repairing of a Boolean Function that Incorrectly Accepts a Negative Example 162
Computational Complexity of the Algorithms for the ILE Approach 163
Experimental Data 163
Analysis of the Computational Results 164
Results on the Classification Accuracy 165
Results on the Number of Clauses 168
Results on the CPU Times 170
Concluding Remarks 173
A Duality Relationship Between Boolean Functions in CNF and DNF Derivable from the Same Training Examples 175
Introduction 175
Generating Boolean Functions in CNF and DNF Form 175
An Illustrative Example of Deriving Boolean Functions in CNF and DNF 176
Some Computational Results 177
Concluding Remarks 178
The Rejectability Graph of Two Sets of Examples 179
Introduction 179
The Definition of the Rejectability Graph 180
Properties of the Rejectability Graph 181
On the Minimum Clique Cover of the Rejectability Graph 183
Problem Decomposition 184
Connected Components 184
Clique Cover 185
An Example of Using the Rejectability Graph 186
Some Computational Results 188
Concluding Remarks 198
Part II Application Issues 199
The Reliability Issue in Data Mining: The Case of Computer-Aided Breast Cancer Diagnosis 200
Introduction 200
Some Background Information on Computer-Aided Breast Cancer Diagnosis 200
Reliability Criteria 202
The Representation/Narrow Vicinity Hypothesis 205
Some Computational Results 208
Concluding Remarks 210
Appendix I: Definitions of the Key Attributes 212
Appendix II: Technical Procedures 214
The Interactive Approach 214
The Hierarchical Approach 215
The Monotonicity Property 215
Logical Discriminant Functions 216
Data Mining and Knowledge Discovery by Means of Monotone Boolean Functions 218
Introduction 218
Background Information 220
Problem Descriptions 220
Hierarchical Decomposition of Attributes 223
Some Key Properties of Monotone Boolean Functions 224
Existing Approaches to Problem 1 228
An Existing Approach to Problem 2 230
Existing Approaches to Problem 3 231
Stochastic Models for Problem 3 231
Inference Objectives and Methodology 233
The Inference Objective for Problem 1 233
The Inference Objective for Problem 2 234
The Inference Objective for Problem 3 235
Incremental Updates for the Fixed Misclassification Probability Model 235
Selection Criteria for Problem 1 236
Selection Criteria for Problems 2.1, 2.2, and 2.3 237
Selection Criterion for Problem 3 237
Experimental Results 242
Experimental Results for Problem 1 242
Experimental Results for Problem 2 244
Experimental Results for Problem 3 246
Summary and Discussion 250
Summary of the Research Findings 250
Significance of the Research Findings 252
Future Research Directions 253
Concluding Remarks 254
Some Application Issues of Monotone Boolean Functions 255
Some Background Information 255
Expressing Any Boolean Function in Terms of Monotone Ones 255
Formulations of Diagnostic Problems as the Inference of Nested Monotone Boolean Functions 257
An Application to a Reliability Engineering Problem 257
An Application to the Breast Cancer Diagnosis Problem 258
Design Problems 259
Process Diagnosis Problems 260
Three Major Illusions in the Evaluation of the Accuracy of Data Mining Models 260
First Illusion: The Single Index Accuracy Rate 261
Second Illusion: Accurate Diagnosis without Hard Cases 261
Third Illusion: High Accuracy on Random Test Data Only 262
Identification of the Monotonicity Property 262
Concluding Remarks 265
Mining of Association Rules 266
Some Background Information 266
Problem Description 268
Methodology 269
Some Related Algorithmic Developments 269
Alterations to the RA1 Algorithm 270
Computational Experiments 272
Concluding Remarks 280
Data Mining of Text Documents 281
Some Background Information 281
A Brief Description of the Document Clustering Process 283
Using the OACT Approach to Classify Text Documents 284
An Overview of the Vector Space Model 286
A Guided Learning Approach for the Classification of Text Documents 288
Experimental Data 289
Testing Methodology 291
The Leave-One-Out Cross Validation 291
The 30/30 Cross Validation 291
Statistical Performance of Both Algorithms 291
Experimental Setting for the Guided Learning Approach 292
Results for the Leave-One-Out and the 30/30 Cross Validations 293
Results for the Guided Learning Approach 296
Concluding Remarks 299
First Case Study: Predicting Muscle Fatigue from EMG Signals 301
Introduction 301
General Problem Description 301
Experimental Data 303
Analysis of the EMG Data 304
The Effects of Load and Electrode Orientation 304
The Effects of Muscle Condition, Load, and Electrode Orientation 304
A Comparative Analysis of the EMG Data 305
Results by the OCAT/RA1 Approach 306
Results by Fisher's Linear Discriminant Analysis 307
Results by Logistic Regression 308
A Neural Network Approach 309
Concluding Remarks 311
Second Case Study: Inference of Diagnostic Rules for Breast Cancer 312
Introduction 312
Description of the Data Set 312
Description of the Inferred Rules 315
Concluding Remarks 319
A Fuzzy Logic Approach to Attribute Formalization: Analysis of Lobulation for Breast Cancer Diagnosis 320
Introduction 320
Some Background Information on Digital Mammography 320
Some Background Information on Fuzzy Sets 322
Formalization with Fuzzy Logic 323
Degrees of Lobularity and Microlobularity 329
Concluding Remarks 331
Conclusions 332
General Concluding Remarks 332
Twelve Key Areas of Potential Future Research on Data Mining and Knowledge Discovery from Databases 333
Overfitting and Overgeneralization 333
Guided Learning 334
Stochasticity 334
More on Monotonicity 334
Visualization 334
Systems for Distributed Computing Environments 335
Developing Better Exact Algorithms and Heuristics 335
Hybridization and Other Algorithmic Issues 335
Systems with Self-Explanatory Capabilities 336
New Systems for Image Analysis 336
Systems for Web Applications 336
Developing More Applications 337
Epilogue 337
References 339
Subject Index 356
Author Index 366
About the Author 370

Erscheint lt. Verlag 8.6.2010
Reihe/Serie Springer Optimization and Its Applications
Springer Optimization and Its Applications
Zusatzinfo XXXIV, 350 p. 91 illus., 9 illus. in color.
Verlagsort New York
Sprache englisch
Themenwelt Informatik Datenbanken Data Warehouse / Data Mining
Mathematik / Informatik Informatik Programmiersprachen / -werkzeuge
Informatik Theorie / Studium Künstliche Intelligenz / Robotik
Mathematik / Informatik Mathematik Allgemeines / Lexika
Mathematik / Informatik Mathematik Logik / Mengenlehre
Technik
Wirtschaft Betriebswirtschaft / Management Planung / Organisation
Wirtschaft Betriebswirtschaft / Management Unternehmensführung / Management
Schlagworte algorithms • Artifical Intelligence • boolean function • Data Analysis • Data Mining • Decision Making • Intelligent Systems • Knowledge Discovery • Learning Systems • Logic • Mathematical Logic
ISBN-10 1-4419-1630-X / 144191630X
ISBN-13 978-1-4419-1630-3 / 9781441916303
Haben Sie eine Frage zum Produkt?
PDFPDF (Wasserzeichen)
Größe: 6,4 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasser­zeichen und ist damit für Sie persona­lisiert. Bei einer missbräuch­lichen Weiter­gabe des eBooks an Dritte ist eine Rück­ver­folgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seiten­layout eignet sich die PDF besonders für Fach­bücher mit Spalten, Tabellen und Abbild­ungen. Eine PDF kann auf fast allen Geräten ange­zeigt werden, ist aber für kleine Displays (Smart­phone, eReader) nur einge­schränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Mehr entdecken
aus dem Bereich
A data engineer's guide to building and managing ETL and ELT …

von Dmitry Foshin; Tonya Chernyshova; Dmitry Anoshin …

eBook Download (2024)
Packt Publishing (Verlag)
39,59