Managing and Mining Uncertain Data (eBook)
XXII, 494 Seiten
Springer US (Verlag)
978-0-387-09690-2 (ISBN)
Managing and Mining Uncertain Data, a survey with chapters by a variety of well known researchers in the data mining field, presents the most recent models, algorithms, and applications in the uncertain data mining field in a structured and concise way. This book is organized to make it more accessible to applications-driven practitioners for solving real problems. Also, given the lack of structurally organized information on this topic, Managing and Mining Uncertain Data provides insights which are not easily accessible elsewhere. Managing and Mining Uncertain Data is designed for a professional audience composed of researchers and practitioners in industry. This book is also suitable as a reference book for advanced-level students in computer science and engineering, as well as the ACM, IEEE, SIAM, INFORMS and AAAI Society groups.
Managing and Mining Uncertain Data, a survey with chapters by a variety of well known researchers in the data mining field, presents the most recent models, algorithms, and applications in the uncertain data mining field in a structured and concise way. This book is organized to make it more accessible to applications-driven practitioners for solving real problems. Also, given the lack of structurally organized information on this topic, Managing and Mining Uncertain Data provides insights which are not easily accessible elsewhere. Managing and Mining Uncertain Data is designed for a professional audience composed of researchers and practitioners in industry. This book is also suitable as a reference book for advanced-level students in computer science and engineering, as well as the ACM, IEEE, SIAM, INFORMS and AAAI Society groups.
Preface 6
Contents 8
List of Figures 16
List of Tables 21
Chapter 1 AN INTRODUCTION TO UNCERTAIN DATA ALGORITHMS AND APPLICATIONS 22
1. Introduction 22
2. Algorithms for Uncertain Data 24
3. Conclusions 27
Acknowledgements 28
References 28
Chapter 2 MODELS FOR INCOMPLETE AND PROBABILISTIC INFORMATION 30
1. Introduction 30
2. Incomplete Information and Representation Systems 34
3. RA-Completeness and Finite Completeness 35
4. Closure Under Relational Operations 39
5. Algebraic Completion 40
6. Probabilistic Databases and Representation Systems 42
7. Probabilistic ?-Tables and Probabilistic Or-Set Tables 43
8. Probabilistic c-tables 45
9. Queries on Annotated Relations 46
10. K-Relations 48
11. Polynomials for Provenance 51
12. Query Containment 54
13. RelatedWork 55
14. Conclusion and Further Work 55
Acknowledgments 56
References 57
15. Appendix 61
Chapter 3 RELATIONAL MODELS AND ALGEBRA FOR UNCERTAIN DATA 64
1. Introduction 64
2. Different Probabilistic Data Models 67
2.1 Point-Valued Probability Measures Assigned to Each Tuple 67
2.2 Point-Valued Probability Measures Assigned to Attributes and Attribute Sets 71
2.3 Interval-Valued Probability Measures Assigned to Attribute-Values 73
2.4 Interval-valued Probability Measures Assigned to Tuples 75
3. Probabilistic Relational Algebra 75
3.1 Basic Definitions 75
3.2 Primary and Foreign Keys 77
3.3 Relational Operations 78
3.4 Relational Algebra 81
3.5 Incomplete Distribution and Null Values 82
4. Algebraic Implications of the Different Representations and Associated Assumptions 86
4.1 Models Assigning Point-Valued Probability Measures at the Tuple Level 87
4.2 Models Assigning Point-Valued Probability Measures at the Attribute Level 88
4.3 Models Assigning Interval-Valued Probability Measures at the Attribute Level 89
4.4 Models Assigning Interval-Valued Probability Measures to Tuples 90
4.5 Some Observations on the Independence Assumption Across Tuples 91
5. Concluding Remarks 91
References 93
Chapter 4 GRAPHICAL MODELS FOR UNCERTAIN DATA 95
1. Introduction 96
2. Graphical Models: Overview 98
2.1 Directed Graphical Models: Bayesian Networks 100
2.2 Undirected Graphical Models: Markov Networks 102
2.3 Inference Queries 103
3. Representing Uncertainty using Graphical Models 105
3.1 PossibleWorld Semantics 106
3.2 Shared Factors 109
3.3 Representing Probabilistic Relations 109
4. Query Evaluation over Uncertain Data 110
4.1 Example 112
4.2 Generating Factors during Query Evaluation 114
4.3 Query Evaluation as Inference 117
4.4 Optimizations 118
5. RelatedWork and Discussion 119
5.1 Safe Plans 119
5.2 Representing Uncertainty using Lineage 120
5.3 Probabilistic Relational Models 120
5.4 Lifted Inference 122
5.5 Scalable Inference using a Relational Database 122
6. Conclusions 123
Acknowledgments 124
References 125
Chapter 5 TRIO:ASYSTEMFORDATA,UNCERTAINTY,AND LINEAGE 131
Introduction 131
1. ULDBs: Uncertainty-Lineage Databases 134
1.1 Alternatives 135
1.2 ‘?’ (Maybe) Annotations 135
1.3 Confidences 136
1.4 Lineage 137
1.5 Relational Queries 139
2. TriQL: The Trio Query Language 140
2.1 Operational Semantics 140
2.2 Querying Confidences 142
2.3 Querying Lineage 142
2.4 Duplicate Elimination 143
2.5 Aggregation 144
2.6 Reorganizing Alternatives 145
2.7 Horizontal Subqueries 146
2.8 Query-Defined Result Confidences 146
2.9 Other TriQL Query Constructs 147
3. Data Modifications in Trio 148
3.1 Inserts 148
3.2 Deletes 149
3.3 Updates 149
3.4 Data Modifications and Versioning 151
4. Confidence Computation 151
5. Additional Trio Features 153
6. The Trio System 154
6.1 Encoding ULDB Data 156
6.2 Basic Query Translation Scheme 158
6.3 Duplicate Elimination 159
6.4 Aggregation 159
6.5 Reorganizing Alternatives 160
6.6 Horizontal Subqueries 160
6.7 Built-In Predicates and Functions 161
6.8 Query-Defined Result Confidences 162
6.9 Remaining Constructs 162
References 164
Chapter 6 MAYBMS: A SYSTEM FOR MANAGING LARGE UNCERTAIN AND PROBABILISTIC DATABASES 166
1. Introduction 166
2. Probabilistic Databases 168
3. Query Language Desiderata 169
4. The Algebra 170
5. Representing Probabilistic Data 176
6. Conceptual Query Evaluation, Rewritings, and Asymptotic Efficiency 181
7. The MayBMS Query and Update Language 187
8. The MayBMS System 192
9. Conclusions and Outlook 195
References 197
Chapter 7 UNCERTAINTY IN DATA INTEGRATION 200
1. Introduction 200
2. Overview of the System 202
2.1 Uncertainty in data integration 202
2.2 System architecture 203
2.3 Source of probabilities 204
2.4 Outline of the chapter 205
3. Uncertainty in Mappings 205
3.1 Motivating probabilistic mappings 205
3.2 Definition and Semantics 207
3.3 Query Answering 212
3.4 Creating P-mappings 216
3.5 Broader Classes of Mappings 219
3.6 Other Types of Approximate Schema Mappings 221
4. Uncertainty in Mediated Schema 222
4.1 P-Med-Schema Motivating Example 222
4.2 Probabilistic Mediated Schema 225
4.3 P-med-schema Creation 227
4.4 Consolidation 229
4.5 Other approaches 231
5. Future Directions 232
References 233
Chapter 8 SKETCHING AGGREGATES OVER PROBABILISTIC STREAMS 236
1. Introduction 236
1.1 Aggregates over probabilistic streams 238
1.2 Organization 239
2. The Probabilistic Stream Model 239
2.1 Problem definitions 241
2.2 Frequency Moments and Quantiles 242
3. Overview of techniques and summary of results 244
4. Universal Sampling 248
5. Frequency moments: DISTINCT and REPEAT-RATE 249
5.1 DISTINCT 249
5.2 REPEAT-RATE 251
6. Heavy-Hitters, Quantiles, and MEDIAN 253
7. A Binning Technique for MIN and MAX 254
8. Estimating AVG using generating functions 257
8.1 Generating functions 257
8.2 Estimating AVG 258
8.3 Approximating AVG by SUM/COUNT 261
9. Discussion 265
References 266
Chapter 9 PROBABILISTIC JOIN QUERIES IN UNCERTAIN DATABASES 269
1. Introduction 269
2. Traditional Join Approaches 270
2.1 Simple Nested-Loop Join 271
2.2 Nested-Block-Loop Join 272
2.3 Sort-Merge-Loop Join 272
2.4 Other Join Methods 273
2.5 Spatial Join Algorithms 273
2.6 Spatial Join using a spatial index structure for both relations 274
2.7 Spatial Join using a spatial index structure on one relation 274
2.8 Spatial Join using no Spatial-Index Structure 275
3. Uncertainty Models and Join Predicates 275
3.1 The Continuous Uncertainty Model 276
3.2 The Discrete Uncertainty Model 278
3.3 Join Predicates and Score 283
3.4 Probabilistic Join Query Types 284
3.5 Example 285
3.6 Overview of UncertaintyModels and Probabilistic Join Queries 286
4. Approaches for Efficient Join Processing on Uncertain Data 289
4.1 Confidence-Based Join Methods 289
4.2 Probabilistic Similarity Joins 292
4.3 Probabilistic Spatial Join 300
5. Summary 301
References 304
Chapter 10 INDEXING UNCERTAIN DATA 310
1. Introduction 311
2. Data Models and Query Semantics 313
2.1 Uncertain Attribute types 314
3. Uncertainty Index for Continuous Domains 314
3.1 Probability Threshold Indexing 315
3.2 Special Case: Uniform PDFS 318
3.3 2D mapping of intervals 318
3.4 Join Queries 320
3.5 Multi-dimensional Indexing 322
4. Uncertainty Index for discrete domains 322
4.1 Data Model and Problem Definition 323
4.2 Probabilistic Inverted Index 324
4.3 Probabilistic Distribution R-tree 327
5. Indexing for Nearest Neighbor Queries 329
References 333
Chapter 11 RANGEANDNEARESTNEIGHBORQUERIESON UNCERTAIN SPATIOTEMPORAL DATA 336
1. Introduction 336
2. Range Search 340
2.1 Query Definitions 340
2.2 Filter and Refinement 342
2.3 Nonfuzzy Range Search 343
2.4 Fuzzy Range Search 345
2.5 Indexing 348
3. Nearest Neighbor Retrieval 349
3.1 Query Definition 349
3.2 Query Processing 350
3.3 Variations of Nearest Neighbor Retrieval 352
4. Summary 356
References 357
Chapter 12 PROBABILISTIC XML 360
1. Introduction 360
2. Sources of Uncertainty in XML Data 361
3. Modeling Uncertainty using Tags 362
4. Modeling Uncertainty using Semi-structured Data 365
5. Modeling Uncertainty in XML with Independent or Mutually Exclusive Distribution in Point Probability 367
6. Formal Model of Probabilistic Semi-structured Data with Arbitrary Probabilistic Distributions 370
6.1 Motivating Examples 372
6.2 Probabilistic Semi-structured Data Model 374
6.3 Semantics 379
6.4 PXML Algebra and Comparison with PreviousWork 382
6.5 Probabilistic Aggregate Operations 384
6.6 Modeling Interval Uncertainty in Semi-structured Data 386
7. Summary 389
Acknowledgements 390
References 391
Chapter 13 ON CLUSTERING ALGORITHMS FOR UNCERTAIN DATA 394
1. Introduction 394
2. Density Based Clustering Algorithms 396
3. The UK-means and CK-means Algorithms 399
4. UMicro: Streaming Algorithms for Clustering Uncertain Data 400
4.1 The UMicro Algorithm: Overview 402
4.2 Computing Expected Similarity 403
4.3 Computing the Uncertain Boundary 407
4.4 Further Enhancements 407
5. Approximation Algorithms for Clustering UncertainData 408
6. Conclusions and Summary 408
Acknowledgements 409
References 409
Chapter 14 ON APPLICATIONS OF DENSITY TRANS FORMSFOR UNCERTAIN DATA MINING 412
1. Introduction 412
2. Kernel Density Estimation with Errors 414
2.1 Scalability for Large Data Sets 417
3. Leveraging Density Estimation for Classification 421
4. Application of Density Based Approach to Outlier Detection 424
4.1 Outlier Detection Approach 425
4.2 Subspace Exploration for Outlier Detection 426
5. Conclusions 428
Acknowledgements 428
References 429
Chapter 15 FREQUENT PATTERN MINING ALGORITHMS WITH UNCERTAIN DATA 431
1. Introduction 432
2. Frequent Pattern Mining of Uncertain Data Sets 433
3. Apriori-style Algorithms 434
3.1 Pruning Methods for Apriori-Style Algorithms 435
4. Set-Enumeration Methods 438
5. Pattern Growth based Mining Algorithms 438
5.1 Extending the H-mine algorithm 439
5.2 Extending the FP-growth Algorithm 440
5.3 Another Variation of the FP-growth Algorithm 450
6. A Comparative Study on Challenging Cases 450
6.1 Performance Comparison 454
6.2 Scalability Comparison 456
7. Generalization to the PossibleWorlds Model 458
8. Discussion and Conclusions 459
Acknowledgements 460
References 461
Chapter 16 PROBABILISTIC QUERYING AND MINING OF BIOLOGICAL IMAGES 464
1. Introduction 464
1.1 An Illustrative Example 465
2. RelatedWork 468
3. Probabilistic Image Analyses 469
3.1 Probabilistic Segmentation 469
3.2 Measuring Neurite Thickness 472
3.3 Ganglion Cell Features 473
4. Querying Probabilistic Image Data 474
4.1 Range Queries on Uncertain Data 475
4.2 k-NN Queries on Uncertain Data 475
4.3 Adaptive, Piecewise-Linear Approximations 477
4.4 Indexing the APLA 477
4.5 Experimental Results 478
5. Mining Probabilistic Image Data 479
5.1 Defining Probabilistic Spatial Join (PSJ) 480
5.2 Threshold PSJ Query 481
5.3 Top-k PSJ Query 482
5.4 Experimental Results 483
6. Conclusion 484
Acknowledgements 485
References 486
Index 492
Erscheint lt. Verlag | 8.7.2010 |
---|---|
Reihe/Serie | Advances in Database Systems | Advances in Database Systems |
Zusatzinfo | XXII, 494 p. 60 illus. |
Verlagsort | New York |
Sprache | englisch |
Themenwelt | Informatik ► Datenbanken ► Data Warehouse / Data Mining |
Informatik ► Netzwerke ► Sicherheit / Firewall | |
Informatik ► Theorie / Studium ► Künstliche Intelligenz / Robotik | |
Schlagworte | 2009titlesSusan • Aggarwal • classification • Clustering • currentsmp • Data • Data Mining • Managing • Mining • Uncertain |
ISBN-10 | 0-387-09690-6 / 0387096906 |
ISBN-13 | 978-0-387-09690-2 / 9780387096902 |
Haben Sie eine Frage zum Produkt? |
Größe: 7,2 MB
DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.
Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich