Data Engineering (eBook)
XVII, 447 Seiten
Springer US (Verlag)
978-1-4419-0176-7 (ISBN)
DATA ENGINEERING: Mining, Information, and Intelligence describes applied research aimed at the task of collecting data and distilling useful information from that data. Most of the work presented emanates from research completed through collaborations between Acxiom Corporation and its academic research partners under the aegis of the Acxiom Laboratory for Applied Research (ALAR). Chapters are roughly ordered to follow the logical sequence of the transformation of data from raw input data streams to refined information. Four discrete sections cover Data Integration and Information Quality; Grid Computing; Data Mining; and Visualization. Additionally, there are exercises at the end of each chapter.
The primary audience for this book is the broad base of anyone interested in data engineering, whether from academia, market research firms, or business-intelligence companies. The volume is ideally suited for researchers, practitioners, and postgraduate students alike. With its focus on problems arising from industry rather than a basic research perspective, combined with its intelligent organization, extensive references, and subject and author indices, it can serve the academic, research, and industrial audiences.
DATA ENGINEERING: Mining, Information, and Intelligence describes applied research aimed at the task of collecting data and distilling useful information from that data. Most of the work presented emanates from research completed through collaborations between Acxiom Corporation and its academic research partners under the aegis of the Acxiom Laboratory for Applied Research (ALAR). Chapters are roughly ordered to follow the logical sequence of the transformation of data from raw input data streams to refined information. Four discrete sections cover Data Integration and Information Quality; Grid Computing; Data Mining; and Visualization. Additionally, there are exercises at the end of each chapter.The primary audience for this book is the broad base of anyone interested in data engineering, whether from academia, market research firms, or business-intelligence companies. The volume is ideally suited for researchers, practitioners, and postgraduate students alike. With its focus on problems arising from industry rather than a basic research perspective, combined with its intelligent organization, extensive references, and subject and author indices, it can serve the academic, research, and industrial audiences.
Preface 4
Table of Contents 7
1 Introduction 16
1.1 Common Problem 16
1.2 Data Integration and Data Management 18
1.2.1 Information Quality Overview 18
1.2.2 Customer Data Integration 19
1.2.2.1 Hygiene 20
1.2.2.2 Enhancement 21
1.2.2.3 Entity Resolution 22
1.2.2.4 Aggregation and Selection 22
1.2.3 Data Management 23
1.2.4 Practical Problems to Data Integration and Management 24
1.3 Analytics 25
1.3.1 Model Development 25
1.3.2 Current Modeling and Optimization Techniques 26
1.3.3 Specific Algorithms and Techniques for Improvement 27
1.3.4 Incremental or Evolutionary Updates 28
1.3.5 Visualization 30
1.4 Conclusion 30
1.5 References 31
2 A Declarative Approach to Entity Resolution 32
2.1 Introduction 32
2.2 Background 33
2.2.1 Entity Resolution Definition 33
2.2.2 Entity Resolution Defense 33
2.2.3 Entity Resolution Terminology 34
2.2.3.1 Prospecting 34
2.2.3.2 Blocking 34
2.2.3.3 Closure 34
2.2.3.4 Matching 35
2.2.4 Declarative Languages 35
2.3 The Declarative Taxonomy: The Nouns 35
2.3.1 Attributes 36
2.3.2 References 36
2.3.3 Paths and Match Functions 37
2.3.4 Entities 39
2.3.5 Super Groups 40
2.3.6 Matching Graphs 41
2.4 A Declarative Taxonomy: The Adjectives 42
2.4.1 Attribute Adjectives 42
2.4.2 Reference Adjectives 44
2.5 The Declarative Taxonomy: The Verbs 44
2.5.1 Attribute Verbs 44
2.5.2 Reference Verbs 45
2.5.3 Entity Verbs 47
2.6 A Declarative Representation 48
2.6.1 The XML Schema 49
2.6.2 A Representation for the Operations 51
2.7 Conclusion 52
2.8 Exercises 52
2.9 References 52
3 Transitive Closure of Data Records: Applicationand Computation 54
3.1 Introduction 54
3.1.1 Motivation 55
3.1.2 Literature Review 57
3.2 Problem Definition 58
3.3 Sequential Algorithms 60
3.3.1 A Breadth First Search Based Algorithm 60
3.3.2 A Sorting and Disjoint Set Based Algorithm 62
3.3.3 Experiment 66
3.4 Parallel and Distributed Algorithms 68
3.4.1 An Overview of a Parallel and Distributed Scheme 68
3.4.2 Generate Matching Pairs 70
3.4.3 Conversion Process 70
3.4.4 Closure Process 71
3.4.5 A MPI Based Parallel and Distributed Algorithm 77
3.4.6 Experiment 79
3.5 Conclusion 85
3.6 Exercises 86
3.7 Acknowledgments 88
3.8 References 89
4 Semantic Data Matching: Principlesand Performance 91
4.1 Introduction 91
4.2 Problem Statement: Data Matching for Customer DataIntegration 92
4.3 Semantic Data Matching 92
4.3.1 Background on Latent Semantic Analysis 92
4.3.2 Analysis 94
4.4 Effect of Shared Terms 95
4.4.1 Fundamental Limitations on Data Matching 95
4.4.2 Experiments 96
4.5 Results 97
4.6 Conclusion 101
4.7 Exercises 103
4.8 Acknowledgments 103
4.9 References 103
5 Application of the Near Miss Strategy and EditDistance to Handle Dirty Data 105
5.1 Introduction 105
5.2 Background 106
5.2.1 Techniques used for General Spelling Error Correction 107
5.2.1.1 Minimum edit distance techniques 107
5.2.1.2 Soundex and Phonetic Strategy 108
5.2.1.3 Rule-based techniques 108
5.2.1.4 N-gram-based techniques 108
5.2.1.5 Probabilistic techniques and Neural Nets 109
5.2.2 Domain-Specific Correction 109
5.3 Individual Name Spelling Correction Algorithm: thePersonal Name Recognition Strategy (PNRS) 110
5.3.1 Experiment Results 112
5.4 Conclusion 113
5.5 Exercises 113
5.6 References 114
6 A Parallel General-Purpose Synthetic DataGenerator1 116
6.1 Introduction 116
6.2 SDDL 117
6.2.1 Min/Max Constraints 118
6.2.2 Distribution Constraints 119
6.2.3 Formula Constraints 119
6.2.4 Iterations 119
6.2.5 Query Pools 121
6.3 Pools 121
6.4 Parallel Data Generation 123
6.4.1 Generation Algorithm 1 124
6.4.2 Generation Algorithm 2 125
6.5 Performance and Applications 126
6.6 Conclusion and Future Directions 127
6.7 Exercises 129
6.8 References 130
7 A Grid Operating Environment for CDI 131
7.1 Introduction 131
7.2 Grid-Based Service Deployment 132
7.2.1 Evolution of the Acxiom Grid (A Case Study) 132
7.2.2 Services Grid 134
7.2.3 Grid Management 136
7.3 Grid-Based Batch Processing 139
7.3.1 Workflow Grid 139
7.3.2 I/O Constraints 145
7.3.3 Data Grid 147
7.3.4 Database Grid 149
7.3.5 Data Management 150
7.4 Conclusion 152
7.5 Exercises 153
8 Parallel File Systems 155
8.1 Introduction 155
8.2 Commercial Data and Access Patterns 156
8.2.1 Large File Access Patterns 157
8.2.2 File System Interfaces 158
8.3 Basics of Parallel File Systems 159
8.3.1 Common Storage System Hardware 160
8.4 Design Challenges 161
8.4.1 Performance 162
8.4.2 Consistency Semantics 162
8.4.3 Fault Tolerance 163
8.4.4 Interoperability 164
8.4.5 Management Tools 165
8.4.6 Traditional Design Challenges 166
8.5 Case Studies 166
8.5.1 Multi-Path File System (MPFS) 166
8.5.1.1 Architecture 167
8.5.1.2 File Mapping Protocol 168
8.5.1.3 Caching 168
8.5.1.4 Fault Tolerance 169
8.5.1.5 Similar File Systems 169
8.5.2 Parallel Virtual File System (PVFS) 169
8.5.2.1 Architecture 169
8.5.2.2 Fault Tolerance 170
8.5.2.3 Application Interfaces 171
8.5.2.4 Consistency Semantics 172
8.5.2.5 Similar File Systems 172
8.5.3 The Google File System (GFS) 172
8.5.3.1 Architecture 173
8.5.3.2 Fault Tolerance 173
8.5.3.3 Application Interfaces 174
8.5.3.4 Consistency Semantics 175
8.5.3.5 Similar File Systems 175
8.5.4 pNFS 175
8.5.4.1 Architecture 176
8.5.4.2 Layouts 177
8.5.4.3 Layout Requests 177
8.5.4.4 Implementations 178
8.6 Conclusion 179
8.7 Exercises 179
8.8 References 180
9 Performance Modeling of Enterprise Grids 181
9.1 Introduction and Background 181
9.1.1 Performance Modeling 181
9.1.2 Capacity Planning Tools and Methodology 183
9.2 Measurement Collection and Preliminary Analysis 185
9.3 Workload Characterization 186
9.3.1 K-means Clustering 188
9.3.1.1 Starting Point Selection 191
9.3.1.2 K-means Analysis Example 192
9.3.2 Hierarchical Workload Characterization 193
9.3.3 Other Issues in Workload Characterization 194
9.4 Baseline System Models and Tool Construction 196
9.4.1 Analytic Models 196
9.4.1.1 Queueing Networks 197
9.4.1.2 Petri Nets 201
9.4.2 Simulation Tools for Enterprise Grid Systems 203
9.5 Enterprise Grid Capacity Planning Case Study 204
9.5.1 Data Collection and Preliminary Analysis 206
9.5.2 Workload Characterization 206
9.5.3 Development and Validation of the Baseline Model 207
9.6 Summary 211
9.7 Exercises 211
9.8 References 212
10 Delay Characteristics of Packet SwitchedNetworks 214
10.1 Introduction 214
10.2 High-Speed Packet Switching Systems 215
10.2.1 Packet Switched General Organization 215
10.2.2 Switching Fabric Structures for Packet Switches 216
10.2.3 Queuing Schemes for Packet Switches 217
10.3 Technical Background 218
10.3.1 Packet Scheduling in Packet Switches 218
10.3.2 Introduction to Network Calculus 219
10.4 Delay Characteristics of Output Queuing Switches 221
10.4.1 Output Queuing Switch System 221
10.4.2 OQ Switch Modeling and Analysis 222
10.4.3 Output Queuing Emulation for Delay Guarantee 223
10.5 Delay Characteristics of Buffered Crossbar Switches 223
10.5.1 Buffered Crossbar Switch System 223
10.5.2 Modeling Traffic Control in Buffered Crossbar Switches 225
10.5.3 Delay Analysis for Buffered Crossbar Switches 226
10.5.4 Numerical Examples 227
10.6 Delay Comparison of Output Queuing to BufferedCrossbar 228
10.6.1 Maximum Packet Delay Comparison 228
10.6.2 Bandwidth Allocation for Delay Performance Guarantees 229
10.6.3 Numerical Examples 230
10.7 Summary 232
10.8 Exercises 233
10.9 References 233
11 Knowledge Discovery in Textual Databases: AConcept-Association Mining Approach 235
11.1 Introduction 235
11.2 Method 238
11.2.1 Concept Based Association Rule Mining Approach 238
11.2.2 Concept Extraction 239
11.2.3 Mining Concept Associations 241
11.2.4 Generating a Directed Graph of Concept Associations 241
11.3 Experiments and Results 243
11.3.1 Isolated words vs. multi-word concepts 243
11.3.2 New Metrics vs. the Traditional Support & Confidence
11.3.2.1 Directed Graphs 247
11.4 Conclusions 250
11.5 Examples 251
11.6 Exercises 252
11.7 References 252
12 Mining E-Documents to Uncover Structures 254
12.1 Introduction 254
12.2 Related Research 255
12.3 Discovery of the Physical Structure 256
12.3.1 Paragraph 256
12.3.2 Heading 257
12.3.2.1 Assigning Heading Levels to Informal Headings 257
12.3.3 Table 261
12.3.4 Image 262
12.3.5 Capturing the physical structure of an e-document 263
12.4 Discovery of the Explicit Terms Using Ontology 272
12.4.1 The Stemmer 273
12.4.2 The Ontology 273
12.4.3 Discovery Process 275
12.5 Discovery of the Logical Structure 277
12.5.1 Segmentation 277
12.5.2 Segments’ Relationships 279
12.6 Empirical Results 281
12.7 Conclusions 283
12.8 Exercises 283
12.9 Acknowledgments 285
12.10 References 285
13 Designing a Flexible Framework for a TableAbstraction 288
13.1 Introduction 288
13.2 Analysis of the Table ADT 290
13.3 Formal Design Contracts 292
13.4 Layered Architecture 294
13.5 Client Layer 295
13.5.1 Abstract Predicates for Keys and Records 296
13.5.2 Keys and the Comparable Interface 296
13.5.3 Records and the Keyed Interface 297
13.5.4 Interactions among the Layers 298
13.6 Access Layer 298
13.6.1 Abstract Predicates for Tables 298
13.6.2 Table Interface 298
13.6.3 Interactions among the Layers 300
13.7 Storage Layer 301
13.7.1 Abstract Predicate for Storable Records 301
13.7.2 Bridge Pattern 301
13.7.3 Proxy Pattern 302
13.7.4 RecordStore Interface 303
13.7.5 RecordSlot Interface 304
13.7.6 Interactions among the Layers 306
13.8 Externalization Module 306
13.9 Iterators 308
13.9.1 Table Iterator Methods 309
13.9.2 Input Iterators 310
13.9.3 Filtering Iterators 311
13.9.4 Query Iterator Methods 312
13.10 Evolving Frameworks 314
13.10.1 Three Examples 314
13.10.2 Whitebox Frameworks 315
13.10.3 Component Library 315
13.10.4 Hot Spots 316
13.10.5 Pluggable Objects 317
13.11 Discussion 317
13.12 Conclusion 319
13.13 Exercises 319
13.14 Acknowledgements 321
13.15 References 321
14 Information Quality Framework for VerifiableIntelligence Products 324
14.1 Introduction 324
14.2 Background 326
14.2.1 Production Process of Intelligence Products 326
14.2.2 Current IQ Practices in the IC 328
14.2.3 Relevant Concepts and Methods of IQ Management 330
14.2.3.1 TDQM Framework 330
14.2.3.2 Treating information as Product and IP-Map 331
14.2.3.3 PolyGen 331
14.2.3.4 QER 332
14.3 IQ Challenges within the IC 332
14.3.1 IQ Issues in Intelligence Collection and Analysis 332
14.3.2 Other IQ Problems 333
14.3.3 IQ Dimensions Related to the IC 334
14.4 Towards a Proposed Solution 335
14.4.1 IQ Metrics for Intelligence Products 336
14.4.2 Verifiability of Intelligence Products 337
14.4.3 Objectives and Plan 338
14.5 Conclusion 340
14.6 Exercises 340
14.7 References 340
15 Interactive Visualization of LargeHigh-Dimensional Datasets 343
15.1 Introduction 343
15.1.1 Related work 343
15.1.2 General requirements for a data visualization system 344
15.2 Data Visualization Process 345
15.2.1 Data Rendering Stage 346
15.2.1.1 Choosing visual objects and features 347
15.2.1.2 Non-uniform data distribution problem 347
15.2.2 Backward Transformation Stage 349
15.2.3 Knowledge Extraction Stage 350
15.3 Interactive Visualization Model 351
15.4 Utilizing Summary Icons 352
15.5 A Case Study 354
15.6 Conclusion 358
15.7 Exercises 358
15.8 Acknowledgements 358
15.9 References 358
16 Image Watermarking Based on PyramidDecomposition with CH Transform 360
16.1. Introduction 360
16.2. Algorithm for multi-layer image watermarking 361
16.2.1. Resistant watermarking 361
16.2.2. Resistant watermark detection 371
16.2.3. Fragile watermarking 376
16.3. Data hiding 377
16.4. Evaluation of the watermarking efficiency 378
16.5. Experimental results 379
16.6. Application areas 386
16.6.1. Resistant watermarks 386
16.6.2. Fragile watermarks 387
16.6.3. Data hiding 387
16.7. Conclusion 387
16.8 Exercises 388
16.9 Acknowledgment 393
16.10 References 393
17 Immersive Visualization of Cellular Structures 395
17.1 Introduction 395
17.2 Light Microscopic Cellular Images and Focus: Basics 396
17.3 Flat-Field Correction 398
17.4 Separation of Transparent Layers using Focus 399
17.5 3D Visualization of Cellular Structures 402
17.5.1 Volume Rendering 402
17.5.2 Immersive Visualization: CAVE Environment 404
17.6 Conclusions 407
17.7 Exercises 407
17.8 References 407
18 Visualization and Ontology of GeospatialIntelligence 409
18.1 Introduction 409
18.1.1 Premises 409
18.1.2 Research Agenda 410
18.2 Semantic Information Representation and Extraction 411
18.3 Markov Random Field 412
18.3.1 Spatial or Contextual Pattern Recognition 413
18.3.2 Image Classification using k-medoid Method 413
18.3.3 Random Field and Spatial Time Series 416
18.3.4 First Persian-Gulf-War Example 418
18.4 Context-driven Visualization 420
18.4.1 Relevant Methodologies 420
18.4.2 Visual Perception and Tracking 421
18.4.3 Visualization 423
18.5 Intelligent Information Fusion 425
18.5.1 Semantic Information Extraction 425
18.5.2 Intelligent Contextual Inference 426
18.5.3 Context-driven Ontology 426
18.6 Metrics for Knowledge Extraction and Discovery 427
18.7 Conclusions and Recommendations 428
18.7.1 Contributions 428
18.7.2 Looking Ahead 429
18.8 Exercises 430
18.9 Acknowledgements 433
18.10 References 434
19 Looking Ahead 436
19.1 Introduction 436
19.2 Data Integration and Information Qual 437
19.3 Grid Computing 439
19.4 Data Mining 440
19.5 Visualization 442
19.6 References 443
Index 445
Erscheint lt. Verlag | 15.10.2009 |
---|---|
Reihe/Serie | International Series in Operations Research & Management Science | International Series in Operations Research & Management Science |
Zusatzinfo | XVII, 447 p. |
Verlagsort | New York |
Sprache | englisch |
Themenwelt | Informatik ► Datenbanken ► Data Warehouse / Data Mining |
Mathematik / Informatik ► Informatik ► Netzwerke | |
Informatik ► Office Programme ► Outlook | |
Informatik ► Theorie / Studium ► Algorithmen | |
Mathematik / Informatik ► Informatik ► Web / Internet | |
Mathematik / Informatik ► Mathematik ► Finanz- / Wirtschaftsmathematik | |
Naturwissenschaften | |
Wirtschaft ► Betriebswirtschaft / Management ► Planung / Organisation | |
Wirtschaft ► Betriebswirtschaft / Management ► Wirtschaftsinformatik | |
Schlagworte | Analytics • association • classification • Database • Data Quality • entity resolution • grid computing • Performance • Visualization |
ISBN-10 | 1-4419-0176-0 / 1441901760 |
ISBN-13 | 978-1-4419-0176-7 / 9781441901767 |
Haben Sie eine Frage zum Produkt? |
Größe: 12,2 MB
DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich