High-Performance Parallel Database Processing and Grid Databases

David Taniar, Clement H. C. Leung, Wenny Rahayu, Sushant Goel (Autoren)

Buch | Hardcover

576 Seiten

2008
John Wiley & Sons Inc (Verlag)
978-0-470-10762-1 (ISBN)

Artikel merken

High Performance Parallel Database Processing and Grid Databases targets the theoretical/conceptual details needed to form a base of understanding and then delivers information on development, implementations, and analytical modeling of parallel databases. This includes key information on new developments with grid databases.

The latest techniques and principles of parallel and grid database processing The growth in grid databases, coupled with the utility of parallel query processing, presents an important opportunity to understand and utilize high-performance parallel database processing within a major database management system (DBMS). This important new book provides readers with a fundamental understanding of parallelism in data-intensive applications, and demonstrates how to develop faster capabilities to support them. It presents a balanced treatment of the theoretical and practical aspects of high-performance databases to demonstrate how parallel query is executed in a DBMS, including concepts, algorithms, analytical models, and grid transactions.

High-Performance Parallel Database Processing and Grid Databases serves as a valuable resource for researchers working in parallel databases and for practitioners interested in building a high-performance database. It is also a much-needed, self-contained textbook for database courses at the advanced undergraduate and graduate levels.

David Taniar, PhD, lectures in information technology at Monash University, Australia. Dr. Taniar has published extensively in the field of high- performance parallel databases and is the Editor in Chief of the International Journal of Data Warehousing and Mining. Clement H. C. Leung, PhD, is Foundation Chair in Computer Science at Victoria University, Australia. Dr. Leung previously held the Established Chair in Computer Science at the University of London. Wenny Rahayu, PhD, is Associate Professor at La Trobe University, Australia, and actively works in the areas of database design and implementation, covering object-relational databases and Web databases. Sushant Goel, PhD, is a software consultant and holds a PhD in computer systems engineering from RMIT University, Australia. His research interests are in grid transaction management and software development processes, such as agile computing.

Preface xv

Part I Introduction

1. Introduction 3

1.1. A Brief Overview: Parallel Databases and Grid Databases 4

1.2. Parallel Query Processing: Motivations 5

1.3. Parallel Query Processing: Objectives 7

1.3.1. Speed Up 7

1.3.2. Scale Up 8

1.3.3. Parallel Obstacles 10

1.4. Forms of Parallelism 12

1.4.1. Interquery Parallelism 13

1.4.2. Intraquery Parallelism 14

1.4.3. Intraoperation Parallelism 15

1.4.4. Interoperation Parallelism 15

1.4.5. Mixed Parallelism—A More Practical Solution 18

1.5. Parallel Database Architectures 19

1.5.1. Shared-Memory and Shared-Disk Architectures 20

1.5.2. Shared-Nothing Architecture 22

1.5.3. Shared-Something Architecture 23

1.5.4. Interconnection Networks 24

1.6. Grid Database Architecture 26

1.7. Structure of this Book 29

1.8. Summary 30

1.9. Bibliographical Notes 30

1.10. Exercises 31

2. Analytical Models 33

2.1. Cost Models 33

2.2. Cost Notations 34

2.2.1. Data Parameters 34

2.2.2. Systems Parameters 36

2.2.3. Query Parameters 37

2.2.4. Time Unit Costs 37

2.2.5. Communication Costs 38

2.3. Skew Model 39

2.4. Basic Operations in Parallel Databases 43

2.4.1. Disk Operations 44

2.4.2. Main Memory Operations 45

2.4.3. Data Computation and Data Distribution 45

2.5. Summary 47

2.6. Bibliographical Notes 47

2.7. Exercises 47

Part II Basic Query Parallelism

3. Parallel Search 51

3.1. Search Queries 51

3.1.1. Exact-Match Search 52

3.1.2. Range Search Query 53

3.1.3. Multiattribute Search Query 54

3.2. Data Partitioning 54

3.2.1. Basic Data Partitioning 55

3.2.2. Complex Data Partitioning 60

3.3. Search Algorithms 69

3.3.1. Serial Search Algorithms 69

3.3.2. Parallel Search Algorithms 73

3.4. Summary 74

3.5. Bibliographical Notes 75

3.6. Exercises 75

4. Parallel Sort and GroupBy 77

4.1. Sorting, Duplicate Removal, and Aggregate Queries 78

4.1.1. Sorting and Duplicate Removal 78

4.1.2. Scalar Aggregate 79

4.1.3. GroupBy 80

4.2. Serial External Sorting Method 80

4.3. Algorithms for Parallel External Sort 83

4.3.1. Parallel Merge-All Sort 83

4.3.2. Parallel Binary-Merge Sort 85

4.3.3. Parallel Redistribution Binary-Merge Sort 86

4.3.4. Parallel Redistribution Merge-All Sort 88

4.3.5. Parallel Partitioned Sort 90

4.4. Parallel Algorithms for GroupBy Queries 92

4.4.1. Traditional Methods (Merge-All and Hierarchical Merging) 92

4.4.2. Two-Phase Method 93

4.4.3. Redistribution Method 94

4.5. Cost Models for Parallel Sort 96

4.5.1. Cost Models for Serial External Merge-Sort 96

4.5.2. Cost Models for Parallel Merge-All Sort 98

4.5.3. Cost Models for Parallel Binary-Merge Sort 100

4.5.4. Cost Models for Parallel Redistribution Binary-Merge Sort 101

4.5.5. Cost Models for Parallel Redistribution Merge-All Sort 102

4.5.6. Cost Models for Parallel Partitioned Sort 103

4.6. Cost Models for Parallel GroupBy 104

4.6.1. Cost Models for Parallel Two-Phase Method 104

4.6.2. Cost Models for Parallel Redistribution Method 107

4.7. Summary 109

4.8. Bibliographical Notes 110

4.9. Exercises 110

5. Parallel Join 112

5.1. Join Operations 112

5.2. Serial Join Algorithms 114

5.2.1. Nested-Loop Join Algorithm 114

5.2.2. Sort-Merge Join Algorithm 116

5.2.3. Hash-Based Join Algorithm 117

5.2.4. Comparison 120

5.3. Parallel Join Algorithms 120

5.3.1. Divide and Broadcast-Based Parallel Join Algorithms 121

5.3.2. Disjoint Partitioning-Based Parallel Join Algorithms 124

5.4. Cost Models 128

5.4.1. Cost Models for Divide and Broadcast 128

5.4.2. Cost Models for Disjoint Partitioning 129

5.4.3. Cost Models for Local Join 130

5.5. Parallel Join Optimization 132

5.5.1. Optimizing Main Memory 132

5.5.2. Load Balancing 133

5.6. Summary 134

5.7. Bibliographical Notes 135

5.8. Exercises 136

Part III Advanced Parallel Query Processing

6. Parallel GroupBy-Join 141

6.1. Groupby-Join Queries 141

6.1.1. Groupby Before Join 142

6.1.2. Groupby After Join 142

6.2. Parallel Algorithms for Groupby-Before-Join Query Processing 143

6.2.1. Early Distribution Scheme 143

6.2.2. Early GroupBy with Partitioning Scheme 145

6.2.3. Early GroupBy with Replication Scheme 146

6.3. Parallel Algorithms for Groupby-After-Join Query Processing 148

6.3.1. Join Partitioning Scheme 148

6.3.2. GroupBy Partitioning Scheme 150

6.4. Cost Model Notations 151

6.5. Cost Model for Groupby-Before-Join Query Processing 153

6.5.1. Cost Models for the Early Distribution Scheme 153

6.5.2. Cost Models for the Early GroupBy with Partitioning Scheme 156

6.5.3. Cost Models for the Early GroupBy with Replication Scheme 158

6.6. Cost Model for “Groupby-After-Join” Query Processing 159

6.6.1. Cost Models for the Join Partitioning Scheme 159

6.6.2. Cost Models for the GroupBy Partitioning Scheme 161

6.7. Summary 163

6.8. Bibliographical Notes 164

6.9. Exercises 164

7. Parallel Indexing 167

7.1. Parallel Indexing–an Internal Perspective on Parallel Indexing Structures 168

7.2. Parallel Indexing Structures 169

7.2.1. Nonreplicated Indexing (NRI) Structures 169

7.2.2. Partially Replicated Indexing (PRI) Structures 171

7.2.3. Fully Replicated Indexing (FRI) Structures 178

7.3. Index Maintenance 180

7.3.1. Maintaining a Parallel Nonreplicated Index 182

7.3.2. Maintaining a Parallel Partially Replicated Index 182

7.3.3. Maintaining a Parallel Fully Replicated Index 188

7.3.4. Complexity Degree of Index Maintenance 188

7.4. Index Storage Analysis 188

7.4.1. Storage Cost Models for Uniprocessors 189

7.4.2. Storage Cost Models for Parallel Processors 191

7.5. Parallel Processing of Search Queries using Index 192

7.5.1. Parallel One-Index Search Query Processing 192

7.5.2. Parallel Multi-Index Search Query Processing 195

7.6. Parallel Index Join Algorithms 200

7.6.1. Parallel One-Index Join 200

7.6.2. Parallel Two-Index Join 203

7.7. Comparative Analysis 207

7.7.1. Comparative Analysis of Parallel Search Index 207

7.7.2. Comparative Analysis of Parallel Index Join 213

7.8. Summary 216

7.9. Bibliographical Notes 217

7.10. Exercises 217

8. Parallel Universal Qualification—Collection Join Queries 219

8.1. Universal Quantification and Collection Join 220

8.2. Collection Types and Collection Join Queries 222

8.2.1. Collection-Equi Join Queries 222

8.2.2. Collection–Intersect Join Queries 223

8.2.3. Subcollection Join Queries 224

8.3. Parallel Algorithms for Collection Join Queries 225

8.4. Parallel Collection-Equi Join Algorithms 225

8.4.1. Disjoint Data Partitioning 226

8.4.2. Parallel Double Sort-Merge Collection-Equi Join Algorithm 227

8.4.3. Parallel Sort-Hash Collection-Equi Join Algorithm 228

8.4.4. Parallel Hash Collection-Equi Join Algorithm 232

8.5. Parallel Collection-Intersect Join Algorithms 233

8.5.1. Non-Disjoint Data Partitioning 234

8.5.2. Parallel Sort-Merge Nested-Loop Collection-Intersect Join Algorithm 244

8.5.3. Parallel Sort-Hash Collection-Intersect Join Algorithm 245

8.5.4. Parallel Hash Collection-Intersect Join Algorithm 246

8.6. Parallel Subcollection Join Algorithms 246

8.6.1. Data Partitioning 247

8.6.2. Parallel Sort-Merge Nested-Loop Subcollection Join Algorithm 248

8.6.3. Parallel Sort-Hash Subcollection Join Algorithm 249

8.6.4. Parallel Hash Subcollection Join Algorithm 251

8.7. Summary 252

8.8. Bibliographical Notes 252

8.9. Exercises 254

9. Parallel Query Scheduling and Optimization 256

9.1. Query Execution Plan 257

9.2. Subqueries Execution Scheduling Strategies 259

9.2.1. Serial Execution Among Subqueries 259

9.2.2. Parallel Execution Among Subqueries 261

9.3. Serial vs. Parallel Execution Scheduling 264

9.3.1. Nonskewed Subqueries 264

9.3.2. Skewed Subqueries 265

9.3.3. Skewed and Nonskewed Subqueries 267

9.4. Scheduling Rules 269

9.5. Cluster Query Processing Model 270

9.5.1. Overview of Dynamic Query Processing 271

9.5.2. A Cluster Query Processing Architecture 272

9.5.3. Load Information Exchange 273

9.6. Dynamic Cluster Query Optimization 275

9.6.1. Correction 276

9.6.2. Migration 280

9.6.3. Partition 281

9.7. Other Approaches to Dynamic Query Optimization 284

9.8. Summary 285

9.9. Bibliographical Notes 286

9.10. Exercises 286

Part IV Grid Databases

10. Transactions in Distributed and Grid Databases 291

10.1. Grid Database Challenges 292

10.2. Distributed Database Systems and Multidatabase Systems 293

10.2.1. Distributed Database Systems 293

10.2.2. Multidatabase Systems 297

10.3. Basic Definitions on Transaction Management 299

10.4. Acid Properties of Transactions 301

10.5. Transaction Management in Various Database Systems 303

10.5.1. Transaction Management in Centralized and Homogeneous Distributed Database Systems 303

10.5.2. Transaction Management in Heterogeneous Distributed Database Systems 305

10.6. Requirements in Grid Database Systems 307

10.7. Concurrency Control Protocols 309

10.8. Atomic Commit Protocols 310

10.8.1. Homogeneous Distributed Database Systems 310

10.8.2. Heterogeneous Distributed Database Systems 313

10.9. Replica Synchronization Protocols 314

10.9.1. Network Partitioning 315

10.9.2. Replica Synchronization Protocols 316

10.10. Summary 318

10.11. Bibliographical Notes 318

10.12. Exercises 319

11. Grid Concurrency Control 321

11.1. A Grid Database Environment 321

11.2. An Example 322

11.3. Grid Concurrency Control 324

11.3.1. Basic Functions Required by GCC 324

11.3.2. Grid Serializability Theorem 325

11.3.3. Grid Concurrency Control Protocol 329

11.3.4. Revisiting the Earlier Example 333

11.3.5. Comparison with Traditional Concurrency Control Protocols 334

11.4. Correctness of GCC Protocol 336

11.5. Features of GCC Protocol 338

11.6. Summary 339

11.7. Bibliographical Notes 339

11.8. Exercises 339

12. Grid Transaction Atomicity and Durability 341

12.1. Motivation 342

12.2. Grid Atomic Commit Protocol (Grid-ACP) 343

12.2.1. State Diagram of Grid-ACP 343

12.2.2. Grid-ACP Algorithm 344

12.2.3. Early-Abort Grid-ACP 346

12.2.4. Discussion 348

12.2.5. Message and Time Complexity Comparison Analysis 349

12.2.6. Correctness of Grid-ACP 350

12.3. Handling Failure of Sites with Grid-ACP 351

12.3.1. Model for Storing Log Files at the Originator and Participating Sites 351

12.3.2. Logs Required at the Originator Site 352

12.3.3. Logs Required at the Participant Site 353

12.3.4. Failure Recovery Algorithm for Grid-ACP 353

12.3.5. Comparison of Recovery Protocols 359

12.3.6. Correctness of Recovery Algorithm 361

12.4. Summary 365

12.5. Bibliographical Notes 366

12.6. Exercises 366

13. Replica Management in Grids 367

13.1. Motivation 367

13.2. Replica Architecture 368

13.2.1. High-Level Replica Management Architecture 368

13.2.2. Some Problems 369

13.3. Grid Replica Access Protocol (GRAP) 371

13.3.1. Read Transaction Operation for GRAP 371

13.3.2. Write Transaction Operation for GRAP 372

13.3.3. Revisiting the Example Problem 375

13.3.4. Correctness of GRAP 377

13.4. Handling Multiple Partitioning 378

13.4.1. Contingency GRAP 378

13.4.2. Comparison of Replica Management Protocols 381

13.4.3. Correctness of Contingency GRAP 383

13.5. Summary 384

13.6. Bibliographical Notes 385

13.7. Exercises 385

14. Grid Atomic Commitment in Replicated Data 387

14.1. Motivation 388

14.1.1. Architectural Reasons 388

14.1.2. Motivating Example 388

14.2. Modified Grid Atomic Commitment Protocol 390

14.2.1. Modified Grid-ACP 390

14.2.2. Correctness of Modified Grid-ACP 393

14.3. Transaction Properties in Replicated Environment 395

14.4. Summary 397

14.5. Bibliographical Notes 397

14.6. Exercises 398

Part V Other Data-Intensive Applications

15. Parallel Online Analytic Processing (OLAP) and Business Intelligence 401

15.1. Parallel Multidimensional Analysis 402

15.2. Parallelization of ROLLUP Queries 405

15.2.1. Analysis of Basic Single ROLLUP Queries 405

15.2.2. Analysis of Multiple ROLLUP Queries 409

15.2.3. Analysis of Partial ROLLUP Queries 411

15.2.4. Parallelization Without Using ROLLUP 412

15.3. Parallelization of CUBE Queries 412

15.3.1. Analysis of Basic CUBE Queries 413

15.3.2. Analysis of Partial CUBE Queries 416

15.3.3. Parallelization Without Using CUBE 417

15.4. Parallelization of Top-N and Ranking Queries 418

15.5. Parallelization of Cume_Dist Queries 419

15.6. Parallelization of NTILE and Histogram Queries 420

15.7. Parallelization of Moving Average and Windowing Queries 422

15.8. Summary 424

15.9. Bibliographical Notes 424

15.10. Exercises 425

16. Parallel Data Mining—Association Rules and Sequential Patterns 427

16.1. From Databases To Data Warehousing To Data Mining: A Journey 428

16.2. Data Mining: A Brief Overview 431

16.2.1. Data Mining Tasks 431

16.2.2. Querying vs. Mining 433

16.2.3. Parallelism in Data Mining 436

16.3. Parallel Association Rules 440

16.3.1. Association Rules: Concepts 441

16.3.2. Association Rules: Processes 444

16.3.3. Association Rules: Parallel Processing 448

16.4. Parallel Sequential Patterns 450

16.4.1. Sequential Patterns: Concepts 452

16.4.2. Sequential Patterns: Processes 456

16.4.3. Sequential Patterns: Parallel Processing 459

16.5. Summary 461

16.6. Bibliographical Notes 461

16.7. Exercises 462

17. Parallel Clustering and Classification 464

17.1. Clustering and Classification 464

17.1.1. Clustering 464

17.1.2. Classification 465

17.2. Parallel Clustering 467

17.2.1. Clustering: Concepts 467

17.2.2. k-Means Algorithm 468

17.2.3. Parallel k-Means Clustering 471

17.3. Parallel Classification 477

17.3.1. Decision Tree Classification: Structures 477

17.3.2. Decision Tree Classification: Processes 480

17.3.3. Decision Tree Classification: Parallel Processing 488

17.4. Summary 495

17.5. Bibliographical Notes 498

17.6. Exercises 498

Permissions 501

List of Conferences and Journals 507

Bibliography 511

Index 541

Reihe/Serie	The Wiley Series on Parallel and Distributed Computing
Zusatzinfo	Drawings: 143 B&W, 0 Color
Verlagsort	New York
Sprache	englisch
Maße	158 x 239 mm
Gewicht	907 g
Themenwelt	Mathematik / Informatik ► Informatik ► Theorie / Studium
Themenwelt	Informatik ► Weitere Themen ► Hardware
ISBN-10	0-470-10762-6 / 0470107626
ISBN-13	978-0-470-10762-1 / 9780470107621
Zustand	Neuware