Blick ins Buch

Software Automatic Tuning (eBook)

From Concepts to State-of-the-Art Results

John Cavazos, Ken Naono, Reiji Suda, Keita Teranishi (Herausgeber)

eBook Download: PDF

2010 | 2010
XIV, 377 Seiten
Springer New York (Verlag)
978-1-4419-6935-4 (ISBN)

Lese- und Medienproben

Ebook-Leseprobe (PDF)

Automatic Performance Tuning is a new software paradigm which enables software to be high performance in any computing environment. Its methodologies have been developed over the past decade, and it is now rapidly growing in terms of its scope and applicability, as well as in its scientific knowledge and technological methods. Software developers and researchers in the area of scientific and technical computing, high performance database systems, optimized compilers, high performance systems software, and low-power computing will find this book to be an invaluable reference to this powerful new paradigm.

Software Automatic Tuning 3
Preface 5
Contents 7
Contributors 11
Part I Introduction 15
Chapter1 Software Automatic Tuning: Concepts and State-of-the-Art Results 16
1.1 Software Automatic Tuning: General Concepts 17
1.2 Software Automatic Tuning: State-of-the-Art 22
Part II Achievements in Scientific Computing 29
Chapter2 ATLAS Version 3.9: Overview and Status* 30
2.1 Introduction 30
2.1.1 Basic AEOS Requirements 32
2.1.2 Methods of Software Adaptation 33
2.2 ATLAS Overview 34
2.2.1 Explicit LAPACK Support In ATLAS 35
2.2.2 Dense Level 3 BLAS Support in ATLAS 36
2.2.2.1 Using Assembly in ATLAS 38
2.2.3 Dense Level 2 BLAS Support in ATLAS 39
2.2.4 Dense Level 1 BLAS Support in ATLAS 39
2.3 Status 40
References 40
Chapter3 Autotuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS 44
3.1 Introduction 44
3.2 Background 45
3.2.1 BLAS 46
3.2.2 DL-BLAS 46
3.2.3 Motivation of the present Research 48
3.3 Proposed Methods 49
3.3.1 Parallel Efficiency Analysis 49
3.3.2 Diagonal Search 51
3.3.3 Reductive Search 51
3.3.4 Parameter Selection 53
3.4 Experiments 53
3.4.1 Environments 53
3.4.2 Performance of Diagonal Search 54
3.4.3 Analysis and Discussion of Diagonal Search 55
3.4.4 Performance Evaluation Tests 56
3.5 Conclusion 57
References 59
Chapter4 Automatic Tuning for Parallel FFTs 60
4.1 Introduction 60
4.2 Parallel One-Dimensional FFT 61
4.2.1 A Block Nine-Step FFT Algorithm 61
4.2.2 A Parallel One-Dimensional FFT Algorithm Based on the Block Nine-Step FFT 63
4.3 Parallel Multi-Dimensional FFT 65
4.3.1 Parallel Two-Dimensional FFT 65
4.3.2 Parallel Three-Dimensional FFT 66
4.4 Automatic Tuning of Parallel FFTs 68
4.4.1 Automatic Tuning of All-to-All Communication 68
4.4.2 Selection of the Radices 69
4.4.3 Selection of the Block Size 69
4.5 Performance Results 70
4.5.1 Performance of All-to-All Communication 72
4.5.2 Performance of One-Dimensional FFT 72
4.5.3 Performance of Multi-Dimensional FFTs 75
4.6 Conclusion 76
References 78
Chapter5 Dynamic Programming Approaches to Optimizing the Blocking Strategy for Basic Matrix Decompositions 79
5.1 Introduction 79
5.2 Basics of the Householder QR Algorithm 81
5.2.1 The Householder QR Algorithm 81
5.2.2 Blocking Strategies for the Householder QR Algorithm 81
5.2.2.1 Fixed-size Blocking 82
5.2.2.2 Recursive Blocking 82
5.2.3 Another Blocking Strategy: The TSQR Algorithm 83
5.3 Dynamic Programming Approaches to Optimizing the Blocking Strategy 84
5.3.1 Limitations of Conventional Blocking Strategies 84
5.3.2 Approaches for the Sequential Case 85
5.3.2.1 Variable-size Blocking 85
5.3.2.2 Generalized Recursive Blocking 87
5.3.3 Approaches for the Parallel Case 89
5.3.3.1 Variable-size Blocking for Distributed Memory Machines 89
5.3.3.2 Combination of Variable-size Blocking and TSQR 89
5.4 Future Directions 92
5.4.1 Use of Task Level Parallelism 92
5.4.2 Extension to Heterogeneous Architectures 92
5.4.3 Online Refinement of the Performance Models and the Blocking Strategy 92
5.4.4 Application to Other Linear Algebra Algorithms 93
5.5 Conclusion 93
References 93
Chapter6 Automatic Tuning of the Division Number in the Multiple Division Divide-and-Conquer for Real Symmetric Eigenproblem 96
6.1 Introduction 96
6.2 DCk Algorithm and Deflation 97
6.3 Deflation Occurrence Rate (DOR) 99
6.4 Proposal of Algorithm to Estimate the Optimal Division Number 102
6.5 Numerical Experiment 103
6.6 Summary 109
References 109
Chapter7 Automatically Tuned Mixed-Precision Conjugate Gradient Solver 111
7.1 Introduction 111
7.2 Iterative Refinement 113
7.3 Choosing the Target Residual Reduction for the Inner Solver 115
7.4 Stopping the Iterative Refinement 119
7.4.1 Stopping Based on Threshold 119
7.4.2 Stopping Based on Condition Number Estimation 121
7.4.3 Combined Stopping Method 122
7.5 Condition Number Estimation 123
7.6 Conclusions 126
References 127
Chapter8 Automatically Tuned Sparse Eigensolvers 128
8.1 Introduction 128
8.2 Background: AT Research Trends 129
8.2.1 Definition of AT 129
8.2.2 Six Studies on AT 130
8.2.3 AT Research Trends 131
8.3 Proposal of AT-Restarted-Lanczos 133
8.3.1 Restarted-Lanczos and Its Problem 133
8.3.2 Preliminary Experiment 133
8.3.3 Proposal of the Run-Time Automatic Tuning Algorithm 135
8.3.4 Evaluations 137
8.4 Conclusions 139
References 139
Chapter9 Systematic Performance Evaluation of Linear Solvers Using Quality Control Techniques 141
9.1 Introduction 141
9.2 Systematic Performance Evaluation and Systematic Analyses for Numerical Computation Algorithms 142
9.2.1 Systematic Performance Evaluation of Solution Algorithms 142
9.2.2 Construction of the Performance Information DB on the Solution Algorithms 143
9.3 Quality Control in the Solution Process 145
9.3.1 Quality Control Techniques 145
9.3.2 Systematic Performance Evaluation Using Survey Charts 148
9.4 Analysis of Causes of Poor Solution Quality 152
9.4.1 Preconditioning of Solution Algorithm for Linear Equations 154
9.4.2 Residual Vector and Convergence Criterion for Each Preconditioning System 155
9.5 Concluding Remarks 156
References 158
Chapter 10 Application of Alternating Decision Trees in Selecting Sparse Linear Solvers* 159
10.1 Introduction 159
10.2 Machine Learning Methods 161
10.2.1 Adaboost 162
10.2.2 Alternating Decision Trees 163
10.3 Solver Selection as a Classification Problem 164
10.3.1 Feature Selection and Computation 166
10.4 Applying Machine Learning 167
10.5 Experimental Results 168
10.5.1 Experiment 1: Classification on Linear Systems Generated from PDE-based Simulation Code 169
10.5.2 Experiment 2: Classification on Only Coefficient Matrices 171
10.6 Conclusions and Future Plans 175
References 177
Chapter11 Toward Automatic Performance Tuning for Numerical Simulations in the SILC Matrix Computation Framework 180
11.1 Introduction 180
11.2 The SILC Matrix Computation Framework 182
11.3 Performance Modeling 184
11.4 Experiments 185
11.4.1 Cloth Simulation 186
11.4.2 Fluid Simulation 189
11.4.3 Simple Initial Value Problem 190
11.5 Automatic Performance Tuning 193
11.5.1 Autotuning Mechanism for Numerical Simulations in SILC 193
11.5.2 Related Work and Discussions 195
11.6 Concluding Remarks 195
References 196
Chapter12 Exploring Tuning Strategies for Quantum Chemistry Computations 198
12.1 Introduction 198
12.2 Related Works 200
12.3 Methodology 200
12.4 Performance Results and Observations 203
12.5 Tuning Strategy 208
12.6 Conclusions and Future Work 211
References 211
Chapter13 Automatic Tuning of CUDA Execution Parameters for Stencil Processing 214
13.1 Introduction 214
13.2 Related Work 216
13.3 Performance Tuning for CUDA 217
13.4 Automatic Performance Tuning of CUDA Execution Parameters 219
13.4.1 Parameter Exploration Space 220
13.4.2 Auto-Tuning Mechanism 223
13.5 Performance Evaluation 225
13.5.1 Automatic Tuning on Himeno Benchmark 225
13.5.2 Automatic Tuning on LU Decomposition 229
13.6 Conclusions 232
References 232
Chapter14 Static Task Cluster Size Determination in Homogeneous Distributed Systems 234
14.1 Introduction 234
14.2 Problem Definition and Assumptions 236
14.2.1 Assumed Model 236
14.2.2 Problems in Conventional Cluster Merging 237
14.2.3 Requirements for Reducing Both the Schedule Length and the Number of Processors 238
14.3 Definition of an Indicative Value for Schedule Length 239
14.3.1 Abstract of Worst Schedule Length 239
14.3.2 WSL Definition 239
14.4 Derivation of Cluster Size for Minimizing WSL 240
14.4.1 An Upper Bound of WSL 240
14.4.2 Derivation of dopt 246
14.4.3 Relationship Between WSLs and Schedule Length 247
14.4.4 Requirements for Minimizing WSL in a Task Clustering 248
14.5 Task Clustering 248
14.5.1 Overview of the Algorithm 248
14.5.2 Cluster Selection Policies 249
14.6 Experimental Evaluation 251
14.6.1 Comparison Targets 251
14.6.2 Simulation Setup 251
14.6.3 Comparison of WSLS and Schedule Length in the Fixed Number of Processors 252
14.6.4 Processor Utilization 253
14.6.5 Comparison of WSLS and Schedule Length of Real Application DAGs 254
14.6.6 Discussion 255
14.7 Conclusion and Future Works 256
References 256
Part III Evolution to a General Paradigm 258
Chapter15 Algorithmic Parameter Optimization of the DFO Method with the OPAL Framework 259
15.1 Introduction 259
15.2 Optimization of Algorithmic Parameters 262
15.2.1 Black Box Construction 262
15.2.2 Direct Search Algorithms 264
15.3 The OPAL Package 265
15.3.1 The OPAL Structure 266
15.3.2 Usage of OPAL 266
15.4 Application to Derivative-Free Optimization 268
15.4.1 General Description of DFO 268
15.4.2 Two DFO Parameter Optimization Problems 268
15.4.3 Numerical Results 272
15.5 Discussion 276
References 277
Chapter16 A Bayesian Method of Online Automatic Tuning 279
16.1 Introduction 279
16.2 Automatic Tuning Concepts 280
16.3 Review of the Proposed Methods 281
16.3.1 Online Automatic Tuning 282
16.3.2 Tuning Based on Cost Function Modeling 283
16.3.3 Bayesian Analysis to Treat Uncertainties 284
16.3.4 Bayesian Suboptimal Sequential Experimental Design 286
16.3.5 Desirable Properties of Mathematical Methods for Automatic Tuning 288
16.3.6 Infinite Dilution 291
16.3.7 Pseudocode of the Proposed Method 292
16.4 Evaluation 293
16.4.1 Infinite Dilution with Random Sampling 293
16.4.2 Performance Evaluation 293
16.5 Conclusion 296
References 297
Chapter17 ABCLibScript: A Computer Language for Automatic Performance Tuning 298
17.1 Introduction 298
17.2 The FIBER Framework 299
17.2.1 FIBER Framework Basics 299
17.2.2 The Two Users and Auto-tuning Software Construction Assumption 299
17.2.2.1 Software Developer Phase 300
17.2.2.2 Software User (End-User) Phase 301
17.3 The ABCLibScript 301
17.4 Examples of ABCLibScript Descriptions 304
17.4.1 Software Developer API 304
17.4.2 End-User API: Before Execute-Time Auto-tuning 305
17.4.3 Other Programming Examples for Software Developers Using ABCLibScript 306
17.4.3.1 Install-Time Auto-tuning 306
17.4.3.2 Before Execute-Time Auto-tuning 308
17.4.3.3 Run-Time Auto-tuning 309
17.5 Future Direction of ABCLibScript 310
17.5.1 Evaluation of Compiler Optimization and Hardware Performance 310
17.5.2 Adapting ABCLibScript to Embedded Systems 312
17.5.3 Adapting ABCLibScript to Low-Power Optimization 314
17.6 Concluding Remarks 315
References 316
Chapter18 Automatically Tuning Task-Based Programs for Multicore Processors 317
18.1 Introduction 317
18.2 Example 318
18.2.1 Task Extensions 319
18.2.2 Execution 319
18.2.3 Scheduling for Multicore Processor 320
18.3 Implementation Synthesis 322
18.3.1 Dependence Analysis 323
18.3.2 Profiling 323
18.3.3 Candidate Implementation Synthesis 323
18.3.4 Simulation Stage 325
18.3.5 Directed-Simulated Annealing 326
18.3.5.1 Generation of Optimized Implementation 327
18.3.6 Dynamic Scheduling 328
18.4 Evaluation 328
18.4.1 Benchmarks 329
18.4.2 Accuracy of Scheduling Simulator 330
18.4.3 Optimality of Directed Simulated Annealing 331
18.4.4 Generality of Synthesized Implementation 332
18.4.5 Overhead of Bamboo 333
18.5 Related Work 334
18.6 Conclusion 335
References 335
Chapter19 Efficient Program Compilation Through Machine Learning Techniques 337
19.1 Introduction 337
19.1.1 Motivation 338
19.2 Design and Implementation 339
19.2.1 Preparing for Data Collection 339
19.2.2 Gathering Training Data 342
19.2.3 Learning Process 344
19.2.4 Deployment Process 345
19.3 Experimental Setup 346
19.4 Results and Discussion 346
19.4.1 Quality of Training Data 347
19.4.2 Classifier Evaluation 348
19.4.3 Other Benchmarks 349
19.5 Related Work 350
19.6 Conclusions and Future Work 352
19.7 Acknowledgments and Trademarks 352
References 352
Chapter20 Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology 354
20.1 Introduction 354
20.2 Compiler Technology: Autotuning and Specialization 356
20.2.1 Optimization Strategy 358
20.2.2 Specialization and Transformation Using CHiLL 359
20.2.3 Autotuning: Heuristics for Pruning the Search Space 361
20.2.4 Building the Library 362
20.3 Experiments 363
20.3.1 Experimental Environment 364
20.3.2 Generating the Library 364
20.3.3 Evaluating the Pruning Heuristics 365
20.3.4 Performance Results 367
20.4 Related Work 369
20.5 Conclusion 370
References 370
Index 372

Erscheint lt. Verlag	9.9.2010
Zusatzinfo	XIV, 377 p.
Verlagsort	New York
Sprache	englisch
Themenwelt	Informatik ► Weitere Themen ► CAD-Programme
	Technik ► Elektrotechnik / Energietechnik
	Technik ► Nachrichtentechnik
Schlagworte	Adaptive Algorithms • algorithms • automatically-tuned code generation • autonomic computing and • autonomic computing and context-aware computing • Computer-Aided Design (CAD) • Computing with GPGPU and accelators • Debugging • Hybrid Systems • low-power computing • Multicore Processors • Parallel and Distributed Computing • Performance Debugging • performance tuning • Simulation
ISBN-10	1-4419-6935-7 / 1441969357
ISBN-13	978-1-4419-6935-4 / 9781441969354

Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)
Größe: 10,2 MB

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Softcover

160,49 €