Rob Farber has served as a scientist in Europe at the Irish Center for High-End Computing as well as U.S. national labs in Los Alamos, Berkeley, and the Pacific Northwest. He has also been on the external faculty at the Santa Fe Institute, consultant to fortune 100 companies, and co-founder of two computational startups that achieved liquidity events. He is the author of 'CUDA Application Design and Development as well as numerous articles and tutorials that have appeared in Dr. Dobb's Journal and Scientific Computing, The Code Project and others."
As the computer industry retools to leverage massively parallel graphics processing units (GPUs), this book is designed to meet the needs of working software developers who need to understand GPU programming with CUDA and increase efficiency in their projects. CUDA Application Design and Development starts with an introduction to parallel computing concepts for readers with no previous parallel experience, and focuses on issues of immediate importance to working software developers: achieving high performance, maintaining competitiveness, analyzing CUDA benefits versus costs, and determining application lifespan. The book then details the thought behind CUDA and teaches how to create, analyze, and debug CUDA applications. Throughout, the focus is on software engineering issues: how to use CUDA in the context of existing application code, with existing compilers, languages, software tools, and industry-standard API libraries. Using an approach refined in a series of well-received articles at Dr Dobb's Journal, author Rob Farber takes the reader step-by-step from fundamentals to implementation, moving from language theory to practical coding. - Includes multiple examples building from simple to more complex applications in four key areas: machine learning, visualization, vision recognition, and mobile computing- Addresses the foundational issues for CUDA development: multi-threaded programming and the different memory hierarchy- Includes teaching chapters designed to give a full understanding of CUDA tools, techniques and structure. - Presents CUDA techniques in the context of the hardware they are implemented on as well as other styles of programming that will help readers bridge into the new material
Front Cover 1
CUDA Application Design and Development 4
Copyright 5
Dedication 6
Table of Contents 8
Foreword 12
Preface 14
1 First Programs and How to Think in CUDA 20
Source Code and Wiki 21
Distinguishing CUDA from Conventional Programming with a Simple Example 21
Choosing a CUDA API 24
Some Basic CUDA Concepts 27
Understanding Our First Runtime Kernel 30
Three Rules of GPGPU Programming 32
Rule 1: Get the Data on the GPU and Keep It There 32
Rule 2: Give the GPGPU Enough Work to Do 33
Rule 3: Focus on Data Reuse within the GPGPU to Avoid Memory Bandwidth Limitations 33
Big-O Considerations and Data Transfers 34
CUDA and Amdahl’s Law 36
Data and Task Parallelism 37
Hybrid Execution: Using Both CPU and GPU Resources 38
Regression Testing and Accuracy 40
Silent Errors 41
Introduction to Debugging 42
UNIX Debugging 43
NVIDIA's cuda-gdb Debugger 43
The CUDA Memory Checker 45
Use cuda-gdb with the UNIX ddd Interface 46
Windows Debugging with Parallel Nsight 48
Summary 49
2 CUDA for Machine Learning and Optimization 52
Modeling and Simulation 53
Fitting Parameterized Models 54
Nelder-Mead Method 55
Levenberg-Marquardt Method 55
Algorithmic Speedups 56
Machine Learning and Neural Networks 57
XOR: An Important Nonlinear Machine-Learning Problem 58
An Example Objective Function 60
A Complete Functor for Multiple GPU Devices and the Host Processors 61
Brief Discussion of a Complete Nelder-Mead Optimization Code 63
Performance Results on XOR 72
Performance Discussion 72
Summary 75
The C++ Nelder-Mead Template 76
3 The CUDA Tool Suite: Profiling a PCA/NLPCA Functor 82
PCA and NLPCA 83
Autoencoders 84
An Example Functor for PCA Analysis 85
An Example Functor for NLPCA Analysis 87
Obtaining Basic Profile Information 90
Gprof: A Common UNIX Profiler 92
The NVIDIA Visual Profiler: Computeprof 93
Parallel Nsight for Microsoft Visual Studio 96
The Nsight Timeline Analysis 96
The NVTX Tracing Library 98
Scaling Behavior of the CUDA API 99
Tuning and Analysis Utilities (TAU) 101
Summary 102
4 The CUDA Execution Model 104
GPU Architecture Overview 105
Thread Scheduling: Orchestrating Performance and Parallelism via the Execution Configuration 106
Relevant computeprof Values for a Warp 109
Warp Divergence 109
Guidelines for Warp Divergence 110
Relevant computeprof Values for Warp Divergence 111
Warp Scheduling and TLP 111
Relevant computeprof Values for Occupancy 113
ILP: Higher Performance at Lower Occupancy 113
ILP Hides Arithmetic Latency 114
ILP Hides Data Latency 117
ILP in the Future 117
Relevant computeprof Values for Instruction Rates 119
Little’s Law 119
CUDA Tools to Identify Limiting Factors 121
The nvcc Compiler 122
Launch Bounds 123
The Disassembler 124
PTX Kernels 125
GPU Emulators 126
Summary 127
5 CUDA Memory 128
The CUDA Memory Hierarchy 128
GPU Memory 130
L2 Cache 131
Relevant computeprof Values for the L2 Cache 132
L1 Cache 133
Relevant computeprof Values for the L1 Cache 134
CUDA Memory Types 135
Registers 135
Local memory 135
Relevant computeprof Values for Local Memory Cache 136
Shared Memory 136
Relevant computeprof Values for Shared Memory 139
Constant Memory 139
Texture Memory 140
Relevant computeprof Values for Texture Memory 143
Global Memory 143
Common Coalescing Use Cases 145
Allocation of Global Memory 146
Limiting Factors in the Design of Global Memory 147
Relevant computeprof Values for Global Memory 149
Summary 150
6 Efficiently Using GPU Memory 152
Reduction 153
The Reduction Template 153
A Test Program for functionReduce.h 159
Results 163
Utilizing Irregular Data Structures 165
Sparse Matrices and the CUSP Library 168
Graph Algorithms 170
SoA, AoS, and Other Structures 173
Tiles and Stencils 173
Summary 174
7 Techniques to Increase Parallelism 176
CUDA Contexts Extend Parallelism 177
Streams and Contexts 178
Multiple GPUs 178
Explicit Synchronization 179
Implicit Synchronization 180
The Unified Virtual Address Space 181
A Simple Example 181
Profiling Results 184
Out-of-Order Execution with Multiple Streams 185
Tip for Concurrent Kernel Execution on the Same GPU 188
Atomic Operations for Implicitly Concurrent Kernels 188
Tying Data to Computation 191
Manually Partitioning Data 191
Mapped Memory 192
How Mapped Memory Works 194
Summary 195
8 CUDA for All GPU and CPU Applications 198
Pathways from CUDA to Multiple Hardware Backends 199
The PGI CUDA x86 Compiler 200
The PGI CUDA x86 Compiler 202
An x86 core as an SM 204
The NVIDIA NVCC Compiler 205
Ocelot 206
Swan 207
MCUDA 207
Accessing CUDA from Other Languages 207
SWIG 208
Copperhead 208
EXCEL 209
MATLAB 209
Libraries 210
CUBLAS 210
CUFFT 210
MAGMA 221
phiGEMM Library 222
CURAND 222
Summary 224
9 Mixing CUDA and Rendering 226
OpenGL 227
GLUT 227
Mapping GPU Memory with OpenGL 228
Using Primitive Restart for 3D Performance 229
Introduction to the Files in the Framework 232
The Demo and Perlin Example Kernels 232
The Demo Kernel 233
The Demo Kernel to Generate a Colored Sinusoidal Surface 233
Perlin Noise 236
Using the Perlin Noise Kernel to Generate Artificial Terrain 238
The simpleGLmain.cpp File 243
The simpleVBO.cpp File 247
The callbacksVBO.cpp File 252
Summary 257
10 CUDA in a Cloud and Cluster Environments 260
The Message Passing Interface (MPI) 261
The MPI Programming Model 261
The MPI Communicator 262
MPI Rank 262
Master-Slave 264
Point-to-Point Basics 264
How MPI Communicates 265
Bandwidth 267
Balance Ratios 268
Considerations for Large MPI Runs 271
Scalability of the Initial Data Load 271
Using MPI to Perform a Calculation 272
Check Scalability 273
Cloud Computing 274
A Code Example 275
Data Generation 275
Summary 283
11 CUDA for Real Problems 284
Working with High-Dimensional Data 285
PCA/NLPCA 286
Multidimensional Scaling 286
K-Means Clustering 287
Expectation-Maximization 287
Support Vector Machines 288
Bayesian Networks 288
Mutual information 289
Force-Directed Graphs 290
Monte Carlo Methods 291
Molecular Modeling 292
Quantum Chemistry 292
Interactive Workflows 293
A Plethora of Projects 293
Summary 294
12 Application Focus on Live Streaming Video 296
Topics in Machine Vision 297
3D Effects 298
Segmentation of Flesh-colored Regions 298
Edge Detection 299
FFmpeg 300
TCP Server 302
Live Stream Application 306
kernelWave(): An Animated Kernel 306
kernelFlat(): Render the Image on a Flat Surface 307
kernelSkin(): Keep Only Flesh-colored Regions 307
kernelSobel(): A Simple Sobel Edge Detection Filter 308
The launch_kernel() Method 309
The simpleVBO.cpp File 310
The callbacksVBO.cpp File 310
Building and Running the Code 314
The Future 314
Machine Learning 314
The Connectome 315
Summary 316
Listing for simpleVBO.cpp 316
Works Cited 322
Index 330
A 330
B 330
C 330
D 330
E 331
F 331
G 331
H 331
I 331
J 331
K 331
L 332
M 332
N 332
O 332
P 333
Q 333
R 333
S 333
T 333
U 334
V 334
W 334
X 334
Erscheint lt. Verlag | 8.10.2011 |
---|---|
Sprache | englisch |
Themenwelt | Mathematik / Informatik ► Informatik ► Netzwerke |
Mathematik / Informatik ► Informatik ► Programmiersprachen / -werkzeuge | |
Mathematik / Informatik ► Informatik ► Software Entwicklung | |
Mathematik / Informatik ► Informatik ► Theorie / Studium | |
Informatik ► Weitere Themen ► Hardware | |
ISBN-10 | 0-12-388432-2 / 0123884322 |
ISBN-13 | 978-0-12-388432-9 / 9780123884329 |
Haben Sie eine Frage zum Produkt? |
Größe: 6,6 MB
Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine
Geräteliste und zusätzliche Hinweise
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
Größe: 3,0 MB
Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM
Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine
Geräteliste und zusätzliche Hinweise
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich