Handbook of Floating-Point Arithmetic (eBook)
XXIV, 572 Seiten
Birkhäuser Boston (Verlag)
978-0-8176-4705-6 (ISBN)
Floating-point arithmetic is the most widely used way of implementing real-number arithmetic on modern computers. However, making such an arithmetic reliable and portable, yet fast, is a very difficult task. As a result, floating-point arithmetic is far from being exploited to its full potential. This handbook aims to provide a complete overview of modern floating-point arithmetic. So that the techniques presented can be put directly into practice in actual coding or design, they are illustrated, whenever possible, by a corresponding program.
The handbook is designed for programmers of numerical applications, compiler designers, programmers of floating-point algorithms, designers of arithmetic operators, and more generally, students and researchers in numerical analysis who wish to better understand a tool used in their daily work and research.
Floating-point arithmetic is the most widely used way of implementing real-number arithmetic on modern computers. However, making such an arithmetic reliable and portable, yet fast, is a very difficult task. As a result, floating-point arithmetic is far from being exploited to its full potential. This handbook aims to provide a complete overview of modern floating-point arithmetic. So that the techniques presented can be put directly into practice in actual coding or design, they are illustrated, whenever possible, by a corresponding program.The handbook is designed for programmers of numerical applications, compiler designers, programmers of floating-point algorithms, designers of arithmetic operators, and more generally, students and researchers in numerical analysis who wish to better understand a tool used in their daily work and research.
Preface 14
List of Figures 16
List of Tables 19
I Introduction, Basic Definitions, and Standards 22
1 Introduction 23
1.1 Some History 23
1.2 Desirable Properties 26
1.3 Some Strange Behaviors 27
1.3.1 Some famous bugs 27
1.3.2 Difficult problems 28
2 Definitions and Basic Notions 33
2.1 Floating-Point Numbers 33
2.2 Rounding 40
2.2.1 Rounding modes 40
2.2.2 Useful properties 42
2.2.3 Relative error due to rounding 43
2.3 Exceptions 45
2.4 Lost or Preserved Properties of the Arithmetic on the Real Numbers 47
2.5 Note on the Choice of the Radix 49
2.5.1 Representation errors 49
2.5.2 A case for radix 10 50
2.6 Tools for Manipulating Floating-Point Errors 52
2.6.1 The ulp function 52
2.6.2 Errors in ulps and relative errors 57
2.6.3 An example: iterated products 57
2.6.4 Unit roundoff 59
2.7 Note on Radix Conversion 60
2.7.1 Conditions on the formats 60
2.7.2 Conversion algorithms 63
2.8 The Fused Multiply-Add (FMA) Instruction 71
2.9 Interval Arithmetic 71
2.9.1 Intervals with floating-point bounds 72
2.9.2 Optimized rounding 72
3 Floating-Point Formats and Environment 74
3.1 The IEEE 754-1985 Standard 75
3.1.1 Formats specified by IEEE 754-1985 75
3.1.2 Little-endian, big-endian 79
3.1.3 Rounding modes specified by IEEE 754-1985 80
3.1.4 Operations specified by IEEE 754-1985 81
3.1.5 Exceptions specified by IEEE 754-1985 85
3.1.6 Special values 88
3.2 The IEEE 854-1987 Standard 89
3.2.1 Constraints internal to a format 89
3.2.2 Various formats and the constraints between them 90
3.2.3 Conversions between floating-point numbers and decimal strings 91
3.2.4 Rounding 92
3.2.5 Operations 92
3.2.6 Comparisons 93
3.2.7 Exceptions 93
3.3 The Need for a Revision 93
3.3.1 A typical problem: ``double rounding'' 94
3.3.2 Various ambiguities 96
3.4 The New IEEE 754-2008 Standard 98
3.4.1 Formats specified by the revised standard 99
3.4.2 Binary interchange format encodings 100
3.4.3 Decimal interchange format encodings 101
3.4.4 Larger formats 111
3.4.5 Extended and extendable precisions 111
3.4.6 Attributes 112
3.4.7 Operations specified by the standard 116
3.4.8 Comparisons 118
3.4.9 Conversions 118
3.4.10 Default exception handling 119
3.4.11 Recommended transcendental functions 122
3.5 Floating-Point Hardware in Current Processors 123
3.5.1 The common hardware denominator 123
3.5.2 Fused multiply-add 123
3.5.3 Extended precision 123
3.5.4 Rounding and precision control 124
3.5.5 SIMD instructions 125
3.5.6 Floating-point on x86 processors: SSE2 versus x87 125
3.5.7 Decimal arithmetic 126
3.6 Floating-Point Hardware in Recent GraphicsProcessing Units 127
3.7 Relations with Programming Languages 128
3.7.1 The Language Independent Arithmetic (LIA) standard 128
3.7.2 Programming languages 129
3.8 Checking the Environment 129
3.8.1 MACHAR 130
3.8.2 Paranoia 130
3.8.3 UCBTest 134
3.8.4 TestFloat 135
3.8.5 IeeeCC754 135
3.8.6 Miscellaneous 135
II Cleverly Using Floating-Point Arithmetic 136
4 Basic Properties and Algorithms 137
4.1 Testing the Computational Environment 137
4.1.1 Computing the radix 137
4.1.2 Computing the precision 139
4.2 Exact Operations 140
4.2.1 Exact addition 140
4.2.2 Exact multiplications and divisions 142
4.3 Accurate Computations of Sums of Two Numbers 143
4.3.1 The Fast2Sum algorithm 144
4.3.2 The 2Sum algorithm 147
4.3.3 If we do not use rounding to nearest 149
4.4 Computation of Products 150
4.4.1 Veltkamp splitting 150
4.4.2 Dekker's multiplication algorithm 153
4.5 Complex numbers 157
4.5.1 Various error bounds 158
4.5.2 Error bound for complex multiplication 159
4.5.3 Complex division 162
4.5.4 Complex square root 167
5 The Fused Multiply-Add Instruction 169
5.1 The 2MultFMA Algorithm 170
5.2 Computation of Residuals of Division and Square Root 171
5.3 Newton--Raphson-Based Division with an FMA 173
5.3.1 Variants of the Newton--Raphson iteration 173
5.3.2 Using the Newton--Raphson iteration for correctly rounded division 178
5.4 Newton--Raphson-Based Square Root with an FMA 185
5.4.1 The basic iterations 185
5.4.2 Using the Newton--Raphson iteration for correctly rounded square roots 186
5.5 Multiplication by an Arbitrary-Precision Constant 189
5.5.1 Checking for a given constant C if Algorithm 5.2 will always work 190
5.6 Evaluation of the Error of an FMA 193
5.7 Evaluation of Integer Powers 195
6 Enhanced Floating-Point Sums, Dot Products, and Polynomial Values 198
6.1 Preliminaries 199
6.1.1 Floating-point arithmetic models 200
6.1.2 Notation for error analysis and classical error estimates 201
6.1.3 Properties for deriving validated running error bounds 204
6.2 Computing Validated Running Error Bounds 205
6.3 Computing Sums More Accurately 207
6.3.1 Reordering the operands, and a bit more 207
6.3.2 Compensated sums 209
6.3.3 Implementing a ``long accumulator'' 216
6.3.4 On the sum of three floating-point numbers 216
6.4 Compensated Dot Products 218
6.5 Compensated Polynomial Evaluation 220
7 Languages and Compilers 222
7.1 A Play with Many Actors 222
7.1.1 Floating-point evaluation in programming languages 223
7.1.2 Processors, compilers, and operating systems 225
7.1.3 In the hands of the programmer 226
7.2 Floating Point in the C Language 226
7.2.1 Standard C99 headers and IEEE 754-1985 support 226
7.2.2 Types 227
7.2.3 Expression evaluation 230
7.2.4 Code transformations 233
7.2.5 Enabling unsafe optimizations 234
7.2.6 Summary: a few horror stories 235
7.3 Floating-Point Arithmetic in the C++ Language 237
7.3.1 Semantics 237
7.3.2 Numeric limits 238
7.3.3 Overloaded functions 239
7.4 FORTRAN Floating Point in a Nutshell 240
7.4.1 Philosophy 240
7.4.2 IEEE 754 support in FORTRAN 243
7.5 Java Floating Point in a Nutshell 244
7.5.1 Philosophy 244
7.5.2 Types and classes 245
7.5.3 Infinities, NaNs, and signed zeros 247
7.5.4 Missing features 248
7.5.5 Reproducibility 249
7.5.6 The BigDecimal package 250
7.6 Conclusion 251
III Implementing Floating-Point Operators 253
8 Algorithms for the Five Basic Operations 254
8.1 Overview of Basic Operation Implementation 254
8.2 Implementing IEEE 754-2008 Rounding 256
8.2.1 Rounding a nonzero finite value with unbounded exponent range 256
8.2.2 Overflow 258
8.2.3 Underflow and subnormal results 259
8.2.4 The inexact exception 260
8.2.5 Rounding for actual operations 260
8.3 Floating-Point Addition and Subtraction 261
8.3.1 Decimal addition 264
8.3.2 Decimal addition using binary encoding 265
8.3.3 Subnormal inputs and outputs in binary addition 266
8.4 Floating-Point Multiplication 266
8.4.1 Normal case 267
8.4.2 Handling subnormal numbers in binary multiplication 267
8.4.3 Decimal specifics 268
8.5 Floating-Point Fused Multiply-Add 269
8.5.1 Case analysis for normal inputs 269
8.5.2 Handling subnormal inputs 273
8.5.3 Handling decimal cohorts 274
8.5.4 Overview of a binary FMA implementation 274
8.6 Floating-Point Division 277
8.6.1 Overview and special cases 277
8.6.2 Computing the significand quotient 278
8.6.3 Managing subnormal numbers 279
8.6.4 The inexact exception 280
8.6.5 Decimal specifics 280
8.7 Floating-Point Square Root 280
8.7.1 Overview and special cases 280
8.7.2 Computing the significand square root 281
8.7.3 Managing subnormal numbers 282
8.7.4 The inexact exception 282
8.7.5 Decimal specifics 282
9 Hardware Implementation of Floating-Point Arithmetic 283
9.1 Introduction and Context 283
9.1.1 Processor internal formats 283
9.1.2 Hardware handling of subnormal numbers 284
9.1.3 Full-custom VLSI versus reconfigurable circuits 285
9.1.4 Hardware decimal arithmetic 286
9.1.5 Pipelining 287
9.2 The Primitives and Their Cost 288
9.2.1 Integer adders 288
9.2.2 Digit-by-integer multiplication in hardware 294
9.2.3 Using nonstandard representations of numbers 294
9.2.4 Binary integer multiplication 295
9.2.5 Decimal integer multiplication 297
9.2.6 Shifters 298
9.2.7 Leading-zero counters 298
9.2.8 Tables and table-based methods for fixed-point function approximation 300
9.3 Binary Floating-Point Addition 302
9.3.1 Overview 302
9.3.2 A first dual-path architecture 303
9.3.3 Leading-zero anticipation 305
9.3.4 Probing further on floating-point adders 309
9.4 Binary Floating-Point Multiplication 310
9.4.1 Basic architecture 310
9.4.2 FPGA implementation 310
9.4.3 VLSI implementation optimized for delay 312
9.4.4 Managing subnormals 315
9.5 Binary Fused Multiply-Add 316
9.5.1 Classic architecture 317
9.5.2 To probe further 319
9.6 Division 319
9.6.1 Digit-recurrence division 320
9.6.2 Decimal division 323
9.7 Conclusion: Beyond the FPU 323
9.7.1 Optimization in context of standard operators 324
9.7.2 Operation with a constant operand 325
9.7.3 Block floating point 327
9.7.4 Specific architectures for accumulation 327
9.7.5 Coarser-grain operators 331
9.8 Probing Further 334
10 Software Implementation of Floating-Point Arithmetic 335
10.1 Implementation Context 336
10.1.1 Standard encoding of binary floating-point data 336
10.1.2 Available integer operators 337
10.1.3 First examples 340
10.1.4 Design choices and optimizations 342
10.2 Binary Floating-Point Addition 343
10.2.1 Handling special values 344
10.2.2 Computing the sign of the result 346
10.2.3 Swapping the operands and computing the alignment shift 347
10.2.4 Getting the correctly rounded result 349
10.3 Binary Floating-Point Multiplication 355
10.3.1 Handling special values 355
10.3.2 Sign and exponent computation 357
10.3.3 Overflow detection 359
10.3.4 Getting the correctly rounded result 360
10.4 Binary Floating-Point Division 363
10.4.1 Handling special values 364
10.4.2 Sign and exponent computation 365
10.4.3 Overflow detection 368
10.4.4 Getting the correctly rounded result 369
10.5 Binary Floating-Point Square Root 375
10.5.1 Handling special values 376
10.5.2 Exponent computation 378
10.5.3 Getting the correctly rounded result 379
IV Elementary Functions 387
11 Evaluating Floating-Point Elementary Functions 388
11.1 Basic Range Reduction Algorithms 392
11.1.1 Cody and Waite's reduction algorithm 392
11.1.2 Payne and Hanek's algorithm 394
11.2 Bounding the Relative Error of Range Reduction 395
11.3 More Sophisticated Range Reduction Algorithms 397
11.3.1 An example of range reduction for the exponential function 399
11.3.2 An example of range reduction for the logarithm 400
11.4 Polynomial or Rational Approximations 401
11.4.1 L2 case 402
11.4.2 L, or minimax case 403
11.4.3 ``Truncated'' approximations 405
11.5 Evaluating Polynomials 406
11.6 Correct Rounding of Elementary Functions tobinary64 407
11.6.1 The Table Maker's Dilemma and Ziv's onion peelingstrategy 407
11.6.2 When the TMD is solved 408
11.6.3 Rounding test 409
11.6.4 Accurate second step 413
11.6.5 Error analysis and the accuracy/performance tradeoff 414
11.7 Computing Error Bounds 415
11.7.1 The point with efficient code 415
11.7.2 Example: a ``double-double'' polynomial evaluation 416
12 Solving the Table Maker's Dilemma 418
12.1 Introduction 418
12.1.1 The Table Maker's Dilemma 419
12.1.2 Brief history of the TMD 423
12.1.3 Organization of the chapter 424
12.2 Preliminary Remarks on the Table Maker's Dilemma 425
12.2.1 Statistical arguments: what can be expected in practice 425
12.2.2 In some domains, there is no need to find worst cases 429
12.2.3 Deducing the worst cases from other functions or domains 432
12.3 The Table Maker's Dilemma for Algebraic Functions 433
12.3.1 Algebraic and transcendental numbers and functions 433
12.3.2 The elementary case of quotients 435
12.3.3 Around Liouville's theorem 437
12.3.4 Generating bad rounding cases for the square root using Hensel 2-adic lifting 438
12.4 Solving the Table Maker's Dilemma for Arbitrary Functions 442
12.4.1 Lindemann's theorem: application to some transcendental functions 442
12.4.2 A theorem of Nesterenko and Waldschmidt 443
12.4.3 A first method: tabulated differences 445
12.4.4 From the TMD to the distance between a grid and a segment 447
12.4.5 Linear approximation: Lefèvre's algorithm 449
12.4.6 The SLZ algorithm 456
12.4.7 Periodic functions on large arguments 461
12.5 Some Results 462
12.5.1 Worst cases for the exponential, logarithmic, trigonometric, and hyperbolic functions 462
12.5.2 A special case: integer powers 471
12.6 Current Limits and Perspectives 471
V Extensions 473
13 Formalisms for Certifying Floating-Point Algorithms 474
13.1 Formalizing Floating-Point Arithmetic 474
13.1.1 Defining floating-point numbers 475
13.1.2 Simplifying the definition 477
13.1.3 Defining rounding operators 478
13.1.4 Extending the set of numbers 481
13.2 Formalisms for Certifying Algorithms by Hand 482
13.2.1 Hardware units 482
13.2.2 Low-level algorithms 483
13.2.3 Advanced algorithms 484
13.3 Automating Proofs 485
13.3.1 Computing on bounds 486
13.3.2 Counting digits 488
13.3.3 Manipulating expressions 490
13.3.4 Handling the relative error 494
13.4 Using Gappa 495
13.4.1 Toy implementation of sine 495
13.4.2 Integer division on Itanium 499
14 Extending the Precision 503
14.1 Double-Words, Triple-Words… 504
14.1.1 Double-word arithmetic 505
14.1.2 Static triple-word arithmetic 508
14.1.3 Quad-word arithmetic 510
14.2 Floating-Point Expansions 513
14.3 Floating-Point Numbers with Batched Additional Exponent 519
14.4 Large Precision Relying on Processor Integers 520
14.4.1 Using arbitrary-precision integer arithmetic for arbitrary-precision floating-point arithmetic 522
14.4.2 A brief introduction to arbitrary-precision integer arithmetic 523
VI Perspectives and Appendix 527
15 Conclusion and Perspectives 528
16 Appendix: Number Theory Tools for F-P Arithmetic 529
16.1 Continued Fractions 529
16.2 The LLL Algorithm 532
Bibliography 537
Index 574
Erscheint lt. Verlag | 11.11.2009 |
---|---|
Zusatzinfo | XXIV, 572 p. |
Verlagsort | Boston |
Sprache | englisch |
Themenwelt | Mathematik / Informatik ► Informatik ► Programmiersprachen / -werkzeuge |
Mathematik / Informatik ► Mathematik ► Analysis | |
Mathematik / Informatik ► Mathematik ► Statistik | |
Mathematik / Informatik ► Mathematik ► Wahrscheinlichkeit / Kombinatorik | |
Technik | |
Schlagworte | algorithm • Algorithm analysis and problem complexity • algorithms • approximation under constraints • certification, verification, big precision • Compilers • Computer • correctly rounded software division, square roots • elementary functions • error analysis, error bounds • Floating-point arithmetic • fused mul • Hardware • IEEE-754 standards • Number Theory • Numerical analysis • Operator • verification |
ISBN-10 | 0-8176-4705-8 / 0817647058 |
ISBN-13 | 978-0-8176-4705-6 / 9780817647056 |
Haben Sie eine Frage zum Produkt? |
Größe: 6,9 MB
DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.
Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich