Next Previous Contents

11. Package for machine-independent IEEE floating point arithmetic

Abstract data `IEEE' may be used for implementation of a cross-compiler. This abstract data implements IEEE floating point arithmetic by machine independent way with the aid of package `arithm'. This abstract data is necessary because host machine may not support such arithmetic for target machine. For example, VAX does not support IEEE floating point arithmetic. The floating point numbers are represented by bytes in big endian mode. The implementation of the package functions are not sufficiently efficient in order to use for run-time. The package functions are oriented to implement constant-folding in compilers. All integer sizes (see transformation functions) are given in bytes and must be positive.

Functions of addition, subtraction, multiplication, division, conversion of floating point numbers of different formats can fix input exceptions. If an operand of such operation is trapping (signal) not a number then invalid operation and reserved operand exceptions are fixed and the result is (quiet) NaN, otherwise if an operand is (quiet) NaN then only reserved operand exception is fixed and the result is (quiet) NaN. Operation specific processing the rest of special case values of operands is placed with description of the operation. In general case the function can fix output exceptions and produces results for exception according to the following table. The result and status for a given exceptional operation are determined by the highest priority exception. If, for example, an operation produces both overflow and imprecise result exceptions, the overflow exception, having higher priority, determines the behavior of the operation. The behavior of this operation is therefore described by the Overflow entry of the table.

    Exception|Condition|                     |Result |Status
  -----------|---------|---------------------|-------|-------------
             |masked   |         IEEE_RN(_RP)| +Inf  |IEEE_OFL and
             |overflow | sign +  IEEE_RZ(_RM)| +Max  |IEEE_IMP    
             |exception|---------------------|-------|-------------
    Overflow |         | sign -  IEEE_RN(_RM)| -Inf  |IEEE_OFL and
             |         |         IEEE_RZ(_RP)| -Max  |IEEE_IMP    
             |---------|---------------------|-------|-------------
             |unmasked | Precise result      |See    |IEEE_OFL
             |overflow |---------------------|above  |-------------
             |exception| Imprecise result    |       |IEEE_OFL and
             |         |                     |       |IEEE_IMP    
  -----------|---------|---------------------|-------|-------------
             |masked   |                     |Rounded|IEEE_UFL and
             |underflow| Imprecise result    |result |IEEE_IMP    
   Underflow |exception|                     |       |     
             |---------|---------------------|-------|-------------
             |unmasked | Precise result      |result |IEEE_UFL
             |underflow|---------------------|-------|-------------
             |exception| Imprecise result    |Rounded|IEEE_UFL and
             |         |                     |result |IEEE_IMP    
  -----------|-------------------------------|-------|-------------
             |masked imprecise exception     |Rounded|IEEE_IMP
   Imprecise |                               |result |             
             |-------------------------------|-------|-------------
             |unmasked imprecise exception   |Rounded|IEEE_IMP
             |                               |result |             

The package uses package `bits'. The interface part of the abstract data is file `IEEE.h'. The implementation part is file `IEEE.cpp'. The interface contains the following external definitions:

Macros `IEEE_FLOAT_SIZE', `IEEE_DOUBLE_SIZE', `IEEE_QUAD_SIZE'

have values which are are sizes of IEEE single, double and quad precision floating point numbers (`4', `8', and `16' correspondingly).

Macros `MAX_SINGLE_10_STRING_LENGTH', `MAX_DOUBLE_10_STRING_LENGTH', `MAX_QUAD_10_STRING_LENGTH'

have values which are maximal length of string generated by functions creating decimal ascii representation of IEEE floats (see functions to_string).

Macros `MAX_SINGLE_16_STRING_LENGTH', `MAX_DOUBLE_16_STRING_LENGTH', `MAX_QUAD_16_STRING_LENGTH', `MAX_SINGLE_8_STRING_LENGTH', `MAX_DOUBLE_8_STRING_LENGTH', `MAX_QUAD_8_STRING_LENGTH', `MAX_SINGLE_4_STRING_LENGTH', `MAX_DOUBLE_4_STRING_LENGTH', `MAX_QUAD_4_STRING_LENGTH', `MAX_SINGLE_2_STRING_LENGTH', `MAX_DOUBLE_2_STRING_LENGTH', `MAX_QUAD_2_STRING_LENGTH'

have values which are maximal length of string generated by functions creating binary ascii representation of IEEE floats with given base (see functions to_binary_string).

Types `IEEE_float_t', `IEEE_double_t', and `IEEE_quad_t'

are simply synonyms of classes `IEEE_float', `IEEE_double', and `IEEE_quad' representing correspondingly IEEE single precision, double, and quad precision floating point numbers.

Constants `IEEE_RN', `IEEE_RM', `IEEE_RP', `IEEE_RZ'

defines rounding control (round to nearest representable number, round toward minus infinity, round toward plus infinity, round toward zero).

Round to nearest means the result produced is the representable value nearest to the infinitely-precise result. There are special cases when infinitely precise result falls exactly halfway between two representable values. In this cases the result will be whichever of those two representable values has a fractional part whose least significant bit is zero.

Round toward minus infinity means the result produced is the representable value closest to but no greater than the infinitely precise result.

Round toward plus infinity means the result produced is the representable value closest to but no less than the infinitely precise result.

Round toward zero, i.e. the result produced is the representable value closest to but no greater in magnitude than the infinitely precise result.

Class `IEEE'

The class has the following functions common for all packages:

Static public function `reset'

           `void reset (void)'
           
and to separate bits in mask returned by functions
           `IEEE_get_sticky_status_bits',
           `IEEE_get_status_bits', and
           `IEEE_get_trap_mask'.
           

Function `IEEE_reset'

           `void IEEE_reset (void)'
           
and to separate bits in mask returned by functions
           `IEEE_get_sticky_status_bits',
           `IEEE_get_status_bits', and
           `IEEE_get_trap_mask'.
           

Static public function `get_trap_mask'

           `int get_trap_mask (void)'
           
returns exceptions trap mask. Static public function
           `int set_trap_mask (int mask)'
           
sets up new exception trap mask and returns the previous.

If the mask bit corresponding given exception is set, a floating point exception trap does not occur for given exception. Such exception is said to be masked exception. Initial exception trap mask is zero. Remember that more one exception may be occurred simultaneously.

Static public function `set_sticky_status_bits'

           `int set_sticky_status_bits (int mask)'
           
changes sticky status bits and returns the previous bits.

Static public function

           `int get_sticky_status_bits (void)'
           
returns mask of current sticky status bits. Only sticky status bits corresponding to masked exceptions are updated regardless whether a floating point exception trap is taken or not. Initial values of sticky status bits are zero.

Static public function `get_status_bits'

           `int get_status_bits (void)'
           
returns mask of status bits. It is supposed that the function will be used in trap on an floating point exception. Status bits are updated regardless of the current exception trap mask only when a floating point exception trap is taken. Initial values of status bits are zero.

Static public functions `set_round', `get_round'

           `int set_round (int round_mode)'
           
which sets up current rounding mode and returns previous mode and
           `int IEEE_get_round (void)'
           
which returns current mode. Initial rounding mode is round to nearest.

Static public function `default_floating_point_exception_trap'

           `void default_floating_point_exception_trap (void)'
           
Originally reaction on occurred trap on an unmasked floating point exception is equal to this function. The function does nothing. All occurred exceptions can be found in the trap with the aid of status bits.

Static public function `set_floating_point_exception_trap'

           `void (*set_floating_point_exception_trap
                   (void (*function) (void))) (void)'
           
sets up trap on an unmasked exception. Function given as parameter simulates floating point exception trap.

Classes `IEEE_float', `IEEE_double', and `IEEE_quad'

The classes implements IEEE floating point numbers in object-oriented style. The following functions are described for class `IEEE_float'. The classes `IEEE_double' and `IEEE_quad' have analogous functions (if details are absent) with the same names but for IEEE double and quad numbers.

Public constructors `IEEE_float', `IEEE_double', `IEEE_quad'

           `IEEE_float (void)'
           `IEEE_float (float f)'
           `IEEE_double (void)'
           `IEEE_double (float f)'
           `IEEE_quad (void)'
           `IEEE_quad (float f)'
           
creates IEEE single, IEEE double, or IEEE quad precision numbers with pozitive zero values or with given value.

Public function `positive_zero'

           `void positive_zero (void)'
           
Given float becomes positive single precision zero constant. There are analogous functions which return other special case values:
           `negative_zero',
           `NaN',
           `trapping_NaN',
           `positive_infinity',
           `negative_infinity',
           

According to the IEEE standard NaN (and trapping NaN) can be represented by more one bit string. But all functions of the package generate and use only one its representation created by function `NaN' (and `trapping_NaN'). A (quiet) NaN does not cause an Invalid Operation exception and can be reported as an operation result. A trapping NaN causes an Invalid Operation exception if used as in input operand to floating point operation. Trapping NaN can not be reported as an operation result.

Public function `is_positive_zero'

           `int is_positive_zero (void)'
           
returns 1 if given number is positive single precision zero constant. There are analogous functions for other special case values:
           `is_negative_zero',
           `is_NaN',
           `is_trapping_NaN',
           `is_positive_infinity',
           `is_negative_infinity',
           `is_positive_maximum' (positive max value),
           `is_negative_maximum',
           `is_positive_minimum' (positive min value),
           `is_negative_minimum',
           
In spite of that all functions of the package generate and use only one its representation created by function `NaN' (or `trapping_NaN'). The function `is_NaN' (and `trapping_NaN') determines any representation of NaN.

Public function `is_normalized'

           `int is_normalized (void)'
           
returns TRUE if given number is normalized (special case values are not normalized). There is analogous function
           `is_denormalized'
           
for determination of denormalized number.

Public operator `+'

           `class IEEE_float operator + (class IEEE_float &op)'
           
makes single precision addition of floating point numbers. There are analogous operators which implement other floating point operations:
           `-',
           `*',
           `/',
           
Results and input exceptions for operands of special cases values (except for NaNs) are described for addition by the following table
               first  |         second operand                
               operand|---------------------------------------
                      |    +Inf      |    -Inf     |   Others
               -------|--------------|-------------|----------
               +Inf   |    +Inf      |     NaN     |   +Inf
                      |    none      |IEEE_INV(_RO)|   none
               -------|--------------|-------------|----------
               -Inf   |    NaN       |    -Inf     |   -Inf
                      |IEEE_INV(_RO) |    none     |   none
               -------|--------------|-------------|----------
               Others |    +Inf      |    -Inf     |
                      |    none      |    none     |          
   
Results and input exceptions for operands of special cases values (except for NaNs) are described for subtraction by the following table
               first  |         second operand                
               operand|---------------------------------------
                      |    +Inf     |    -Inf      |   Others
               -------|-------------|--------------|----------
               +Inf   |     NaN     |    +Inf      |   +Inf
                      |IEEE_INV(_RO)|    none      |   none
               -------|-------------|--------------|----------
               -Inf   |    -Inf     |    NaN       |   -Inf
                      |    none     |IEEE_INV(_RO) |   none
               -------|-------------|--------------|----------
               Others |    -Inf     |    +Inf      |
                      |    none     |    none      |          
   
Results and input exceptions for operands of special cases values (except for NaNs) are described for multiplication by the following table
           first  |         second operand                
           operand|--------------------------------------------
                  |  +Inf    |   -Inf    |    0       |  Others
           -------|----------|-----------|------------|--------
           +Inf   |  +Inf    |   -Inf    |    NaN     | (+-)Inf
                  |  none    |   none    |  IEEE_INV  |  none  
                  |          |           |    (_RO)   | 
           -------|----------|-----------|------------|--------
           -Inf   |  -Inf    |   +Inf    |    NaN     | (+-)Inf
                  |  none    |   none    |  IEEE_INV  |  none  
                  |          |           |   (_RO)    | 
           -------|----------|-----------|------------|--------
           0      |   NaN    |   NaN     |   (+-)0    | (+-)0  
                  | IEEE_INV | IEEE_INV  |   none     | none   
                  |  (_RO)   |   (_RO)   |            | 
           -------|----------|-----------|------------|--------
           Others | (+-)Inf  |  (+-)Inf  |   (+-)0    |        
                  |  none    |   none    |   none     |        
   
Results and input exceptions for operands of special cases values (except for NaNs) are described for division by the following table
           first  |         second operand                
           operand|--------------------------------------------
                  |   +Inf    |   -Inf    |   0       |  Others
           -------|-----------|-----------|-----------|--------
           +Inf   |    NaN    |    NaN    |  (+-)Inf  | (+-)Inf
                  |  IEEE_INV |  IEEE_INV |  none     |  none  
                  |   (_RO)   |   (_RO)   |           |
           -------|-----------|-----------|-----------|--------
           -Inf   |    NaN    |    NaN    |  (+-)Inf  | (+-)Inf
                  |  IEEE_INV |  IEEE_INV |  none     |  none  
                  |   (_RO)   |   (_RO)   |           |
           -------|-----------|-----------|-----------|--------
           0      |  (+-)0    |  (+-)0    |    NaN    | (+-)0  
                  |  none     |  none     |  IEEE_INV | none   
                  |           |           |   (_RO)   |
           -------|-----------|-----------|-----------|--------
           Others |  (+-)0    |  (+-)0    |  (+-)Inf  |        
                  |  none     |   none    |  IEEE_DZ  |        
   

Public operator `=='

           `int operator == (class IEEE_float &op)'
           
compares two floating point numbers on equality and returns 1 or 0 depending on result of the comparison. There are analogous operators which implement other integer operations:
           `!=',
           `>',
           `>=',
           `<',
           `<='.
           
Results and input exceptions for operands of special cases values are described for equality and inequality by the following table
        
           first  |         second operand                
           operand|---------------------------------------
                  |    SNaN     |    QNaN      |   Others
           -------|-------------|--------------|----------
           SNaN   |   FALSE     |   FALSE      |  FALSE
                  |  IEEE_INV   |  IEEE_INV    | IEEE_INV
           -------|-------------|--------------|----------
           QNaN   |   FALSE     |   FALSE      |  FALSE
                  |  IEEE_INV   |    none      |   none
           -------|-------------|--------------|----------
           Others |   FALSE     |   FALSE      |
                  |  IEEE_INV   |    none      |          
   
Results and input exceptions for operands of special cases values are described for other comparison operation by the following table
        
           first  |         second operand                
           operand|---------------------------------------
                  |    SNaN     |    QNaN      |   Others
           -------|-------------|--------------|----------
           SNaN   |   FALSE     |   FALSE      |  FALSE
                  |  IEEE_INV   |  IEEE_INV    | IEEE_INV
           -------|-------------|--------------|----------
           QNaN   |   FALSE     |   FALSE      |  FALSE
                  |  IEEE_INV   |  IEEE_INV    | IEEE_INV
           -------|-------------|--------------|----------
           Others |   FALSE     |   FALSE      |
                  |  IEEE_INV   |  IEEE_INV    |          
   

Public functions `to_string'

           `char *to_string (char *result)'
           
transform single precision to decimal ascii representation with obligatory integer part (1 digit), fractional part (of constant length), and optional exponent. Signs minus are present if it is needed. The special cases IEEE floating point values are represented by strings `SNaN', `QNaN', `+Inf', `-Inf', `+0', and `-0'. The functions return value `result'. Current round mode does not affect the resultant ascii representation. The functions output 9 decimal fraction digits for single precision number, 17 decimal fraction digits for double precision number, and 36 decimal fraction digits for quad precision number

Public functions `to_binary_string'

           `char *to_binary_string (int base, char *result)'
   
           
The functions are analogous to to_string but but transform float number into to binary ascii representation with obligatory integer part (1 digit) of given base, optional fractional part of given base, and optional binary exponent (decimal number giving power of 2). The binary exponent starts with character `p' instead of `e'. Signs minus are present if it is needed. The special cases IEEE floating point values are represented by strings `SNaN', `QNaN', `+Inf', `-Inf', `+0', and `-0'. The functions return value `result'. Value of parameter base should be 2, 4, 8, or 16. Current round mode does not affect the resultant ascii representation.

Public functions `from_string'

           `char *from_string (const char *operand)'
           
skip all white spaces at the begin of source string and transforms tail of the source string to single precision floating point number. The number must correspond the following syntax
              ['+' | '-'] [<decimal digits>]
                          [ '.' [<decimal digits>] ]
                   [ ('e' | 'E') ['+' | '-'] <decimal digits>]
           
or must be the following strings `SNaN', `QNaN', `+Inf', `-Inf', `+0', or `-0'. The functions return pointer to first character in the source string after read floating point number. If the string does not correspond floating point number syntax the result will be zero and functions return the source string.

The functions can fix output exceptions as described above. Current round mode may affect resultant floating point number. It is guaranteed that transformation `IEEE floating point number -> string -> IEEE floating point number' results in the same IEEE floating point number if round to nearest mode is used. But the reverse transformation `string with 9 (or 17) digits -> IEEE floating point number -> string' may results in different digits of the fractions in ascii representation because a floating point number may represent several such strings with differences in the least significant digit. But the ascii representations are identical when the functions for IEEE single, double, and quad precision numbers do not fix imprecise result exception or less than 9 (17 or 36) digits of the fractions in the ascii representations are compared.

Public functions `from_binary_string'

           `char *from_binary_string (const char *operand, int base)'
           
The functions ar analogous to to_string but transform binary representation of the floating point number. The number must correspond the following syntax
              ['+' | '-'] [<digits less base>] [ '.' [<digits less base>] ]
                   [ ('p' | 'P') ['+' | '-'] <decimal digits>]
           
or must be the following strings `SNaN', `QNaN', `+Inf', `-Inf', `+0', or `-0'. The functions return pointer to first character in the source string after read floating point number. If the string does not correspond floating point number syntax the result will be zero and function returns the source string. The exponent (after character `p' or `P') defines power of two.

The functions can fix output exceptions as described above. Current round mode can affect resultant floating point number if there are too many given digits.

Public transformation functions

In class `IEEE_float'

           `class IEEE_double to_double (void)'
           `class IEEE_quad to_quad (void)'
           `class IEEE_float &from_signed_integer
                              (int size, const void *integer)'
           `class IEEE_float &from_unsigned_integer
                              (int size,
                               const void *unsigned_integer)'
           `void to_signed_integer (int size, void *integer)'
           `void to_unsigned_integer (int size,
                                      void *unsigned_integer)'
           
In class `IEEE_double'
           `class IEEE_float to_single (void)'
           `class IEEE_quad to_quad (void)'
           `class IEEE_double &from_signed_integer
                               (int size,
                                const void *integer)'
           `class IEEE_double &from_unsigned_integer
                                (int size,
                                 const void *unsigned_integer)'
           `void to_signed_integer (int size, void *integer)'
           `void to_unsigned_integer (int size,
                                      void *unsigned_integer)'
           
In class `IEEE_quad'
           `class IEEE_float to_single (void)'
           `class IEEE_double to_double (void)'
           `class IEEE_quad &from_signed_integer
                               (int size,
                                const void *integer)'
           `class IEEE_quad &from_unsigned_integer
                                (int size,
                                 const void *unsigned_integer)'
           `void to_signed_integer (int size, void *integer)'
           `void to_unsigned_integer (int size,
                                      void *unsigned_integer)'
           
Actually no one output exceptions occur during transformation of single precision floating point number to double (or quad) precision number and of double precision floating point number to quad precision number. No input exceptions occur during transformation of integer numbers to floating point numbers. Results and input exceptions for operand of special cases values (and for NaNs) are described for conversion floating point number to integer by the following table
                       Operand     | Result & Exception
                     --------------|-------------------
                         SNaN      |     0  
                                   |IEEE_INV(_RO)
                     --------------|-------------------
                         QNaN      |     0    
                                   |IEEE_INV(_RO)     
                     --------------|-------------------
                         +Inf      |    IMax     
                                   |  IEEE_INV     
                     --------------|-------------------
                         -Inf      |    IMin     
                                   |  IEEE_INV     
                     --------------|-------------------
                         Others    |             
                                   |               
   
Results and input exceptions for operand of special cases values (and for NaNs) are described for conversion floating point number to unsigned integer by the following table
                       Operand     | Result & Exception
                     --------------|-------------------
                         SNaN      |     0  
                                   |IEEE_INV(_RO)
                     --------------|-------------------
                         QNaN      |     0    
                                   |IEEE_INV(_RO)     
                     --------------|-------------------
                         +Inf      |    IMax     
                                   |  IEEE_INV     
                     --------------|-------------------
                         -Inf or   |    0     
                    negative number|  IEEE_INV     
                     --------------|-------------------
                         Others    |             
                                   |               
   
Results and exceptions for NaNs during transformation of floating point numbers to (unsigned) integers are differed from the ones for operations of addition, multiplication and so on.

Template transformation functions

As mentioned above there are template classes `sint' and `unsint' of package `arithm'. Therefore package `IEEE' contains template functions for transformation of between IEEE numbers and integer numbers. As in package `arithm' if you define macro `NO_TEMPLATE' before inclusion of interface file, these template transformation functions will be absent. There are the following functions:

        `template <int size> 
         class IEEE_float &IEEE_float_from_unsint
                           (class IEEE_float &single,
                            class unsint<size> &unsigned_integer)'

        `template <int size>
         class IEEE_float &IEEE_float_from_sint
                           (class IEEE_float &single,
                            class sint<size> &integer)

        `template <int size>
         void IEEE_float_to_sint (class IEEE_float &single,
                                  class sint<size> &integer)'
        `template <int size>
         void IEEE_float_to_unsint
                      (class IEEE_float &single,
                       class unsint<size> &unsigned_integer)'
        `template <int size> 
         class IEEE_double &IEEE_double_from_unsint
                            (class IEEE_double &single,
                             class unsint<size> &unsigned_integer)'

        `template <int size>
         class IEEE_double &IEEE_double_from_sint
                            (class IEEE_double &single,
                             class sint<size> &integer)

        `template <int size>
         void IEEE_double_to_sint (class IEEE_double &single,
                                   class sint<size> &integer)'
        `template <int size>
         void IEEE_double_to_unsint
              (class IEEE_double &single,
               class unsint<size> &unsigned_integer)'
        `template <int size> 
         class IEEE_quad &IEEE_quad_from_unsint
                            (class IEEE_quad &single,
                             class unsint<size> &unsigned_integer)'

        `template <int size>
         class IEEE_quad &IEEE_quad_from_sint
                            (class IEEE_quad &single,
                             class sint<size> &integer)

        `template <int size>
         void IEEE_quad_to_sint (class IEEE_quad &single,
                                 class sint<size> &integer)'
        `template <int size>
         void IEEE_quad_to_unsint
              (class IEEE_quad &single,
               class unsint<size> &unsigned_integer)'
        
Exceptions for these functions are the same as described above for functions `from_signed_integer', `to_signed_integer' and so on.

Important note: All items (they contains word quad or QUAD in their names) relative to IEEE 128 bits floating point numbers are defined only when macro `IEEE_QUAD' is defined. By default `IEEE_QUAD' is not defined. It is made because supporting IEEE 18-bits numbers requires more 100Kb memory.


Next Previous Contents