The scope of this document is to provide information on the implementation of floating point arithmetics in the C language on different platforms and details on different compilers. It will explain why it is problematic to simply use floating point math without considering the implications. It will also try to provide a set of preprocessor macros that allow the consistent usage of floating point arithmetics across as many platforms and compilers as possible.
This document is far from complete. Please be encouraged to contact me if you have any further information on this topic whatsoever. I'm especially interested in floating point arithmetics of non-x86 platforms, for example sparc, ia64, arm or alpha. I'd also be interested in someone with knowledge of Windows64 or at least the possibility to test some things with different compilers there. (especially Visual Studio 9, MinGW-gcc, Cygwin-gcc)
This document currently only deals with floating point precision. Two other topics will be discussed in future versions of this documents: floting point exceptions and internal rounding control.
Please note that some understanding of the assembly language and the internal doings of microprocessors is extremely helpful for understanding this document completely. Nevertheless the conclusion should not be problematic.
For a general explanation on floating point numbers, please refer to Wikipedia article on IEEE 754. The following is only a small overview.
There are four main types of floating point numbers that are widely implemented today:
Some processors (most notably the x87 fpu) have a so-called internal precision, i.e. instead of having different registers and instructions for different floating point types (as is the case with integers) they have a single internal floating point register type (in the x87 case double extended precision) and a single operation type for doing math.
Many floating point experts state the case that using a higher internal precision for floating point calculations is better in order to ensure algorithms remain stable. While I basically agree with this argument, the problem with x87 FPUs is that they hide the usage of this internal precision from the the programmer. This causes two different kinds of headaches:
Note that the second problem apparently does not occur for single-precision floating point numbers since the number of fraction bits is so small that there appears to be no short decimal number that maps to a floating point number with these characteristics. Thus for single precision, calculating in a higher precision and then truncating appears to have the same effect as directly calculating in single precision.
This section will take a look at different hardware implementations and analyze them.
x86 platforms support two types of floating point instruction sets nowadays: The classical x87 floating point unit and the new MMX/SSE instructions that can even perform parallel calculations.
The characteristics of the x87 FPU are:
The characteristics of the MMX/SSE instructions are:
According to the Documentation (see: Book I: PowerPC User Instruction Set Architecture) and some tests done by Scott MacVicar, the following holds for PPC:
The PPC processor also seems to support a non-IEEE-mode for the FPU. The documentation does not clearly state how this mode is defined and for portability reasons it clearly should be avoided.
Johannes Schlüter did some tests on UltraSPARC IIIi platform which yielded the following results:
I'd love to have some details on how other platforms implement floating point operations. The following is a non-exhaustive list of platforms I don't have any information on:
Some notes:
This section focusses on the implementation of floating point operations in compilers and system libraries, i.e. how C code is transformed into assembly instructions and how they interact with the internal processor state.
Calling conventions define how function parameters and return values are passed. This is platform, operating-system and compiler dependent. Please note that quite a few systems pass floating point parameters and return values directly in the FPU registers and not im memory. This can have interesting side-effects, take for example a method that returns a float
(single-precision) but instead of simply storing the return value the function is directly used within another function. With certain calling conventions the FP registers may directly be passed through to the second function - without ever storing the value in memory and thus not truncating it from internal precision. On the other hand, different systems may have calling conventions that use the stack for parameters and thus the value is stored in memory and is truncated. This also causes non-deterministic behaviour in the sense that different systems react differently to the same piece of code. I will discuss this issue separately for each compiler if applicable.
I did extensive tests with cl.exe
from Microsoft Visual C++ 2005 Express Edition under Windows 2000 and with MinGW (gcc 3.4.5 mingw-special). I could only test on 32bit Windows, I'd be happy to hear about 64bit Windows.
Most notably, the Visual C++ Compiler does not support double-extended precision, it only supports single precision and double precision. ([1], [2]) It does, however, know a long double
type which is treated the same way as double
for arithmetic reasons but is a different type with regard to type coerxion. It also switches the processor's internal precision to double precision, so all calculations are executed in double precision.
TODO: It would be nice to know when this exactly happens. Is it the kernel or some part of the runtime initialization?
The Microsoft Visual C++ Runtime also supports the functions _controlfp
_controlfp_s
to switch x87 FP mode. They are documented MSDN: _controlfp, _controlfp_s. The latter is available since Visual C++ 2005. In order to set the current precision, use:
// Both are defined here #include <float.h> // old version unsigned int fpu_oldcw; fpu_oldcw = _controlfp(0, 0); // store old vlaue _controlfp(precision, _MCW_PC); // calculation here _controlfp(fpu_oldcw, _MCW_PC); // new version unsigned int fpu_oldcw, fpu_cw; _controlfp_s(&fpu_cw, 0, 0); fpu_oldcw = fpu_cw; _controlfp_s(&fpu_cw, precision, _MCW_PC); // calculation here _controlfp_s(&fpu_cw, fpu_oldcw, _MCW_PC);
Since the new version only does additional range checking on the values, it is ok to use the old version as long as the supplied values are valid.
The precision may be one of:
Please note that the double-extended precision setting is for internal precision only. Values will always be truncated to double precision with Visual C++.
In order to ensure that the optimizer does not make wrong assumptions, please add #pragma fenv_access(on)
to the source code. This will tell the compiler that the floating point mode may be changed by the source.
Calling conventions: Windows does use FPU registers for return values but it uses a small trick to ensure that values react deterministaclly. For single precision values the compiler temporarily stores the value in a variable and loads it again - the value is then truncated to single precision. (double precision values don't matter since the compiler assumes the processor is always in double precision) Consider the following source code:
float divf (float a, float b) { return a / b; }
This is transformed to:
_divf PROC ; COMDAT ; Line 36 fld DWORD PTR _a$[esp-4] fdiv DWORD PTR _b$[esp-4] fstp DWORD PTR tv130[esp-4] fld DWORD PTR tv130[esp-4] ; Line 37 ret 0 _divf ENDP
Note: Is it possible to have cl.exe generate SSE code?
Conclusion:
float
datatype for all operhands. Since single-precision values will always be truncated before use, the calculations will always appear to have been in single precision. Also, it couldn't hurt to use _controlfp
just in case.double
datatype for all operhands. Since the internal precision set by the compiler is double precision, it will be used automatically. However, it cannot hurt to use _controlfp
to force the precision just in case._controlfp
to force the internal precision to double-extended. But be aware of the fact that whenever a value is stored back to memory (which may happen arbitrarily based on the decision of the optimizer), it will be truncated to double precision. Using double-extended precision with Microsoft Visual C++ is not feasible.This compiler supports double-extended precision (datatype long double
). It sets the processor (or leaves it) in double-extended internal precision.
MinGW offers access to _controlfp
but not _controlfp_s
. The semantics are the same as with Visual C++, only that MinGW can actually use double-extended precision.
Calling conventions: Please note that MinGW never truncates to either double or float before returning a value. You can, however, force it to do that by either using the compiler option -ffloat-store or stroring the result in an explicit variable that is declared volatile
and then returning that variable. This will result in the desired code.
Conclusion:
float
datatype for all operhands and use the compiler option -ffloat-store
. If it is not feasible to specify a compiler option, use the following code:
{ volatile float result = /* actual result */; return result; }instead of
return /* actual result */;Also, it couldn't hurt to use
_controlfp
just in case.double
datatype and use _controlfp
to set and restore the precision at the beginning of the method (-ffloat-store
will not suffice!)long double
datatype. If you haven't touched the internal precision, you don't need to explicitely set it but it wouldn't hurt, just in case.Not tested, feedback welcome.
As does the GCC that comes with MinGW, this compiler supports double-extended precision and also uses that for the internal processor precision.
glibc provides a header file fpu_control.h which defines several useful macros to set the precision of the FPU. These macros evaluate to inline assembler code that directly sets the FPU state. Example usage:
#include <fpu_control.h> fpu_control_t fpu_oldcw, fpu_cw; _FPU_GETCW(fpu_oldcw); // store old cw fpu_cw = (fpu_oldcw & ~_FPU_EXTENDED & ~_FPU_DOUBLE & ~_FPU_SINGLE) | precision; _FPU_SETCW(fpu_cw); // calculations here _FPU_SETCW(fpu_oldcw); // restore old cw
Precision my be one of:
Calling conventions: In order for the GCC to truncate variables to the correct precision before returning them, use the -ffloat-store compiler option or declare an explicit return value variable volatile
(as with MinGW). Please note that this alone will not solve the problem described in this article since the above mentioned example 0.002877 will result in different doubles even after truncation.
Caution: Since the FPU control macros only expand to inline assembler, the optimizer will sometimes reorder the actual calculation after the reset of the FPU control word. Example:
double divcw (double a, double b) { double res; fpu_control_t _oldcw, _cw; _FPU_GETCW(_oldcw); _cw = (_oldcw & ~_FPU_EXTENDED & ~_FPU_SINGLE) | _FPU_DOUBLE; _FPU_SETCW(_cw); res = a / b; _FPU_SETCW(_oldcw); return res; }
Without optimization, this will be expanded to:
divcw: # stuff left out for brevity #APP fldcw -12(%ebp) #NO_APP fldl -24(%ebp) fdivl -32(%ebp) fstpl -8(%ebp) #APP fldcw -10(%ebp) #NO_APP fldl -8(%ebp) leave ret .size divcw, .-divcw
With optimization (-O2, but -O1 has the same problem) however, this will result in:
divcw: pushl %ebp movl %esp, %ebp subl $16, %esp fldl 8(%ebp) #APP fnstcw -2(%ebp) #NO_APP movzwl -2(%ebp), %eax andl $-769, %eax orl $512, %eax movw %ax, -4(%ebp) #APP fldcw -4(%ebp) fldcw -2(%ebp) #NO_APP fdivl 16(%ebp) leave ret .size divcw, .-divcw
As one can see, the optimizer reordered the instructions. I consider this a bug in GCC. GCC 3.3.6 and 3.4.6 were not affected according to my tests (but will only use -ffloat-store for explicit variables). I did not yet test it with newer versions.
This reordering can be avoided by two means: Either use the -ffloat-store compiler option or declare the return varaible (here: result
) volatile
.
Using MMX/SSE: GCC also supports a compiler option -mfpmath=sse. In order for it to take effect, at least the additional options -msse and -march=XXX (where XXX is e.g. pentium4 or any other arch value that supports SSE) must be supplied. In that case, for single and double precision floating point calculations the MMX/SSE extensions will be used instead of the FPU and thus eliminating the need for changing the FPU precision because the MMX/SSE extensions always operate with the same precision as the operhands.
Conclusion:
float
datatype for all operhands and use the compiler option -ffloat-store
. If it is not feasible to specify a compiler option, use the following code:
{ volatile float result = /* actual result */; return result; }instead of
return /* actual result */;Also, it couldn't hurt to use
_FPU_SETCW
just in case.double
datatype and use _FPU_SETCW
to set and restore the precision at the beginning of the method (-ffloat-store
will not suffice!). Please note that in addition to _FPU_SETCW
, -ffloat-store
or the volatile trick (just with double instead of float) must be used, else the compiler will use the wrong optimization.long double
datatype. If you haven't touched the internal precision, you don't need to explicitely set it but it wouldn't hurt, just in case.Not tested, feedback welcome.
With FreeBSD the x87 fpu is in double precision mode by default. It does, however, support the double-extended data type (long double
). Calculations with that data type will however only be executed within double precision by default.
FreeBSD also supports several macros to change the processor's float mode:
#include <machine/ieeefp.h> fp_prec_t fpu_oldprec; fpu_oldprec = fpgetprec(); fpsetprec(precision); // calculations here fpsetprec(fpu_oldprec);
Valid precision values are:
Calling conventions: The compiler does not generally store the result of an operation in memory and back into the registers to truncate it. -ffloat-store itself in only interpreted for actual variables, not for expressions (i.e. double result = a / b; return result;
and return a / b;
will result in different code with -ffloat-store active - even in optimizer mode.
Conclusion:
float
datatype. Also use a temporary floating point varible to store the result:
{ volatile float result = /* actual result */; return result; }instead of
return /* actual result */;Using -ffloat-store would require an additional variable anyway to have effect, so just declaring a new volatile variable will suffice anyway. Using
fpsetprec
couldn't hurt either.
double
datatype. Since the default internal precision is double by default, one does not need to change anything. Using fpsetprec
and a volatile return variable couldn't hurt, though.long double
datatype. Also be sure to use fpsetprec
to ensure the correct internal precision.Not tested, feedback welcome.
It does not appear to be possible to switch to double precision on OpenSolaris on Intel platforms without resorting to inline assembler or relying on GCC and SSE extensions (and OpenSolaris runs on non-SSE-x86 processors, so that's not portable).
The compiler supports double-extended precision and uses that for the default floating point precision on the platform. There seems to be no possibility to explicitely change the FP precision setting without resorting to inline assembler. (there are some macros / functions defined in sys/ieeefp.h but the interesting part is enclosed in an #ifdef _KERNEL
) The only possibility appears to be to use MMX/SSE instructions: -mfpmath=sse -msse -march=pentium4 or similar - or to use inline assembly to set the internal precision.
Calling conventions: The compiler does not generally store the result of an operation in memory and back into the registers to truncate it. -ffloat-store itself in only interpreted for actual variables, not for expressions (i.e. double result = a / b; return result;
and return a / b;
will result in different code with -ffloat-store active - even in optimizer mode.
Conclusion:
float
datatype. Also use a temporary floating point variable to store the result:
{ volatile float result = /* actual result */; return result; }instead of
return /* actual result */;Using -ffloat-store would require an additional variable anyway to have effect, so just declaring a new volatile variable will suffice anyway. sing inline assembly to set the precision manually couldn't hurt either.
double
datatype and use inline assembly to manually set the precision (see below).long double
datatype. Using inline assembly to force the precision couldn't hurt either.The compiler behaves more or less identical with GCC on OpenSolaris, there are only two differences:
Conclusion:
float
datatype for all operhands. Since the compiler will always truncate results, the semantics will work correctly.Not tested, feedback welcome.
This section is based on tests that other people did for me.
The FPU mode is double-extended by default and that datatype is also supported. But since Mac OS X runs only on x86 platforms that already support MMX/SSE, the compiler will always use MMX/SSE for single and double precision floating point calculations. Only double-extended precision floating point calculations will be done with the x87 FPU.
This is actually the most desirable behaviour since the precision is always determined by the largest operhand type in C.
Because of this, Mac OS X does not provide any C wrapper macros to change the internal precision setting of the x87 FPU. It is simply not necessary. Should this really be wanted, inline assembler would probably be possible, I haven't tested this, however.
Not tested: See what happens if -mfpmath=i387 is used as a compiler option.
Conclusion:
Simply use the correct datatype and the operations performed will have the correct semantics: float
for single precision, double
for double precision and long double
for double-extended precision.
Not tested, feedback welcome.
GCC on 64bit x86 Linux behaves the same as GCC on Mac OS X: FPU mode is extended by default but since SSE is supported by all x86_64 processors, SSE is always used for single and double precision floating point calculations and thus the largest operhand in C defines the calculation precision.
However, it is possible to use the -mfpmath=i387 switch. In that case, glibc also provides the fpu_control.h with the same macros as for the x86 case.
Conclusion:
Simply use the correct datatype and the operations performed will have the correct semantics: float
for single precision, double
for double precision and long double
for double-extended precision.
Scott MacVicar did some tests on the Power4 platform with GCC 4.0.1 which yielded the expected results that the compiler generates different instructions for different datatypes, thus using the correct precision. An interesting detail: quad precision calculations are done via software emulation by default, perhaps for compability with older processor revisions.
Conclusion:
Simply use the correct datatype and the operations performed will have the correct semantics: float
for single precision, double
for double precision and long double
for quad precision.
Johannes Schlüter did some tests on the UltraSPARC IIIi platform with both GCC 3.4.3 and Sun's CC which yielded the expected results that both compilers generate different instructions for different datatypes, thus using the correct precision. An interesting detail: quad precision calculations are done via software emulation by default (by both (!) compilers), perhaps for compability with older processor revisions.
Conclusion:
Simply use the correct datatype and the operations performed will have the correct semantics: float
for single precision, double
for double precision and long double
for quad precision.
A friend of mine ran the test suite below on a Nokia N810 with Linux installed. Since the N810 does not posess a FPU, all FPU operations are emulated via function calls.
The emulation uses the same precision as the largest operhand type. long double
is treated the same as double
, i.e. there is no extended precision.
Conclusion:
Simply use the correct datatype and the operations performed will have the correct semantics: float
for single precision and double
for double precision.
The best solution would probably be using the MPFR library (or something similar). That has its drawbacks though: The performance is slower than native FPU performance and it requires memory allocation on the heap. Also, having such a big library as a requirement for the own software may not always be the optimal choice. Thus a solution using the native FP unit will be built.
Using single or double precision for floating point calculations seems to be the only portable alternative without using an entire library. Double-extended precision is not supported on PPC and SPARC, quad precision is not supported on x86 and extended precision is also not supported by Microsoft's Visual C++ compiler.
Since there is probably no good reason to restrict calculations to single precision (lower precision means less accuracy), this document will focus on using double precision portably.
In order to acchieve this, the following steps should be taken:
double
datatype.The following macros provide platform-independent support for saving, switching and restoring FPU precision. If a platform does not support this or it is not necessary, they will simply expand to no-ops. XPFPA stands for "cross platform floating point arithmetic".
Example usage:
#include "xpfpa.h" double somefn (double a) { XPFPA_DECLARE() double localvar1; XPFPA_SWITCH_DOUBLE() // do some calculation localvar1 = a * a; // restore state temporarily, call an external library function XPFPA_RESTORE() localvar1 = external_lib_solve_linear_equation(localvar1); XPFPA_SWITCH_DOUBLE() // do some more calculations localvar1 = localvar1 / 2.0; // return the value XPFPA_RETURN_DOUBLE(localvar1); }
Please be sure to always use the correct datatype for the calculations since several architectures do not posess an internal precision but rather determine the precision according to the datatype of the operhands!
The header file xpfpa.h defines the above macros on systems where the internal precision can be changed. On other systems, the macros expand to nothing and we assume that the correct precision is chosen by the compiler based on the operhand types. The file can also be downloaded in an archive: xpfpa.tar.gz
In order to set the correct defines for the macros, I have written an m4 file that may be included in your configure.in. It's name is float_precision.m4 and it is available in the archive autoconf-macros.tar.gz. Usage:
dnl ... top of configure.in or acinclude.m4 or whatever: sinclude(float_precision.m4) dnl ... below: CHECK_FLOAT_PRECISION
It will set one or more of the following variables, please make sure your config.h.in contains them:
I have also provided a macro for CMake that may be included in your CMakeLists.txt. It's name is CheckFloatPrecision.cmake and it is available in the archive cmake-macros.tar.gz. Usage:
include(CheckFloatPrecision.cmake) # or, if in module path include(CheckFloatPrecision) check_float_precision()
It will set one or more of the following variables, please make sure your config.h.in contains them:
I have started to create an easy test suite that does some checks on the default floating point behaviour. It will only run on POSIX-like platforms with a C compiler that supports options in the style of GCC (e.g. GCC itself, ICC, Sun C Compiler, ...).
It can be downloaded here: fp-testsuite.tar.bz2 / fp-testsuite.tar.gz.
If you have a new platform / new compiler that I haven't discussed yet, please run the following (in the directory where you unpacked the archive):
CC=/path/to/your/compiler ./fptest.sh | tee messages.txt
Then please send me messages.txt and results.tar.gz.
Author: Christian Seiler, Last change: 2008-10-28