mirror of
https://github.com/ilyakooo0/urbit.git
synced 2024-12-24 07:26:51 +03:00
1434 lines
51 KiB
HTML
1434 lines
51 KiB
HTML
|
|
<HTML>
|
|
|
|
<HEAD>
|
|
<TITLE>Berkeley SoftFloat Library Interface</TITLE>
|
|
</HEAD>
|
|
|
|
<BODY>
|
|
|
|
<H1>Berkeley SoftFloat Release 3: Library Interface</H1>
|
|
|
|
<P>
|
|
John R. Hauser<BR>
|
|
2015 February 16<BR>
|
|
</P>
|
|
|
|
|
|
<H2>Contents</H2>
|
|
|
|
<BLOCKQUOTE>
|
|
<TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0>
|
|
<COL WIDTH=25>
|
|
<COL WIDTH=*>
|
|
<TR><TD COLSPAN=2>1. Introduction</TD></TR>
|
|
<TR><TD COLSPAN=2>2. Limitations</TD></TR>
|
|
<TR><TD COLSPAN=2>3. Acknowledgments and License</TD></TR>
|
|
<TR><TD COLSPAN=2>4. Types and Functions</TD></TR>
|
|
<TR><TD></TD><TD>4.1. Boolean and Integer Types</TD></TR>
|
|
<TR><TD></TD><TD>4.2. Floating-Point Types</TD></TR>
|
|
<TR><TD></TD><TD>4.3. Supported Floating-Point Functions</TD></TR>
|
|
<TR>
|
|
<TD></TD>
|
|
<TD>4.4. Non-canonical Representations in <CODE>extFloat80_t</CODE></TD>
|
|
</TR>
|
|
<TR><TD></TD><TD>4.5. Conventions for Passing Arguments and Results</TD></TR>
|
|
<TR><TD COLSPAN=2>5. Reserved Names</TD></TR>
|
|
<TR><TD COLSPAN=2>6. Mode Variables</TD></TR>
|
|
<TR><TD></TD><TD>6.1. Rounding Mode</TD></TR>
|
|
<TR><TD></TD><TD>6.2. Underflow Detection</TD></TR>
|
|
<TR>
|
|
<TD></TD>
|
|
<TD>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</TD>
|
|
</TR>
|
|
<TR><TD COLSPAN=2>7. Exceptions and Exception Flags</TD></TR>
|
|
<TR><TD COLSPAN=2>8. Function Details</TD></TR>
|
|
<TR><TD></TD><TD>8.1. Conversions from Integer to Floating-Point</TD></TR>
|
|
<TR><TD></TD><TD>8.2. Conversions from Floating-Point to Integer</TD></TR>
|
|
<TR><TD></TD><TD>8.3. Conversions Among Floating-Point Types</TD></TR>
|
|
<TR><TD></TD><TD>8.4. Basic Arithmetic Functions</TD></TR>
|
|
<TR><TD></TD><TD>8.5. Fused Multiply-Add Functions</TD></TR>
|
|
<TR><TD></TD><TD>8.6. Remainder Functions</TD></TR>
|
|
<TR><TD></TD><TD>8.7. Round-to-Integer Functions</TD></TR>
|
|
<TR><TD></TD><TD>8.8. Comparison Functions</TD></TR>
|
|
<TR><TD></TD><TD>8.9. Signaling NaN Test Functions</TD></TR>
|
|
<TR><TD></TD><TD>8.10. Raise-Exception Function</TD></TR>
|
|
<TR><TD COLSPAN=2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></TD></TR>
|
|
<TR><TD></TD><TD>9.1. Name Changes</TD></TR>
|
|
<TR><TD></TD><TD>9.2. Changes to Function Arguments</TD></TR>
|
|
<TR><TD></TD><TD>9.3. Added Capabilities</TD></TR>
|
|
<TR><TD></TD><TD>9.4. Better Compatibility with the C Language</TD></TR>
|
|
<TR><TD></TD><TD>9.5. New Organization as a Library</TD></TR>
|
|
<TR><TD></TD><TD>9.6. Optimization Gains (and Losses)</TD></TR>
|
|
<TR><TD COLSPAN=2>10. Future Directions</TD></TR>
|
|
<TR><TD COLSPAN=2>11. Contact Information</TD></TR>
|
|
</TABLE>
|
|
</BLOCKQUOTE>
|
|
|
|
|
|
<H2>1. Introduction</H2>
|
|
|
|
<P>
|
|
Berkeley SoftFloat is a software implementation of binary floating-point that
|
|
conforms to the IEEE Standard for Floating-Point Arithmetic.
|
|
The current release supports four binary formats: <NOBR>32-bit</NOBR>
|
|
single-precision, <NOBR>64-bit</NOBR> double-precision, <NOBR>80-bit</NOBR>
|
|
double-extended-precision, and <NOBR>128-bit</NOBR> quadruple-precision.
|
|
The following functions are supported for each format:
|
|
<UL>
|
|
<LI>
|
|
addition, subtraction, multiplication, division, and square root;
|
|
<LI>
|
|
fused multiply-add as defined by the IEEE Standard, except for
|
|
<NOBR>80-bit</NOBR> double-extended-precision;
|
|
<LI>
|
|
remainder as defined by the IEEE Standard;
|
|
<LI>
|
|
round to integral value;
|
|
<LI>
|
|
comparisons;
|
|
<LI>
|
|
conversions to/from other supported formats; and
|
|
<LI>
|
|
conversions to/from <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> integers,
|
|
signed and unsigned.
|
|
</UL>
|
|
All operations required by the original 1985 version of the IEEE Floating-Point
|
|
Standard are implemented, except for conversions to and from decimal.
|
|
</P>
|
|
|
|
<P>
|
|
This document gives information about the types defined and the routines
|
|
implemented by SoftFloat.
|
|
It does not attempt to define or explain the IEEE Floating-Point Standard.
|
|
Information about the standard is available elsewhere.
|
|
</P>
|
|
|
|
<P>
|
|
The current version of SoftFloat is <NOBR>Release 3</NOBR>.
|
|
The functional interface of this version differs in many details from that of
|
|
earlier SoftFloat releases.
|
|
For specifics of these differences, see <NOBR>section 9</NOBR> below,
|
|
<I>Changes from SoftFloat <NOBR>Release 2</NOBR></I>.
|
|
</P>
|
|
|
|
|
|
<H2>2. Limitations</H2>
|
|
|
|
<P>
|
|
SoftFloat assumes the computer has an addressable byte size of 8 or
|
|
<NOBR>16 bits</NOBR>.
|
|
(Nearly all computers in use today have <NOBR>8-bit</NOBR> bytes.)
|
|
</P>
|
|
|
|
<P>
|
|
SoftFloat is written in C and is designed to work with other C code.
|
|
The C compiler used must conform at a minimum to the 1989 ANSI standard for the
|
|
C language (same as the 1990 ISO standard) and must in addition support basic
|
|
arithmetic on <NOBR>64-bit</NOBR> integers.
|
|
Earlier releases of SoftFloat included implementations of <NOBR>32-bit</NOBR>
|
|
single-precision and <NOBR>64-bit</NOBR> double-precision floating-point that
|
|
did not require <NOBR>64-bit</NOBR> integers, but this option is not supported
|
|
with <NOBR>Release 3</NOBR>.
|
|
Since 1999, ISO standards for C have mandated compiler support for
|
|
<NOBR>64-bit</NOBR> integers.
|
|
A compiler conforming to the 1999 C Standard or later is recommended but not
|
|
strictly required.
|
|
</P>
|
|
|
|
<P>
|
|
Most operations not required by the original 1985 version of the IEEE
|
|
Floating-Point Standard but added in the 2008 version are not yet supported in
|
|
SoftFloat <NOBR>Release 3</NOBR>.
|
|
</P>
|
|
|
|
|
|
<H2>3. Acknowledgments and License</H2>
|
|
|
|
<P>
|
|
The SoftFloat package was written by me, <NOBR>John R.</NOBR> Hauser.
|
|
<NOBR>Release 3</NOBR> of SoftFloat is a completely new implementation
|
|
supplanting earlier releases.
|
|
This project (<NOBR>Release 3</NOBR> only, not earlier releases) was done in
|
|
the employ of the University of California, Berkeley, within the Department of
|
|
Electrical Engineering and Computer Sciences, first for the Parallel Computing
|
|
Laboratory (Par Lab) and then for the ASPIRE Lab.
|
|
The work was officially overseen by Prof. Krste Asanovic, with funding provided
|
|
by these sources:
|
|
<BLOCKQUOTE>
|
|
<TABLE>
|
|
<COL>
|
|
<COL WIDTH=10>
|
|
<COL>
|
|
<TR>
|
|
<TD VALIGN=TOP><NOBR>Par Lab:</NOBR></TD>
|
|
<TD></TD>
|
|
<TD>
|
|
Microsoft (Award #024263), Intel (Award #024894), and U.C. Discovery
|
|
(Award #DIG07-10227), with additional support from Par Lab affiliates Nokia,
|
|
NVIDIA, Oracle, and Samsung.
|
|
</TD>
|
|
</TR>
|
|
<TR>
|
|
<TD VALIGN=TOP><NOBR>ASPIRE Lab:</NOBR></TD>
|
|
<TD></TD>
|
|
<TD>
|
|
DARPA PERFECT program (Award #HR0011-12-2-0016), with additional support from
|
|
ASPIRE industrial sponsor Intel and ASPIRE affiliates Google, Nokia, NVIDIA,
|
|
Oracle, and Samsung.
|
|
</TD>
|
|
</TR>
|
|
</TABLE>
|
|
</BLOCKQUOTE>
|
|
</P>
|
|
|
|
<P>
|
|
The following applies to the whole of SoftFloat <NOBR>Release 3</NOBR> as well
|
|
as to each source file individually.
|
|
</P>
|
|
|
|
<P>
|
|
Copyright 2011, 2012, 2013, 2014, 2015 The Regents of the University of
|
|
California (Regents).
|
|
All Rights Reserved.
|
|
Redistribution and use in source and binary forms, with or without
|
|
modification, are permitted provided that the following conditions are met:
|
|
</P>
|
|
|
|
<P>
|
|
Redistributions of source code must retain the above copyright notice, this
|
|
list of conditions, and the following two paragraphs of disclaimer.
|
|
Redistributions in binary form must reproduce the above copyright notice, this
|
|
list of conditions, and the following two paragraphs of disclaimer in the
|
|
documentation and/or other materials provided with the distribution.
|
|
Neither the name of the Regents nor the names of its contributors may be used
|
|
to endorse or promote products derived from this software without specific
|
|
prior written permission.
|
|
</P>
|
|
|
|
<P>
|
|
IN NO EVENT SHALL REGENTS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL,
|
|
INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF
|
|
THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF REGENTS HAS BEEN
|
|
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
</P>
|
|
|
|
<P>
|
|
REGENTS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
|
|
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
|
|
THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS
|
|
PROVIDED "<NOBR>AS IS</NOBR>".
|
|
REGENTS HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES,
|
|
ENHANCEMENTS, OR MODIFICATIONS.
|
|
</P>
|
|
|
|
|
|
<H2>4. Types and Functions</H2>
|
|
|
|
<P>
|
|
The types and functions of SoftFloat are declared in header file
|
|
<CODE>softfloat.h</CODE>.
|
|
</P>
|
|
|
|
<H3>4.1. Boolean and Integer Types</H3>
|
|
|
|
<P>
|
|
Header file <CODE>softfloat.h</CODE> depends on standard headers
|
|
<CODE><stdbool.h></CODE> and <CODE><stdint.h></CODE> to define type
|
|
<CODE>bool</CODE> and several integer types.
|
|
These standard headers have been part of the ISO C Standard Library since 1999.
|
|
With any recent compiler, they are likely to be supported, even if the compiler
|
|
does not claim complete conformance to the ISO C Standard.
|
|
For older or nonstandard compilers, a port of SoftFloat may have substitutes
|
|
for these headers.
|
|
Header <CODE>softfloat.h</CODE> depends only on the name <CODE>bool</CODE> from
|
|
<CODE><stdbool.h></CODE> and on these type names from
|
|
<CODE><stdint.h></CODE>:
|
|
<BLOCKQUOTE>
|
|
<PRE>
|
|
uint16_t
|
|
uint32_t
|
|
uint64_t
|
|
int32_t
|
|
int64_t
|
|
uint_fast8_t
|
|
uint_fast32_t
|
|
uint_fast64_t
|
|
</PRE>
|
|
</BLOCKQUOTE>
|
|
</P>
|
|
|
|
|
|
<H3>4.2. Floating-Point Types</H3>
|
|
|
|
<P>
|
|
The <CODE>softfloat.h</CODE> header defines four floating-point types:
|
|
<BLOCKQUOTE>
|
|
<TABLE CELLSPACING=0 CELLPADDING=0>
|
|
<TR>
|
|
<TD><CODE>float32_t</CODE></TD>
|
|
<TD><NOBR>32-bit</NOBR> single-precision binary format</TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float64_t</CODE></TD>
|
|
<TD><NOBR>64-bit</NOBR> double-precision binary format</TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>extFloat80_t </CODE></TD>
|
|
<TD><NOBR>80-bit</NOBR> double-extended-precision binary format (old Intel or
|
|
Motorola format)</TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float128_t</CODE></TD>
|
|
<TD><NOBR>128-bit</NOBR> quadruple-precision binary format</TD>
|
|
</TR>
|
|
</TABLE>
|
|
</BLOCKQUOTE>
|
|
The non-extended types are each exactly the size specified:
|
|
<NOBR>32 bits</NOBR> for <CODE>float32_t</CODE>, <NOBR>64 bits</NOBR> for
|
|
<CODE>float64_t</CODE>, and <NOBR>128 bits</NOBR> for <CODE>float128_t</CODE>.
|
|
Aside from these size requirements, the definitions of all these types may
|
|
differ for different ports of SoftFloat to specific systems.
|
|
A given port of SoftFloat may or may not define some of the floating-point
|
|
types as aliases for the C standard types <CODE>float</CODE>,
|
|
<CODE>double</CODE>, and <CODE>long</CODE> <CODE>double</CODE>.
|
|
</P>
|
|
|
|
<P>
|
|
Header file <CODE>softfloat.h</CODE> also defines a structure,
|
|
<CODE>struct</CODE> <CODE>extFloat80M</CODE>, for the representation of
|
|
<NOBR>80-bit</NOBR> double-extended-precision floating-point values in memory.
|
|
This structure is the same size as type <CODE>extFloat80_t</CODE> and contains
|
|
at least these two fields (not necessarily in this order):
|
|
<BLOCKQUOTE>
|
|
<PRE>
|
|
uint16_t signExp;
|
|
uint64_t signif;
|
|
</PRE>
|
|
</BLOCKQUOTE>
|
|
Field <CODE>signExp</CODE> contains the sign and exponent of the floating-point
|
|
value, with the sign in the most significant bit (<NOBR>bit 15</NOBR>) and the
|
|
encoded exponent in the other <NOBR>15 bits</NOBR>.
|
|
Field <CODE>signif</CODE> is the complete <NOBR>64-bit</NOBR> significand of
|
|
the floating-point value.
|
|
(In the usual encoding for <NOBR>80-bit</NOBR> extended floating-point, the
|
|
leading <NOBR>1 bit</NOBR> of normalized numbers is not implicit but is stored
|
|
in the most significant bit of the significand.)
|
|
</P>
|
|
|
|
<H3>4.3. Supported Floating-Point Functions</H3>
|
|
|
|
<P>
|
|
SoftFloat implements these arithmetic operations for its floating-point types:
|
|
<UL>
|
|
<LI>
|
|
conversions between any two floating-point formats;
|
|
<LI>
|
|
for each floating-point format, conversions to and from signed and unsigned
|
|
<NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> integers;
|
|
<LI>
|
|
for each format, the usual addition, subtraction, multiplication, division, and
|
|
square root operations;
|
|
<LI>
|
|
for each format except <CODE>extFloat80_t</CODE>, the fused multiply-add
|
|
operation defined by the IEEE Standard;
|
|
<LI>
|
|
for each format, the floating-point remainder operation defined by the IEEE
|
|
Standard;
|
|
<LI>
|
|
for each format, a “round to integer” operation that rounds to the
|
|
nearest integer value in the same format; and
|
|
<LI>
|
|
comparisons between two values in the same floating-point format.
|
|
</UL>
|
|
</P>
|
|
|
|
<P>
|
|
The following operations required by the 2008 IEEE Floating-Point Standard are
|
|
not supported in SoftFloat <NOBR>Release 3</NOBR>:
|
|
<UL>
|
|
<LI>
|
|
<B>nextUp</B>, <B>nextDown</B>, <B>minNum</B>, <B>maxNum</B>, <B>minNumMag</B>,
|
|
<B>maxNumMag</B>, <B>scaleB</B>, and <B>logB</B>;
|
|
<LI>
|
|
conversions between floating-point formats and decimal or hexadecimal character
|
|
sequences;
|
|
<LI>
|
|
all “quiet-computation” operations (<B>copy</B>, <B>negate</B>,
|
|
<B>abs</B>, and <B>copySign</B>, which all involve only simple copying and/or
|
|
manipulation of the floating-point sign bit); and
|
|
<LI>
|
|
all “non-computational” operations other than <B>isSignaling</B>
|
|
(which is supported).
|
|
</UL>
|
|
</P>
|
|
|
|
<H3>4.4. Non-canonical Representations in <CODE>extFloat80_t</CODE></H3>
|
|
|
|
<P>
|
|
Because the <NOBR>80-bit</NOBR> double-extended-precision format,
|
|
<CODE>extFloat80_t</CODE>, stores an explicit leading significand bit, many
|
|
floating-point numbers are encodable in this type in equivalent normalized and
|
|
denormalized forms.
|
|
Zeros and values in the subnormal range have each only a single possible
|
|
encoding, for which the leading significand bit must <NOBR>be 0</NOBR>.
|
|
For other finite values (outside the subnormal range), a unique normalized
|
|
representation, with leading significand bit set <NOBR>to 1</NOBR>, always
|
|
exists, and is considered the <I>canonical</I> representation of the value.
|
|
Any equivalent denormalized representations (having leading significand bit
|
|
<NOBR>of 0</NOBR>) are <I>non-canonical</I>.
|
|
Similarly, the leading significand bit is expected to <NOBR>be 1</NOBR> for
|
|
infinities and NaNs as well;
|
|
any infinity or NaN with a leading significand bit <NOBR>of 0</NOBR> is again
|
|
considered non-canonical.
|
|
In short, for an <CODE>extFloat80_t</CODE> representation to be canonical, the
|
|
leading significand bit must <NOBR>be 1</NOBR> unless it is required to
|
|
<NOBR>be 0</NOBR> because the encoded value is zero or a subnormal.
|
|
</P>
|
|
|
|
<P>
|
|
For <NOBR>Release 3</NOBR> of SoftFloat, functions are not guaranteed to
|
|
operate as expected when inputs of type <CODE>extFloat80_t</CODE> are
|
|
non-canonical.
|
|
Assuming all of a function’s <CODE>extFloat80_t</CODE> inputs (if any)
|
|
are canonical, function outputs of type <CODE>extFloat80_t</CODE> will always
|
|
be canonical.
|
|
</P>
|
|
|
|
<H3>4.5. Conventions for Passing Arguments and Results</H3>
|
|
|
|
<P>
|
|
Values that are at most <NOBR>64 bits</NOBR> in size (i.e., not the
|
|
<NOBR>80-bit</NOBR> or <NOBR>128-bit</NOBR> floating-point formats) are in all
|
|
cases passed as function arguments by value.
|
|
Likewise, when an output of a function is no more than <NOBR>64 bits</NOBR>, it
|
|
is always returned directly as the function result.
|
|
Thus, for example, the SoftFloat function for adding two <NOBR>64-bit</NOBR>
|
|
floating-point values has this simple signature:
|
|
<BLOCKQUOTE>
|
|
<CODE>float64_t f64_add( float64_t, float64_t );</CODE>
|
|
</BLOCKQUOTE>
|
|
</P>
|
|
|
|
<P>
|
|
The story is more complex when function inputs and outputs are
|
|
<NOBR>80-bit</NOBR> and <NOBR>128-bit</NOBR> floating-point.
|
|
For these types, SoftFloat always provides a function that passes these larger
|
|
values into or out of the function indirectly, via pointers.
|
|
For example, for adding two <NOBR>128-bit</NOBR> floating-point values,
|
|
SoftFloat supplies this function:
|
|
<BLOCKQUOTE>
|
|
<CODE>void f128M_add( const float128_t *, const float128_t *, float128_t * );</CODE>
|
|
</BLOCKQUOTE>
|
|
The first two arguments point to the values to be added, and the last argument
|
|
points to the location where the sum will be stored.
|
|
The <CODE>M</CODE> in the name <CODE>f128M_add</CODE> is mnemonic for the fact
|
|
that the <NOBR>128-bit</NOBR> inputs and outputs are “in memory”,
|
|
pointed to by pointer arguments.
|
|
</P>
|
|
|
|
<P>
|
|
All ports of SoftFloat implement these <I>pass-by-pointer</I> functions for
|
|
types <CODE>extFloat80_t</CODE> and <CODE>float128_t</CODE>.
|
|
At the same time, SoftFloat ports may also implement alternate versions of
|
|
these same functions that pass <CODE>extFloat80_t</CODE> and
|
|
<CODE>float128_t</CODE> by value, like the smaller formats.
|
|
Thus, besides the function with name <CODE>f128M_add</CODE> shown above, a
|
|
SoftFloat port may also supply an equivalent function with this signature:
|
|
<BLOCKQUOTE>
|
|
<CODE>float128_t f128_add( float128_t, float128_t );</CODE>
|
|
</BLOCKQUOTE>
|
|
</P>
|
|
|
|
<P>
|
|
As a general rule, on computers where the machine word size is
|
|
<NOBR>32 bits</NOBR> or smaller, only the pass-by-pointer versions of functions
|
|
(e.g., <CODE>f128M_add</CODE>) are provided for types <CODE>extFloat80_t</CODE>
|
|
and <CODE>float128_t</CODE>, because passing such large types directly can have
|
|
significant extra cost.
|
|
On computers where the word size is <NOBR>64 bits</NOBR> or larger, both
|
|
function versions (<CODE>f128M_add</CODE> and <CODE>f128_add</CODE>) are
|
|
provided, because the cost of passing by value is then more reasonable.
|
|
Applications that must be portable accross both classes of computers must use
|
|
the pointer-based functions, as these are always implemented.
|
|
However, if it is known that SoftFloat includes the by-value functions for all
|
|
platforms of interest, programmers can use whichever version they prefer.
|
|
</P>
|
|
|
|
|
|
<H2>5. Reserved Names</H2>
|
|
|
|
<P>
|
|
In addition to the variables and functions documented here, SoftFloat defines
|
|
some symbol names for its own private use.
|
|
These private names always begin with the prefix
|
|
‘<CODE>softfloat_</CODE>’.
|
|
When a program includes header <CODE>softfloat.h</CODE> or links with the
|
|
SoftFloat library, all names with prefix ‘<CODE>softfloat_</CODE>’
|
|
are reserved for possible use by SoftFloat.
|
|
Applications that use SoftFloat should not define their own names with this
|
|
prefix, and should reference only such names as are documented.
|
|
</P>
|
|
|
|
|
|
<H2>6. Mode Variables</H2>
|
|
|
|
<P>
|
|
The following variables control rounding mode, underflow detection, and the
|
|
<NOBR>80-bit</NOBR> extended format’s rounding precision:
|
|
<BLOCKQUOTE>
|
|
<CODE>softfloat_roundingMode</CODE><BR>
|
|
<CODE>softfloat_detectTininess</CODE><BR>
|
|
<CODE>extF80_roundingPrecision</CODE>
|
|
</BLOCKQUOTE>
|
|
These mode variables are covered in the next several subsections.
|
|
</P>
|
|
|
|
<H3>6.1. Rounding Mode</H3>
|
|
|
|
<P>
|
|
All five rounding modes defined by the 2008 IEEE Floating-Point Standard are
|
|
implemented for all operations that require rounding.
|
|
The rounding mode is selected by the global variable
|
|
<BLOCKQUOTE>
|
|
<CODE>uint_fast8_t softfloat_roundingMode;</CODE>
|
|
</BLOCKQUOTE>
|
|
This variable may be set to one of the values
|
|
<BLOCKQUOTE>
|
|
<TABLE CELLSPACING=0 CELLPADDING=0>
|
|
<TR>
|
|
<TD><CODE>softfloat_round_near_even</CODE></TD>
|
|
<TD>round to nearest, with ties to even</TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>softfloat_round_near_maxMag </CODE></TD>
|
|
<TD>round to nearest, with ties to maximum magnitude (away from zero)</TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>softfloat_round_minMag</CODE></TD>
|
|
<TD>round to minimum magnitude (toward zero)</TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>softfloat_round_min</CODE></TD>
|
|
<TD>round to minimum (down)</TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>softfloat_round_max</CODE></TD>
|
|
<TD>round to maximum (up)</TD>
|
|
</TR>
|
|
</TABLE>
|
|
</BLOCKQUOTE>
|
|
Variable <CODE>softfloat_roundingMode</CODE> is initialized to
|
|
<CODE>softfloat_round_near_even</CODE>.
|
|
</P>
|
|
|
|
<H3>6.2. Underflow Detection</H3>
|
|
|
|
<P>
|
|
In the terminology of the IEEE Standard, SoftFloat can detect tininess for
|
|
underflow either before or after rounding.
|
|
The choice is made by the global variable
|
|
<BLOCKQUOTE>
|
|
<CODE>uint_fast8_t softfloat_detectTininess;</CODE>
|
|
</BLOCKQUOTE>
|
|
which can be set to either
|
|
<BLOCKQUOTE>
|
|
<CODE>softfloat_tininess_beforeRounding</CODE><BR>
|
|
<CODE>softfloat_tininess_afterRounding</CODE>
|
|
</BLOCKQUOTE>
|
|
Detecting tininess after rounding is better because it results in fewer
|
|
spurious underflow signals.
|
|
The other option is provided for compatibility with some systems.
|
|
Like most systems (and as required by the newer 2008 IEEE Standard), SoftFloat
|
|
always detects loss of accuracy for underflow as an inexact result.
|
|
</P>
|
|
|
|
<H3>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</H3>
|
|
|
|
<P>
|
|
For <CODE>extFloat80_t</CODE> only, the rounding precision of the basic
|
|
arithmetic operations is controlled by the global variable
|
|
<BLOCKQUOTE>
|
|
<CODE>uint_fast8_t extF80_roundingPrecision;</CODE>
|
|
</BLOCKQUOTE>
|
|
The operations affected are:
|
|
<BLOCKQUOTE>
|
|
<CODE>extF80_add</CODE><BR>
|
|
<CODE>extF80_sub</CODE><BR>
|
|
<CODE>extF80_mul</CODE><BR>
|
|
<CODE>extF80_div</CODE><BR>
|
|
<CODE>extF80_sqrt</CODE>
|
|
</BLOCKQUOTE>
|
|
When <CODE>extF80_roundingPrecision</CODE> is set to its default value of 80,
|
|
these operations are rounded to the full precision of the <NOBR>80-bit</NOBR>
|
|
double-extended-precision format, like occurs for other formats.
|
|
Setting <CODE>extF80_roundingPrecision</CODE> to 32 or to 64 causes the
|
|
operations listed to be rounded to <NOBR>32-bit</NOBR> precision (equivalent to
|
|
<CODE>float32_t</CODE>) or to <NOBR>64-bit</NOBR> precision (equivalent to
|
|
<CODE>float64_t</CODE>), respectively.
|
|
When rounding to reduced precision, additional bits in the result significand
|
|
beyond the rounding point are set to zero.
|
|
The consequences of setting <CODE>extF80_roundingPrecision</CODE> to a value
|
|
other than 32, 64, or 80 is not specified.
|
|
Operations other than the ones listed above are not affected by
|
|
<CODE>extF80_roundingPrecision</CODE>.
|
|
</P>
|
|
|
|
|
|
<H2>7. Exceptions and Exception Flags</H2>
|
|
|
|
<P>
|
|
All five exception flags required by the IEEE Floating-Point Standard are
|
|
implemented.
|
|
Each flag is stored as a separate bit in the global variable
|
|
<BLOCKQUOTE>
|
|
<CODE>uint_fast8_t softfloat_exceptionFlags;</CODE>
|
|
</BLOCKQUOTE>
|
|
The positions of the exception flag bits within this variable are determined by
|
|
the bit masks
|
|
<BLOCKQUOTE>
|
|
<CODE>softfloat_flag_inexact</CODE><BR>
|
|
<CODE>softfloat_flag_underflow</CODE><BR>
|
|
<CODE>softfloat_flag_overflow</CODE><BR>
|
|
<CODE>softfloat_flag_infinite</CODE><BR>
|
|
<CODE>softfloat_flag_invalid</CODE>
|
|
</BLOCKQUOTE>
|
|
Variable <CODE>softfloat_exceptionFlags</CODE> is initialized to all zeros,
|
|
meaning no exceptions.
|
|
</P>
|
|
|
|
<P>
|
|
An individual exception flag can be cleared with the statement
|
|
<BLOCKQUOTE>
|
|
<CODE>softfloat_exceptionFlags &= ~softfloat_flag_<<I>exception</I>>;</CODE>
|
|
</BLOCKQUOTE>
|
|
where <CODE><<I>exception</I>></CODE> is the appropriate name.
|
|
To raise a floating-point exception, function <CODE>softfloat_raise</CODE>
|
|
should normally be used.
|
|
</P>
|
|
|
|
<P>
|
|
When SoftFloat detects an exception other than <I>inexact</I>, it calls
|
|
<CODE>softfloat_raise</CODE>.
|
|
The default version of this function simply raises the corresponding exception
|
|
flags.
|
|
Particular ports of SoftFloat may support alternate behavior, such as exception
|
|
traps, by modifying the default <CODE>softfloat_raise</CODE>.
|
|
A program may also supply its own <CODE>softfloat_raise</CODE> function to
|
|
override the one from the SoftFloat library.
|
|
</P>
|
|
|
|
<P>
|
|
Because inexact results occur frequently under most circumstances (and thus are
|
|
hardly exceptional), SoftFloat does not ordinarily call
|
|
<CODE>softfloat_raise</CODE> for <I>inexact</I> exceptions.
|
|
It does always raise the <I>inexact</I> exception flag as required.
|
|
</P>
|
|
|
|
|
|
<H2>8. Function Details</H2>
|
|
|
|
<P>
|
|
In this section, <CODE><<I>float</I>></CODE> appears in function names as
|
|
a substitute for one of these abbreviations:
|
|
<BLOCKQUOTE>
|
|
<TABLE CELLSPACING=0 CELLPADDING=0>
|
|
<TR>
|
|
<TD><CODE>f32</CODE></TD>
|
|
<TD>indicates <CODE>float32_t</CODE>, passed by value</TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>f64</CODE></TD>
|
|
<TD>indicates <CODE>float64_t</CODE>, passed by value</TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>extF80M </CODE></TD>
|
|
<TD>indicates <CODE>extFloat80_t</CODE>, passed indirectly via pointers</TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>extF80</CODE></TD>
|
|
<TD>indicates <CODE>extFloat80_t</CODE>, passed by value</TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>f128M</CODE></TD>
|
|
<TD>indicates <CODE>float128_t</CODE>, passed indirectly via pointers</TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>f128</CODE></TD>
|
|
<TD>indicates <CODE>float128_t</CODE>, passed by value</TD>
|
|
</TR>
|
|
</TABLE>
|
|
</BLOCKQUOTE>
|
|
The circumstances under which values of floating-point types
|
|
<CODE>extFloat80_t</CODE> and <CODE>float128_t</CODE> may be passed either by
|
|
value or indirectly via pointers was discussed earlier in
|
|
<NOBR>section 4.5</NOBR>, <I>Conventions for Passing Arguments and Results</I>.
|
|
</P>
|
|
|
|
<H3>8.1. Conversions from Integer to Floating-Point</H3>
|
|
|
|
<P>
|
|
All conversions from a <NOBR>32-bit</NOBR> or <NOBR>64-bit</NOBR> integer,
|
|
signed or unsigned, to a floating-point format are supported.
|
|
Functions performing these conversions have these names:
|
|
<BLOCKQUOTE>
|
|
<CODE>ui32_to_<<I>float</I>></CODE><BR>
|
|
<CODE>ui64_to_<<I>float</I>></CODE><BR>
|
|
<CODE>i32_to_<<I>float</I>></CODE><BR>
|
|
<CODE>i64_to_<<I>float</I>></CODE>
|
|
</BLOCKQUOTE>
|
|
Conversions from <NOBR>32-bit</NOBR> integers to <NOBR>64-bit</NOBR>
|
|
double-precision and larger formats are always exact, and likewise conversions
|
|
from <NOBR>64-bit</NOBR> integers to <NOBR>80-bit</NOBR>
|
|
double-extended-precision and <NOBR>128-bit</NOBR> quadruple-precision are also
|
|
always exact.
|
|
</P>
|
|
|
|
<P>
|
|
Each conversion function takes one input of the appropriate type and generates
|
|
one output.
|
|
The following illustrates the signatures of these functions in cases when the
|
|
floating-point result is passed either by value or via pointers:
|
|
<BLOCKQUOTE>
|
|
<PRE>
|
|
float64_t i32_to_f64( int32_t <I>a</I> );
|
|
</PRE>
|
|
<PRE>
|
|
void i32_to_f128M( int32_t <I>a</I>, float128_t *<I>destPtr</I> );
|
|
</PRE>
|
|
</BLOCKQUOTE>
|
|
</P>
|
|
|
|
<H3>8.2. Conversions from Floating-Point to Integer</H3>
|
|
|
|
<P>
|
|
Conversions from a floating-point format to a <NOBR>32-bit</NOBR> or
|
|
<NOBR>64-bit</NOBR> integer, signed or unsigned, are supported with these
|
|
functions:
|
|
<BLOCKQUOTE>
|
|
<CODE><<I>float</I>>_to_ui32</CODE><BR>
|
|
<CODE><<I>float</I>>_to_ui64</CODE><BR>
|
|
<CODE><<I>float</I>>_to_i32</CODE><BR>
|
|
<CODE><<I>float</I>>_to_i64</CODE>
|
|
</BLOCKQUOTE>
|
|
The functions have signatures as follows, depending on whether the
|
|
floating-point input is passed by value or via pointers:
|
|
<BLOCKQUOTE>
|
|
<PRE>
|
|
int_fast32_t f64_to_i32( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
|
|
</PRE>
|
|
<PRE>
|
|
int_fast32_t
|
|
f128M_to_i32( const float128_t *<I>aPtr</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
|
|
</PRE>
|
|
</BLOCKQUOTE>
|
|
The <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode for
|
|
the conversion.
|
|
The variable that usually indicates rounding mode,
|
|
<CODE>softfloat_roundingMode</CODE>, is ignored.
|
|
Argument <CODE><I>exact</I></CODE> determines whether the <I>inexact</I>
|
|
exception flag is raised if the conversion is not exact.
|
|
If <CODE><I>exact</I></CODE> is <CODE>true</CODE>, the <I>inexact</I> flag may
|
|
be raised;
|
|
otherwise, it will not be, even if the conversion is inexact.
|
|
</P>
|
|
|
|
<P>
|
|
Conversions from floating-point to integer raise the <I>invalid</I> exception
|
|
if the source value cannot be rounded to a representable integer of the desired
|
|
size (32 or 64 bits).
|
|
In such a circumstance, if the floating-point input is a NaN or if the
|
|
conversion is to an unsigned integer type, the largest positive integer is
|
|
returned;
|
|
otherwise, the largest integer with the same sign as the input is returned.
|
|
The functions that convert to integer types never raise the <I>overflow</I>
|
|
exception.
|
|
</P>
|
|
|
|
<P>
|
|
Note that, when converting to an unsigned integer type, if the <I>invalid</I>
|
|
exception is raised because the input floating-point value would round to a
|
|
negative integer, the value returned is the <EM>maximum positive unsigned
|
|
integer</EM>.
|
|
Zero is not returned when the <I>invalid</I> exception is raised, even when
|
|
zero is the closest integer to the original floating-point value.
|
|
</P>
|
|
|
|
<P>
|
|
Because languages such <NOBR>as C</NOBR> require that conversions to integers
|
|
be rounded toward zero, the following functions are provided for improved speed
|
|
and convenience:
|
|
<BLOCKQUOTE>
|
|
<CODE><<I>float</I>>_to_ui32_r_minMag</CODE><BR>
|
|
<CODE><<I>float</I>>_to_ui64_r_minMag</CODE><BR>
|
|
<CODE><<I>float</I>>_to_i32_r_minMag</CODE><BR>
|
|
<CODE><<I>float</I>>_to_i64_r_minMag</CODE>
|
|
</BLOCKQUOTE>
|
|
These functions round only toward zero (to minimum magnitude).
|
|
The signatures for these functions are the same as above without the redundant
|
|
<CODE><I>roundingMode</I></CODE> argument:
|
|
<BLOCKQUOTE>
|
|
<PRE>
|
|
int_fast32_t f64_to_i32_r_minMag( float64_t <I>a</I>, bool <I>exact</I> );
|
|
</PRE>
|
|
<PRE>
|
|
int_fast32_t f128M_to_i32_r_minMag( const float128_t *<I>aPtr</I>, bool <I>exact</I> );
|
|
</PRE>
|
|
</BLOCKQUOTE>
|
|
</P>
|
|
|
|
<H3>8.3. Conversions Among Floating-Point Types</H3>
|
|
|
|
<P>
|
|
Conversions between floating-point formats are done by functions with these
|
|
names:
|
|
<BLOCKQUOTE>
|
|
<CODE><<I>float</I>>_to_<<I>float</I>></CODE>
|
|
</BLOCKQUOTE>
|
|
All combinations of source and result type are supported where the source and
|
|
result are different formats.
|
|
There are four different styles of signature for these functions, depending on
|
|
whether the input and the output floating-point values are passed by value or
|
|
via pointers:
|
|
<BLOCKQUOTE>
|
|
<PRE>
|
|
float32_t f64_to_f32( float64_t <I>a</I> );
|
|
</PRE>
|
|
<PRE>
|
|
float32_t f128M_to_f32( const float128_t *<I>aPtr</I> );
|
|
</PRE>
|
|
<PRE>
|
|
void f32_to_f128M( float32_t <I>a</I>, float128_t *<I>destPtr</I> );
|
|
</PRE>
|
|
<PRE>
|
|
void extF80M_to_f128M( const extFloat80_t *<I>aPtr</I>, float128_t *<I>destPtr</I> );
|
|
</PRE>
|
|
</BLOCKQUOTE>
|
|
</P>
|
|
|
|
<P>
|
|
Conversions from a smaller to a larger floating-point format are always exact
|
|
and so require no rounding.
|
|
</P>
|
|
|
|
<H3>8.4. Basic Arithmetic Functions</H3>
|
|
|
|
<P>
|
|
The following basic arithmetic functions are provided:
|
|
<BLOCKQUOTE>
|
|
<CODE><<I>float</I>>_add</CODE><BR>
|
|
<CODE><<I>float</I>>_sub</CODE><BR>
|
|
<CODE><<I>float</I>>_mul</CODE><BR>
|
|
<CODE><<I>float</I>>_div</CODE><BR>
|
|
<CODE><<I>float</I>>_sqrt</CODE>
|
|
</BLOCKQUOTE>
|
|
Each floating-point operation takes two operands, except for <CODE>sqrt</CODE>
|
|
(square root) which takes only one.
|
|
The operands and result are all of the same floating-point format.
|
|
Signatures for these functions take the following forms:
|
|
<BLOCKQUOTE>
|
|
<PRE>
|
|
float64_t f64_add( float64_t <I>a</I>, float64_t <I>b</I> );
|
|
</PRE>
|
|
<PRE>
|
|
void
|
|
f128M_add(
|
|
const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> );
|
|
</PRE>
|
|
<PRE>
|
|
float64_t f64_sqrt( float64_t <I>a</I> );
|
|
</PRE>
|
|
<PRE>
|
|
void f128M_sqrt( const float128_t *<I>aPtr</I>, float128_t *<I>destPtr</I> );
|
|
</PRE>
|
|
</BLOCKQUOTE>
|
|
When floating-point values are passed indirectly through pointers, arguments
|
|
<CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to the input
|
|
operands, and the last argument, <CODE><I>destPtr</I></CODE>, points to the
|
|
location where the result is stored.
|
|
</P>
|
|
|
|
<P>
|
|
Rounding of the <NOBR>80-bit</NOBR> double-extended-precision
|
|
(<CODE>extFloat80_t</CODE>) functions is affected by variable
|
|
<CODE>extF80_roundingPrecision</CODE>, as explained earlier in
|
|
<NOBR>section 6.3</NOBR>,
|
|
<I>Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</I>.
|
|
</P>
|
|
|
|
<H3>8.5. Fused Multiply-Add Functions</H3>
|
|
|
|
<P>
|
|
The 2008 version of the IEEE Floating-Point Standard defines a <I>fused
|
|
multiply-add</I> operation that does a combined multiplication and addition
|
|
with only a single rounding.
|
|
SoftFloat implements fused multiply-add with functions
|
|
<BLOCKQUOTE>
|
|
<CODE><<I>float</I>>_mulAdd</CODE>
|
|
</BLOCKQUOTE>
|
|
Unlike other operations, fused multiple-add is supported only for the
|
|
non-extended formats, <CODE>float32_t</CODE>, <CODE>float64_t</CODE>, and
|
|
<CODE>float128_t</CODE>.
|
|
No fused multiple-add function is currently provided for the
|
|
<NOBR>80-bit</NOBR> double-extended-precision type, <CODE>extFloat80_t</CODE>.
|
|
</P>
|
|
|
|
<P>
|
|
Depending on whether floating-point values are passed by value or via pointers,
|
|
the fused multiply-add functions have signatures of these forms:
|
|
<BLOCKQUOTE>
|
|
<PRE>
|
|
float64_t f64_mulAdd( float64_t <I>a</I>, float64_t <I>b</I>, float64_t <I>c</I> );
|
|
</PRE>
|
|
<PRE>
|
|
void
|
|
f128M_mulAdd(
|
|
const float128_t *<I>aPtr</I>,
|
|
const float128_t *<I>bPtr</I>,
|
|
const float128_t *<I>cPtr</I>,
|
|
float128_t *<I>destPtr</I>
|
|
);
|
|
</PRE>
|
|
</BLOCKQUOTE>
|
|
The functions compute
|
|
<NOBR>(<CODE><I>a</I></CODE> × <CODE><I>b</I></CODE>)
|
|
+ <CODE><I>c</I></CODE></NOBR>
|
|
with a single rounding.
|
|
When floating-point values are passed indirectly through pointers, arguments
|
|
<CODE><I>aPtr</I></CODE>, <CODE><I>bPtr</I></CODE>, and
|
|
<CODE><I>cPtr</I></CODE> point to operands <CODE><I>a</I></CODE>,
|
|
<CODE><I>b</I></CODE>, and <CODE><I>c</I></CODE> respectively, and
|
|
<CODE><I>destPtr</I></CODE> points to the location where the result is stored.
|
|
</P>
|
|
|
|
<P>
|
|
If one of the multiplication operands <CODE><I>a</I></CODE> and
|
|
<CODE><I>b</I></CODE> is infinite and the other is zero, these functions raise
|
|
the invalid exception even if operand <CODE><I>c</I></CODE> is a quiet NaN.
|
|
</P>
|
|
|
|
<H3>8.6. Remainder Functions</H3>
|
|
|
|
<P>
|
|
For each format, SoftFloat implements the remainder operation defined by the
|
|
IEEE Floating-Point Standard.
|
|
The remainder functions have names
|
|
<BLOCKQUOTE>
|
|
<CODE><<I>float</I>>_rem</CODE>
|
|
</BLOCKQUOTE>
|
|
Each remainder operation takes two floating-point operands of the same format
|
|
and returns a result in the same format.
|
|
Depending on whether floating-point values are passed by value or via pointers,
|
|
the remainder functions have signatures of these forms:
|
|
<BLOCKQUOTE>
|
|
<PRE>
|
|
float64_t f64_rem( float64_t <I>a</I>, float64_t <I>b</I> );
|
|
</PRE>
|
|
<PRE>
|
|
void
|
|
f128M_rem(
|
|
const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> );
|
|
</PRE>
|
|
</BLOCKQUOTE>
|
|
When floating-point values are passed indirectly through pointers, arguments
|
|
<CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to operands
|
|
<CODE><I>a</I></CODE> and <CODE><I>b</I></CODE> respectively, and
|
|
<CODE><I>destPtr</I></CODE> points to the location where the result is stored.
|
|
</P>
|
|
|
|
<P>
|
|
The IEEE Standard remainder operation computes the value
|
|
<NOBR><CODE><I>a</I></CODE>
|
|
− <I>n</I> × <CODE><I>b</I></CODE></NOBR>,
|
|
where <I>n</I> is the integer closest to
|
|
<NOBR><CODE><I>a</I></CODE> ÷ <CODE><I>b</I></CODE></NOBR>.
|
|
If <NOBR><CODE><I>a</I></CODE> ÷ <CODE><I>b</I></CODE></NOBR> is exactly
|
|
halfway between two integers, <I>n</I> is the <EM>even</EM> integer closest to
|
|
<NOBR><CODE><I>a</I></CODE> ÷ <CODE><I>b</I></CODE></NOBR>.
|
|
The IEEE Standard’s remainder operation is always exact and so requires
|
|
no rounding.
|
|
</P>
|
|
|
|
<P>
|
|
Depending on the relative magnitudes of the operands, the remainder
|
|
functions can take considerably longer to execute than the other SoftFloat
|
|
functions.
|
|
This is inherent in the remainder operation itself and is not a flaw in the
|
|
SoftFloat implementation.
|
|
</P>
|
|
|
|
<H3>8.7. Round-to-Integer Functions</H3>
|
|
|
|
<P>
|
|
For each format, SoftFloat implements the round-to-integer operation specified
|
|
by the IEEE Floating-Point Standard.
|
|
These functions are named
|
|
<BLOCKQUOTE>
|
|
<CODE><<I>float</I>>_roundToInt</CODE>
|
|
</BLOCKQUOTE>
|
|
Each round-to-integer operation takes a single floating-point operand.
|
|
This operand is rounded to an integer according to a specified rounding mode,
|
|
and the resulting integer value is returned in the same floating-point format.
|
|
(Note that the result is not an integer type.)
|
|
</P>
|
|
|
|
<P>
|
|
The signatures of the round-to-integer functions are similar to those for
|
|
conversions to an integer type:
|
|
<BLOCKQUOTE>
|
|
<PRE>
|
|
float64_t f64_roundToInt( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
|
|
</PRE>
|
|
<PRE>
|
|
void
|
|
f128M_roundToInt(
|
|
const float128_t *<I>aPtr</I>,
|
|
uint_fast8_t <I>roundingMode</I>,
|
|
bool <I>exact</I>,
|
|
float128_t *<I>destPtr</I>
|
|
);
|
|
</PRE>
|
|
</BLOCKQUOTE>
|
|
The <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode to
|
|
apply.
|
|
The variable that usually indicates rounding mode,
|
|
<CODE>softfloat_roundingMode</CODE>, is ignored.
|
|
Argument <CODE><I>exact</I></CODE> determines whether the <I>inexact</I>
|
|
exception flag is raised if the conversion is not exact.
|
|
If <CODE><I>exact</I></CODE> is <CODE>true</CODE>, the <I>inexact</I> flag may
|
|
be raised;
|
|
otherwise, it will not be, even if the conversion is inexact.
|
|
When floating-point values are passed indirectly through pointers,
|
|
<CODE><I>aPtr</I></CODE> points to the input operand and
|
|
<CODE><I>destPtr</I></CODE> points to the location where the result is stored.
|
|
</P>
|
|
|
|
<H3>8.8. Comparison Functions</H3>
|
|
|
|
<P>
|
|
For each format, the following floating-point comparison functions are
|
|
provided:
|
|
<BLOCKQUOTE>
|
|
<CODE><<I>float</I>>_eq</CODE><BR>
|
|
<CODE><<I>float</I>>_le</CODE><BR>
|
|
<CODE><<I>float</I>>_lt</CODE>
|
|
</BLOCKQUOTE>
|
|
Each comparison takes two operands of the same type and returns a Boolean.
|
|
The abbreviation <CODE>eq</CODE> stands for “equal” (=);
|
|
<CODE>le</CODE> stands for “less than or equal” (≤);
|
|
and <CODE>lt</CODE> stands for “less than” (<).
|
|
Depending on whether the floating-point operands are passed by value or via
|
|
pointers, the comparison functions have signatures of these forms:
|
|
<BLOCKQUOTE>
|
|
<PRE>
|
|
bool f64_eq( float64_t <I>a</I>, float64_t <I>b</I> );
|
|
</PRE>
|
|
<PRE>
|
|
bool f128M_eq( const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I> );
|
|
</PRE>
|
|
</BLOCKQUOTE>
|
|
</P>
|
|
|
|
<P>
|
|
The usual greater-than (>), greater-than-or-equal (≥), and not-equal
|
|
(≠) comparisons are easily obtained from the functions provided.
|
|
The not-equal function is just the logical complement of the equal function.
|
|
The greater-than-or-equal function is identical to the less-than-or-equal
|
|
function with the arguments in reverse order, and likewise the greater-than
|
|
function is identical to the less-than function with the arguments reversed.
|
|
</P>
|
|
|
|
<P>
|
|
The IEEE Floating-Point Standard specifies that the less-than-or-equal and
|
|
less-than comparisons by default raise the <I>invalid</I> exception if either
|
|
operand is any kind of NaN.
|
|
Equality comparisons, on the other hand, are defined by default to raise the
|
|
<I>invalid</I> exception only for signaling NaNs, not quiet NaNs.
|
|
For completeness, SoftFloat provides these complementary functions:
|
|
<BLOCKQUOTE>
|
|
<CODE><<I>float</I>>_eq_signaling</CODE><BR>
|
|
<CODE><<I>float</I>>_le_quiet</CODE><BR>
|
|
<CODE><<I>float</I>>_lt_quiet</CODE>
|
|
</BLOCKQUOTE>
|
|
The <CODE>signaling</CODE> equality comparisons are identical to the default
|
|
equality comparisons except that the <I>invalid</I> exception is raised for any
|
|
NaN input, not just for signaling NaNs.
|
|
Similarly, the <CODE>quiet</CODE> comparison functions are identical to their
|
|
default counterparts except that the <I>invalid</I> exception is not raised for
|
|
quiet NaNs.
|
|
</P>
|
|
|
|
<H3>8.9. Signaling NaN Test Functions</H3>
|
|
|
|
<P>
|
|
Functions for testing whether a floating-point value is a signaling NaN are
|
|
provided with these names:
|
|
<BLOCKQUOTE>
|
|
<CODE><<I>float</I>>_isSignalingNaN</CODE>
|
|
</BLOCKQUOTE>
|
|
The functions take one floating-point operand and return a Boolean indicating
|
|
whether the operand is a signaling NaN.
|
|
Accordingly, the functions have the forms
|
|
<BLOCKQUOTE>
|
|
<PRE>
|
|
bool f64_isSignalingNaN( float64_t <I>a</I> );
|
|
</PRE>
|
|
<PRE>
|
|
bool f128M_isSignalingNaN( const float128_t *<I>aPtr</I> );
|
|
</PRE>
|
|
</BLOCKQUOTE>
|
|
</P>
|
|
|
|
<H3>8.10. Raise-Exception Function</H3>
|
|
|
|
<P>
|
|
SoftFloat provides a single function for raising floating-point exceptions:
|
|
<BLOCKQUOTE>
|
|
<PRE>
|
|
void softfloat_raise( uint_fast8_t <I>exceptions</I> );
|
|
</PRE>
|
|
</BLOCKQUOTE>
|
|
The <CODE><I>exceptions</I></CODE> argument is a mask indicating the set of
|
|
exceptions to raise.
|
|
(See earlier section 7, <I>Exceptions and Exception Flags</I>.)
|
|
In addition to setting the specified exception flags in variable
|
|
<CODE>softfloat_exceptionFlags</CODE>, the <CODE>softfloat_raise</CODE>
|
|
function may cause a trap or abort appropriate for the current system.
|
|
</P>
|
|
|
|
|
|
<H2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></H2>
|
|
|
|
<P>
|
|
Apart from the change in the legal use license, there are numerous technical
|
|
differences between <NOBR>Release 3</NOBR> of SoftFloat and earlier releases.
|
|
</P>
|
|
|
|
<H3>9.1. Name Changes</H3>
|
|
|
|
<P>
|
|
The most obvious and pervasive difference compared to <NOBR>Release 2</NOBR> is
|
|
that the names of most functions and variables have changed, even when the
|
|
behavior has not.
|
|
First, the floating-point types, the mode variables, the exception flags
|
|
variable, the function to raise exceptions, and various associated constants
|
|
have been renamed as follows:
|
|
<BLOCKQUOTE>
|
|
<TABLE>
|
|
<TR>
|
|
<TD>old name, Release 2:</TD>
|
|
<TD>new name, Release 3:</TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float32</CODE></TD>
|
|
<TD><CODE>float32_t</CODE></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float64</CODE></TD>
|
|
<TD><CODE>float64_t</CODE></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>floatx80</CODE></TD>
|
|
<TD><CODE>extFloat80_t</CODE></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float128</CODE></TD>
|
|
<TD><CODE>float128_t</CODE></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float_rounding_mode</CODE></TD>
|
|
<TD><CODE>softfloat_roundingMode</CODE></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float_round_nearest_even</CODE></TD>
|
|
<TD><CODE>softfloat_round_near_even</CODE></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float_round_to_zero</CODE></TD>
|
|
<TD><CODE>softfloat_round_minMag</CODE></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float_round_down</CODE></TD>
|
|
<TD><CODE>softfloat_round_min</CODE></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float_round_up</CODE></TD>
|
|
<TD><CODE>softfloat_round_max</CODE></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float_detect_tininess</CODE></TD>
|
|
<TD><CODE>softfloat_detectTininess</CODE></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float_tininess_before_rounding </CODE></TD>
|
|
<TD><CODE>softfloat_tininess_beforeRounding</CODE></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float_tininess_after_rounding</CODE></TD>
|
|
<TD><CODE>softfloat_tininess_afterRounding</CODE></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>floatx80_rounding_precision</CODE></TD>
|
|
<TD><CODE>extF80_roundingPrecision</CODE></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float_exception_flags</CODE></TD>
|
|
<TD><CODE>softfloat_exceptionFlags</CODE></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float_flag_inexact</CODE></TD>
|
|
<TD><CODE>softfloat_flag_inexact</CODE></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float_flag_underflow</CODE></TD>
|
|
<TD><CODE>softfloat_flag_underflow</CODE></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float_flag_overflow</CODE></TD>
|
|
<TD><CODE>softfloat_flag_overflow</CODE></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float_flag_divbyzero</CODE></TD>
|
|
<TD><CODE>softfloat_flag_infinite</CODE></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float_flag_invalid</CODE></TD>
|
|
<TD><CODE>softfloat_flag_invalid</CODE></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>float_raise</CODE></TD>
|
|
<TD><CODE>softfloat_raise</CODE></TD>
|
|
</TR>
|
|
</TABLE>
|
|
</BLOCKQUOTE>
|
|
</P>
|
|
|
|
<P>
|
|
Furthermore, <NOBR>Release 3</NOBR> has adopted the following new abbreviations
|
|
for function names:
|
|
<BLOCKQUOTE>
|
|
<TABLE>
|
|
<TR>
|
|
<TD>used in names in Release 2:<CODE> </CODE></TD>
|
|
<TD>used in names in Release 3:</TD>
|
|
</TR>
|
|
<TR> <TD><CODE>int32</CODE></TD> <TD><CODE>i32</CODE></TD> </TR>
|
|
<TR> <TD><CODE>int64</CODE></TD> <TD><CODE>i64</CODE></TD> </TR>
|
|
<TR> <TD><CODE>float32</CODE></TD> <TD><CODE>f32</CODE></TD> </TR>
|
|
<TR> <TD><CODE>float64</CODE></TD> <TD><CODE>f64</CODE></TD> </TR>
|
|
<TR> <TD><CODE>floatx80</CODE></TD> <TD><CODE>extF80</CODE></TD> </TR>
|
|
<TR> <TD><CODE>float128</CODE></TD> <TD><CODE>f128</CODE></TD> </TR>
|
|
</TABLE>
|
|
</BLOCKQUOTE>
|
|
Thus, for example, the function to add two <NOBR>32-bit</NOBR> floating-point
|
|
numbers, previously called <CODE>float32_add</CODE> in <NOBR>Release 2</NOBR>,
|
|
is now <CODE>f32_add</CODE>.
|
|
Lastly, there are a few other changes to function names:
|
|
<BLOCKQUOTE>
|
|
<TABLE>
|
|
<TR>
|
|
<TD>used in names in Release 2:<CODE> </CODE></TD>
|
|
<TD>used in names in Release 3:<CODE> </CODE></TD>
|
|
<TD>relevant functions:</TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>_round_to_zero</CODE></TD>
|
|
<TD><CODE>_r_minMag</CODE></TD>
|
|
<TD>conversions from floating-point to integer (<NOBR>section 8.2</NOBR>)</TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>round_to_int</CODE></TD>
|
|
<TD><CODE>roundToInt</CODE></TD>
|
|
<TD>round-to-integer functions (<NOBR>section 8.7</NOBR>)</TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><CODE>is_signaling_nan </CODE></TD>
|
|
<TD><CODE>isSignalingNaN</CODE></TD>
|
|
<TD>signaling NaN test functions (<NOBR>section 8.9</NOBR>)</TD>
|
|
</TR>
|
|
</TABLE>
|
|
</BLOCKQUOTE>
|
|
</P>
|
|
|
|
<H3>9.2. Changes to Function Arguments</H3>
|
|
|
|
<P>
|
|
Besides simple name changes, some operations have a different interface in
|
|
<NOBR>Release 3</NOBR> than they had in <NOBR>Release 2</NOBR>:
|
|
<UL>
|
|
|
|
<LI>
|
|
<P>
|
|
In <NOBR>Release 3</NOBR>, integer arguments and results of functions have
|
|
standard types from header <CODE><stdint.h></CODE>, such as
|
|
<CODE>uint32_t</CODE>, whereas previously their types could be defined
|
|
differently for each port of SoftFloat, usually using traditional C types such
|
|
as <CODE>unsigned</CODE> <CODE>int</CODE>.
|
|
Likewise, functions in <NOBR>Release 3</NOBR> pass Booleans as standard type
|
|
<CODE>bool</CODE> from <CODE><stdbool.h></CODE>, whereas previously these
|
|
were again passed as a port-specific type (usually <CODE>int</CODE>).
|
|
</P>
|
|
|
|
<LI>
|
|
<P>
|
|
As explained earlier in <NOBR>section 4.5</NOBR>, <I>Conventions for Passing
|
|
Arguments and Results</I>, SoftFloat functions in <NOBR>Release 3</NOBR> may
|
|
pass <NOBR>80-bit</NOBR> and <NOBR>128-bit</NOBR> floating-point values through
|
|
pointers, meaning that functions take pointer arguments and then read or write
|
|
floating-point values at the locations indicated by the pointers.
|
|
In <NOBR>Release 2</NOBR>, floating-point arguments and results were always
|
|
passed by value, regardless of their size.
|
|
</P>
|
|
|
|
<LI>
|
|
<P>
|
|
Functions that round to an integer have additional
|
|
<CODE><I>roundingMode</I></CODE> and <CODE><I>exact</I></CODE> arguments that
|
|
they did not have in <NOBR>Release 2</NOBR>.
|
|
Refer to sections 8.2 <NOBR>and 8.7</NOBR> for descriptions of these functions
|
|
in <NOBR>Release 3</NOBR>.
|
|
For <NOBR>Release 2</NOBR>, the rounding mode, when needed, was taken from the
|
|
same global variable that affects the basic arithmetic operations (now called
|
|
<CODE>softfloat_roundingMode</CODE> but previously known as
|
|
<CODE>float_rounding_mode</CODE>).
|
|
Also, for <NOBR>Release 2</NOBR>, if the original floating-point input was not
|
|
an exact integer value, and if the <I>invalid</I> exception was not raised by
|
|
the function, the <I>inexact</I> exception was always raised.
|
|
<NOBR>Release 2</NOBR> had no option to suppress raising <I>inexact</I> in this
|
|
case.
|
|
Applications using SoftFloat <NOBR>Release 3</NOBR> can get the same effect as
|
|
<NOBR>Release 2</NOBR> by passing variable <CODE>softfloat_roundingMode</CODE>
|
|
for argument <CODE><I>roundingMode</I></CODE> and <CODE>true</CODE> for
|
|
argument <CODE><I>exact</I></CODE>.
|
|
</P>
|
|
|
|
</UL>
|
|
</P>
|
|
|
|
<H3>9.3. Added Capabilities</H3>
|
|
|
|
<P>
|
|
<NOBR>Release 3</NOBR> adds some new features not present in
|
|
<NOBR>Release 2</NOBR>:
|
|
<UL>
|
|
|
|
<LI>
|
|
<P>
|
|
With <NOBR>Release 3</NOBR>, a port of SoftFloat can now define any of the
|
|
floating-point types <CODE>float32_t</CODE>, <CODE>float64_t</CODE>,
|
|
<CODE>extFloat80_t</CODE>, and <CODE>float128_t</CODE> as aliases for C’s
|
|
standard floating-point types <CODE>float</CODE>, <CODE>double</CODE>, and
|
|
<CODE>long</CODE> <CODE>double</CODE>, using either <CODE>#define</CODE> or
|
|
<CODE>typedef</CODE>.
|
|
This potential convenience was not supported under <NOBR>Release 2</NOBR>.
|
|
</P>
|
|
|
|
<P>
|
|
(Note, however, that there may be a performance cost to defining
|
|
SoftFloat’s floating-point types this way, depending on the platform and
|
|
the applications using SoftFloat.
|
|
Ports of SoftFloat may choose to forgo the convenience in favor of better
|
|
speed.)
|
|
</P>
|
|
|
|
<P>
|
|
<LI>
|
|
Functions have been added for converting between the floating-point types and
|
|
unsigned integers.
|
|
<NOBR>Release 2</NOBR> supported only signed integers, not unsigned.
|
|
</P>
|
|
|
|
<P>
|
|
<LI>
|
|
A new, fifth rounding mode, <CODE>softfloat_round_near_maxMag</CODE> (round to
|
|
nearest, with ties to maximum magnitude, away from zero) is supported for all
|
|
cases involving rounding.
|
|
</P>
|
|
|
|
<P>
|
|
<LI>
|
|
Fused multiply-add functions have been added for the non-extended formats,
|
|
<CODE>float32_t</CODE>, <CODE>float64_t</CODE>, and <CODE>float128_t</CODE>.
|
|
</P>
|
|
|
|
</UL>
|
|
</P>
|
|
|
|
<H3>9.4. Better Compatibility with the C Language</H3>
|
|
|
|
<P>
|
|
<NOBR>Release 3</NOBR> of SoftFloat is written to conform better to the ISO C
|
|
Standard’s rules for portability.
|
|
For example, older releases of SoftFloat employed type conversions in ways
|
|
that, while commonly practiced, are not fully defined by the C Standard.
|
|
Such problematic type conversions have generally been replaced by the use of
|
|
unions, the behavior around which is more strictly regulated these days.
|
|
</P>
|
|
|
|
<H3>9.5. New Organization as a Library</H3>
|
|
|
|
<P>
|
|
With <NOBR>Release 3</NOBR>, SoftFloat now builds as a library.
|
|
Previously, SoftFloat compiled into a single, monolithic object file containing
|
|
all the SoftFloat functions, with the consequence that a program linking with
|
|
SoftFloat would get every SoftFloat function in its binary file even if only a
|
|
few functions were actually used.
|
|
With SoftFloat in the form of a library, a program that is linked by a standard
|
|
linker will include only those functions of SoftFloat that it needs and no
|
|
others.
|
|
</P>
|
|
|
|
<H3>9.6. Optimization Gains (and Losses)</H3>
|
|
|
|
<P>
|
|
Individual SoftFloat functions are variously improved in <NOBR>Release 3</NOBR>
|
|
compared to earlier releases.
|
|
In particular, better, faster algorithms have been deployed for the operations
|
|
of division, square root, and remainder.
|
|
For functions operating on the larger <NOBR>80-bit</NOBR> and
|
|
<NOBR>128-bit</NOBR> formats, <CODE>extFloat80_t</CODE> and
|
|
<CODE>float128_t</CODE>, code size has also generally been reduced.
|
|
</P>
|
|
|
|
<P>
|
|
However, because <NOBR>Release 2</NOBR> compiled all of SoftFloat together as a
|
|
single object file, compilers could make optimizations across function calls
|
|
when one SoftFloat function calls another.
|
|
Now that the functions of SoftFloat are compiled separately and only afterward
|
|
linked together into a program, there is not usually the same opportunity to
|
|
optimize across function calls.
|
|
Some loss of speed has been observed due to this change.
|
|
</P>
|
|
|
|
|
|
<H2>10. Future Directions</H2>
|
|
|
|
<P>
|
|
The following improvements are anticipated for future releases of SoftFloat:
|
|
<UL>
|
|
<LI>
|
|
support for the common <NOBR>16-bit</NOBR> “half-precision”
|
|
floating-point format;
|
|
<LI>
|
|
more functions from the 2008 version of the IEEE Floating-Point Standard;
|
|
<LI>
|
|
consistent, defined behavior for non-canonical representations of extended
|
|
format <CODE>extFloat80_t</CODE> (discussed in <NOBR>section 4.4</NOBR>,
|
|
<I>Non-canonical Representations in <CODE>extFloat80_t</CODE></I>).
|
|
|
|
</UL>
|
|
</P>
|
|
|
|
|
|
<H2>11. Contact Information</H2>
|
|
|
|
<P>
|
|
At the time of this writing, the most up-to-date information about SoftFloat
|
|
and the latest release can be found at the Web page
|
|
<A HREF="http://www.jhauser.us/arithmetic/SoftFloat.html"><CODE>http://www.jhauser.us/arithmetic/SoftFloat.html</CODE></A>.
|
|
</P>
|
|
|
|
|
|
</BODY>
|
|
|