I wanted to optimize fmod to be a bit faster. This is my C++20 solution.[snip]
double myFmod( double x, double y )
It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.
On Mon, 24 Feb 2025 11:48:08 +0100
Bonita Montero <Bonita.Montero@gmail.com> wibbled:
I wanted to optimize fmod to be a bit faster. This is my C++20 solution.[snip]
double myFmod( double x, double y )
It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.
double myFmod(double x, double y)
{
double div = x / y;
return y * (div - (long)div);
}
Am 24.02.2025 um 13:00 schrieb Muttley@DastardlyHQ.org:
On Mon, 24 Feb 2025 11:48:08 +0100
Bonita Montero <Bonita.Montero@gmail.com> wibbled:
I wanted to optimize fmod to be a bit faster. This is my C++20 solution. >>>[snip]
double myFmod( double x, double y )
It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.
double myFmod(double x, double y)
{
double div = x / y;
return y * (div - (long)div);
}
If the exponent difference between x and y is large enough this
returns results which are larger than y. glibc does it completley
with integer-operations also.
If the values are so large or small that you start to get floating point based errors then you should probably be using integer arthmetic or a large number library anyway like GMP anyway.
On Mon, 24 Feb 2025 13:48:02 +0100
Bonita Montero <Bonita.Montero@gmail.com> wibbled:
Am 24.02.2025 um 13:00 schrieb Muttley@DastardlyHQ.org:
On Mon, 24 Feb 2025 11:48:08 +0100
Bonita Montero <Bonita.Montero@gmail.com> wibbled:
I wanted to optimize fmod to be a bit faster. This is my C++20[snip]
solution.
double myFmod( double x, double y )
It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.
double myFmod(double x, double y)
{
double div = x / y;
return y * (div - (long)div);
}
If the exponent difference between x and y is large enough this
returns results which are larger than y. glibc does it completley
with integer-operations also.
If the values are so large or small that you start to get floating
point based errors then you should probably be using integer
arthmetic or a large number library anyway like GMP anyway.
On Mon, 24 Feb 2025 13:09:50 -0000 (UTC)
Muttley@DastardlyHQ.org wrote:
If the values are so large or small that you start to get floating
point based errors then you should probably be using integer
arthmetic or a large number library anyway like GMP anyway.
Your method will sometimes produce results that are 1 LSB off
relatively to IEEE-754 prescription when values are neither small nor
large.
And sometimes 1 LSB off means that result is 2x off.
For example, for x=0.9999999999999999, y=0.9999999999999998 your method >produces 2.2204460492503126e-16. A correct result is, of course, >1.1102230246251565e-16
Also, I don't think that your method is any faster than correct methods.
Am 24.02.2025 um 14:09 schrieb Muttley@DastardlyHQ.org:
If the values are so large or small that you start to get floating
point based errors then you should probably be using integer
arthmetic or a large number library anyway like GMP anyway.
There's no need for large integer arithmetics since each calculation
step results in a mantissa with equal or less bits than the divisor.
Even if the exponents are close enough to have not missing integer
-bits the following multiplication is very likely to have a precision
-loss. All current solutions (MSVC, libstdc++) work with integer-ope-
tations and are always 100% precise.
Do you have real application for fast fmod() or just playing?
For as long as y is positive and abs(x/y) <= 2**53, a very simple
formula will produce precise result: fma(trunc(x/fabs(y)), -fabs(y), x).
On Mon, 24 Feb 2025 16:22:46 +0200
Michael S <already5chosen@yahoo.com> wibbled:
On Mon, 24 Feb 2025 13:09:50 -0000 (UTC)
Muttley@DastardlyHQ.org wrote:
If the values are so large or small that you start to get floating
point based errors then you should probably be using integer
arthmetic or a large number library anyway like GMP anyway.
Your method will sometimes produce results that are 1 LSB off
relatively to IEEE-754 prescription when values are neither small nor >large.
And sometimes 1 LSB off means that result is 2x off.
For example, for x=0.9999999999999999, y=0.9999999999999998 your
method produces 2.2204460492503126e-16. A correct result is, of
course, 1.1102230246251565e-16
Frankly I doubt anyone would care, its zero in all but name.
Also, I don't think that your method is any faster than correct
methods.
Don't know, but its only 3 mathematical operations all of which can
be done by the hardware so its going to be pretty fast.
Am 24.02.2025 um 16:13 schrieb Michael S:
Do you have real application for fast fmod() or just playing?
I experimented with x87 FPREM and wanted to know whether it is
precise; it isn't
and the results can be > y.
So I developed my own
routine which is always 100% precise.
For as long as y is positive
and abs(x/y) <= 2**53, a very simple
formula will produce precise result: fma(trunc(x/fabs(y)),
-fabs(y), x).
The multiplication mostly will drop bits so that the difference might
become larger than y.
On Mon, 24 Feb 2025 15:10:53 -0000 (UTC)
Muttley@DastardlyHQ.org wrote:
Don't know, but its only 3 mathematical operations all of which can
be done by the hardware so its going to be pretty fast.
Looks like 4 operations to me - division, truncation, subtraction, >multiplication. If compiler takes it literally, which he probably
Nevertheless, after a bit of thinking I concur that your formula is
faster than 100% correct methods. Initially, I didn't took into account
all difficulties that correct methods have to face in cases of very
large x to y ratios.
However your method is approximately the same speed as *mostly correct* >method shown in my post above. May be, yours is even a little slower,
at least as long as we use good optimizing compiler and target modern
CPUs that support trunc() and fma() as fast hardware instructions.
I wanted to optimize fmod to be a bit faster. This is my C++20 solution.
double myFmod( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >= 0x7FF0000000000001u; }; auto isSig = []( uint64_t m ) { return !(m & QBIT); };
if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX & binY & QBIT :
binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
if( isNaN( binY ) ) [[unlikely]] // x != NaN || y == NaN
#if defined(_MSC_VER)
return y;
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) == 0x7FF0000000000000u; }; if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 || y == Inf
return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF; };
int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
static auto normalize = []( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - 11;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
uint64_t signX = binX & SIGN;
int expDiff;
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= 11 ? expDiff : 11;
if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX & MANT );
}
It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.
double myFmod(double x, double y)
{
double div = x / y;
return y * (div - std::round(div));
}
That is why I don't use multiplication. Did you ever asked yourself
what is the meaning of 'f' in 'fma' ?
On Mon, 24 Feb 2025 11:48:08 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:
It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.
double myFmod(double x, double y)
{
double div = x / y;
return y * (div - std::round(div));
}
Am 24.02.2025 um 16:52 schrieb Michael S:
That is why I don't use multiplication. Did you ever asked yourself
what is the meaning of 'f' in 'fma' ?
The FMA-instructions produce the same results:
#include <iostream>
#include <random>
#include <bit>
#include <cmath>
#include <iomanip>
#include <intrin.h>
using namespace std;
int main()
{
auto fma = []( double a, double b, double c )
{
__m128d mA, mB, mC;
mA.m128d_f64[0] = a;
mB.m128d_f64[0] = b;
mC.m128d_f64[0] = c;
return _mm_fmadd_pd( mA, mB, mC ).m128d_f64[0];
};
mt19937_64 mt;
uniform_int_distribution<uint64_t> finites( 1, 0x7FEFFFFFFFFFFFFFu );
auto rnd = [&]() -> double { return
bit_cast<double>( finites( mt ) ); };
ptrdiff_t nEQs = 0;
for( ptrdiff_t r = 0; r != 1'000'000; ++r )
{
double
a = rnd(), b = rnd(), c = rnd(),
rA = fma( a, b, c ),
rB = a * b + c;
nEQs = rA != rB;
}
cout << hexfloat << nEQs / 1.0e6 << endl;
}
Am 24.02.2025 um 16:52 schrieb Michael S:
That is why I don't use multiplication. Did you ever asked yourself
what is the meaning of 'f' in 'fma' ?
The FMA-instructions produce the same results:
#include <iostream>
#include <random>
#include <bit>
#include <cmath>
#include <iomanip>
#include <intrin.h>
using namespace std;
int main()
{
auto fma = []( double a, double b, double c )
{
__m128d mA, mB, mC;
mA.m128d_f64[0] = a;
mB.m128d_f64[0] = b;
mC.m128d_f64[0] = c;
return _mm_fmadd_pd( mA, mB, mC ).m128d_f64[0];
};
mt19937_64 mt;
uniform_int_distribution<uint64_t> finites( 1,
0x7FEFFFFFFFFFFFFFu ); auto rnd = [&]() -> double { return
bit_cast<double>( finites( mt ) ); }; ptrdiff_t nEQs = 0;
for( ptrdiff_t r = 0; r != 1'000'000; ++r )
{
double
a = rnd(), b = rnd(), c = rnd(),
rA = fma( a, b, c ),
rB = a * b + c;
nEQs = rA != rB;
}
cout << hexfloat << nEQs / 1.0e6 << endl;
}
On Tue, 25 Feb 2025 09:09:21 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 24.02.2025 um 16:52 schrieb Michael S:
That is why I don't use multiplication. Did you ever asked yourself
what is the meaning of 'f' in 'fma' ?
The FMA-instructions produce the same results:
#include <iostream>
#include <random>
#include <bit>
#include <cmath>
#include <iomanip>
#include <intrin.h>
using namespace std;
int main()
{
auto fma = []( double a, double b, double c )
{
__m128d mA, mB, mC;
mA.m128d_f64[0] = a;
mB.m128d_f64[0] = b;
mC.m128d_f64[0] = c;
return _mm_fmadd_pd( mA, mB, mC ).m128d_f64[0];
};
mt19937_64 mt;
uniform_int_distribution<uint64_t> finites( 1,
0x7FEFFFFFFFFFFFFFu ); auto rnd = [&]() -> double { return
bit_cast<double>( finites( mt ) ); }; ptrdiff_t nEQs = 0;
for( ptrdiff_t r = 0; r != 1'000'000; ++r )
{
double
a = rnd(), b = rnd(), c = rnd(),
rA = fma( a, b, c ),
rB = a * b + c;
nEQs = rA != rB;
}
cout << hexfloat << nEQs / 1.0e6 << endl;
}
GIGO.
Do a proper test then you'd get a proper answer.
Am 25.02.2025 um 16:26 schrieb Michael S:
On Tue, 25 Feb 2025 09:09:21 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 24.02.2025 um 16:52 schrieb Michael S:
That is why I don't use multiplication. Did you ever asked
yourself what is the meaning of 'f' in 'fma' ?
The FMA-instructions produce the same results:
#include <iostream>
#include <random>
#include <bit>
#include <cmath>
#include <iomanip>
#include <intrin.h>
using namespace std;
int main()
{
auto fma = []( double a, double b, double c )
{
__m128d mA, mB, mC;
mA.m128d_f64[0] = a;
mB.m128d_f64[0] = b;
mC.m128d_f64[0] = c;
return _mm_fmadd_pd( mA, mB, mC ).m128d_f64[0];
};
mt19937_64 mt;
uniform_int_distribution<uint64_t> finites( 1,
0x7FEFFFFFFFFFFFFFu ); auto rnd = [&]() -> double { return
bit_cast<double>( finites( mt ) ); }; ptrdiff_t nEQs = 0;
for( ptrdiff_t r = 0; r != 1'000'000; ++r )
{
double
a = rnd(), b = rnd(), c = rnd(),
rA = fma( a, b, c ),
rB = a * b + c;
nEQs = rA != rB;
}
cout << hexfloat << nEQs / 1.0e6 << endl;
}
GIGO.
Do a proper test then you'd get a proper answer.
The test is proper with MSVC since MSVC doesn't replace the
"a * b + c"-operation with a FMA-operation.
With your code
it isn't guaranteed that the CPU-specific FMA-operations are
used.
I'm using the SSE FMA operation explicitly and I'm using
it for a million random finite double-values.
Don't invent your own fma(). Use one provided by library.
Then MSVC will do what it is prescribed to do by the standard.
Originally, I didn't even try to investigate what garbage exactly you
are feeding to your test. Now I took a look. It seems that you are
doing something fundamentally stupid, like all fma inputs positive.
Am 25.02.2025 um 18:17 schrieb Michael S:
Don't invent your own fma(). Use one provided by library.
Then MSVC will do what it is prescribed to do by the standard.
I want to be sure that I'm using the SSE FMA operation
and
not a conventional substitute of two instructions.
Originally, I didn't even try to investigate what garbage exactly
you are feeding to your test. Now I took a look. It seems that you
are doing something fundamentally stupid, like all fma inputs
positive.
I extended the test to incorporate all possible double
bitrepresentations (taken von mt()) - with no difference.
Am 24.02.2025 um 23:21 schrieb Mr Flibble:
double myFmod(double x, double y)
{
double div = x / y;
return y * (div - std::round(div));
}
Doesn't work, not only for the reasons already mentioned.
On Tue, 25 Feb 2025 07:37:23 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:
Am 24.02.2025 um 23:21 schrieb Mr Flibble:
double myFmod(double x, double y)
{
double div = x / y;
return y * (div - std::round(div));
}
Doesn't work, not only for the reasons already mentioned.
double myFmod(double x, double y)
{
double div = x / y;
return y * (div - std::trunc(div));
}
/Flibble
double myFmod(double x, double y)
{
double div = x / y;
return y * (div - std::trunc(div));
}
SSEn has no FMA operations.
Here is an example of the program that prints 250508 on Intel Haswell
CPU, but prints 0 on Intel Ivy Bridge.
Compiled as 'cl -O1 -W4 -MD fma_tst0.c'.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
static
unsigned long long rnd(void)
{
unsigned long long x = 0;
for (int i = 0; i < 5; ++i)
x = (x << 15) + (rand() & 0x7FFF);
return x;
}
int main(void)
{
srand(1);
int n = 0;
for (int i = 0; i < 1000000; ++i) {
double x = rnd() * 0x1p-64;
double y = rnd() * 0x1p-64;
double z = rnd() * 0x1p-114;
double r1 = x*y + z;
double r2 = fma(x, y, z);
n += r1 != r2;
}
printf("%d\n", n);
return 0;
}
It is certainly worth a bug report, but I am afraid that Microsoft will
do nothing to fix it, likely claiming that they don't care about old >hardware.
On Wed, 26 Feb 2025 00:36:19 +0200
Michael S <already5chosen@yahoo.com> wibbled:
Here is an example of the program that prints 250508 on Intel Haswell
CPU, but prints 0 on Intel Ivy Bridge.
Compiled as 'cl -O1 -W4 -MD fma_tst0.c'.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
static
unsigned long long rnd(void)
{
unsigned long long x = 0;
for (int i = 0; i < 5; ++i)
x = (x << 15) + (rand() & 0x7FFF);
return x;
}
int main(void)
{
srand(1);
int n = 0;
for (int i = 0; i < 1000000; ++i) {
double x = rnd() * 0x1p-64;
double y = rnd() * 0x1p-64;
double z = rnd() * 0x1p-114;
double r1 = x*y + z;
double r2 = fma(x, y, z);
n += r1 != r2;
}
printf("%d\n", n);
return 0;
}
It is certainly worth a bug report, but I am afraid that Microsoft
will do nothing to fix it, likely claiming that they don't care
about old hardware.
Just FYI - it also returns 0 when compiled by Clang on an ARM Mac.
On Wed, 26 Feb 2025 08:16:01 -0000 (UTC)
Muttley@DastardlyHQ.org wrote:
I played a little on godbolt and it seems that the bug is relatively
new. clang 13 still generates correct code. clang 14 does not. I.e.
slightly less than 3 years.
I wanted to optimize fmod to be a bit faster. This is my C++20 solution.
double myFmod( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >= 0x7FF0000000000001u; }; auto isSig = []( uint64_t m ) { return !(m & QBIT); };
if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX & binY & QBIT :
binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
if( isNaN( binY ) ) [[unlikely]] // x != NaN || y == NaN
#if defined(_MSC_VER)
return y;
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) == 0x7FF0000000000000u; }; if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 || y == Inf
return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF; };
int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
static auto normalize = []( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - 11;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
uint64_t signX = binX & SIGN;
int expDiff;
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= 11 ? expDiff : 11;
if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX & MANT );
}
It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.
double myFmod(double x, double y)
{
return x / y - std::trunc(x / y) * y;
}
double myFmod(double x, double y)
{
return x / y - std::trunc(x / y) * y;
}
/Flibble
On Wed, 26 Feb 2025 14:27:21 +0200
Michael S <already5chosen@yahoo.com> wibbled:
On Wed, 26 Feb 2025 08:16:01 -0000 (UTC)
Muttley@DastardlyHQ.org wrote:
I played a little on godbolt and it seems that the bug is relatively
new. clang 13 still generates correct code. clang 14 does not. I.e. >slightly less than 3 years.
I don't think they've noticed:
R8603$ cc --version
Apple clang version 16.0.0 (clang-1600.0.26.6)
On Thu, 27 Feb 2025 21:16:50 +0000
Mr Flibble <leigh@i42.co.uk> wrote:
double myFmod(double x, double y)
{
return x / y - std::trunc(x / y) * y;
}
/Flibble
Nonsense.
The one below is not nonsense, but still very bad.
double myFmod(double x, double y)
{
return x - trunc(x / y) * y;
}
Even ignoring potential overflow during division, this method is
very imprecise.
(1e3/9 - trunc(1e3/9))*9 = 1.000000000000028
(1e6/9 - trunc(1e6/9))*9 = 0.999999999985448
(1e9/9 - trunc(1e9/9))*9 = 0.999999940395355
(1e12/9 - trunc(1e12/9))*9 = 1.000030517578125
(1e15/9 - trunc(1e15/9))*9 = 0.984375
OTOH
1e3/9 - trunc(1e3/9)*9 = 1
1e6/9 - trunc(1e6/9)*9 = 1
1e9/9 - trunc(1e9/9)*9 = 1
1e12/9 - trunc(1e12/9)*9 = 1
1e15/9 - trunc(1e15/9)*9 = 1
double myFmod(double x, double y)
{
return x / y - std::trunc(x / y) * y;
}
I wanted to optimize fmod to be a bit faster. This is my C++20 solution.
double myFmod( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >= 0x7FF0000000000001u; }; auto isSig = []( uint64_t m ) { return !(m & QBIT); };
if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX & binY & QBIT :
binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
if( isNaN( binY ) ) [[unlikely]] // x != NaN || y == NaN
#if defined(_MSC_VER)
return y;
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) == 0x7FF0000000000000u; }; if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 || y == Inf
return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF; };
int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
static auto normalize = []( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - 11;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
uint64_t signX = binX & SIGN;
int expDiff;
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= 11 ? expDiff : 11;
if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX & MANT );
}
It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.
double my_fmod(double x, double y)
{
if (y == 0.0)
return x / y;
return x - std::trunc(x / y) * y;
}
Am 28.02.2025 um 19:30 schrieb Mr Flibble:
double my_fmod(double x, double y)
{
if (y == 0.0)
return x / y;
return x - std::trunc(x / y) * y;
}
This still sucks. Try it with this test:
#include <iostream>
#include <cmath>
#include <random>
#include <bit>
using namespace std;
double trivialFmod( double a, double b );
int main()
{
mt19937_64 mt;
uniform_int_distribution<uint64_t> gen( 1, 0x7FEFFFFFFFFFFFFFu );
size_t imprecise = 0, outOfRange = 0;
for( size_t r = 1'000'000; r; --r )
{
double
a = bit_cast<double>( gen( mt ) ),
b = bit_cast<double>( gen( mt ) ),
fm = fmod( a, b ),
tfm = trivialFmod( a, b );
imprecise += fm != tfm;
outOfRange += tfm >= b;
}
auto print = []( char const *what, size_t n ) { cout << what <<
(ptrdiff_t)n / (1.0e6 / 100) << "%" << endl; };
print( "imprecise: ", imprecise );
print( "out of range: ", outOfRange );
}
double trivialFmod( double a, double b )
{
return a - trunc( a / b ) * b;
}
IEEE 754 does not define how std::fmod should behave, only
std::remainder.
Am 28.02.2025 um 20:47 schrieb Mr Flibble:
IEEE 754 does not define how std::fmod should behave, only
std::remainder.
There's only one way to do it for finite numbers.
On Fri, 28 Feb 2025 20:49:38 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:
Am 28.02.2025 um 20:47 schrieb Mr Flibble:
IEEE 754 does not define how std::fmod should behave, only
std::remainder.
There's only one way to do it for finite numbers.
Not true as there is a fixed mantissa size, so finite precision,
making your test case useless if x is sufficiently large.
On Wed, 26 Feb 2025 14:39:34 -0000 (UTC)
Muttley@DastardlyHQ.org wrote:
On Wed, 26 Feb 2025 14:27:21 +0200
Michael S <already5chosen@yahoo.com> wibbled:
On Wed, 26 Feb 2025 08:16:01 -0000 (UTC)
Muttley@DastardlyHQ.org wrote:
I played a little on godbolt and it seems that the bug is relatively
new. clang 13 still generates correct code. clang 14 does not. I.e.
slightly less than 3 years.
I don't think they've noticed:
R8603$ cc --version
Apple clang version 16.0.0 (clang-1600.0.26.6)
More googling/stack-overflowing.
clang/LLVM people think that it is a feature rather than a bug. They
claim that the standard allows fusing. I think that they are wrong, but
I didn't read the respective part of the standard.
The behavior can be turned back into clang13 way by -ffp-contract=off.
Or with pragma
#pragma STDC FP_CONTRACT OFF
Am 01.03.2025 um 15:37 schrieb Mr Flibble:
On Fri, 28 Feb 2025 20:49:38 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:
Am 28.02.2025 um 20:47 schrieb Mr Flibble:
IEEE 754 does not define how std::fmod should behave, only
std::remainder.
There's only one way to do it for finite numbers.
Not true as there is a fixed mantissa size, so finite precision,
making your test case useless if x is sufficiently large.
The way to do a modolo calculations for every floating point value
except inf or nan (finite numbers) is always the same for all imple- >mentations. And correct implementations are always without precision
los, i.e. exact.
As I've shown solutions like yours are only 50% exact and in 2% they
generate out of range results.
Thus you are asserting that all finite numbers have an exact IEEE 754 floating point representation ...
Am 01.03.2025 um 18:11 schrieb Mr Flibble:
Thus you are asserting that all finite numbers have an exact IEEE 754
floating point representation ...
But the result of floating-point operations usually have precision-loss; >except fmod(), which is always correct - when implemented properly. You >didn't implement it correctly.
On Sat, 1 Mar 2025 21:02:32 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:
Am 01.03.2025 um 18:11 schrieb Mr Flibble:
Thus you are asserting that all finite numbers have an exact IEEE 754
floating point representation ...
But the result of floating-point operations usually have precision-loss;
except fmod(), which is always correct - when implemented properly. You
didn't implement it correctly.
False, see my other post.
/Flibble
Am 01.03.2025 um 21:32 schrieb Mr Flibble:
On Sat, 1 Mar 2025 21:02:32 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:
Am 01.03.2025 um 18:11 schrieb Mr Flibble:
Thus you are asserting that all finite numbers have an exact IEEE 754
floating point representation ...
But the result of floating-point operations usually have precision-loss; >>> except fmod(), which is always correct - when implemented properly. You
didn't implement it correctly.
False, see my other post.
/Flibble
This:
#include <iostream>
#include <cmath>
#include <random>
#include <bit>
using namespace std;
double my_fmod( double x, double y );
int main()
{
mt19937_64 mt;
uniform_int_distribution<uint64_t> gen( 1, 0x7FEFFFFFFFFFFFFFu );
size_t imprecise = 0, outOfRange = 0;
for( size_t r = 1'000'000; r; --r )
{
double
a = bit_cast<double>( gen( mt ) ),
b = bit_cast<double>( gen( mt ) ),
fm = fmod( a, b ),
tfm = my_fmod( a, b );
imprecise += fm != tfm;
outOfRange += tfm >= b;
}
auto print = []( char const *what, size_t n ) { cout << what <<
(ptrdiff_t)n / (1.0e6 / 100) << "%" << endl; };
print( "imprecise: ", imprecise );
print( "out of range: ", outOfRange );
}
double my_fmod( double x, double y )
{
if( y == 0.0 )
return x / y;
return x - std::trunc( x / y ) * y;
}
... prints this ...
imprecise: 49.9096%
out of range: 2.0039%
So your solution is unusable.
I wanted to optimize fmod to be a bit faster. This is my C++20
solution.
double myFmod( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >=
0x7FF0000000000001u; }; auto isSig = []( uint64_t m ) { return !(m &
QBIT); }; if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX
& binY & QBIT : binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]] feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
if( isNaN( binY ) ) [[unlikely]] // x != NaN || y == NaN
#if defined(_MSC_VER)
return y;
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) ==
0x7FF0000000000000u; }; if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 ||
y == Inf return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF;
}; int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
static auto normalize = []( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - 11;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
uint64_t signX = binX & SIGN;
int expDiff;
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= 11 ? expDiff : 11;
if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX
& MANT ); }
It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.
Thus you are asserting that all finite numbers have an exact IEEE 754 floating point representation ...
I wanted to optimize fmod to be a bit faster. This is my C++20
solution.
It's about six times faster than the glibc 2.31 solution in my
benchmark. The returned NaNs and the raised exceptions are MSVC-
and glibc-compatible.
This is my code, improved by the _udiv128-intrinsic of MSVC which
provides a 128 / 64 division. With that my algorithm becomes nearly
tree times as fast as before. I'll provide a g++ / clang++ compatible
version with inline-assembly later.
On Sun, 2 Mar 2025 17:10:37 +0100, Bonita Montero
<Bonita.Montero@gmail.com> wrote:
This is my code, improved by the _udiv128-intrinsic of MSVC whichStill slow tho.
provides a 128 / 64 division. With that my algorithm becomes nearly
tree times as fast as before. I'll provide a g++ / clang++ compatible
version with inline-assembly later.
On Mon, 24 Feb 2025 11:48:08 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
I wanted to optimize fmod to be a bit faster. This is my C++20
solution.
How about that?
Pay attention, it's C rather than C++. So 5 times shorter :-)
It's not the fastest for big x/y ratios, but rather simple and not
*too* slow. At least as long as hardware supports FMA.
For small x/y ratios it should be pretty close to best possible.
Still, it is better than Bonita's in all possible circumstances.
With MSVC and the fairer random numbers I chose I'm 2.5 times
faster.
On Sun, 2 Mar 2025 17:26:18 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
With MSVC and the fairer random numbers I chose I'm 2.5 times
faster.
I don't trust your benchmarking skills.
Am 02.03.2025 um 18:07 schrieb Michael S:
On Sun, 2 Mar 2025 17:26:18 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
With MSVC and the fairer random numbers I chose I'm 2.5 times
faster.
I don't trust your benchmarking skills.
I don't prefer your fast-path value combinations but I chose 75% random finite combinations.
On Sun, 2 Mar 2025 17:10:37 +0100, Bonita MonteroThe truth is that relative speed of FP vs Integer algorithms depends on specific CPU that one is using for measurements.
<Bonita.Montero@gmail.com> wrote:
This is my code, improved by the _udiv128-intrinsic of MSVC whichStill slow tho.
provides a 128 / 64 division. With that my algorithm becomes nearly
tree times as fast as before. I'll provide a g++ / clang++ compatible >version with inline-assembly later.
/Flibble
The truth is that relative speed of FP vs Integer algorithms depends on specific CPU that one is using for measurements.
I measured on relatively old CPU - Intel Skylake. On this CPU integer division is very significantly slower than floating-point division.
On newer CPUs, like Intel IceLake/Tiger Lake and Alder Lake or AMD Zen
3/4/5 and even more sore on Apple M-series the difference in speed
between floating-point and integer division is less significant, and
in few cases integer division is even faster, so in theory Bonita's code could be more competitive.
From Agner Fog's tables:
Arch DIVSD DIV r64
Skylake 13-14 35-88
IceLake 13-14 15
Alder Lake 14 10
Zen3 13.5 9-17
My problem is that because of Bonita's horrible coding style I am not
even trying to understand what's is going on within his/her code.
Am 02.03.2025 um 18:09 schrieb Bonita Montero:
Am 02.03.2025 um 18:07 schrieb Michael S:
On Sun, 2 Mar 2025 17:26:18 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
With MSVC and the fairer random numbers I chose I'm 2.5 times
faster.
I don't trust your benchmarking skills.
I don't prefer your fast-path value combinations but I chose 75%
random finite combinations.
This generates the random values for me:
mt19937_64 mt;
uniform_int_distribution<uint64_t>
genType( 0, 15 ),
genFinite( 0x0010000000000000u, 0x7FEFFFFFFFFFFFFFu ),
genDen( 1, 0x000FFFFFFFFFFFFFu ),
genNaN( 0x7FF0000000000001u, 0x7FFFFFFFFFFFFFFFu );
auto get = [&]()
{
constexpr uint64_t
FINITE_THRESH = 4, // 75% finites
ZERO = 3, // 6.25% zeroes
DENORMALS = 2, // 6.25% denormals
INF = 1, // 6.25% Infs
NAN_ = 0; // 6.25% NaNs
uint64_t
sign = mt() & -
numeric_limits<int64_t>::min(), type = genType( mt );
if( type >= FINITE_THRESH )
return bit_cast<double>( sign | genFinite( mt
) ); if( type == ZERO )
return bit_cast<double>( sign );
if( type == DENORMALS )
return bit_cast<double>( sign | genDen( mt )
); if( type == INF )
return bit_cast<double>( sign |
0x7FF0000000000000u ); assert(type == NAN_);
return bit_cast<double>( sign | genNaN( mt ) );
};
Your distribution is very different from what one would expect in
real-world usage.
In real-world usage apart from debugging stage there are no inf, nan or y=zero. x=zero happens, but with lower probability that 6%. Denormals
also happen, but with even lower probability than x=zero.
Also in majority of real-world scenarios huge x/y ratios either not happen at all or are extremely rare.
Your coding-style is horrible.
Mine is "too beautiful" (my employer).
Mine is "too beautiful" (my employer).
It sounds like your employer agrees with me, but he expresses his
thought in humoristic style.
Am 02.03.2025 um 18:52 schrieb Michael S:
Your distribution is very different from what one would expect in real-world usage.
There's no real world distibution, so I chose all finites to be
equally likely.
In real-world usage apart from debugging stage there are no inf,
nan or y=zero. x=zero happens, but with lower probability that 6%. Denormals also happen, but with even lower probability than x=zero.
As I said if I chose 100% finites from the 1 to 0x7FEFFFFFFFFFFFFFu
range I'm still 2.4 times faster.
Also in majority of real-world scenarios huge x/y ratios either not
happen at all or are extremely rare.
On Sun, 2 Mar 2025 18:55:22 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 02.03.2025 um 18:52 schrieb Michael S:
Your distribution is very different from what one would expect in
real-world usage.
There's no real world distibution, so I chose all finites to be
equally likely.
In real-world usage apart from debugging stage there are no inf,
nan or y=zero. x=zero happens, but with lower probability that 6%.
Denormals also happen, but with even lower probability than x=zero.
As I said if I chose 100% finites from the 1 to 0x7FEFFFFFFFFFFFFFu
range I'm still 2.4 times faster.
Also in majority of real-world scenarios huge x/y ratios either not
happen at all or are extremely rare.
You didn't answer the second point, which is critical.
In your fully random scenario 48.7% of cases are huge x/y. That is
completely unrealistic.
I can easily improve speed of huge x/y at cost of less simple code and
of small slowdown of more typical case, but I consider it
counterproductive. It seems, authors of standard libraries agree with
my judgment.
You didn't answer the second point, which is critical.
In your fully random scenario 48.7% of cases are huge x/y. That is
completely unrealistic.
I can easily improve speed of huge x/y at cost of less simple code and
of small slowdown of more typical case, but I consider it
counterproductive. It seems, authors of standard libraries agree with
my judgment.
Am 02.03.2025 um 19:09 schrieb Michael S:
You didn't answer the second point, which is critical.
In your fully random scenario 48.7% of cases are huge x/y. That is completely unrealistic.
I can easily improve speed of huge x/y at cost of less simple code
and of small slowdown of more typical case, but I consider it counterproductive. It seems, authors of standard libraries agree
with my judgment.
And you use fesetround, which takes about 40 clock cycles on my
CPU under Linux (WSL2). Better chose _mm_getcsr() and _mm_setcsr()
for that, which directly sets the FPU control word for SSE / AVX*
/ AVX-512. This is multiple times faster. For the x87-FPU you'd
have to chose different code, but the x87-FPU is totally broken
anywax.
Am 02.03.2025 um 18:52 schrieb Michael S:
Your distribution is very different from what one would expect in real-world usage.
There's no real world distibution, so I chose all finites to be
equally likely.
In real-world usage apart from debugging stage there are no inf,
nan or y=zero. x=zero happens, but with lower probability that 6%. Denormals also happen, but with even lower probability than x=zero.
As I said if I chose 100% finites from the 1 to 0x7FEFFFFFFFFFFFFFu
range I'm still 2.4 times faster.
If it was on the fast path, I'd consider it.
But improving speed of unimportant slow path at cost of portability?
Nah.
Am 02.03.2025 um 21:58 schrieb Michael S:
If it was on the fast path, I'd consider it.
But improving speed of unimportant slow path at cost of portability?
Nah.
For the 75% random finites case I've shown your code becomes about
28% faster with _mm_getcsr() and _mm_setcsr().
For the x87-FPU you'd
have to chose different code, but the x87-FPU is totally broken
anywax.
x87 is not broken relatively to its own specifications. It just happens
to be slightly different from IEEE-754 specifications. Which is not surprising considering that it predates IEEE-754 Standard by several
years.
Today there are very few reasons to still use x87 in new software.
However back in it's time x87 was an astonishingly good piece of work,
less so in performance (it was not fast, even by standards of its time)
more so for features, precision and especially for consistency of its arithmetic.
Am 03.03.2025 um 17:20 schrieb Michael S:
x87 is not broken relatively to its own specifications. It just
happens to be slightly different from IEEE-754 specifications.
Which is not surprising considering that it predates IEEE-754
Standard by several years.
You can reduce the with of the mantissa to 53 or 24 bit, but the
exponent is always 15 bit; that's not up to any specification.
Today there are very few reasons to still use x87 in new software.
However back in it's time x87 was an astonishingly good piece of
work, less so in performance (it was not fast, even by standards of
its time) more so for features, precision and especially for
consistency of its arithmetic.
There are compiler-settings which enforce consistency by storing
values with reduced precision and re-loading them to give expectable
results when you use values < long double. That's a mess.
That's up to x87 specification. Which predates IEEE-754.
On Mon, 3 Mar 2025 12:52:51 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 02.03.2025 um 21:58 schrieb Michael S:
If it was on the fast path, I'd consider it.
But improving speed of unimportant slow path at cost of
portability? Nah.
For the 75% random finites case I've shown your code becomes about
28% faster with _mm_getcsr() and _mm_setcsr().
It seems that major slowdown is specific to combination of msys2
libraries with Zen3/4 CPU.
I see even worse slowness of get/set rounding mode on msys2/Zen3.
The same msys-compiled binary on Intel CPUs is o.k., at least
relatively to other heavy things going on on the slow path.
On Zen3 with Microsoft's compiler/library it is also o.k.
As long as it only affects slow path there is nothing to agitated
about.
It seems that major slowdown is specific to combination of msys2
libraries with Zen3/4 CPU.
I see even worse slowness of get/set rounding mode on msys2/Zen3.
The same msys-compiled binary on Intel CPUs is o.k., at least
relatively to other heavy things going on on the slow path.
On Zen3 with Microsoft's compiler/library it is also o.k.
This is my code, improved by the _udiv128-intrinsic of MSVC which
provides a 128 / 64 division. With that my algorithm becomes nearly
tree times as fast as before. I'll provide a g++ / clang++ compatible
version with inline-assembly later.
template<bool _32 = false>
double xMyFmod( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >=
0x7FF0000000000001u; }; auto isSig = []( uint64_t m ) { return !(m &
QBIT); }; if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX
& binY & QBIT : binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]] feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) ==
0x7FF0000000000000u; }; if( isNaN( binY ) ) [[unlikely]] // x != NaN
|| y == NaN #if defined(_MSC_VER)
{
if constexpr( _32 )
if( isInf( binX ) )
feraiseexcept( FE_INVALID );
return y;
}
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 ||
y == Inf return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF;
}; int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
int headBits = 11;
static auto normalize = [&]( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - headBits;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
int
tailX = countr_zero( mantX ),
tailY = countr_zero( mantY ),
tailBits = tailX <= tailY ? tailX : tailY;
headBits += tailBits;
mantX >>= tailBits;
mantY >>= tailBits;
uint64_t signX = binX & SIGN;
int expDiff;
#if defined(_MSC_VER)
while( (expDiff = expX - expY) > 63 )
{
unsigned long long hi = mantX >> 1, lo = mantX << 63, remainder; (void)_udiv128( hi, lo, mantY, &remainder );
expX -= 63;
mantX = remainder;
normalize( expX, mantX );
}
#endif
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= headBits ? expDiff :
headBits; if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
mantX <<= tailBits;
mantY <<= tailBits;
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX
& MANT ); }
double myFmod( double x, double y )
{
return xMyFmod( x, y );
}
inline float myFmod( float x, float y )
{
return (float)xMyFmod<true>( (double)x, (double)y );
}
This code does not work in plenty of cases. It seems, your test vectors
have poor coverage.
Try, for example, x=1.8037919852882307, y=2.22605637008665934e-194
Am 09.03.2025 um 00:31 schrieb Michael S:
This code does not work in plenty of cases. It seems, your test
vectors have poor coverage.
Try, for example, x=1.8037919852882307, y=2.22605637008665934e-194
cout << hexfloat << myFmod( 1.8037919852882307, 2.22605637008665934e-194 ) << endl;
cout << hexfloat << fmod( 1.8037919852882307,
2.22605637008665934e-194 ) << endl;
Prints the same result under Linux and Windows.
This prints the same result (0.0) under Windows and Linux:
This prints the same result (0.0) under Windows and Linux:
I am no longer going to look at your code until you start posting full
files, with all includes and using directives.
Am 09.03.2025 um 10:46 schrieb Michael S:
This prints the same result (0.0) under Windows and Linux:
I am no longer going to look at your code until you start posting
full files, with all includes and using directives.
You could simply replace the single function I've shown.
On Sun, 9 Mar 2025 10:54:40 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 09.03.2025 um 10:46 schrieb Michael S:
This prints the same result (0.0) under Windows and Linux:
I am no longer going to look at your code until you start posting
full files, with all includes and using directives.
You could simply replace the single function I've shown.
I can. I don't want to do it.
You want me to look at/test your code? You post full code.
Simple, isn't it?
This prints the same result (0.0) under Windows and Linux:
double myFmod( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >= 0x7FF0000000000001u; };
auto isSig = []( uint64_t m ) { return !(m & QBIT); };
if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX & binY & QBIT :
binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
if( isNaN( binY ) ) [[unlikely]] // x != NaN || y == NaN
#if defined(_MSC_VER)
return y;
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) == 0x7FF0000000000000u; };
if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 || y == Inf
return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF; };
int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
int headBits = 11;
static auto normalize = [&]( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - headBits;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
int
tailX = countr_zero( mantX ),
tailY = countr_zero( mantY ),
tailBits = tailX <= tailY ? tailX : tailY;
mantX >>= tailBits;
mantY >>= tailBits;
headBits += tailBits;
uint64_t signX = binX & SIGN;
int expDiff;
#if defined(_MSC_VER) && !defined(__llvm__) && defined(_M_X64)
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= 63 ? expDiff : 63;
unsigned long long hi = mantX >> 64 - bits, lo = mantX << bits, remainder;
(void)_udiv128( hi, lo, mantY, &remainder );
if( !remainder ) [[unlikely]]
return bit_cast<double>( signX );
mantX = remainder;
expX -= bits;
normalize( expX, mantX );
}
#else
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= headBits ? expDiff : headBits;
if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
#endif
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
mantX <<= tailBits;
mantY <<= tailBits;
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX & MANT ); }
From the view of implementing myFmod, I think using C-like coding style would be better,
but all depending on what you want to achieve.
Am 09.03.2025 um 11:09 schrieb Michael S:
On Sun, 9 Mar 2025 10:54:40 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 09.03.2025 um 10:46 schrieb Michael S:
This prints the same result (0.0) under Windows and Linux:
I am no longer going to look at your code until you start posting
full files, with all includes and using directives.
You could simply replace the single function I've shown.
I can. I don't want to do it.
You want me to look at/test your code? You post full code.
Simple, isn't it?
I've read you don't trust my tests, so use your own with myFmod.
Am 09.03.2025 um 11:51 schrieb wij:
From the view of implementing myFmod, I think using C-like coding
style would be better, but all depending on what you want to
achieve.
A C coding style would result in about two times the code.
So far all we had see from you is 2-3 times longer than C code (real C,
not C-style C++) ...
Am 09.03.2025 um 13:23 schrieb Michael S:
So far all we had see from you is 2-3 times longer than C code
(real C, not C-style C++) ...
Not true since I save a lot of redundant-code with [&]-lambdas.
On Sun, 9 Mar 2025 11:21:55 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 09.03.2025 um 11:09 schrieb Michael S:
On Sun, 9 Mar 2025 10:54:40 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 09.03.2025 um 10:46 schrieb Michael S:
This prints the same result (0.0) under Windows and Linux:
I am no longer going to look at your code until you start posting
full files, with all includes and using directives.
You could simply replace the single function I've shown.
I can. I don't want to do it.
You want me to look at/test your code? You post full code.
Simple, isn't it?
I've read you don't trust my tests, so use your own with myFmod.
ok. I was too curios :(
This version produces correct results both when compiled under MSVC and
when compiled with other compilers. It is a little faster too.
With MSVC on old Intel CPU it is only 2.5 times slower than standard
library in relevant range of x/y. Previous version was 3.4 times
slower.
With gcc and clang it is still more than 6 times slower than standard >library.
The coding style is now less insane.
Measurements in nsec.
First result - Intel Skylake at 4.25 GHz
Second result - AMD Zen3 at 3.7 GHz
abs(x/y) in range that matters [0.5:2**53]:
Standard MSVC Library - 11.1 10.4
Standard gnu Library - 5.4 10.7
Yours (MSVc) - 27.6 11.5
Yours (gcc) - 36.4 23.7
Yours (clang) - 37.4 24.3
abs(x/y) in full range [2**-2090:2**2090]:
Standard MSVC Library - 109.4 153.5
Standard glib Library - 102.3 155.5
Yours (MSVc) - 134.9 52.6
Yours (gcc) - 284.7 151.8
Yours (clang) - 285.2 156.5
So it is slow ergo a pointless alternative to what we already have.
Measurements in nsec.
First result - Intel Skylake at 4.25 GHz
Second result - AMD Zen3 at 3.7 GHz
abs(x/y) in range that matters [0.5:2**53]:
Standard MSVC Library - 11.1 10.4
Standard gnu Library - 5.4 10.7
Yours (MSVc) - 27.6 11.5
Yours (gcc) - 36.4 23.7
Yours (clang) - 37.4 24.3
abs(x/y) in full range [2**-2090:2**2090]:
Standard MSVC Library - 109.4 153.5
Standard glib Library - 102.3 155.5
Yours (MSVc) - 134.9 52.6
Yours (gcc) - 284.7 151.8
Yours (clang) - 285.2 156.5
Am 09.03.2025 um 16:26 schrieb Mr Flibble:
So it is slow ergo a pointless alternative to what we already have.
glibc does it nearly in the same way I do it because the FMA-solution
isn't portable.
If fma( a, b, c ) is substituted with a * b + c
because there's no proper CPU-instruction the whole issue doesn't
work.
And with support for _udiv128 my solution has about the same
performance like the Michael's solution with clang++ 18.1.7.
Am 09.03.2025 um 13:09 schrieb Michael S:
Measurements in nsec.
First result - Intel Skylake at 4.25 GHz
Second result - AMD Zen3 at 3.7 GHz
abs(x/y) in range that matters [0.5:2**53]:
Standard MSVC Library - 11.1 10.4
Standard gnu Library - 5.4 10.7
Yours (MSVc) - 27.6 11.5
Yours (gcc) - 36.4 23.7
Yours (clang) - 37.4 24.3
abs(x/y) in full range [2**-2090:2**2090]:
Standard MSVC Library - 109.4 153.5
Standard glib Library - 102.3 155.5
Yours (MSVc) - 134.9 52.6
Yours (gcc) - 284.7 151.8
Yours (clang) - 285.2 156.5
With MSVC and an arbitrary combination of finite x and y on my
Zen4-machine:
your fmod: 77.1214
my: 38.4486
With MSVC and an arbitrary combination of finite x with exponents
ranging from 0x3FF to 0x433 (close exponents) on my Zen4-machine:
your fmod: 23.6423
my: 9.79146
This is a nearly proper implementation of your idea with
FMA-intrinsics and SSE/AVX control register access:
double fmody( double x, double y )
{
if( isnan( x ) ) [[unlikely]]
return x;
if( isnan( y ) ) [[unlikely]]
return y;
if( isinf( x ) || !y ) [[unlikely]]
{
feraiseexcept( FE_INVALID );
return numeric_limits<double>::quiet_NaN();
}
if( !x || isinf( y ) ) [[unlikely]]
return x;
uint64_t sign = bit_cast<uint64_t>( x ) & numeric_limits<int64_t>::min(); x = abs( x );
y = abs( y );
int oldCsr = _mm_getcsr();
constexpr int CHOP = 0x6000;
_mm_setcsr( oldCsr | CHOP );
constexpr uint64_t
EXP = -(1ll << 52),
MANT = ~EXP;
uint64_t binY = bit_cast<uint64_t>( y );
int64_t expY = binY & EXP;
if( !expY ) [[unlikely]]
expY = (uint64_t)(0 - (countl_zero( binY & MANT ) -
12)) << 52; while( x >= y )
{
uint64_t yExpAdd = 0;
double div = x / y;
if( div < 0x1.FFFFFFFFFFFFFp+1023 ) [[likely]]
div = xtrunc( div );
else
{
uint64_t
binX = bit_cast<uint64_t>( x ),
newExp = expY + (54ull << 52);
yExpAdd = (binX & EXP) - newExp;
div = xtrunc( bit_cast<double>( newExp | binX
& MANT ) / y ); }
__m128d mult1, mult2, add;
#if defined(_MSC_VER)
mult1.m128d_f64[0] = div;
mult2.m128d_f64[0] = -bit_cast<double>( binY +
yExpAdd ); add.m128d_f64[0] = x;
x = _mm_fmadd_sd( mult1, mult2, add ).m128d_f64[0];
#else
mult1[0] = div;
mult2[0] = -bit_cast<double>( binY + yExpAdd );
add[0] = x;
x = _mm_fmadd_sd( mult1, mult2, add )[0];
#endif
if( !x ) [[unlikely]]
return bit_cast<double>( sign );
}
_mm_setcsr( oldCsr );
return bit_cast<double>( sign | bit_cast<uint64_t>( x ) );
}
The only thing that doesn't work currently is the support for denormal values.
In absence of FMA hardware it is expected to be rather slow, but still
has to produce correct results.
It would be interesting to test if glibc is better in that regard.
I readily admitted that from practical perspective all 3 solutions
that I posted here (one of each is incorrect) are pointless.
Still, what you say about relative speed is true only on newer CPUs and
only when compiled with MSVC. ...
I already said that I don't approve non-portable constructs like _mm_getcsr()/_mm_setcsr() except when they help important cases and
help ALOT. Neither applies here.
I've recently posted a comparison of your solution against mine on
a Zen4-CPU with MSVC. Your solution is equally performant with close exponents (0x3FF to 0x433) if I compile it unter WSL2 with g++-12.
The problem with MSVC is that the trunc() function is extremely slow.
In my implementation of your idea (x86 FMA) I use my own function
xtrunc which makes the code twice as fast.
The solution is much simpler than your solution since there are no
separate fast ans slow paths. The performance is about the same for
close exponents like your solution.
I already said that I don't approve non-portable constructs like _mm_getcsr()/_mm_setcsr() except when they help important cases and
help ALOT. Neither applies here.
64#64 division somehow.And now my original solution is about 23% faster than your solution
Am 09.03.2025 um 19:04 schrieb Michael S:
I already said that I don't approve non-portable constructs like _mm_getcsr()/_mm_setcsr() except when they help important cases and
help ALOT. Neither applies here.
I dropped getting and setting the MSCSR-register to set the roun-
ding mode. Now I set the rounding mode directly with the intrinsics _mm_div_round_sd and and _mm_fmadd_round_sd. Now this more manual
code does help "ALOT", i.e. the solution is 2/3 faster than your
initial solution with clang++-18 under Linux.
But there's still a problem with the denormals.
I can't check your claim about speed, because the code does not compile. Compiler has no idea WTF is xtrunc. But considering that so far all
your claims about speed were false, I can safely assume that this one
is false as well.
On y or on x?
BTW, for last 20-25 years IEEE-754 prefers to call binary floating
point numbers in range (0:DBL_MIN) 'subnormal' rather than 'denormal'.
I'd guess that it is because the term 'denormal' has wider meaning.
I can't check your claim about speed, because the code does not compile.
I tried to use a unsigned __int128 / uint64_t division and I expected
that the compiler calls a library function which does the subtract and
shift manually. But g++ as well as clang++ handle this as a 128 : 64
64#64 division somehow.
And now my original solution is about 23% faster than your solution
for close exponents (exponent difference <= 53) and for arbitrary
exponent differences your solution is about 3% faster.
double fmodO( double x, double y )
{
constexpr uint64_t
SIGN = 1ull << 63,
IMPLICIT = 1ull << 52,
MANT = IMPLICIT - 1,
QBIT = 1ull << 51;
uint64_t const
binX = bit_cast<uint64_t>( x ),
binY = bit_cast<uint64_t>( y );
static auto abs = []( uint64_t m ) { return m & ~SIGN; };
auto isNaN = []( uint64_t m ) { return abs( m ) >=
0x7FF0000000000001u; }; auto isSig = []( uint64_t m ) { return !(m &
QBIT); }; if( isNaN( binX ) ) [[unlikely]] // x == NaN
#if defined(_MSC_VER)
return bit_cast<double>( isNaN( binY ) ? binY | binX
& binY & QBIT : binX );
#else
{
if( isSig( binX ) || isNaN( binY ) && isSig( binY ) ) [[unlikely]] feraiseexcept( FE_INVALID );
return bit_cast<double>( binX | QBIT );
}
#endif
if( isNaN( binY ) ) [[unlikely]] // x != NaN || y == NaN
#if defined(_MSC_VER)
return y;
#else
{
if( isSig( binY ) ) [[unlikely]]
feraiseexcept( FE_INVALID );
return bit_cast<double>( binY | QBIT );
}
#endif
auto isInf = []( uint64_t m ) { return abs( m ) ==
0x7FF0000000000000u; }; if( isInf( binX ) ) // x == Inf
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return bit_cast<double>( binX & ~MANT | QBIT );
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binY ) ) [[unlikely]] // y == 0
{
feraiseexcept( FE_INVALID );
#if defined(_MSC_VER)
return numeric_limits<double>::quiet_NaN();
#else
return -numeric_limits<double>::quiet_NaN();
#endif
}
if( !abs( binX ) || isInf( binY ) ) [[unlikely]] // x == 0 ||
y == Inf return x;
auto exp = []( uint64_t b ) -> int { return b >> 52 & 0x7FF;
}; int
expX = exp( binX ),
expY = exp( binY );
auto mant = []( uint64_t b ) { return b & MANT; };
uint64_t
mantX = mant( binX ),
mantY = mant( binY );
int headBits = 11;
static auto normalize = [&]( int &exp, uint64_t &mant )
{
unsigned shift = countl_zero( mant ) - headBits;
mant <<= shift;
exp -= shift;
};
auto build = []( int &exp, uint64_t &mant )
{
if( exp ) [[likely]]
mant |= IMPLICIT;
else
{
exp = 1;
normalize( exp, mant );
}
};
build( expX, mantX );
build( expY, mantY );
int
tailX = countr_zero( mantX ),
tailY = countr_zero( mantY ),
tailBits = tailX <= tailY ? tailX : tailY;
mantX >>= tailBits;
mantY >>= tailBits;
headBits += tailBits;
uint64_t signX = binX & SIGN;
int expDiff;
#if defined(_MSC_VER) && !defined(__llvm__) && defined(_M_X64)
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= 63 ? expDiff : 63;
unsigned long long hi = mantX >> 64 - bits, lo =
mantX << bits, remainder; (void)_udiv128( hi, lo, mantY, &remainder );
if( !remainder ) [[unlikely]]
return bit_cast<double>( signX );
mantX = remainder;
expX -= bits;
normalize( expX, mantX );
}
#elif defined(__GNUC__) || defined(__clang__)
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= 63 ? expDiff : 63;
unsigned __int128 dividend = (unsigned __int128)mantX
<< bits; mantX = (uint64_t)(dividend % mantY);
if( !mantX ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
#else
while( (expDiff = expX - expY) > 0 )
{
unsigned bits = expDiff <= headBits ? expDiff :
headBits; if( !(mantX = (mantX << bits) % mantY) ) [[unlikely]]
return bit_cast<double>( signX );
expX -= bits;
normalize( expX, mantX );
}
#endif
if( !expDiff && mantX >= mantY ) [[unlikely]]
if( (mantX -= mantY) ) [[likely]]
normalize( expX, mantX );
else
return bit_cast<double>( signX );
mantX <<= tailBits;
mantY <<= tailBits;
if( expX <= 0 ) [[unlikely]]
{
assert(expX >= -51);
mantX = mantX >> (unsigned)(-expX + 1);
expX = 0;
}
return bit_cast<double>( signX | (uint64_t)expX << 52 | mantX
& MANT ); }
It should be an obvious thing to anybody who cared to think for 15
seconds.
A library-or-compiler starts with hi1=rem(0:x_hi, y). Now hi1 is
guaranteed to be smaller than y, so it's safe to do rem(hi1:x_lo, y).
It is not *very* slow, but still there are 2 dependent division
operations.
Measurements in nsec.
First result - Intel Skylake at 4.25 GHz
Second result - AMD Zen3 at 3.7 GHz
abs(x/y) in range that matters [0.5:2**53]:
Standard MSVC Library - 11.1 10.4
Standard gnu Library - 5.4 10.7
Yours (MSVc) - 27.6 11.5
Yours (gcc) - 36.4 23.7
Yours (clang) - 37.4 24.3
my (MSVc) - 10.7 11.3
my (gcc) - 7.7 7.6
my (clang) - 6.3 7.5
Your last (MSVc) - 27.6 11.6
Your last (gcc) - 33.8 28.6
Your last (clang) - 32.6 26.9
abs(x/y) in full range [2**-2090:2**2090]:
Standard MSVC Library - 109.4 153.5
Standard glib Library - 102.3 155.5
Yours (MSVc) - 134.9 52.6
Yours (gcc) - 284.7 151.8
Yours (clang) - 285.2 156.5
my (MSVc) - 62.1 61.1
my (gcc) - 60.8 59.1
my (clang) - 59.9 59.3
Your last (MSVc) - 135.0 52.5
Your last (gcc) - 172.1 137.3
Your last (clang) - 167.7 126.3
You code would be much, much cleaner and more readable if you replace
lambdas with proper functions. ..
Am 10.03.2025 um 13:26 schrieb Michael S:
It should be an obvious thing to anybody who cared to think for 15
seconds.
I was wrong and it absolutels isn't obvious. The compiler calls the
glibc function __umodti3 of glibc which has a shortcut for results
which fit into 64 bit. Although there's an additional call on Linux
the code with clang++-18 is still a bit faster than my Windows
solution with the _udiv128-intrinsic; that's really surprisng.
A library-or-compiler starts with hi1=rem(0:x_hi, y). Now hi1 is
guaranteed to be smaller than y, so it's safe to do rem(hi1:x_lo,
y). It is not *very* slow, but still there are 2 dependent division operations.
Both parameters are variable so there could be no static evaluation
at compile tiime.
Measurements in nsec.
First result - Intel Skylake at 4.25 GHz
Second result - AMD Zen3 at 3.7 GHz
abs(x/y) in range that matters [0.5:2**53]:
Standard MSVC Library - 11.1 10.4
Standard gnu Library - 5.4 10.7
Yours (MSVc) - 27.6 11.5
Yours (gcc) - 36.4 23.7
Yours (clang) - 37.4 24.3
my (MSVc) - 10.7 11.3
my (gcc) - 7.7 7.6
my (clang) - 6.3 7.5
Your last (MSVc) - 27.6 11.6
Your last (gcc) - 33.8 28.6
Your last (clang) - 32.6 26.9
abs(x/y) in full range [2**-2090:2**2090]:
Standard MSVC Library - 109.4 153.5
Standard glib Library - 102.3 155.5
Yours (MSVc) - 134.9 52.6
Yours (gcc) - 284.7 151.8
Yours (clang) - 285.2 156.5
my (MSVc) - 62.1 61.1
my (gcc) - 60.8 59.1
my (clang) - 59.9 59.3
Your last (MSVc) - 135.0 52.5
Your last (gcc) - 172.1 137.3
Your last (clang) - 167.7 126.3
This are the clang++-18 results on my Zen4-computer under WSL2 with
close exponents (0x3ff to 0x433):
fmodO: 9.42929
fmodM: 11.7907
So my code is about 23% faster on my computer.
This are the results for arbitrary exponents:
fmodO: 41.9115
fmodM: 41.2062
Exactly what I already mentioned.
Maybe that depends on the glibc-version because a different glibc
version might have different efficient __umodti3 functions.
You code would be much, much cleaner and more readable if you
replace lambdas with proper functions. ..
For me that doesn't make a difference.
https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/builtins/udivmodti4.c
On line 114 they specialize the case of 64-bit divisor.
On line 116 the further specialize our specific case of x.hi < y.
So, at the end they use the same single division instruction as MSVC.
The only difference is an overhead of cal and of to very predictable branches.
For my code it's strangely slow. On 4+ GHz Zen4 I would expect ~5 nsec.
But it makes difference for your potential readers.
Am 10.03.2025 um 15:33 schrieb Michael S:
That's what I also guessed, but maybe we've not the same
glibc-version. Or the code runs just more efficiently on my Zen4-CPU-
For my code it's strangely slow. On 4+ GHz Zen4 I would expect ~5
nsec.
I want your crystal ball.
One does not need crystal ball to extrapolate speed of simple integer
code from 3.7 GHz Zen3 to 4+ GHz Zen4 (probably 4.7 or 4.8 GHz).
But your repetitive avoidance of answering my direct questions about
what exactly you are using as "my code" makes me even more suspicious.
Am 10.03.2025 um 15:59 schrieb Michael S:
One does not need crystal ball to extrapolate speed of simple
integer code from 3.7 GHz Zen3 to 4+ GHz Zen4 (probably 4.7 or 4.8
GHz).
If one core only computes the clock is even 5.7GHz.
But the results aren't better than shown.
But your repetitive avoidance of answering my direct questions about
what exactly you are using as "my code" makes me even more
suspicious.
I've shown the latest code of fmodO;
you can easily inegrate it into
your own benchmark.
I don't use unfiform_real_distrubution for the
random numbers but uniform_int_distribution with bounds of 0x3FFull
<< 52 and 0x433ull << 52 for the close exponent case. The whole test
code nearly hasn't changed over my initial post.
That's not the question I am asking for 4th or 5th time.
My question is what *exactly* is fmodM.
I did. And presented the results. I am fully willing to believe that
the difference between our clang results explained by difference in
compiler support library.
But so far I find no explanation for why results for what you claim to
be *my* code are so much slower in your measurements, despite your
faster CPU.
BTW, the number you did not publish at all was the speed of fmod()
routine from standard library. ...
Assuming that the exponent of y is fixed at 1023 that is approximately
the same as my own test.
Am 10.03.2025 um 16:43 schrieb Michael S:
That's not the question I am asking for 4th or 5th time.
My question is what *exactly* is fmodM.
fmodM is your code, M like Michael.
I did. And presented the results. I am fully willing to believe that
the difference between our clang results explained by difference in compiler support library.
Or maybe the different CPU.
But so far I find no explanation for why results for what you claim
to be *my* code are so much slower in your measurements, despite
your faster CPU.
I compiled with -O2 and march=native, that should be sufficient.
BTW, the number you did not publish at all was the speed of fmod()
routine from standard library. ...
I've the fmod code of glibc 2.31, which is rather slow since it
does the subtract and shifts manually - code from Sun of the 90s.
Assuming that the exponent of y is fixed at 1023 that is
approximately the same as my own test.
Yes, but as you said earlier close exponents are more relevant.
What my code?
Post it.
Am 10.03.2025 um 18:51 schrieb Michael S:
What my code?
Post it.
I just changed the function name and that the code uses xtrunc()
instead of trunc() since trunc() is slow with MSVC. I removed my
improvement _mm_getcsr() / _mm_setcsr() since the speedup was
noticeable but not significant, unlike the xtrunc() optimization,
which made a speedup of about +100% with MSVC.
double fmodM( double x, double y )
{
if( isnan( x ) )
return x;
// pre-process y
if( isless( y, 0 ) )
y = -y;
else if( isgreater( y, 0 ) )
;
else {
if( isnan( y ) )
return y;
// y==0
feraiseexcept( FE_INVALID );
return nan( "y0" );
}
// y in (0:+inf]
// Quick path
double xx = x * 0x1p-53;
if( xx > -y && xx < y ) {
// among other things, x guaranteed to be finite
if( x > -y && x < y )
return x; // case y=+-inf covered here
double d = xtrunc( x / y );
double res = fma( -y, d, x );
if( signbit( x ) != signbit( res ) ) {
// overshoot because of unfortunate division
rounding // it is extremely rare for small x/y,
// but not rare when x/y is close to 2**53
res = fma( -y, d + (signbit( x ) * 2 - 1), x
); }
return res;
}
// slow path
if( isinf( x ) ) {
feraiseexcept( FE_INVALID );
return nan( "xinf" );
}
int oldRnd = fegetround();
fesetround( FE_TOWARDZERO );
double ax = fabs( x );
do {
double yy = y;
while( yy < ax * 0x1p-1022 )
yy *= 0x1p1021;
do
ax = fma( -yy, xtrunc( ax / yy ), ax );
while( ax >= yy );
} while( ax >= y );
ax = copysign( ax, x );
fesetround( oldRnd );
return ax;
}
Your idea is really elegant
and as I've shown it could be
significantly imporved with SSE 4.1 along with FMA3. But at the point
where I noticed how performant a 128 : 64 modulo division through the
glibc is and as this is superior over the FMA-solution I dropped the
whole idea and removed the SSE-FMA-code from my test program.
The relevant code is the one posted a weak ago. I am posting it for
the second time:
#include <math.h>
#include <fenv.h>
double my_fmod(double x, double y)
{
if (isnan(x))
return x;
// pre-process y
if (y < 0)
y = -y;
else if (y > 0)
;
else {
if (isnan(y))
return y;
// y==0
feraiseexcept(FE_INVALID);
return NAN;
}
// y in (0:+inf]
double ax = fabs(x);
// Quick path
if (ax * 0x1p-53 < y) {
// among other things, x guaranteed to be finite
if (ax < y)
return x; // case y=+-inf covered here
double d = floor(ax/y);
double res = fma(-y, d, ax);
if (res < 0) {
// overshoot because of unfortunate division rounding
// it is extremely rare for small x/y,
// but not rare when x/y is close to 2**53
res += y;
}
if (x < 0)
res = -res;
return res;
}
// slow path
if (isinf(x)) {
feraiseexcept(FE_INVALID);
return NAN;
}
int flipflop = 0;
do {
double yy = y;
while (yy < ax * 0x1p-1022)
yy *= 0x1p1021;
do {
ax = fma(-yy, floor(ax/yy), ax);
flipflop ^= (ax < 0);
ax = fabs(ax);
} while (ax >= yy);
} while (ax >= y);
if (flipflop)
ax = y - ax;
if (x < 0)
ax = -ax;
return ax;
}
I'd rather call it "simple" or "straightforward". "Elegant" in my book
is something else. For example, the code above is closer to what I
consider elegant.
1. Long division is very slow on majority of older CPUs. That includes
CPUs that are quite fast in the absolute sense, like Intel Skylake,
with all its subvariants (Caby Lake, Coffee Lake, Whisky Lake, etc...)
and AMD Zen2.
2. The source language is not a standard C (or C++ for that matter). One
has to use ether gnu extensions or Microsoft's extensions. In the latter case, it became non-portable to ARM64.
4. Even on CPUs with fast long and division and with
compilers/libraries that are able to generate long division it is
measurably slower than fdiv/floor/fma in the case that corresponds to
my quick path.
And for aribitrary exponets (0x1 to 0x7FE):
fmodO: 9.29622
fmodM: 11.4518
I compared your new vesion against fmod() of MSVC in terms of accuracy
and your solution isn't absolute accurate:
fmod: 80.0059
fmodM: 44.21
50.8631 bits shared accuracy
equal results: 95.917%
equal exceptions: 91.017%
equal NaN signs: 96.466%
equal NaN-types: 85.78%
equal NaNs: 66.164%
All my solutions so far have 100%-values against glibc and MSVC-runtime.
Partitially wrong alarm: I forget to convert my xtrunc() function into
an xfloor() function. These are the accuracy results now compared to
fmod() of MSVC:
53 bits shared accuracy
equal results: 100%
equal exceptions: 91.017%
equal NaN signs: 96.475%
equal NaN-types: 99.78%
equal NaNs: 96.253%
These are the accuracy results compared to glibc:
53 bits shared accuracy
equal results: 100%
equal exceptions: 99.901%
equal NaN signs: 87.224%
equal NaN-types: 93.181%
equal NaNs: 80.405%
Am 10.03.2025 um 22:34 schrieb Bonita Montero:Let's establish common measurement methodology.
And for aribitrary exponets (0x1 to 0x7FE):
fmodO: 9.29622
fmodM: 11.4518
Sorry, the copy-buffer wasn't refreshed with the new results:
fmodO: 40.4702
fmodM: 40.1652
Pay attention that fmod() has no requirements w.r.t. to such
exceptions as FE_INEXACT, FE_UNDERFLOW and non-standard FE_DENORMAL.
Strictly speaking, even raising FE_OVERFLOW is not illegal,
but doing so would be bad quality of implementation.
Also spec does not say what happens to FE_INVALID when one of the
inputs is signalling NAN.
Am 11.03.2025 um 11:29 schrieb Michael S:
Pay attention that fmod() has no requirements w.r.t. to such
exceptions as FE_INEXACT, FE_UNDERFLOW and non-standard
FE_DENORMAL.
Yes, that's while I evaluate FE_INVALID only. But your code also can
set FE_INEXACT due to your "rounding" with sign change. MSVC seems
also try to do the math with the FPU with a integer-fallback, because
with exponent differences <= 53 MSVC's fmod() often sets FE_INECACT;
but I ignore that because that shouldn't be part of fmod();
Strictly speaking, even raising FE_OVERFLOW is not illegal,
but doing so would be bad quality of implementation.
Couldn't FE_OVERFLOW happen with your implementation when the
exponents are too far away that you get inf from the division ?
Also spec does not say what happens to FE_INVALID when one of the
inputs is signalling NAN.
See my code; I return MSVC and glibc compatible NaNs and I return
the same exceptions. MSVC sets FE_INVALID only when x is inf or y
is zero, glibc in addition raises FE_INVALID when either operand
is a signalling NaN.
Am 11.03.2025 um 11:29 schrieb Michael S:
Pay attention that fmod() has no requirements w.r.t. to such
exceptions as FE_INEXACT, FE_UNDERFLOW and non-standard
FE_DENORMAL.
Yes, that's while I evaluate FE_INVALID only.
Exactly. Both options are legal. MS's decision to not set FE_INVALID is
as good as glibc decision to set it.
BTW, what is the output of MS library in that case? SNAN or QNAN?
Am 11.03.2025 um 12:34 schrieb Michael S:
Exactly. Both options are legal. MS's decision to not set
FE_INVALID is as good as glibc decision to set it.
If I do a SSE-/AVX-operation where either operand is a signalling NaN
I get a FE_INVALID; since the FPU behaves this way the MSVC runtime
should do that also.
BTW, what is the output of MS library in that case? SNAN or QNAN?
Results with SNaN parameters are always QNaN, that shoud be common
with any FPU.
But not when library routine does not use FPU. Or uses FPU only for comparison ops.
The point is, it does not sound right if SNAN is *silently* converted
to QNAN. That type of conversion has to be loud i.e. accompanied by
setting of FE_INVALID.
On Tue, 11 Mar 2025 13:10:31 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 11.03.2025 um 12:34 schrieb Michael S:
Exactly. Both options are legal. MS's decision to not set
FE_INVALID is as good as glibc decision to set it.
If I do a SSE-/AVX-operation where either operand is a signalling
NaN I get a FE_INVALID; since the FPU behaves this way the MSVC
runtime should do that also.
BTW, what is the output of MS library in that case? SNAN or QNAN?
Results with SNaN parameters are always QNaN, that shoud be common
with any FPU.
But not when library routine does not use FPU. Or uses FPU only for comparison ops.
The point is, it does not sound right if SNAN is *silently* converted
to QNAN. That type of conversion has to be loud i.e. accompanied by
setting of FE_INVALID.
Pay attention that fmod() has no requirements w.r.t. to such exceptions
as FE_INEXACT, FE_UNDERFLOW and non-standard FE_DENORMAL.
Strictly speaking, even raising FE_OVERFLOW is not illegal, but doing
so would be bad quality of implementation.
Also spec does not say what happens to FE_INVALID when one of the
inputs is signalling NAN.
On Mon, 10 Mar 2025 19:00:06 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Your idea is really elegant
I'd rather call it "simple" or "straightforward". "Elegant" in my book
is something else. For example, the code above is closer to what I
consider elegant.
May be, later today or tomorrow, I'll show you solution that I
consider bright. Bright, but impractical.
On Mon, 10 Mar 2025 20:38:18 +0200
Michael S <already5chosen@yahoo.com> wrote:
On Mon, 10 Mar 2025 19:00:06 +0100
Bonita Montero <Bonita.Montero@gmail.com> wrote:
Your idea is really elegant
I'd rather call it "simple" or "straightforward". "Elegant" in my
book is something else. For example, the code above is closer to
what I consider elegant.
May be, later today or tomorrow, I'll show you solution that I
consider bright. Bright, but impractical.
Here, here!
A bright part is in lines 18 to 29. The rest are hopefully competent technicalities.
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 1,029 |
Nodes: | 10 (1 / 9) |
Uptime: | 183:54:52 |
Calls: | 13,337 |
Calls today: | 4 |
Files: | 186,574 |
D/L today: |
5,840 files (1,604M bytes) |
Messages: | 3,356,632 |