Forum: War Ensemble BBS

Floating point implementations on AMD64

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Apr 13 17:55:18 2024

From Newsgroup: comp.lang.forth

I just looked at the floating-point implementations of recent
SwiftForth and VFX (finally present in the system from the start), and
on iForth-5.1-mini (for comparison):

1 FLOATS .

reports:

16 iforth
10 sf64
10 vfx64

For

: foo f+ f* ;

the resulting code is:

SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
: foo f+ f* ; ok
see foo
44E8B9 ST(0) ST(1) FADDP DEC1
44E8BB ST(0) ST(1) FMULP DEC9
44E8BD RET C3 ok

VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023

: foo f+ f* ; ok
see foo
FOO
( 0050A250 DEC1 ) FADDP ST(1), ST
( 0050A252 DEC9 ) FMULP ST(1), ST
( 0050A254 C3 ) RET/NEXT
( 5 bytes, 3 instructions )

iForth:
$10226000 : foo 488BC04883ED088F4500 H.@H.m..E. $1022600A fld [r13 0 +] tbyte41DB6D00 A[m. $1022600E fld [r13 #16 +] tbyte
41DB6D10 A[m. $10226012 fxch ST(2) D9CA YJ
$10226014 lea r13, [r13 #32 +] qword
4D8D6D20 M.m $10226018 faddp ST(1), ST DEC1 ^A
$1022601A fxch ST(1) D9C9 YI
$1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m. $10226026 fmulp ST(1), ST DEC9 ^I
$10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}. $10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok

So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.

Gforth sticks out by using 8-byte FP values; most of those are stored
in memory (supporting deep FP stacks), with the top of FP stack in an
xmm register on AMD64:

: foo f+ f* ; ok
see-code foo
$7FF2CE8034E0 f+ 1->1
7FF2CE4A6E43: mov rax,r12
7FF2CE4A6E46: lea r12,$08[r12]
7FF2CE4A6E4B: addsd xmm15,$08[rax]
$7FF2CE8034E8 f* 1->1
7FF2CE4A6E51: mov rax,r12
7FF2CE4A6E54: lea r12,$08[r12]
7FF2CE4A6E59: mulsd xmm15,$08[rax]
$7FF2CE8034F0 ;s 1->1
7FF2CE4A6E5F: mov rbx,[r14]
7FF2CE4A6E62: add r14,$08
7FF2CE4A6E66: mov rax,[rbx]
7FF2CE4A6E69: jmp eax

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
--- Synchronet 3.20a-Linux NewsLink 1.114

From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Sat Apr 13 18:47:20 2024

From Newsgroup: comp.lang.forth

On 4/13/24 12:55, Anton Ertl wrote:

I just looked at the floating-point implementations of recent
SwiftForth and VFX (finally present in the system from the start), and
on iForth-5.1-mini (for comparison):

1 FLOATS .

reports:

16 iforth
10 sf64
10 vfx64

For

: foo f+ f* ;

the resulting code is:

SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
: foo f+ f* ; ok
see foo
44E8B9 ST(0) ST(1) FADDP DEC1
44E8BB ST(0) ST(1) FMULP DEC9
44E8BD RET C3 ok

VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023

: foo f+ f* ; ok
see foo
FOO
( 0050A250 DEC1 ) FADDP ST(1), ST
( 0050A252 DEC9 ) FMULP ST(1), ST
( 0050A254 C3 ) RET/NEXT
( 5 bytes, 3 instructions )

iForth:
$10226000 : foo 488BC04883ED088F4500 H.@H.m..E. $1022600A fld [r13 0 +] tbyte41DB6D00 A[m. $1022600E fld [r13 #16 +] tbyte
41DB6D10 A[m. $10226012 fxch ST(2) D9CA YJ $10226014 lea r13, [r13 #32 +] qword
4D8D6D20 M.m $10226018 faddp ST(1), ST DEC1 ^A $1022601A fxch ST(1) D9C9 YI $1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m. $10226026 fmulp ST(1), ST DEC9 ^I $10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}. $10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok

So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.

...

For me, an 8 item hardware fp stack limit is too limiting to be useful.
This is mostly because of my use of the fp stack for initializing tables (arrays and matrices), and my coding style of returning more than 8
floats on the fp stack for some types of computation. No doubt one can
limit themselves to an 8-item fp stack, but I'd hate to have to code wit
such a limit.

--
Krishna

--- Synchronet 3.20a-Linux NewsLink 1.114

From dxf@dxforth@gmail.com to comp.lang.forth on Sun Apr 14 13:29:01 2024

From Newsgroup: comp.lang.forth

On 14/04/2024 3:55 am, Anton Ertl wrote:

I just looked at the floating-point implementations of recent
SwiftForth and VFX (finally present in the system from the start), and
on iForth-5.1-mini (for comparison):
...
So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.

All the more reason to make fp loadable so users can choose the model
they want instead of built-in. IIRC VFX and SWF previously did this.

--- Synchronet 3.20a-Linux NewsLink 1.114

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Sun Apr 14 07:03:15 2024

From Newsgroup: comp.lang.forth

Anton Ertl wrote:
[..]

iForth:
$10226000 : foo 488BC04883ED088F4500 H.@H.m..E. $1022600A fld [r13 0 +] tbyte41DB6D00 A[m. $1022600E fld [r13 #16 +] tbyte
41DB6D10 A[m. $10226012 fxch ST(2) D9CA YJ $10226014 lea r13, [r13 #32 +] qword
4D8D6D20 M.m $10226018 faddp ST(1), ST DEC1 ^A $1022601A fxch ST(1) D9C9 YI $1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m. $10226026 fmulp ST(1), ST DEC9 ^I $10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}. $10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok

So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.

Turbo Pascal had a fast FP mode that used the FPU stack. I found almost immediately that that is unusable for serious work.

The used scheme is rather complicated. iForth uses the internal stack
when it can prove that there will be no under- or overflow. Non-inlined
calls (F. below) always use the memory stack.

FORTH> pi fvalue val1 pi/2 fvalue val2 ok
FORTH> : test val1 fdup val2 foo val1 f+ val2 f* f. ; ok
FORTH> see test
Flags: ANSI
$015FDA80 : test
$015FDA8A fld $015FD650 tbyte-offset
$015FDA90 fld ST(0)
$015FDA92 fld $015FD670 tbyte-offset
$015FDA98 faddp ST(1), ST
$015FDA9A fmulp ST(1), ST
$015FDA9C fld $015FD650 tbyte-offset
$015FDAA2 faddp ST(1), ST
$015FDAA4 fld $015FD670 tbyte-offset
$015FDAAA fmulp ST(1), ST
$015FDAAC fpush,
$015FDAB6 jmp F.+10 ( $0124ED42 ) offset NEAR
$015FDABB ;

Apparently there are special interrupts that one can enable
to signal FPU stack underflow (and then spill to memory)
but I never got them to work reliably. The software
analysis works fine, but can be fooled in case of rather
contrived circumstances. I have not encountered a bug in the
past two decades.

-marcel
--- Synchronet 3.20a-Linux NewsLink 1.114

From dxf@dxforth@gmail.com to comp.lang.forth on Sun Apr 14 18:02:10 2024

From Newsgroup: comp.lang.forth

On 14/04/2024 5:03 pm, mhx wrote:

Anton Ertl wrote:
[..]

iForth:
$10226000 : foo                        488BC04883ED088F4500      H.@H.m..E.
$1022600A fld           [r13 0 +] tbyte41DB6D00                  A[m.
$1022600E fld           [r13 #16 +] tbyte
                                        41DB6D10                  A[m.
$10226012 fxch          ST(2)          D9CA                      YJ
$10226014 lea           r13, [r13 #32 +] qword
                                        4D8D6D20                  M.m $10226018 faddp         ST(1), ST      DEC1                      ^A
$1022601A fxch          ST(1)          D9C9                      YI
$1022601C fpopswap,                    41DB6D00D9CA4D8D6D10      A[m.YJM.m.
$10226026 fmulp         ST(1), ST      DEC9                      ^I
$10226028 fpush,                       4D8D6DF0D9C941DB7D00      M.mpYIA[}.
$10226032 ;                            488B45004883C508FFE0      H.E.H.E..` ok

So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.

Turbo Pascal had a fast FP mode that used the FPU stack. I found almost immediately that that is unusable for serious work.

Were that the case Intel had plenty opportunity to change it. They had
an academic advising them.

--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Apr 14 08:34:35 2024

From Newsgroup: comp.lang.forth

mhx@iae.nl (mhx) writes:

Apparently there are special interrupts that one can enable
to signal FPU stack underflow (and then spill to memory)
but I never got them to work reliably.

From what I read about this, the intention was that the FP stack would
extend into memory (and thus not be limited to 8 elements): software
should react to FP stack overflows and underflows and store some
elements on overflow, and reload some elements on underflow. However,
this functionality was implemented in a buggy way on the 8087, so it
never worked as intended. Hoever, when they noticed this, the 8087
was already on the market, and Hyrum's law ensured that this behaviour
could not be changed.

And apparently this feature was not considered to be important enough
to add a new architectural feature that allows implementing the FP
stack extension to memory. I guess that the implementations of
Fortran and Algol-family languages (e.g., C) in the 1980s only used
the stack within an expression, so avoiding FP stack overflows with
compiler analysis (like you do), is relatively easy.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
--- Synchronet 3.20a-Linux NewsLink 1.114

From dxf@dxforth@gmail.com to comp.lang.forth on Sun Apr 14 20:05:33 2024

From Newsgroup: comp.lang.forth

On 14/04/2024 6:34 pm, Anton Ertl wrote:

mhx@iae.nl (mhx) writes:

Apparently there are special interrupts that one can enable
to signal FPU stack underflow (and then spill to memory)
but I never got them to work reliably.

From what I read about this, the intention was that the FP stack would
extend into memory (and thus not be limited to 8 elements): software
should react to FP stack overflows and underflows and store some
elements on overflow, and reload some elements on underflow. However,
this functionality was implemented in a buggy way on the 8087, so it
never worked as intended. Hoever, when they noticed this, the 8087
was already on the market, and Hyrum's law ensured that this behaviour
could not be changed.

Do you have a reference for that? Below is a paper written by one of the designers and it doesn't appear to be mentioned. It's of course possible
to maintain a stack in software and use the FPU to do the calculations.
There are instructions to load/store Temp Real format to memory and that
gets a mention.

https://dl.acm.org/doi/pdf/10.1145/800053.801923

--- Synchronet 3.20a-Linux NewsLink 1.114

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sun Apr 14 13:12:19 2024

From Newsgroup: comp.lang.forth

In article <2024Apr13.195518@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

I just looked at the floating-point implementations of recent
SwiftForth and VFX (finally present in the system from the start), and
on iForth-5.1-mini (for comparison):

1 FLOATS .

reports:

16 iforth
10 sf64
10 vfx64

For

: foo f+ f* ;

the resulting code is:

SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
: foo f+ f* ; ok
see foo
44E8B9 ST(0) ST(1) FADDP DEC1
44E8BB ST(0) ST(1) FMULP DEC9
44E8BD RET C3 ok

VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023

: foo f+ f* ; ok
see foo
FOO
( 0050A250 DEC1 ) FADDP ST(1), ST
( 0050A252 DEC9 ) FMULP ST(1), ST
( 0050A254 C3 ) RET/NEXT
( 5 bytes, 3 instructions )

I cut the same corners with ciforth. However I think this
cannot be compliant with the IEEE requirement of the standard?

- anton

--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
--- Synchronet 3.20a-Linux NewsLink 1.114

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sun Apr 14 13:21:14 2024

From Newsgroup: comp.lang.forth

In article <27089a13c7ce61da7ffb927cb6c365d2@www.novabbs.com>,
mhx <mhx@iae.nl> wrote:

Anton Ertl wrote:
[..]

iForth:
$10226000 : foo 488BC04883ED088F4500 H.@H.m..E. >> $1022600A fld [r13 0 +] tbyte41DB6D00 A[m.
$1022600E fld [r13 #16 +] tbyte
41DB6D10 A[m.
$10226012 fxch ST(2) D9CA YJ
$10226014 lea r13, [r13 #32 +] qword
4D8D6D20 M.m
$10226018 faddp ST(1), ST DEC1 ^A
$1022601A fxch ST(1) D9C9 YI
$1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m. >> $10226026 fmulp ST(1), ST DEC9 ^I
$10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}. >> $10226032 ; 488B45004883C508FFE0

H.E.H.E..` ok

So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.

Turbo Pascal had a fast FP mode that used the FPU stack. I found almost >immediately that that is unusable for serious work.

The used scheme is rather complicated. iForth uses the internal stack
when it can prove that there will be no under- or overflow. Non-inlined
calls (F. below) always use the memory stack.

FORTH> pi fvalue val1 pi/2 fvalue val2 ok
FORTH> : test val1 fdup val2 foo val1 f+ val2 f* f. ; ok
FORTH> see test
Flags: ANSI
$015FDA80 : test
$015FDA8A fld $015FD650 tbyte-offset
$015FDA90 fld ST(0)
$015FDA92 fld $015FD670 tbyte-offset
$015FDA98 faddp ST(1), ST
$015FDA9A fmulp ST(1), ST
$015FDA9C fld $015FD650 tbyte-offset
$015FDAA2 faddp ST(1), ST
$015FDAA4 fld $015FD670 tbyte-offset
$015FDAAA fmulp ST(1), ST
$015FDAAC fpush,
$015FDAB6 jmp F.+10 ( $0124ED42 ) offset NEAR
$015FDABB ;

Apparently there are special interrupts that one can enable
to signal FPU stack underflow (and then spill to memory)
but I never got them to work reliably. The software
analysis works fine, but can be fooled in case of rather
contrived circumstances. I have not encountered a bug in the
past two decades.

This is a practical way.

I researched whether it is possible to detect whether the
circular stack overflows. There are instructions to
detect whether a position in this stack is occupied.
For a word that using a stack 4 deep, you could detect whether
it is necessary to save words this way, I thought.
I couldn't make it work, because essential assembler instruction
are missing. (Or I'm not clever enough.)

-marcel

--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Apr 14 11:25:07 2024

From Newsgroup: comp.lang.forth

dxf <dxforth@gmail.com> writes:

On 14/04/2024 6:34 pm, Anton Ertl wrote:

From what I read about this, the intention was that the FP stack would
extend into memory (and thus not be limited to 8 elements): software
should react to FP stack overflows and underflows and store some
elements on overflow, and reload some elements on underflow. However,
this functionality was implemented in a buggy way on the 8087, so it
never worked as intended. Hoever, when they noticed this, the 8087
was already on the market, and Hyrum's law ensured that this behaviour
could not be changed.

Do you have a reference for that?

Kahan writes about the original intention in

http://web.archive.org/web/20170118054747/https://cims.nyu.edu/~dbindel/class/cs279/87stack.pdf

especially starting at the last paragraph of page 2.

And about the bug (or rather design mistake):

https://history.siam.org/pdfs2/Kahan_final.pdf

Start with the second-to-last paragraph on page 163. He digresses for
a page, but continues on the fourth paragraph of page 165 and
continues to the first paragraph of page 168.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
--- Synchronet 3.20a-Linux NewsLink 1.114

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Sun Apr 14 11:59:51 2024

From Newsgroup: comp.lang.forth

albert@spenarnc.xs4all.nl wrote:

In article <27089a13c7ce61da7ffb927cb6c365d2@www.novabbs.com>,
mhx <mhx@iae.nl> wrote:

Anton Ertl wrote:
[..]

This is a practical way.

I researched whether it is possible to detect whether the
circular stack overflows. There are instructions to
detect whether a position in this stack is occupied.
For a word that using a stack 4 deep, you could detect whether
it is necessary to save words this way, I thought.
I couldn't make it work, because essential assembler instruction
are missing. (Or I'm not clever enough.)

I vaguely remember something like that for the FPU stack (combined
with interrupts?). It falls under the category "I couldn't make
it work."

-marcel
--- Synchronet 3.20a-Linux NewsLink 1.114

From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Sun Apr 14 07:50:42 2024

From Newsgroup: comp.lang.forth

On 4/14/24 03:02, dxf wrote:

On 14/04/2024 5:03 pm, mhx wrote:

Anton Ertl wrote:
[..]

iForth:
$10226000 : foo                        488BC04883ED088F4500      H.@H.m..E.
$1022600A fld           [r13 0 +] tbyte41DB6D00                  A[m.
$1022600E fld           [r13 #16 +] tbyte
                                        41DB6D10                  A[m.
$10226012 fxch          ST(2)          D9CA                      YJ
$10226014 lea           r13, [r13 #32 +] qword
                                        4D8D6D20                  M.m $10226018 faddp         ST(1), ST      DEC1                      ^A
$1022601A fxch          ST(1)          D9C9                      YI
$1022601C fpopswap,                    41DB6D00D9CA4D8D6D10      A[m.YJM.m.
$10226026 fmulp         ST(1), ST      DEC9                      ^I
$10226028 fpush,                       4D8D6DF0D9C941DB7D00      M.mpYIA[}.
$10226032 ;                            488B45004883C508FFE0      H.E.H.E..` ok

So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.

Turbo Pascal had a fast FP mode that used the FPU stack. I found almost
immediately that that is unusable for serious work.

Were that the case Intel had plenty opportunity to change it. They had
an academic advising them.

Let's take a non-trivial example to illustrate why the 8-deep fp stack
may not be that useful for numerical computation. This example is
actually from the FSL demo. The word computes the Lorenz equations,
which give rise to the famous butterfly attractor. This is a system of
three nonlinear first order differential equations in three variables,
x, y, z, which are time dependent. The Lorenz equations define the instantaneous derivatives of these variables:

dx/dt = sigma*(y - x)
dy/dt = x*(rho -z) - y
dz/dt = x*y - beta*z

where sigma, rho, and beta are constant parameters.

Let's say we want to write a word DERIVS which computes and stores the derivatives, given the instantaneous values of x, y, z. This is the
basis for any numerical code which solves the trajectory in time,
starting from an initial condition.

DERIVS ( F: x y z -- )

Hence, we want to place some values x, y, and z onto the fp stack and
compute the three derivatives. Ideally these three values remain on the
fp stack and don't need to be fetched from memory constantly until the
three derivatives are computed, especially if one is using the hardware
fp stack. We allow the constant parameters to be fetched from memory and
the results of the derivative computation to be stored to memory so they
don't overflow the stack. This should be doable with the 8-element
hardware fp stack.

Below I give Forth code which computes the derivatives. This code is
usable only on systems with a separate FP stack. It will be interesting
to see the compiled code given by Forth systems using the hardware fpu
stack to compute the results. While this example may behave properly, if
we go to a fourth order system or higher, it gets less likely that the hardware stack remains usable.

--
Krishna

== begin fpstack-test.4th ==
\ fpstack-test.4th
\
\ Compute the Lorenz equations, a set of three coupled
\ nonlinear differential equations.
\
\ dx/dt = sigma*(y - x)
\ dy/dt = x*(rho -z) - y
\ dz/dt = x*y - beta*z
\
\ sigma, rho, and beta are constant parameters.
\
\ The following code requires a separate fp stack

include ans-words \ only for kForth64
include fsl/fsl-util

[UNDEFINED] FPICK [IF]
cr .( Your system may not use a separate floating point stack!)
ABORT
[THEN]

[UNDEFINED] F2OVER [IF]
: f2over ( F: r1 r2 r3 r4 -- r1 r2 r3 r4 r1 r2 )
3 fpick 3 fpick ;
[THEN]

[UNDEFINED] F+! [IF]
: f+! ( a -- ) ( F: r -- ) dup f@ f+ f! ;
[THEN]

16.0e0 fconstant sigma
45.92e0 fconstant rho
4.0e0 fconstant beta

\ Compute the derivatives given the instantaneous values
\ x, y, z for a given time t.

\ xdot{ is an array consisting of dx/dt, dy/dt, dz/dt
3 float array xdot{

: derivs ( F: x y z -- )
fdup f2over \ F: x y z z x y
f- sigma f* fnegate
xdot{ 0 } f! \ F: x y z z
rho fover f- \ F: x y z z rho-z
4 fpick f* \ F: x y z z x*(rho - z)
3 fpick f-
xdot{ 1 } f! \ F: x y z z
fdrop
beta f* fnegate
xdot{ 2 } f!
f* xdot{ 2 } f+!
;

0 [IF]
include ttester
\ Test DERIVS
1e-15 set-near
t{ 0.1e 0.6e 4.0e derivs -> }t
t{ xdot{ 0 } f@ -> 8.0e0 }t
t{ xdot{ 1 } f@ -> 3.592e0 }t
t{ xdot{ 2 } f@ -> -15.94e0 }t
[THEN]

== end fpstack-test.4th ==

--- Synchronet 3.20a-Linux NewsLink 1.114

From dxf@dxforth@gmail.com to comp.lang.forth on Sun Apr 14 23:35:24 2024

From Newsgroup: comp.lang.forth

On 14/04/2024 9:25 pm, Anton Ertl wrote:

dxf <dxforth@gmail.com> writes:

On 14/04/2024 6:34 pm, Anton Ertl wrote:

From what I read about this, the intention was that the FP stack would
extend into memory (and thus not be limited to 8 elements): software
should react to FP stack overflows and underflows and store some
elements on overflow, and reload some elements on underflow. However,
this functionality was implemented in a buggy way on the 8087, so it
never worked as intended. Hoever, when they noticed this, the 8087
was already on the market, and Hyrum's law ensured that this behaviour
could not be changed.

Do you have a reference for that?

Kahan writes about the original intention in

http://web.archive.org/web/20170118054747/https://cims.nyu.edu/~dbindel/class/cs279/87stack.pdf

especially starting at the last paragraph of page 2.

And about the bug (or rather design mistake):

https://history.siam.org/pdfs2/Kahan_final.pdf

Start with the second-to-last paragraph on page 163. He digresses for
a page, but continues on the fourth paragraph of page 165 and
continues to the first paragraph of page 168.

The latter sounds like someone not getting his way more than a design mistake. In the first reference Kahan states:

"When the 8087 was designed, I knew that stack over/underflow was an issue of
more aesthetic than practical importance. I still regret that the 8087's stack
implementation was not quite so neat as my original intention described in the
accompanying note."

Intel decided Kahan's aesthetic afterthought could be dispensed with. History appears to have proven them correct. Were 8 levels of stack actually insufficient,
it would have made more sense for Intel to double it (if not for the 8087 then the
next incarnation) than to spill to memory which was bad in every way.

--- Synchronet 3.20a-Linux NewsLink 1.114

From dxf@dxforth@gmail.com to comp.lang.forth on Mon Apr 15 00:27:41 2024

From Newsgroup: comp.lang.forth

On 14/04/2024 10:50 pm, Krishna Myneni wrote:

...
Below I give Forth code which computes the derivatives. This code is usable only on systems with a separate FP stack. It will be interesting to see the compiled code given by Forth systems using the hardware fpu stack to compute the results. While this example may behave properly, if we go to a fourth order system or higher, it gets less likely that the hardware stack remains usable.

Systems that default to hardware fpu stack may well offer a software stack option.

--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Apr 14 15:19:41 2024

From Newsgroup: comp.lang.forth

Krishna Myneni <krishna.myneni@ccreweb.org> writes:

dx/dt = sigma*(y - x)
dy/dt = x*(rho -z) - y
dz/dt = x*y - beta*z

where sigma, rho, and beta are constant parameters.

Let's say we want to write a word DERIVS which computes and stores the >derivatives, given the instantaneous values of x, y, z. This is the
basis for any numerical code which solves the trajectory in time,
starting from an initial condition.

DERIVS ( F: x y z -- )

Hence, we want to place some values x, y, and z onto the fp stack and >compute the three derivatives. Ideally these three values remain on the
fp stack and don't need to be fetched from memory constantly until the
three derivatives are computed, especially if one is using the hardware
fp stack. We allow the constant parameters to be fetched from memory and
the results of the derivative computation to be stored to memory so they >don't overflow the stack. This should be doable with the 8-element
hardware fp stack.

I have adapted your Forth code:

[UNDEFINED] F2OVER [IF]
: f2over ( F: r1 r2 r3 r4 -- r1 r2 r3 r4 r1 r2 )
3 fpick 3 fpick ;
[THEN]

16.0e0 fconstant sigma
45.92e0 fconstant rho
4.0e0 fconstant beta

fvariable dx/dt
fvariable dy/dt
fvariable dz/dt

: derivs ( F: x y z -- )
fdup f2over \ F: x y z z x y
f- sigma f* fnegate
dx/dt f! \ F: x y z z
rho fover f- \ F: x y z z rho-z
4 fpick f* \ F: x y z z x*(rho - z)
3 fpick f-
dy/dt f! \ F: x y z z
fdrop
beta f* fnegate
frot frot f* f+ dz/dt f!
;

0.1e 0.6e 4.0e derivs
dx/dt f@ f. cr \ 8.
dy/dt f@ f. cr \ 3.592
dz/dt f@ f. cr \ -15.94

In particular, I eliminated the additional memory accesses to DZ/DT.

SwiftForth, VFX and iforth produce the expected results for your test
case. The code is:

SwiftForth 4.0.0-RC87 VFX Forth 64 5.43 iforth-5.1-mini
ST(0) FLD FLD ST fld ST(0)
44E8BC ( f2over ) CALL CALL 0050A080 F2OVER fld [r13 0 +] tbyte
ST(0) ST(1) FSUBP FSUBP ST(1), ST fxch ST(1)
44E8FB ( sigma ) CALL CALL 0050A2BB SIGMA fld [r13 #16 +] tby
ST(0) ST(1) FMULP FMULP ST(1), ST lea r13, [r13 #32 +]
FCHS FCHS fxch ST(3)
-8 [RBP] RBP LEA FSTP TBYTE FFF9CFE8 [RIP] fxch ST(1)
RBX 0 [RBP] MOV CALL 0050A2FB RHO fld ST(3)
4C508 [RDI] RBX LEA FLD ST(1) fld ST(3)
0 [RBX] TBYTE FSTP FSUBP ST(1), ST fsubp ST(1), ST
0 [RBP] RBX MOV LEA RBP, [RBP+-08] fld $101BC720 tbyte
8 [RBP] RBP LEA MOV [RBP], RBX fmulp ST(1), ST
44E923 ( rho ) CALL MOV EBX, # 00000004 fchs
ST(1) FLD CALL 005030C0 FPICK fstp $10226470 tbyte
ST(0) ST(1) FSUBP FMULP ST(1), ST fld $101BC710 tbyte
-8 [RBP] RBP LEA LEA RBP, [RBP+-08] fld ST(1)
RBX 0 [RBP] MOV MOV [RBP], RBX fsubp ST(1), ST
4 # EBX MOV MOV EBX, # 00000003 fld ST(4)
43C901 ( FPICK ) CALL CALL 005030C0 FPICK fmulp ST(1), ST
ST(0) ST(1) FMULP FSUBP ST(1), ST fld ST(3)
-8 [RBP] RBP LEA FSTP TBYTE FFF9CFC1 [RIP] fsubp ST(1), ST
RBX 0 [RBP] MOV FSTP ST fstp $10226490 tbyte
3 # EBX MOV CALL 0050A33B BETA ffreep ST(0)
43C901 ( FPICK ) CALL FMULP ST(1), ST fld $101BC700 tbyte
ST(0) ST(1) FSUBP FCHS fmulp ST(1), ST
-8 [RBP] RBP LEA FXCH ST(1) fchs
RBX 0 [RBP] MOV FXCH ST(2) fxch ST(1)
4C530 [RDI] RBX LEA FXCH ST(1) fxch ST(2)
0 [RBX] TBYTE FSTP FXCH ST(2) fxch ST(1)
0 [RBP] RBX MOV FMULP ST(1), ST fxch ST(2)
8 [RBP] RBP LEA FADDP ST(1), ST fmulp ST(1), ST
ST(0) FSTP FSTP TBYTE FFF9CFB4 [RIP] fxch ST(1)
44E94B ( beta ) CALL RET/NEXT fpopswap,
ST(0) ST(1) FMULP faddp ST(1), ST
FCHS fstp $102264B0 tbyte
43C807 ( FROT ) CALL ;
43C807 ( FROT ) CALL
ST(0) ST(1) FMULP
ST(0) ST(1) FADDP
-8 [RBP] RBP LEA
RBX 0 [RBP] MOV
4C558 [RDI] RBX LEA
0 [RBX] TBYTE FSTP
0 [RBP] RBX MOV
8 [RBP] RBP LEA
RET

FPICK is apparently implemented on SwiftForth and VFX through an
indirect branch that branches to one of 8 variants of "FLD ST(...)",
while iForth manages to resolve this during compilation.

I have also looked at VFX 5.11 which uses XMM registers instead of the
FP stack, but it does not inline FP operations, so you mostly see a long sequence of calls.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Apr 14 15:53:40 2024

From Newsgroup: comp.lang.forth

dxf <dxforth@gmail.com> writes:

On 14/04/2024 9:25 pm, Anton Ertl wrote:

Kahan writes about the original intention in

http://web.archive.org/web/20170118054747/https://cims.nyu.edu/~dbindel/class/cs279/87stack.pdf

especially starting at the last paragraph of page 2.

And about the bug (or rather design mistake):

https://history.siam.org/pdfs2/Kahan_final.pdf

Start with the second-to-last paragraph on page 163. He digresses for
a page, but continues on the fourth paragraph of page 165 and
continues to the first paragraph of page 168.

The latter sounds like someone not getting his way more than a design mistake. >In the first reference Kahan states:

"When the 8087 was designed, I knew that stack over/underflow was an issue of more aesthetic than practical importance. I still regret that the 8087's stack
implementation was not quite so neat as my original intention described in the
accompanying note."

Intel decided Kahan's aesthetic afterthought could be dispensed with.

In a way, they did, and Kahan obviously did not get his way. But to
me it sounds like they tried and failed at implementing a stack that
extends into memory. The tags that indicate the presence of a stack
item are there. If they had made a conscious decision at the start to
dispense with the idea of an extensible stack, they would have
discpensed with these bits indicating the presence of a stack item as
well. So what happened is that they botched the first attempt, and
then decided that they did not want to do what would have been
necessary to fix it.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stephen Pelc@stephen@vfxforth.com to comp.lang.forth on Sun Apr 14 20:02:01 2024

From Newsgroup: comp.lang.forth

On 14 Apr 2024 at 01:47:20 CEST, "Krishna Myneni" <krishna.myneni@ccreweb.org> wrote:

On 4/13/24 12:55, Anton Ertl wrote:

I just looked at the floating-point implementations of recent
SwiftForth and VFX (finally present in the system from the start), and
on iForth-5.1-mini (for comparison):

1 FLOATS .

reports:

16 iforth
10 sf64
10 vfx64

For

: foo f+ f* ;

the resulting code is:

SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
: foo f+ f* ; ok
see foo
44E8B9 ST(0) ST(1) FADDP DEC1
44E8BB ST(0) ST(1) FMULP DEC9
44E8BD RET C3 ok

VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023

: foo f+ f* ; ok
see foo
FOO
( 0050A250 DEC1 ) FADDP ST(1), ST
( 0050A252 DEC9 ) FMULP ST(1), ST
( 0050A254 C3 ) RET/NEXT
( 5 bytes, 3 instructions )

iForth:
$10226000 : foo 488BC04883ED088F4500 H.@H.m..E. >> $1022600A fld [r13 0 +] tbyte41DB6D00 A[m.
$1022600E fld [r13 #16 +] tbyte
41DB6D10 A[m.
$10226012 fxch ST(2) D9CA YJ
$10226014 lea r13, [r13 #32 +] qword
4D8D6D20 M.m
$10226018 faddp ST(1), ST DEC1 ^A
$1022601A fxch ST(1) D9C9 YI
$1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m. >> $10226026 fmulp ST(1), ST DEC9 ^I
$10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}. >> $10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok

So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.

...

For me, an 8 item hardware fp stack limit is too limiting to be useful.
This is mostly because of my use of the fp stack for initializing tables (arrays and matrices), and my coding style of returning more than 8
floats on the fp stack for some types of computation. No doubt one can
limit themselves to an 8-item fp stack, but I'd hate to have to code wit
such a limit.

The manual (gasp) documents how to change the default FP package.

Changing the default pack also changes the system call interfaces to
match.

Stephen
--
Stephen Pelc, stephen@vfxforth.com
MicroProcessor Engineering, Ltd. - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)78 0390 3612, +34 649 662 974
http://www.mpeforth.com
MPE website
http://www.vfxforth.com/downloads/VfxCommunity/
downloads
--- Synchronet 3.20a-Linux NewsLink 1.114

From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Sun Apr 14 18:32:11 2024

From Newsgroup: comp.lang.forth

On 4/14/24 10:19, Anton Ertl wrote:

Krishna Myneni <krishna.myneni@ccreweb.org> writes:

dx/dt = sigma*(y - x)
dy/dt = x*(rho -z) - y
dz/dt = x*y - beta*z

where sigma, rho, and beta are constant parameters.

Let's say we want to write a word DERIVS which computes and stores the
derivatives, given the instantaneous values of x, y, z. This is the
basis for any numerical code which solves the trajectory in time,
starting from an initial condition.

DERIVS ( F: x y z -- )

Hence, we want to place some values x, y, and z onto the fp stack and
compute the three derivatives. Ideally these three values remain on the
fp stack and don't need to be fetched from memory constantly until the
three derivatives are computed, especially if one is using the hardware
fp stack. We allow the constant parameters to be fetched from memory and
the results of the derivative computation to be stored to memory so they
don't overflow the stack. This should be doable with the 8-element
hardware fp stack.

I have adapted your Forth code:

[UNDEFINED] F2OVER [IF]
: f2over ( F: r1 r2 r3 r4 -- r1 r2 r3 r4 r1 r2 )
3 fpick 3 fpick ;
[THEN]

16.0e0 fconstant sigma
45.92e0 fconstant rho
4.0e0 fconstant beta

fvariable dx/dt
fvariable dy/dt
fvariable dz/dt

: derivs ( F: x y z -- )
fdup f2over \ F: x y z z x y
f- sigma f* fnegate
dx/dt f! \ F: x y z z
rho fover f- \ F: x y z z rho-z
4 fpick f* \ F: x y z z x*(rho - z)
3 fpick f-
dy/dt f! \ F: x y z z
fdrop
beta f* fnegate
frot frot f* f+ dz/dt f!
;

0.1e 0.6e 4.0e derivs
dx/dt f@ f. cr \ 8.
dy/dt f@ f. cr \ 3.592
dz/dt f@ f. cr \ -15.94

In particular, I eliminated the additional memory accesses to DZ/DT.

Nice. FROT FROT is expensive on a memory based FP stack, unless it is optimized by the compiler, but for fpu stack use it's probably very
fast. I see that VFX Forth and iforth use a series of FXCH instructions
to implement FROT FROT.

SwiftForth, VFX and iforth produce the expected results for your test
case. The code is:

SwiftForth 4.0.0-RC87 VFX Forth 64 5.43 iforth-5.1-mini
ST(0) FLD FLD ST fld ST(0)
44E8BC ( f2over ) CALL CALL 0050A080 F2OVER fld [r13 0 +] tbyte
ST(0) ST(1) FSUBP FSUBP ST(1), ST fxch ST(1)
44E8FB ( sigma ) CALL CALL 0050A2BB SIGMA fld [r13 #16 +] tby
ST(0) ST(1) FMULP FMULP ST(1), ST lea r13, [r13 #32 +]
FCHS FCHS fxch ST(3)
-8 [RBP] RBP LEA FSTP TBYTE FFF9CFE8 [RIP] fxch ST(1)
RBX 0 [RBP] MOV CALL 0050A2FB RHO fld ST(3)
4C508 [RDI] RBX LEA FLD ST(1) fld ST(3)
0 [RBX] TBYTE FSTP FSUBP ST(1), ST fsubp ST(1), ST
0 [RBP] RBX MOV LEA RBP, [RBP+-08] fld $101BC720 tbyte
8 [RBP] RBP LEA MOV [RBP], RBX fmulp ST(1), ST
44E923 ( rho ) CALL MOV EBX, # 00000004 fchs
ST(1) FLD CALL 005030C0 FPICK fstp $10226470 tbyte
ST(0) ST(1) FSUBP FMULP ST(1), ST fld $101BC710 tbyte
-8 [RBP] RBP LEA LEA RBP, [RBP+-08] fld ST(1)
RBX 0 [RBP] MOV MOV [RBP], RBX fsubp ST(1), ST
4 # EBX MOV MOV EBX, # 00000003 fld ST(4)
43C901 ( FPICK ) CALL CALL 005030C0 FPICK fmulp ST(1), ST
ST(0) ST(1) FMULP FSUBP ST(1), ST fld ST(3)
-8 [RBP] RBP LEA FSTP TBYTE FFF9CFC1 [RIP] fsubp ST(1), ST
RBX 0 [RBP] MOV FSTP ST fstp $10226490 tbyte
3 # EBX MOV CALL 0050A33B BETA ffreep ST(0)
43C901 ( FPICK ) CALL FMULP ST(1), ST fld $101BC700 tbyte
ST(0) ST(1) FSUBP FCHS fmulp ST(1), ST
-8 [RBP] RBP LEA FXCH ST(1) fchs
RBX 0 [RBP] MOV FXCH ST(2) fxch ST(1)
4C530 [RDI] RBX LEA FXCH ST(1) fxch ST(2)
0 [RBX] TBYTE FSTP FXCH ST(2) fxch ST(1)
0 [RBP] RBX MOV FMULP ST(1), ST fxch ST(2)
8 [RBP] RBP LEA FADDP ST(1), ST fmulp ST(1), ST
ST(0) FSTP FSTP TBYTE FFF9CFB4 [RIP] fxch ST(1)
44E94B ( beta ) CALL RET/NEXT fpopswap,
ST(0) ST(1) FMULP faddp ST(1), ST
FCHS fstp $102264B0 tbyte 43C807 ( FROT ) CALL ;
43C807 ( FROT ) CALL
ST(0) ST(1) FMULP
ST(0) ST(1) FADDP
-8 [RBP] RBP LEA
RBX 0 [RBP] MOV
4C558 [RDI] RBX LEA
0 [RBX] TBYTE FSTP
0 [RBP] RBX MOV
8 [RBP] RBP LEA
RET

FPICK is apparently implemented on SwiftForth and VFX through an
indirect branch that branches to one of 8 variants of "FLD ST(...)",
while iForth manages to resolve this during compilation.

Good to see that x, y, z are not repeatedly fetched from memory.

For this example, the hardware fpu stack is sufficient. But, it's easy
to see that the benefits of a hardware-only stack would diminish quickly
as the size of the problem increased a small amount, and then the
programmer (or compiler) would have to keep careful track of how many
fpu registers are used.

I have also looked at VFX 5.11 which uses XMM registers instead of the
FP stack, but it does not inline FP operations, so you mostly see a long sequence of calls.

--
Krishna

--- Synchronet 3.20a-Linux NewsLink 1.114

From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Sun Apr 14 18:34:45 2024

From Newsgroup: comp.lang.forth

On 4/14/24 15:02, Stephen Pelc wrote:

On 14 Apr 2024 at 01:47:20 CEST, "Krishna Myneni" <krishna.myneni@ccreweb.org>
wrote:

On 4/13/24 12:55, Anton Ertl wrote:

I just looked at the floating-point implementations of recent
SwiftForth and VFX (finally present in the system from the start), and
on iForth-5.1-mini (for comparison):

...

So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.

...

For me, an 8 item hardware fp stack limit is too limiting to be useful.
This is mostly because of my use of the fp stack for initializing tables
(arrays and matrices), and my coding style of returning more than 8
floats on the fp stack for some types of computation. No doubt one can
limit themselves to an 8-item fp stack, but I'd hate to have to code wit
such a limit.

The manual (gasp) documents how to change the default FP package.

Good to know.

--
Krishna

--- Synchronet 3.20a-Linux NewsLink 1.114

From dxf@dxforth@gmail.com to comp.lang.forth on Mon Apr 15 11:37:39 2024

From Newsgroup: comp.lang.forth

On 15/04/2024 6:02 am, Stephen Pelc wrote:

On 14 Apr 2024 at 01:47:20 CEST, "Krishna Myneni" <krishna.myneni@ccreweb.org>
wrote:

On 4/13/24 12:55, Anton Ertl wrote:

I just looked at the floating-point implementations of recent
SwiftForth and VFX (finally present in the system from the start), and
on iForth-5.1-mini (for comparison):

1 FLOATS .

reports:

16 iforth
10 sf64
10 vfx64

For

: foo f+ f* ;

the resulting code is:

SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
: foo f+ f* ; ok
see foo
44E8B9 ST(0) ST(1) FADDP DEC1
44E8BB ST(0) ST(1) FMULP DEC9
44E8BD RET C3 ok

VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023

: foo f+ f* ; ok
see foo
FOO
( 0050A250 DEC1 ) FADDP ST(1), ST
( 0050A252 DEC9 ) FMULP ST(1), ST
( 0050A254 C3 ) RET/NEXT
( 5 bytes, 3 instructions )

iForth:
$10226000 : foo 488BC04883ED088F4500 H.@H.m..E.
$1022600A fld [r13 0 +] tbyte41DB6D00 A[m.
$1022600E fld [r13 #16 +] tbyte
41DB6D10 A[m.
$10226012 fxch ST(2) D9CA YJ
$10226014 lea r13, [r13 #32 +] qword
4D8D6D20 M.m
$10226018 faddp ST(1), ST DEC1 ^A
$1022601A fxch ST(1) D9C9 YI
$1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m.
$10226026 fmulp ST(1), ST DEC9 ^I
$10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}.
$10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok

So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.

...

For me, an 8 item hardware fp stack limit is too limiting to be useful.
This is mostly because of my use of the fp stack for initializing tables
(arrays and matrices), and my coding style of returning more than 8
floats on the fp stack for some types of computation. No doubt one can
limit themselves to an 8-item fp stack, but I'd hate to have to code wit
such a limit.

The manual (gasp) documents how to change the default FP package.

Changing the default pack also changes the system call interfaces to
match.

Specifically chapter 14 in the PDF doc.

integers
remove-FP-pack
include Lib/x64/Hfpx64

swaps in the 80-bit external stack model.

The HTML doc appears to lack this information (or hard to find) should a user select that by mistake.

--- Synchronet 3.20a-Linux NewsLink 1.114

From dxf@dxforth@gmail.com to comp.lang.forth on Mon Apr 15 12:07:47 2024

From Newsgroup: comp.lang.forth

On 15/04/2024 1:53 am, Anton Ertl wrote:

dxf <dxforth@gmail.com> writes:

On 14/04/2024 9:25 pm, Anton Ertl wrote:

Kahan writes about the original intention in

http://web.archive.org/web/20170118054747/https://cims.nyu.edu/~dbindel/class/cs279/87stack.pdf

especially starting at the last paragraph of page 2.

And about the bug (or rather design mistake):

https://history.siam.org/pdfs2/Kahan_final.pdf

Start with the second-to-last paragraph on page 163. He digresses for
a page, but continues on the fourth paragraph of page 165 and
continues to the first paragraph of page 168.

The latter sounds like someone not getting his way more than a design mistake.
In the first reference Kahan states:

"When the 8087 was designed, I knew that stack over/underflow was an issue of
more aesthetic than practical importance. I still regret that the 8087's stack
implementation was not quite so neat as my original intention described in the
accompanying note."

Intel decided Kahan's aesthetic afterthought could be dispensed with.

In a way, they did, and Kahan obviously did not get his way. But to
me it sounds like they tried and failed at implementing a stack that
extends into memory. The tags that indicate the presence of a stack
item are there. If they had made a conscious decision at the start to dispense with the idea of an extensible stack, they would have
discpensed with these bits indicating the presence of a stack item as
well. So what happened is that they botched the first attempt, and
then decided that they did not want to do what would have been
necessary to fix it.

My impression is Palmer (the mathematician Intel hired to co-head the
project) was trying to placate Kahan and it fell through for various
reasons.

The design criteria that never changed was the 8-level hardware stack.
Forthers can either accept it for best performance - or pick something
more forgiving at a lesser performance.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Mon Apr 15 03:47:33 2024

From Newsgroup: comp.lang.forth

On 4/14/24 21:07, dxf wrote:
...

The design criteria that never changed was the 8-level hardware stack. Forthers can either accept it for best performance - or pick something
more forgiving at a lesser performance.

In the Lorenz equation example, which works with the 8 deep fpu stack,
we have assumed that the fpu hardware stack was empty before calling
DERIVS. In a real use case, the call to DERIVS is likely to occur within
a deeper call chain, resulting in items already on the fpu stack before
args for DERIVS are pushed. As Marcel said, using only a hardware-based
fp stack is not realistic for any non-trivial floating point work.

The loss of performance with a memory-based fp stack is far less a
concern than having to consider the limited stack depth when writing
code involving floating point arithmetic. Failure from overflowing the
fpu stack is silent. Debugging is likely to be a nightmare.

--
Krishna

--- Synchronet 3.20a-Linux NewsLink 1.114

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Mon Apr 15 09:35:22 2024

From Newsgroup: comp.lang.forth

In most cases 'bigger' fp data will be stored in memory anyhow,
which can be cached before disk access. The old 8087 improvements
were caused by its new fp operators, the stack was unusable.

And if CPU based stacks were so lucrative for high performance,
CPU makers would have implemented them since long for normal
integer data.
--- Synchronet 3.20a-Linux NewsLink 1.114

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Mon Apr 15 12:37:11 2024

From Newsgroup: comp.lang.forth

In article <3e419396b1ee93c7a391a7ffc0e44ed8@www.novabbs.com>,
minforth <minforth@gmx.net> wrote:

In most cases 'bigger' fp data will be stored in memory anyhow,
which can be cached before disk access. The old 8087 improvements
were caused by its new fp operators, the stack was unusable.

And if CPU based stacks were so lucrative for high performance,
CPU makers would have implemented them since long for normal
integer data.

The iA64 comes to mind. Apparently a failure but was a
technical or commercial failure?

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
--- Synchronet 3.20a-Linux NewsLink 1.114

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Mon Apr 15 11:37:00 2024

From Newsgroup: comp.lang.forth

albert@spenarnc.xs4all.nl wrote:

In article <3e419396b1ee93c7a391a7ffc0e44ed8@www.novabbs.com>,
minforth <minforth@gmx.net> wrote:

In most cases 'bigger' fp data will be stored in memory anyhow,
which can be cached before disk access. The old 8087 improvements
were caused by its new fp operators, the stack was unusable.

And if CPU based stacks were so lucrative for high performance,
CPU makers would have implemented them since long for normal
integer data.

The iA64 comes to mind. Apparently a failure but was a
technical or commercial failure?

Both i.e. poor developer tool stack and strong AMD competition.
And it was overly complex. https://softwareengineering.stackexchange.com/questions/279334/why-was-the-itanium-processor-difficult-to-write-a-compiler-for
--- Synchronet 3.20a-Linux NewsLink 1.114

From dxf@dxforth@gmail.com to comp.lang.forth on Mon Apr 15 22:14:16 2024

From Newsgroup: comp.lang.forth

On 15/04/2024 6:47 pm, Krishna Myneni wrote:

On 4/14/24 21:07, dxf wrote:
...

The design criteria that never changed was the 8-level hardware stack.
Forthers can either accept it for best performance - or pick something
more forgiving at a lesser performance.

In the Lorenz equation example, which works with the 8 deep fpu stack, we have assumed that the fpu hardware stack was empty before calling DERIVS. In a real use case, the call to DERIVS is likely to occur within a deeper call chain, resulting in items already on the fpu stack before args for DERIVS are pushed. As Marcel said, using only a hardware-based fp stack is not realistic for any non-trivial floating point work.

The loss of performance with a memory-based fp stack is far less a concern than having to consider the limited stack depth when writing code involving floating point arithmetic. Failure from overflowing the fpu stack is silent. Debugging is likely to be a nightmare.

Likely the designers never considered Forth. OTOH ANS-Forth did and said
6 items ought to be good enough for anyone :)

--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Mon Apr 15 14:09:28 2024

From Newsgroup: comp.lang.forth

Krishna Myneni <krishna.myneni@ccreweb.org> writes:

Failure from overflowing the
fpu stack is silent.

Reality check:

VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023

1e 2e 3e 4e 5e 6e 7e ok F:-7
8e ok
NDP Stack Fault: NDP SW = 0041
NDP Potential Exception: NDP SW = 0041

SwiftForth also seems to notice it in some way, but does not report it
as an error:

SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e ok
f. 6.00000000 ok

SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e 7e ok f.
ok

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Mon Apr 15 14:13:27 2024

From Newsgroup: comp.lang.forth

albert@spenarnc.xs4all.nl writes:

In article <3e419396b1ee93c7a391a7ffc0e44ed8@www.novabbs.com>,
minforth <minforth@gmx.net> wrote:

And if CPU based stacks were so lucrative for high performance,
CPU makers would have implemented them since long for normal
integer data.

They have: SPARC has a stack of integer register windows, AMD29K and
IA-64 have a register stack of 128 integer registers. However, the
29K died in the early 1990s, IA-64 always remained niche and has been
killed (last order date in January 2020), and SPARC delivered OoO too
late to save it, and the last new SPARC designs were introduced in
2017.

The iA64 comes to mind. Apparently a failure but was a
technical or commercial failure?

It was a great commercial success. Several companies killed their
RISCs in favour of IA-64 based just on the IA-64 roadmaps, and when
IA-64 failed to achieve the predicted technical and market
superiority, they switched to Intel's AMD64 CPUs: HP, DEC (bought by
Compaq bought by HP), SGI. Apple switched to Intel's AMD64 CPUs
without the IA-64 step, so maybe it would have happened for the others
as well without IA-64, but maybe one of the others would have done
what ARM and Apple did 15-20 years later.

Technically, IA-64 was a bet on in-order execution with compiler
scheduling being superior to hardware scheduling (out-of-order
execution). That was a bet on the wrong horse, as the superior
hardware branch prediction allows OoO hardware to outperform IA-64 on
branchy code, while SIMD and GPGPUs eat IA-64's lunch on data-parallel
code.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
--- Synchronet 3.20a-Linux NewsLink 1.114

From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Mon Apr 15 17:02:40 2024

From Newsgroup: comp.lang.forth

On 4/15/24 09:09, Anton Ertl wrote:

Krishna Myneni <krishna.myneni@ccreweb.org> writes:

Failure from overflowing the
fpu stack is silent.

Reality check:

VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023

1e 2e 3e 4e 5e 6e 7e ok F:-7
8e ok
NDP Stack Fault: NDP SW = 0041
NDP Potential Exception: NDP SW = 0041

SwiftForth also seems to notice it in some way, but does not report it
as an error:

SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e ok
f. 6.00000000 ok

SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e 7e ok
f.
ok

I tried overflowing the fpu stack in kforth32, and no exception is
raised. Perhaps one needs to configure the fpu to raise an exception.
Also tried it in C with an assembly procedure. The executable throws no exception.

--
Krishna

== begin fpu-stack-overflow.4th ==
fpu-stack-overflow.4th
\ for use with kforth32

include ans-words
include strings
include modules
include syscalls
include mc
include asm-x86

code fpu-stack-overflow
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
end-code

fpu-stack-overflow

== end fpu-stack-overflow.4th ==

== begin example ==
$ kforth32
kForth-32 v 2.4.5 (Build: 2024-03-30)
Copyright (c) 1998--2023 Krishna Myneni
Contributions by: dpw gd mu bk abs tn cmb bg dnw
Provided under the GNU Affero General Public License, v3.0 or later

include fpu-stack-overflow
ok
== end example ==

--- Synchronet 3.20a-Linux NewsLink 1.114

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Tue Apr 16 01:10:42 2024

From Newsgroup: comp.lang.forth

I would be surprised to get a SIGFPE interrupt from x87 stack ops.
X87 mode has long been deprecated and replaced by SSE2.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Mon Apr 15 20:41:43 2024

From Newsgroup: comp.lang.forth

On 4/15/24 20:10, minforth wrote:

I would be surprised to get a SIGFPE interrupt from x87 stack ops.
X87 mode has long been deprecated and replaced by SSE2.

The FPU has maskable interrupts for arithmetic -- see

https://github.com/mynenik/kForth-32/blob/master/forth-src/fpu-x86.4th

But, yes, I'm not aware of any interrupts from stack ops.

--
Krishna

--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Apr 16 05:53:15 2024

From Newsgroup: comp.lang.forth

minforth@gmx.net (minforth) writes:

X87 mode has long been deprecated

Citation needed.

and replaced by SSE2.

I just tried compiling the following program with gcc with different
options:

float sfplus(float a, float b)
{
return a+b;
}

double dfplus(double a, double b)
{
return a+b;
}

long double lfplus(long double a, long double b)
{
return a+b;
}

what I got was:

sfplus() dfplus() lfplus()
387 387 387 gcc -m32 -O
SSE2 SSE2 387 gcc -m64 -O
387 387 387 gcc -m64 -mfpmath=387 -O

The System V calling convention for AMD64 passes and returns float and
double in xmm registers, so the last option leads to moving the values
between the xmm register and the 387 FP stack through memory, e.g.:

000000000000001f <dfplus>:
1f: f2 0f 11 44 24 f0 movsd %xmm0,-0x10(%rsp)
25: f2 0f 11 4c 24 f8 movsd %xmm1,-0x8(%rsp)
2b: dd 44 24 f0 fldl -0x10(%rsp)
2f: dc 44 24 f8 faddl -0x8(%rsp)
33: dd 5c 24 f0 fstpl -0x10(%rsp)
37: f2 0f 10 44 24 f0 movsd -0x10(%rsp),%xmm0
3d: c3 retq

By contrast, the middle option produces:

0000000000000005 <dfplus>:
5: f2 0f 58 c1 addsd %xmm1,%xmm0
9: c3 retq

and the first option:

00000009 <dfplus>:
9: dd 44 24 04 fldl 0x4(%esp)
d: dc 44 24 0c faddl 0xc(%esp)
11: c3 ret

The difference between the first and last option is due to the
differences in calling convention.

gcc also has the option -mfpmath=sse,387 which tells the compiler that
it can use both. A small experiment only resulted in using SSE2, but
maybe if I had used code with more values alive at the same time it
would have used both.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
--- Synchronet 3.20a-Linux NewsLink 1.114

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Tue Apr 16 06:58:37 2024

From Newsgroup: comp.lang.forth

Sure, x87 mode is still there for backwards compatibility and you can
instruct compilers to use it eg for 80-bit floats. Godbolt is your friend.
I seem to remember the "deprecation" had to do with inefficient load/store mechanisms between memory and x87 registes.

I can't remember or name an exact citation but found a somewhat related discussion: https://retrocomputing.stackexchange.com/questions/9751/did-any-compiler-fully-use-intel-x87-80-bit-floating-point
--- Synchronet 3.20a-Linux NewsLink 1.114

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Tue Apr 16 07:21:16 2024

From Newsgroup: comp.lang.forth

Anton Ertl wrote:
[..]

gcc also has the option -mfpmath=sse,387 which tells the compiler that
it can use both.

Nice to know! Last time I checked, only the Intel compiler could do that.

-marcel
--- Synchronet 3.20a-Linux NewsLink 1.114

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Tue Apr 16 09:59:31 2024

From Newsgroup: comp.lang.forth

In article <uvk861$gs90$1@dont-email.me>,
Krishna Myneni <krishna.myneni@ccreweb.org> wrote:

On 4/15/24 09:09, Anton Ertl wrote:

Krishna Myneni <krishna.myneni@ccreweb.org> writes:

Failure from overflowing the
fpu stack is silent.

Reality check:

VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023

1e 2e 3e 4e 5e 6e 7e ok F:-7
8e ok
NDP Stack Fault: NDP SW = 0041
NDP Potential Exception: NDP SW = 0041

SwiftForth also seems to notice it in some way, but does not report it
as an error:

SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e ok
f. 6.00000000 ok

SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e 7e ok
f.
ok

I tried overflowing the fpu stack in kforth32, and no exception is
raised. Perhaps one needs to configure the fpu to raise an exception.
Also tried it in C with an assembly procedure. The executable throws no >exception.

--
Krishna

== begin fpu-stack-overflow.4th ==
fpu-stack-overflow.4th
\ for use with kforth32

include ans-words
include strings
include modules
include syscalls
include mc
include asm-x86

code fpu-stack-overflow
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
end-code

fpu-stack-overflow

== end fpu-stack-overflow.4th ==

== begin example ==
$ kforth32
kForth-32 v 2.4.5 (Build: 2024-03-30)
Copyright (c) 1998--2023 Krishna Myneni
Contributions by: dpw gd mu bk abs tn cmb bg dnw
Provided under the GNU Affero General Public License, v3.0 or later

include fpu-stack-overflow
ok
== end example ==

I tried this on ciforth. It crashes with the 10th item
not the 8th.

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
--- Synchronet 3.20a-Linux NewsLink 1.114

From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Tue Apr 16 07:14:46 2024

From Newsgroup: comp.lang.forth

On 4/16/24 02:59, albert@spenarnc.xs4all.nl wrote:

In article <uvk861$gs90$1@dont-email.me>,
Krishna Myneni <krishna.myneni@ccreweb.org> wrote:

On 4/15/24 09:09, Anton Ertl wrote:

Krishna Myneni <krishna.myneni@ccreweb.org> writes:

Failure from overflowing the
fpu stack is silent.

Reality check:

VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023

1e 2e 3e 4e 5e 6e 7e ok F:-7
8e ok
NDP Stack Fault: NDP SW = 0041
NDP Potential Exception: NDP SW = 0041

SwiftForth also seems to notice it in some way, but does not report it
as an error:

SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e ok
f. 6.00000000 ok

SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e 7e ok
f.
ok

I tried overflowing the fpu stack in kforth32, and no exception is
raised. Perhaps one needs to configure the fpu to raise an exception.
Also tried it in C with an assembly procedure. The executable throws no
exception.

--
Krishna

== begin fpu-stack-overflow.4th ==
fpu-stack-overflow.4th
\ for use with kforth32

include ans-words
include strings
include modules
include syscalls
include mc
include asm-x86

code fpu-stack-overflow
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
end-code

fpu-stack-overflow

== end fpu-stack-overflow.4th ==

== begin example ==
$ kforth32
kForth-32 v 2.4.5 (Build: 2024-03-30)
Copyright (c) 1998--2023 Krishna Myneni
Contributions by: dpw gd mu bk abs tn cmb bg dnw
Provided under the GNU Affero General Public License, v3.0 or later

include fpu-stack-overflow
ok
== end example ==

I tried this on ciforth. It crashes with the 10th item
not the 8th.

That may be for some other reason. The following code executes an
arbitrary number of FLD1 instructions:

code fpu-stack-overflow ( n -- -n )
0 [ebx] ecx mov, \ set loop count
0 # eax mov,
DO,
fld1,
eax dec,
LOOP,
eax 0 [ebx] mov,
0 # eax mov,
end-code

16384 fpu-stack-overflow .
\ end of prog

\ run it

include clf-code/fpu-stack-overflow

-16384
ok

--
Krishna

Groetjes Albert

--- Synchronet 3.20a-Linux NewsLink 1.114

From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Wed Apr 17 06:24:15 2024

From Newsgroup: comp.lang.forth

On 4/15/24 20:10, minforth wrote:

I would be surprised to get a SIGFPE interrupt from x87 stack ops.
X87 mode has long been deprecated and replaced by SSE2.

Maybe I'm missing something, but SSE2 does not seem to have anything
beyond basic floating point arithmetic e.g. transcendental functions.

--
Krishna

--- Synchronet 3.20a-Linux NewsLink 1.114

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Wed Apr 17 13:07:22 2024

From Newsgroup: comp.lang.forth

Krishna Myneni wrote:

On 4/15/24 20:10, minforth wrote:

I would be surprised to get a SIGFPE interrupt from x87 stack ops.
X87 mode has long been deprecated and replaced by SSE2.

Maybe I'm missing something, but SSE2 does not seem to have anything
beyond basic floating point arithmetic e.g. transcendental functions.

Hardware x87 support isn't necessarily faster: https://users.ece.utexas.edu/~adnan/comm/fast-trigonometric-functions-using.pdf

But I think the main advantage lies in the possibility of parallel and/or vectorized execution.

And of course Forth is very close to assembler and as such it is natural
to use x87 instructions, unless a Forth system is implemented using libc
or using math libraries that better exploit those many features of modern CPUs. --- Synchronet 3.20a-Linux NewsLink 1.114

From peter.m.falth@peter.m.falth@gmail.com (PMF) to comp.lang.forth on Wed Apr 17 13:08:01 2024

From Newsgroup: comp.lang.forth

Krishna Myneni wrote:

On 4/15/24 20:10, minforth wrote:

I would be surprised to get a SIGFPE interrupt from x87 stack ops.
X87 mode has long been deprecated and replaced by SSE2.

Maybe I'm missing something, but SSE2 does not seem to have anything
beyond basic floating point arithmetic e.g. transcendental functions.

Yes that is right. There is a sqrt but nothing more.

In lxf64/ntf64 I use an external C library, specifically fdlibm53.

This claims to be within 1 ulp correct. My testing also suggests this.
fdlibm looks to be the base for most other math libraries.
I could have used libm from gcc but I wanted the same code for both
Linux and Windows.

In lxf/ntf I use the 387 fp stack. I think this was a wrong decision.
8 stack items is to low to be useful for anything more complicated.
complex numbers is an example that quickly eats all 8 stack items.

Best Regards
Peter Fälth

--
Krishna

--- Synchronet 3.20a-Linux NewsLink 1.114

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Wed Apr 17 17:37:45 2024

From Newsgroup: comp.lang.forth

minforth wrote:

Krishna Myneni wrote:

[..]

Hardware x87 support isn't necessarily faster: https://users.ece.utexas.edu/~adnan/comm/fast-trigonometric-functions-using.pdf

.. which shows it also can be a lot slower. And if I'm not mistaken, it can't deliver the 80 bits of the FPU without breaking down by a factor of at least 1.3.

But I think the main advantage lies in the possibility of parallel and/or vectorized execution.

I have not yet seen algorithms where that would bring something. It might when all other code is done properly with SSE. An on-the-fly reconfigured FPGA (available on some microcontrollers) might be a better idea :--)

And of course Forth is very close to assembler and as such it is natural
to use x87 instructions, unless a Forth system is implemented using libc
or using math libraries that better exploit those many features of modern CPUs.

It all depends, that is what I like about Forth.

-marcel
--- Synchronet 3.20a-Linux NewsLink 1.114

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Wed Apr 17 19:16:47 2024

From Newsgroup: comp.lang.forth

mhx wrote:

minforth wrote:

And of course Forth is very close to assembler and as such it is natural
to use x87 instructions, unless a Forth system is implemented using libc
or using math libraries that better exploit those many features of modern CPUs.

It all depends, that is what I like about Forth.

True, it all depends. Copied from a Visual Studio documentation:

Many of the floating point math library functions have different implementations
for different CPU architectures. For example, the 32-bit x86 CRT may have a different implementation than the 64-bit x64 CRT. In addition, some of the functions may have multiple implementations for a given CPU architecture. The most efficient implementation is selected dynamically at run-time depending on the instruction sets supported by the CPU. For example, in the 32-bit x86 CRT, some functions have both an x87 implementation and an SSE2 implementation. When running on a CPU that supports SSE2, the faster SSE2 implementation is used. When running on a CPU that does not support SSE2, the slower x87 implementation is used.
--- Synchronet 3.20a-Linux NewsLink 1.114

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Thu Apr 18 07:23:48 2024

From Newsgroup: comp.lang.forth

minforth wrote:
[..]

[..] For example, in the 32-bit x86 CRT,
some functions have both an x87 implementation and an SSE2 implementation. When
running on a CPU that supports SSE2, the faster SSE2 implementation is used. When running on a CPU that does not support SSE2, the **slower**
[ my emphasis -mhx ] x87 implementation is used.

This strikes me as showing a strong bias (i.e. not strictly based on technical arguments) towards SSE2. I've noticed before that Microsoft has a dislike for the x87 FPU, if not boycotting it outright (e.g. no long double in their compiler).

-marcel
--- Synchronet 3.20a-Linux NewsLink 1.114

From dxf@dxforth@gmail.com to comp.lang.forth on Thu Apr 18 17:53:45 2024

From Newsgroup: comp.lang.forth

On 17/04/2024 11:07 pm, minforth wrote:

...
And of course Forth is very close to assembler and as such it is natural
to use x87 instructions

In addition to being available on most Intel x86 cpu's, x87 was cheap to support. A relatively small amount of code was needed to implement the assembler extensions:

https://pastebin.com/Md6BGWmj

There wasn't a good reason not to choose x87. If the hardware stack didn't appeal, use a software stack. Performance will still be very good. I recall pitting my 16-bit DTC forth with x87 against VFX using an fp intensive program/ benchmark. The difference was only a factor of 4. I couldn't believe it.

--- Synchronet 3.20a-Linux NewsLink 1.114

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Thu Apr 18 08:08:07 2024

From Newsgroup: comp.lang.forth

mhx wrote:

minforth wrote:
[..]

[..] For example, in the 32-bit x86 CRT,
some functions have both an x87 implementation and an SSE2 implementation. When
running on a CPU that supports SSE2, the faster SSE2 implementation is used. >> When running on a CPU that does not support SSE2, the **slower**
[ my emphasis -mhx ] x87 implementation is used.

This strikes me as showing a strong bias (i.e. not strictly based on technical
arguments) towards SSE2. I've noticed before that Microsoft has a dislike for the x87 FPU, if not boycotting it outright (e.g. no long double in their compiler).

Could be. At least gcc has support for _float80. FWIW Intel's icc even has a compiler flag for x87 stack overflow warnings. It all depends. ;-)
--- Synchronet 3.20a-Linux NewsLink 1.114

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Thu Apr 18 11:00:45 2024

From Newsgroup: comp.lang.forth

In article <6620d188$1@news.ausics.net>, dxf <dxforth@gmail.com> wrote:

On 17/04/2024 11:07 pm, minforth wrote:

...
And of course Forth is very close to assembler and as such it is natural
to use x87 instructions

In addition to being available on most Intel x86 cpu's, x87 was cheap to >support. A relatively small amount of code was needed to implement the >assembler extensions:

https://pastebin.com/Md6BGWmj

There wasn't a good reason not to choose x87. If the hardware stack didn't >appeal, use a software stack. Performance will still be very good. I recall >pitting my 16-bit DTC forth with x87 against VFX using an fp intensive program/
benchmark. The difference was only a factor of 4. I couldn't believe it.

The loadable extension for ciforth runs 20 screens.
This is not politically correct. Only 80 bits floats, only 8 stack depth.
(The assembler is require, which contains 2 screens of fp related instructions.)
The transcendental functions are very low cost,
Nice if you have an occasional cosine.
In implementing the floating point for the transputer, we had
to generate Chebychov polynomials with a special UBASIC
program and a truckload of testing. At least they
accommodated ISO single and double precision because
that was the transputer baseline.

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
--- Synchronet 3.20a-Linux NewsLink 1.114

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Thu Apr 18 11:04:46 2024

From Newsgroup: comp.lang.forth

In article <37580f936ac4a2d21fb84ccf59e6b7a4@www.novabbs.com>,
mhx <mhx@iae.nl> wrote:

minforth wrote:
[..]

[..] For example, in the 32-bit x86 CRT,
some functions have both an x87 implementation and an SSE2 implementation. When
running on a CPU that supports SSE2, the faster SSE2 implementation is used. >> When running on a CPU that does not support SSE2, the **slower**
[ my emphasis -mhx ] x87 implementation is used.

This strikes me as showing a strong bias (i.e. not strictly based on technical >arguments) towards SSE2. I've noticed before that Microsoft has a dislike for >the x87 FPU, if not boycotting it outright (e.g. no long double in their >compiler).

Microsoft are no hobbyists. There has to be a commercial incentive to
take it up, that is not a boycott. OTOH the gcc folks like a
challenge, resulting in a baroque building.

-marcel

--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
--- Synchronet 3.20a-Linux NewsLink 1.114

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Thu Apr 18 09:09:19 2024

From Newsgroup: comp.lang.forth

minforth wrote:

mhx wrote:

[..]

Could be. At least gcc has support for _float80. FWIW Intel's icc even has a compiler flag for x87 stack overflow warnings.

[..]

At runtime?

-marcel
--- Synchronet 3.20a-Linux NewsLink 1.114

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Thu Apr 18 11:11:17 2024

From Newsgroup: comp.lang.forth

In article <2024Apr14.132507@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

dxf <dxforth@gmail.com> writes:

On 14/04/2024 6:34 pm, Anton Ertl wrote:

From what I read about this, the intention was that the FP stack would
extend into memory (and thus not be limited to 8 elements): software
should react to FP stack overflows and underflows and store some
elements on overflow, and reload some elements on underflow. However,
this functionality was implemented in a buggy way on the 8087, so it
never worked as intended. Hoever, when they noticed this, the 8087
was already on the market, and Hyrum's law ensured that this behaviour
could not be changed.

Do you have a reference for that?

Kahan writes about the original intention in

http://web.archive.org/web/20170118054747/https://cims.nyu.edu/~dbindel/class/cs279/87stack.pdf

especially starting at the last paragraph of page 2.

And about the bug (or rather design mistake):

https://history.siam.org/pdfs2/Kahan_final.pdf

Start with the second-to-last paragraph on page 163. He digresses for
a page, but continues on the fourth paragraph of page 165 and
continues to the first paragraph of page 168.

Interesting read. The reverse polish calculator from HP was a
resounding success, with great profits.
Kahan contributed to this.
It was killed by the bean counters at HP.
There was huge demand but they refused to expand production.
Then the calculator died because it simply was not available.

- anton

Groetjes Albert
--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
--- Synchronet 3.20a-Linux NewsLink 1.114

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Thu Apr 18 09:27:07 2024

From Newsgroup: comp.lang.forth

mhx wrote:

minforth wrote:

mhx wrote:

[..]

Could be. At least gcc has support for _float80. FWIW Intel's icc even has a >> compiler flag for x87 stack overflow warnings.

[..]

At runtime?

AFAIK they check if computation results are popped off the x87 stack and put into the xmm0 register. Could be a flag for compile-time checking.
--- Synchronet 3.20a-Linux NewsLink 1.114

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Thu Apr 18 09:51:55 2024

From Newsgroup: comp.lang.forth

minforth wrote:

mhx wrote:

minforth wrote:

mhx wrote:

[..]

Could be. At least gcc has support for _float80. FWIW Intel's icc even has a
compiler flag for x87 stack overflow warnings.

[..]

At runtime?

AFAIK they check if computation results are popped off the x87 stack and put into the xmm0 register. Could be a flag for compile-time checking.

Correction. It generates runtime code https://www.intel.com/content/www/us/en/docs/cpp-compiler/developer-guide-reference/2021-8/fp-stack-check-qfp-stack-check.html

but it doesn't look very convincing to me.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Stephen Pelc@stephen@vfxforth.com to comp.lang.forth on Thu Apr 18 12:09:22 2024

From Newsgroup: comp.lang.forth

On 17 Apr 2024 at 15:08:01 CEST, "PMF" <PMF> wrote:

In lxf64/ntf64 I use an external C library, specifically fdlibm53.

This claims to be within 1 ulp correct. My testing also suggests this.
fdlibm looks to be the base for most other math libraries.
I could have used libm from gcc but I wanted the same code for both
Linux and Windows.

In lxf/ntf I use the 387 fp stack. I think this was a wrong decision.
8 stack items is to low to be useful for anything more complicated.
complex numbers is an example that quickly eats all 8 stack items.

I share your pain. In 2023, I spent far too long trying to satisfy FP users. For VFX64, we ended up providing support for
1) 8 item 387 FP, internal FP stack,
2) 387 FP and FP stack in memory,
3) SSE2 FP and FP stack in memory.

Conversion of FP parameters to/from OS call interfaces is done
automagically by the EXTERN: declarations for system calls and
callbacks.

For a significant proportion of our users, 64 bit FP is inadequate,
and ony 64 bit FP is available on CPUs other than AMD64/x64
unless you are using rare and expensive hardware.

In retrospect, if I were doing this again I would standardise on an
external double-double library (about 106 bits). In most cases that
we encounter, the desire for 387 FP is to gain the extra precision.
Since very few CPUs support quad precision natively, the most
obvious solution is a double-double library.

Stephen
--
Stephen Pelc, stephen@vfxforth.com
MicroProcessor Engineering, Ltd. - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)78 0390 3612, +34 649 662 974
http://www.mpeforth.com
MPE website
http://www.vfxforth.com/downloads/VfxCommunity/
downloads
--- Synchronet 3.20a-Linux NewsLink 1.114

From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Thu Apr 18 07:52:31 2024

From Newsgroup: comp.lang.forth

On 4/17/24 08:08, PMF wrote:

Krishna Myneni wrote:

On 4/15/24 20:10, minforth wrote:

I would be surprised to get a SIGFPE interrupt from x87 stack ops.
X87 mode has long been deprecated and replaced by SSE2.

Maybe I'm missing something, but SSE2 does not seem to have anything
beyond basic floating point arithmetic e.g. transcendental functions.

Yes that is right. There is a sqrt but nothing more.

In lxf64/ntf64 I use an external C library, specifically fdlibm53.

This claims to be within 1 ulp correct. My testing also suggests this.
fdlibm looks to be the base for most other math libraries.
I could have used libm from gcc but I wanted the same code for both
Linux and Windows.

In lxf/ntf I use the 387 fp stack. I think this was a wrong decision.
8 stack items is to low to be useful for anything more complicated.
complex numbers is an example that quickly eats all 8 stack items.
Best Regards
Peter Fälth

Thank you for the clarification. I disassembled the GNU C Math Library
(64-bit libm.so.6) and had a look at the code. It is a huge file. The
long double functions are coded for x87, while functions for float,
double, and float128 use the SSE instructions.

This implies that if x87 is deprecated on 64-bit systems, then 80-bit
floats are also deprecated?

Does fdlibm53 support 80-bit floats and float128?

--
Krishna

--- Synchronet 3.20a-Linux NewsLink 1.114

From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Thu Apr 18 08:00:17 2024

From Newsgroup: comp.lang.forth

On 4/18/24 07:09, Stephen Pelc wrote:
...

In retrospect, if I were doing this again I would standardise on an
external double-double library (about 106 bits). In most cases that
we encounter, the desire for 387 FP is to gain the extra precision.
Since very few CPUs support quad precision natively, the most
obvious solution is a double-double library.

I also thought double-double would be sufficient, until last year, when
I ran into having to code an application for which it was unsuitable.
The problem with double-double was not in the number of bits in the significand (106) -- the problem was that the exponent range of double precision was not large enough. I believe the double-double type even
reduces the available exponent range from that of a double type. In any
case I ended up using the MPFR library for this application. IEEE quad precision would be more suitable, since it expands the exponent range.

--
Krishna

--- Synchronet 3.20a-Linux NewsLink 1.114

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Thu Apr 18 17:10:17 2024

From Newsgroup: comp.lang.forth

The Cephes math library supports extended and 128b floats https://netlib.org/cephes/128bdoc.html

But gnu libquadmath had been good enough for me until now.
--- Synchronet 3.20a-Linux NewsLink 1.114

From dxf@dxforth@gmail.com to comp.lang.forth on Fri Apr 19 13:48:04 2024

From Newsgroup: comp.lang.forth

On 18/04/2024 7:11 pm, albert@spenarnc.xs4all.nl wrote:

...
Interesting read. The reverse polish calculator from HP was a
resounding success, with great profits.
Kahan contributed to this.
It was killed by the bean counters at HP.
There was huge demand but they refused to expand production.
Then the calculator died because it simply was not available.

We had quite a few HP instruments but then the bottom fell out of
the test instrument market (and maintenance generally). It was
embarrassing to see the once great HP pursue the consumer PC market.

--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Apr 20 15:23:22 2024

From Newsgroup: comp.lang.forth

Stephen Pelc <stephen@vfxforth.com> writes:

The manual (gasp) documents how to change the default FP package.

Changing the default pack also changes the system call interfaces to
match.

It is great that now an FP package is available by default, rather
than having to load some obscurely-named file from an
installation-specific path. E.g. appbench-1.4 contains a file
setup/vfx.fth which contains the lines:

1 cells 4 = [if]
include /usr/share/doc/VfxForth/Lib/Ndp387.fth
[then]

The result is that when I just tested various systems on the
appbench-1.4 suite, vfx64 worked (for four of the benchmarks), and I
won't spoil my positive message by reporting on what did not work.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Apr 20 15:58:03 2024

From Newsgroup: comp.lang.forth

mhx@iae.nl (mhx) writes:

minforth wrote:

But I think the main advantage lies in the possibility of parallel and/or
vectorized execution.

I have not yet seen algorithms where that would bring something.

Matrix multiplication is an easy case. I have also done a version of
Jon Bentley's greedy TSP program that benefitted from SSE and AVX; I
had to use assembly language to do this, however; see the thread
starting at <2016Nov14.164726@mips.complang.tuwien.ac.at>.

OTOH, yesterday I saw what gcc did for the inner loop of the bubble
benchmark from the Stanford integer benchmarks:

while ( i<top ) {

if ( sortlist[i] > sortlist[i+1] ) {
j = sortlist[i];
sortlist[i] = sortlist[i+1];
sortlist[i+1] = j;
};
i=i+1;
};

top=top-1;
};

gcc-12.2 -O1 produces straighforward scalar code, gcc-12.2 -O3 wants
to use SIMD instructions:

gcc -01 gcc -O3
1c: add $0x4,%rax c0: movq (%rax),%xmm0
cmp %rsi,%rax add $0x1,%edx
je 35 pshufd $0xe5,%xmm0,%xmm1
25: mov (%rax),%edx movd %xmm0,%edi
mov 0x4(%rax),%ecx movd %xmm1,%ecx
cmp %ecx,%edx cmp %ecx,%edi
jle 1c jle e1
mov %ecx,(%rax) pshufd $0xe1,%xmm0,%xmm0
mov %edx,0x4(%rax) movq %xmm0,(%rax)
jmp 1c e1: add $0x4,%rax
35: cmp %r8d,%edx
jl c0

The version produced by gcc -O3 is almost three times slower on a
Skylake than the one by gcc -O1 and is actually slower than several
Forth systems, including gforth-fast. I think that the reason is that
the movq towards the end stores two items, and the movq at the start
of the next iteration loads one of these item, i.e., there is partial
overlap between the store and the load. In this case the hardware
takes a slow path, which means that the slowdown is much bigger than
the instruction count suggests.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Apr 20 16:23:21 2024

From Newsgroup: comp.lang.forth

albert@spenarnc.xs4all.nl writes:

I cut the same corners with ciforth. However I think this
cannot be compliant with the IEEE requirement of the standard?

Which IEEE requirement of which standard?

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
--- Synchronet 3.20a-Linux NewsLink 1.114

From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Sat Apr 20 12:00:00 2024

From Newsgroup: comp.lang.forth

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

The version produced by gcc -O3 is almost three times slower on a
Skylake than the one by gcc -O1 and is actually slower than several
Forth systems, including gforth-fast.

Wow, that seems worth a bug report. How about gcc -O2 ?
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Apr 21 06:45:29 2024

From Newsgroup: comp.lang.forth

Paul Rubin <no.email@nospam.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

The version produced by gcc -O3 is almost three times slower on a
Skylake than the one by gcc -O1 and is actually slower than several
Forth systems, including gforth-fast.

Wow, that seems worth a bug report.

At least the compiled code is working as the programmer intended. gcc
people regularly resolve bug reports as "INVALID" where earlier gcc
versions work as intended and later gcc versions do not. And then
there are the cases where they just do nothing, as on PR93811.

How about gcc -O2 ?

I have not tried it.

If you want to measure it and/or report this as a bug, you can find
the source code in <http://www.complang.tuwien.ac.at/forth/bench.zip>
in the directory c-manual.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Apr 21 09:12:54 2024

From Newsgroup: comp.lang.forth

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

OTOH, yesterday I saw what gcc did for the inner loop of the bubble
benchmark from the Stanford integer benchmarks:

while ( i<top ) {

if ( sortlist[i] > sortlist[i+1] ) {
j = sortlist[i];
sortlist[i] = sortlist[i+1];
sortlist[i+1] = j;
};
i=i+1;
};

top=top-1;
};

gcc-12.2 -O1 produces straighforward scalar code, gcc-12.2 -O3 wants
to use SIMD instructions:

gcc -01 gcc -O3
1c: add $0x4,%rax c0: movq (%rax),%xmm0
cmp %rsi,%rax add $0x1,%edx
je 35 pshufd $0xe5,%xmm0,%xmm1
25: mov (%rax),%edx movd %xmm0,%edi
mov 0x4(%rax),%ecx movd %xmm1,%ecx
cmp %ecx,%edx cmp %ecx,%edi
jle 1c jle e1
mov %ecx,(%rax) pshufd $0xe1,%xmm0,%xmm0
mov %edx,0x4(%rax) movq %xmm0,(%rax)
jmp 1c e1: add $0x4,%rax
35: cmp %r8d,%edx
jl c0

The version produced by gcc -O3 is almost three times slower on a
Skylake than the one by gcc -O1 and is actually slower than several
Forth systems, including gforth-fast. I think that the reason is that
the movq towards the end stores two items, and the movq at the start
of the next iteration loads one of these item, i.e., there is partial
overlap between the store and the load. In this case the hardware
takes a slow path, which means that the slowdown is much bigger than
the instruction count suggests.

I was curious if a more recent Intel core had improved on that (and
maybe such a more recent Intel core was targeted by the "optimization"
that caused the slowdown), so I measured it on a P-core of a Core
i3-1315U. The results are as follows:

O1/bubble O3/bubble
424,248,952 2,061,809,866 cpu_core/cycles/
1,536,825,253 1,986,035,580 cpu_core/instructions/

So, more than a factor of 4 on this microarchitecture.

The differences in the topdown analysis are also interesting:

O1
1,177,188,340 cpu_core/topdown-retiring/ # 46.1% Retiring
279,332,826 cpu_core/topdown-bad-spec/ # 10.9% Bad Speculation
778,141,445 cpu_core/topdown-fe-bound/ # 30.5% Frontend Bound
319,237,516 cpu_core/topdown-be-bound/ # 12.5% Backend Bound
0 cpu_core/topdown-heavy-ops/ # 0.0% Heavy Operations
269,356,654 cpu_core/topdown-br-mispredict/ # 10.5% Branch Mispredict
269,356,654 cpu_core/topdown-fetch-lat/ # 10.5% Fetch Latency
59,857,034 cpu_core/topdown-mem-bound/ # 2.3% Memory Bound

O3
1,599,831,263 cpu_core/topdown-retiring/ # 12.9% Retiring
630,236,558 cpu_core/topdown-bad-spec/ # 5.1% Bad Speculation
533,277,087 cpu_core/topdown-fe-bound/ # 4.3% Frontend Bound
9,598,987,583 cpu_core/topdown-be-bound/ # 77.6% Backend Bound
280,169 cpu_core/topdown-heavy-ops/ # 0.0% Heavy Operations
630,236,558 cpu_core/topdown-br-mispredict/ # 5.1% Branch Mispredict
193,918,941 cpu_core/topdown-fetch-lat/ # 1.6% Fetch Latency
5,623,649,291 cpu_core/topdown-mem-bound/ # 45.5% Memory Bound

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
--- Synchronet 3.20a-Linux NewsLink 1.114

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sun Apr 21 11:44:38 2024

From Newsgroup: comp.lang.forth

In article <2024Apr20.182321@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

albert@spenarnc.xs4all.nl writes:

I cut the same corners with ciforth. However I think this
cannot be compliant with the IEEE requirement of the standard?

Which IEEE requirement of which standard?

The ISO 9x talks about

64-bit IEEE double-precision number
32-bit IEEE double-precision number

IEEE floating point number as defined in ANSI/IEEE Standard
754-1985

80 bit '87 is not compliant with either of these, I think.

- anton

--
Don't praise the day before the evening. One swallow doesn't make spring.
You must not say "hey" before you have crossed the bridge. Don't sell the
hide of the bear until you shot it. Better one bird in the hand than ten in
the air. First gain is a cat purring. - the Wise from Antrim -
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Apr 21 10:57:40 2024

From Newsgroup: comp.lang.forth

albert@spenarnc.xs4all.nl writes:

In article <2024Apr20.182321@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >>albert@spenarnc.xs4all.nl writes:

I cut the same corners with ciforth. However I think this
cannot be compliant with the IEEE requirement of the standard?

Which IEEE requirement of which standard?

The ISO 9x talks about

Whatever "ISO 9x" may be: Why do you worry about it and mention it
here?

64-bit IEEE double-precision number
32-bit IEEE double-precision number

IEEE floating point number as defined in ANSI/IEEE Standard
754-1985

So?

80 bit '87 is not compliant with either of these, I think.

So?

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023
--- Synchronet 3.20a-Linux NewsLink 1.114

From dxf@dxforth@gmail.com to comp.lang.forth on Mon Apr 22 21:05:10 2024

From Newsgroup: comp.lang.forth

On 21/04/2024 1:23 am, Anton Ertl wrote:

Stephen Pelc <stephen@vfxforth.com> writes:

The manual (gasp) documents how to change the default FP package.

Changing the default pack also changes the system call interfaces to
match.

It is great that now an FP package is available by default, rather
than having to load some obscurely-named file from an
installation-specific path. E.g. appbench-1.4 contains a file
setup/vfx.fth which contains the lines:

1 cells 4 = [if]
include /usr/share/doc/VfxForth/Lib/Ndp387.fth
[then]

The result is that when I just tested various systems on the
appbench-1.4 suite, vfx64 worked (for four of the benchmarks), and I
won't spoil my positive message by reporting on what did not work.

What didn't work?

When loading an alternate f/p package I found it removes parts of the
compiler unrelated to fp. Hopefully this can be rectified. IIRC this
problem didn't exist with the previous setup where fp had to be explicitly loaded.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Stephen Pelc@stephen@vfxforth.com to comp.lang.forth on Mon Apr 22 19:56:55 2024

From Newsgroup: comp.lang.forth

On 22 Apr 2024 at 13:05:10 CEST, "dxf" <dxforth@gmail.com> wrote:

When loading an alternate f/p package I found it removes parts of the compiler unrelated to fp. Hopefully this can be rectified. IIRC this problem didn't exist with the previous setup where fp had to be explicitly loaded.

Please send me more details by email.

Stephen
--
Stephen Pelc, stephen@vfxforth.com
MicroProcessor Engineering, Ltd. - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)78 0390 3612, +34 649 662 974
http://www.mpeforth.com
MPE website
http://www.vfxforth.com/downloads/VfxCommunity/
downloads
--- Synchronet 3.20a-Linux NewsLink 1.114

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	915
Nodes:	10 (2 / 8)
Uptime:	45:57:45
Calls:	12,170
Files:	186,521
Messages:	2,234,593

Floating point implementations on AMD64

Who's Online

System Info