I just looked at the floating-point implementations of recent
SwiftForth and VFX (finally present in the system from the start), and
on iForth-5.1-mini (for comparison):
1 FLOATS .
reports:
16 iforth
10 sf64
10 vfx64
For
: foo f+ f* ;
the resulting code is:
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
: foo f+ f* ; ok
see foo
44E8B9 ST(0) ST(1) FADDP DEC1
44E8BB ST(0) ST(1) FMULP DEC9
44E8BD RET C3 ok
VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023
: foo f+ f* ; ok
see foo
FOO
( 0050A250 DEC1 ) FADDP ST(1), ST
( 0050A252 DEC9 ) FMULP ST(1), ST
( 0050A254 C3 ) RET/NEXT
( 5 bytes, 3 instructions )
iForth:
$10226000 : foo 488BC04883ED088F4500 H.@H.m..E. $1022600A fld [r13 0 +] tbyte41DB6D00 A[m. $1022600E fld [r13 #16 +] tbyte
41DB6D10 A[m. $10226012 fxch ST(2) D9CA YJ $10226014 lea r13, [r13 #32 +] qword
4D8D6D20 M.m $10226018 faddp ST(1), ST DEC1 ^A $1022601A fxch ST(1) D9C9 YI $1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m. $10226026 fmulp ST(1), ST DEC9 ^I $10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}. $10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok
So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.
I just looked at the floating-point implementations of recent
SwiftForth and VFX (finally present in the system from the start), and
on iForth-5.1-mini (for comparison):
...
So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.
iForth:
$10226000 : foo 488BC04883ED088F4500 H.@H.m..E. $1022600A fld [r13 0 +] tbyte41DB6D00 A[m. $1022600E fld [r13 #16 +] tbyte
41DB6D10 A[m. $10226012 fxch ST(2) D9CA YJ $10226014 lea r13, [r13 #32 +] qword
4D8D6D20 M.m $10226018 faddp ST(1), ST DEC1 ^A $1022601A fxch ST(1) D9C9 YI $1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m. $10226026 fmulp ST(1), ST DEC9 ^I $10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}. $10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok
So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.
Anton Ertl wrote:
[..]
iForth:
$10226000 : foo 488BC04883ED088F4500 H.@H.m..E.
$1022600A fld [r13 0 +] tbyte41DB6D00 A[m.
$1022600E fld [r13 #16 +] tbyte
41DB6D10 A[m.
$10226012 fxch ST(2) D9CA YJ
$10226014 lea r13, [r13 #32 +] qword
4D8D6D20 M.m $10226018 faddp ST(1), ST DEC1 ^A
$1022601A fxch ST(1) D9C9 YI
$1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m.
$10226026 fmulp ST(1), ST DEC9 ^I
$10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}.
$10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok
So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.
Turbo Pascal had a fast FP mode that used the FPU stack. I found almost immediately that that is unusable for serious work.
Apparently there are special interrupts that one can enable
to signal FPU stack underflow (and then spill to memory)
but I never got them to work reliably.
mhx@iae.nl (mhx) writes:
Apparently there are special interrupts that one can enable
to signal FPU stack underflow (and then spill to memory)
but I never got them to work reliably.
From what I read about this, the intention was that the FP stack would
extend into memory (and thus not be limited to 8 elements): software
should react to FP stack overflows and underflows and store some
elements on overflow, and reload some elements on underflow. However,
this functionality was implemented in a buggy way on the 8087, so it
never worked as intended. Hoever, when they noticed this, the 8087
was already on the market, and Hyrum's law ensured that this behaviour
could not be changed.
I just looked at the floating-point implementations of recent
SwiftForth and VFX (finally present in the system from the start), and
on iForth-5.1-mini (for comparison):
1 FLOATS .
reports:
16 iforth
10 sf64
10 vfx64
For
: foo f+ f* ;
the resulting code is:
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
: foo f+ f* ; ok
see foo
44E8B9 ST(0) ST(1) FADDP DEC1
44E8BB ST(0) ST(1) FMULP DEC9
44E8BD RET C3 ok
VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023
: foo f+ f* ; ok
see foo
FOO
( 0050A250 DEC1 ) FADDP ST(1), ST
( 0050A252 DEC9 ) FMULP ST(1), ST
( 0050A254 C3 ) RET/NEXT
( 5 bytes, 3 instructions )
- anton--
Anton Ertl wrote:
[..]
iForth:H.E.H.E..` ok
$10226000 : foo 488BC04883ED088F4500 H.@H.m..E. >> $1022600A fld [r13 0 +] tbyte41DB6D00 A[m.
$1022600E fld [r13 #16 +] tbyte
41DB6D10 A[m.
$10226012 fxch ST(2) D9CA YJ
$10226014 lea r13, [r13 #32 +] qword
4D8D6D20 M.m
$10226018 faddp ST(1), ST DEC1 ^A
$1022601A fxch ST(1) D9C9 YI
$1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m. >> $10226026 fmulp ST(1), ST DEC9 ^I
$10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}. >> $10226032 ; 488B45004883C508FFE0
So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.
Turbo Pascal had a fast FP mode that used the FPU stack. I found almost >immediately that that is unusable for serious work.
The used scheme is rather complicated. iForth uses the internal stack
when it can prove that there will be no under- or overflow. Non-inlined
calls (F. below) always use the memory stack.
FORTH> pi fvalue val1 pi/2 fvalue val2 ok
FORTH> : test val1 fdup val2 foo val1 f+ val2 f* f. ; ok
FORTH> see test
Flags: ANSI
$015FDA80 : test
$015FDA8A fld $015FD650 tbyte-offset
$015FDA90 fld ST(0)
$015FDA92 fld $015FD670 tbyte-offset
$015FDA98 faddp ST(1), ST
$015FDA9A fmulp ST(1), ST
$015FDA9C fld $015FD650 tbyte-offset
$015FDAA2 faddp ST(1), ST
$015FDAA4 fld $015FD670 tbyte-offset
$015FDAAA fmulp ST(1), ST
$015FDAAC fpush,
$015FDAB6 jmp F.+10 ( $0124ED42 ) offset NEAR
$015FDABB ;
Apparently there are special interrupts that one can enable
to signal FPU stack underflow (and then spill to memory)
but I never got them to work reliably. The software
analysis works fine, but can be fooled in case of rather
contrived circumstances. I have not encountered a bug in the
past two decades.
-marcel--
On 14/04/2024 6:34 pm, Anton Ertl wrote:
From what I read about this, the intention was that the FP stack would
extend into memory (and thus not be limited to 8 elements): software
should react to FP stack overflows and underflows and store some
elements on overflow, and reload some elements on underflow. However,
this functionality was implemented in a buggy way on the 8087, so it
never worked as intended. Hoever, when they noticed this, the 8087
was already on the market, and Hyrum's law ensured that this behaviour
could not be changed.
Do you have a reference for that?
In article <27089a13c7ce61da7ffb927cb6c365d2@www.novabbs.com>,
mhx <mhx@iae.nl> wrote:
Anton Ertl wrote:This is a practical way.
[..]
I researched whether it is possible to detect whether the
circular stack overflows. There are instructions to
detect whether a position in this stack is occupied.
For a word that using a stack 4 deep, you could detect whether
it is necessary to save words this way, I thought.
I couldn't make it work, because essential assembler instruction
are missing. (Or I'm not clever enough.)
On 14/04/2024 5:03 pm, mhx wrote:
Anton Ertl wrote:
[..]
iForth:
$10226000 : foo 488BC04883ED088F4500 H.@H.m..E.
$1022600A fld [r13 0 +] tbyte41DB6D00 A[m.
$1022600E fld [r13 #16 +] tbyte
41DB6D10 A[m.
$10226012 fxch ST(2) D9CA YJ
$10226014 lea r13, [r13 #32 +] qword
4D8D6D20 M.m $10226018 faddp ST(1), ST DEC1 ^A
$1022601A fxch ST(1) D9C9 YI
$1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m.
$10226026 fmulp ST(1), ST DEC9 ^I
$10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}.
$10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok
So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.
Turbo Pascal had a fast FP mode that used the FPU stack. I found almost
immediately that that is unusable for serious work.
Were that the case Intel had plenty opportunity to change it. They had
an academic advising them.
dxf <dxforth@gmail.com> writes:
On 14/04/2024 6:34 pm, Anton Ertl wrote:
From what I read about this, the intention was that the FP stack would
extend into memory (and thus not be limited to 8 elements): software
should react to FP stack overflows and underflows and store some
elements on overflow, and reload some elements on underflow. However,
this functionality was implemented in a buggy way on the 8087, so it
never worked as intended. Hoever, when they noticed this, the 8087
was already on the market, and Hyrum's law ensured that this behaviour
could not be changed.
Do you have a reference for that?
Kahan writes about the original intention in
http://web.archive.org/web/20170118054747/https://cims.nyu.edu/~dbindel/class/cs279/87stack.pdf
especially starting at the last paragraph of page 2.
And about the bug (or rather design mistake):
https://history.siam.org/pdfs2/Kahan_final.pdf
Start with the second-to-last paragraph on page 163. He digresses for
a page, but continues on the fourth paragraph of page 165 and
continues to the first paragraph of page 168.
...
Below I give Forth code which computes the derivatives. This code is usable only on systems with a separate FP stack. It will be interesting to see the compiled code given by Forth systems using the hardware fpu stack to compute the results. While this example may behave properly, if we go to a fourth order system or higher, it gets less likely that the hardware stack remains usable.
dx/dt = sigma*(y - x)
dy/dt = x*(rho -z) - y
dz/dt = x*y - beta*z
where sigma, rho, and beta are constant parameters.
Let's say we want to write a word DERIVS which computes and stores the >derivatives, given the instantaneous values of x, y, z. This is the
basis for any numerical code which solves the trajectory in time,
starting from an initial condition.
DERIVS ( F: x y z -- )
Hence, we want to place some values x, y, and z onto the fp stack and >compute the three derivatives. Ideally these three values remain on the
fp stack and don't need to be fetched from memory constantly until the
three derivatives are computed, especially if one is using the hardware
fp stack. We allow the constant parameters to be fetched from memory and
the results of the derivative computation to be stored to memory so they >don't overflow the stack. This should be doable with the 8-element
hardware fp stack.
On 14/04/2024 9:25 pm, Anton Ertl wrote:
Kahan writes about the original intention in
http://web.archive.org/web/20170118054747/https://cims.nyu.edu/~dbindel/class/cs279/87stack.pdf
especially starting at the last paragraph of page 2.
And about the bug (or rather design mistake):
https://history.siam.org/pdfs2/Kahan_final.pdf
Start with the second-to-last paragraph on page 163. He digresses for
a page, but continues on the fourth paragraph of page 165 and
continues to the first paragraph of page 168.
The latter sounds like someone not getting his way more than a design mistake. >In the first reference Kahan states:
"When the 8087 was designed, I knew that stack over/underflow was an issue of more aesthetic than practical importance. I still regret that the 8087's stack
implementation was not quite so neat as my original intention described in the
accompanying note."
Intel decided Kahan's aesthetic afterthought could be dispensed with.
On 4/13/24 12:55, Anton Ertl wrote:
I just looked at the floating-point implementations of recent...
SwiftForth and VFX (finally present in the system from the start), and
on iForth-5.1-mini (for comparison):
1 FLOATS .
reports:
16 iforth
10 sf64
10 vfx64
For
: foo f+ f* ;
the resulting code is:
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
: foo f+ f* ; ok
see foo
44E8B9 ST(0) ST(1) FADDP DEC1
44E8BB ST(0) ST(1) FMULP DEC9
44E8BD RET C3 ok
VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023
: foo f+ f* ; ok
see foo
FOO
( 0050A250 DEC1 ) FADDP ST(1), ST
( 0050A252 DEC9 ) FMULP ST(1), ST
( 0050A254 C3 ) RET/NEXT
( 5 bytes, 3 instructions )
iForth:
$10226000 : foo 488BC04883ED088F4500 H.@H.m..E. >> $1022600A fld [r13 0 +] tbyte41DB6D00 A[m.
$1022600E fld [r13 #16 +] tbyte
41DB6D10 A[m.
$10226012 fxch ST(2) D9CA YJ
$10226014 lea r13, [r13 #32 +] qword
4D8D6D20 M.m
$10226018 faddp ST(1), ST DEC1 ^A
$1022601A fxch ST(1) D9C9 YI
$1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m. >> $10226026 fmulp ST(1), ST DEC9 ^I
$10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}. >> $10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok
So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.
For me, an 8 item hardware fp stack limit is too limiting to be useful.
This is mostly because of my use of the fp stack for initializing tables (arrays and matrices), and my coding style of returning more than 8
floats on the fp stack for some types of computation. No doubt one can
limit themselves to an 8-item fp stack, but I'd hate to have to code wit
such a limit.
Krishna Myneni <krishna.myneni@ccreweb.org> writes:
dx/dt = sigma*(y - x)
dy/dt = x*(rho -z) - y
dz/dt = x*y - beta*z
where sigma, rho, and beta are constant parameters.
Let's say we want to write a word DERIVS which computes and stores the
derivatives, given the instantaneous values of x, y, z. This is the
basis for any numerical code which solves the trajectory in time,
starting from an initial condition.
DERIVS ( F: x y z -- )
Hence, we want to place some values x, y, and z onto the fp stack and
compute the three derivatives. Ideally these three values remain on the
fp stack and don't need to be fetched from memory constantly until the
three derivatives are computed, especially if one is using the hardware
fp stack. We allow the constant parameters to be fetched from memory and
the results of the derivative computation to be stored to memory so they
don't overflow the stack. This should be doable with the 8-element
hardware fp stack.
I have adapted your Forth code:
[UNDEFINED] F2OVER [IF]
: f2over ( F: r1 r2 r3 r4 -- r1 r2 r3 r4 r1 r2 )
3 fpick 3 fpick ;
[THEN]
16.0e0 fconstant sigma
45.92e0 fconstant rho
4.0e0 fconstant beta
fvariable dx/dt
fvariable dy/dt
fvariable dz/dt
: derivs ( F: x y z -- )
fdup f2over \ F: x y z z x y
f- sigma f* fnegate
dx/dt f! \ F: x y z z
rho fover f- \ F: x y z z rho-z
4 fpick f* \ F: x y z z x*(rho - z)
3 fpick f-
dy/dt f! \ F: x y z z
fdrop
beta f* fnegate
frot frot f* f+ dz/dt f!
;
0.1e 0.6e 4.0e derivs
dx/dt f@ f. cr \ 8.
dy/dt f@ f. cr \ 3.592
dz/dt f@ f. cr \ -15.94
In particular, I eliminated the additional memory accesses to DZ/DT.
SwiftForth, VFX and iforth produce the expected results for your test
case. The code is:
SwiftForth 4.0.0-RC87 VFX Forth 64 5.43 iforth-5.1-mini
ST(0) FLD FLD ST fld ST(0)
44E8BC ( f2over ) CALL CALL 0050A080 F2OVER fld [r13 0 +] tbyte
ST(0) ST(1) FSUBP FSUBP ST(1), ST fxch ST(1)
44E8FB ( sigma ) CALL CALL 0050A2BB SIGMA fld [r13 #16 +] tby
ST(0) ST(1) FMULP FMULP ST(1), ST lea r13, [r13 #32 +]
FCHS FCHS fxch ST(3)
-8 [RBP] RBP LEA FSTP TBYTE FFF9CFE8 [RIP] fxch ST(1)
RBX 0 [RBP] MOV CALL 0050A2FB RHO fld ST(3)
4C508 [RDI] RBX LEA FLD ST(1) fld ST(3)
0 [RBX] TBYTE FSTP FSUBP ST(1), ST fsubp ST(1), ST
0 [RBP] RBX MOV LEA RBP, [RBP+-08] fld $101BC720 tbyte
8 [RBP] RBP LEA MOV [RBP], RBX fmulp ST(1), ST
44E923 ( rho ) CALL MOV EBX, # 00000004 fchs
ST(1) FLD CALL 005030C0 FPICK fstp $10226470 tbyte
ST(0) ST(1) FSUBP FMULP ST(1), ST fld $101BC710 tbyte
-8 [RBP] RBP LEA LEA RBP, [RBP+-08] fld ST(1)
RBX 0 [RBP] MOV MOV [RBP], RBX fsubp ST(1), ST
4 # EBX MOV MOV EBX, # 00000003 fld ST(4)
43C901 ( FPICK ) CALL CALL 005030C0 FPICK fmulp ST(1), ST
ST(0) ST(1) FMULP FSUBP ST(1), ST fld ST(3)
-8 [RBP] RBP LEA FSTP TBYTE FFF9CFC1 [RIP] fsubp ST(1), ST
RBX 0 [RBP] MOV FSTP ST fstp $10226490 tbyte
3 # EBX MOV CALL 0050A33B BETA ffreep ST(0)
43C901 ( FPICK ) CALL FMULP ST(1), ST fld $101BC700 tbyte
ST(0) ST(1) FSUBP FCHS fmulp ST(1), ST
-8 [RBP] RBP LEA FXCH ST(1) fchs
RBX 0 [RBP] MOV FXCH ST(2) fxch ST(1)
4C530 [RDI] RBX LEA FXCH ST(1) fxch ST(2)
0 [RBX] TBYTE FSTP FXCH ST(2) fxch ST(1)
0 [RBP] RBX MOV FMULP ST(1), ST fxch ST(2)
8 [RBP] RBP LEA FADDP ST(1), ST fmulp ST(1), ST
ST(0) FSTP FSTP TBYTE FFF9CFB4 [RIP] fxch ST(1)
44E94B ( beta ) CALL RET/NEXT fpopswap,
ST(0) ST(1) FMULP faddp ST(1), ST
FCHS fstp $102264B0 tbyte 43C807 ( FROT ) CALL ;
43C807 ( FROT ) CALL
ST(0) ST(1) FMULP
ST(0) ST(1) FADDP
-8 [RBP] RBP LEA
RBX 0 [RBP] MOV
4C558 [RDI] RBX LEA
0 [RBX] TBYTE FSTP
0 [RBP] RBX MOV
8 [RBP] RBP LEA
RET
FPICK is apparently implemented on SwiftForth and VFX through an
indirect branch that branches to one of 8 variants of "FLD ST(...)",
while iForth manages to resolve this during compilation.
I have also looked at VFX 5.11 which uses XMM registers instead of the
FP stack, but it does not inline FP operations, so you mostly see a long sequence of calls.
On 14 Apr 2024 at 01:47:20 CEST, "Krishna Myneni" <krishna.myneni@ccreweb.org>...
wrote:
On 4/13/24 12:55, Anton Ertl wrote:
I just looked at the floating-point implementations of recent
SwiftForth and VFX (finally present in the system from the start), and
on iForth-5.1-mini (for comparison):
...
So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.
For me, an 8 item hardware fp stack limit is too limiting to be useful.
This is mostly because of my use of the fp stack for initializing tables
(arrays and matrices), and my coding style of returning more than 8
floats on the fp stack for some types of computation. No doubt one can
limit themselves to an 8-item fp stack, but I'd hate to have to code wit
such a limit.
The manual (gasp) documents how to change the default FP package.
On 14 Apr 2024 at 01:47:20 CEST, "Krishna Myneni" <krishna.myneni@ccreweb.org>
wrote:
On 4/13/24 12:55, Anton Ertl wrote:
I just looked at the floating-point implementations of recent...
SwiftForth and VFX (finally present in the system from the start), and
on iForth-5.1-mini (for comparison):
1 FLOATS .
reports:
16 iforth
10 sf64
10 vfx64
For
: foo f+ f* ;
the resulting code is:
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
: foo f+ f* ; ok
see foo
44E8B9 ST(0) ST(1) FADDP DEC1
44E8BB ST(0) ST(1) FMULP DEC9
44E8BD RET C3 ok
VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023
: foo f+ f* ; ok
see foo
FOO
( 0050A250 DEC1 ) FADDP ST(1), ST
( 0050A252 DEC9 ) FMULP ST(1), ST
( 0050A254 C3 ) RET/NEXT
( 5 bytes, 3 instructions )
iForth:
$10226000 : foo 488BC04883ED088F4500 H.@H.m..E.
$1022600A fld [r13 0 +] tbyte41DB6D00 A[m.
$1022600E fld [r13 #16 +] tbyte
41DB6D10 A[m.
$10226012 fxch ST(2) D9CA YJ
$10226014 lea r13, [r13 #32 +] qword
4D8D6D20 M.m
$10226018 faddp ST(1), ST DEC1 ^A
$1022601A fxch ST(1) D9C9 YI
$1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m.
$10226026 fmulp ST(1), ST DEC9 ^I
$10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}.
$10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok
So apparently the 8 hardware FP stack items are enough for SwiftForth
and VFX, while iForth prefers to use an FP stack in memory to allow
for a deeper FP stack.
For me, an 8 item hardware fp stack limit is too limiting to be useful.
This is mostly because of my use of the fp stack for initializing tables
(arrays and matrices), and my coding style of returning more than 8
floats on the fp stack for some types of computation. No doubt one can
limit themselves to an 8-item fp stack, but I'd hate to have to code wit
such a limit.
The manual (gasp) documents how to change the default FP package.
Changing the default pack also changes the system call interfaces to
match.
dxf <dxforth@gmail.com> writes:
On 14/04/2024 9:25 pm, Anton Ertl wrote:
Kahan writes about the original intention in
http://web.archive.org/web/20170118054747/https://cims.nyu.edu/~dbindel/class/cs279/87stack.pdf
especially starting at the last paragraph of page 2.
And about the bug (or rather design mistake):
https://history.siam.org/pdfs2/Kahan_final.pdf
Start with the second-to-last paragraph on page 163. He digresses for
a page, but continues on the fourth paragraph of page 165 and
continues to the first paragraph of page 168.
The latter sounds like someone not getting his way more than a design mistake.
In the first reference Kahan states:
"When the 8087 was designed, I knew that stack over/underflow was an issue of
more aesthetic than practical importance. I still regret that the 8087's stack
implementation was not quite so neat as my original intention described in the
accompanying note."
Intel decided Kahan's aesthetic afterthought could be dispensed with.
In a way, they did, and Kahan obviously did not get his way. But to
me it sounds like they tried and failed at implementing a stack that
extends into memory. The tags that indicate the presence of a stack
item are there. If they had made a conscious decision at the start to dispense with the idea of an extensible stack, they would have
discpensed with these bits indicating the presence of a stack item as
well. So what happened is that they botched the first attempt, and
then decided that they did not want to do what would have been
necessary to fix it.
The design criteria that never changed was the 8-level hardware stack. Forthers can either accept it for best performance - or pick something
more forgiving at a lesser performance.
In most cases 'bigger' fp data will be stored in memory anyhow,
which can be cached before disk access. The old 8087 improvements
were caused by its new fp operators, the stack was unusable.
And if CPU based stacks were so lucrative for high performance,
CPU makers would have implemented them since long for normal
integer data.
In article <3e419396b1ee93c7a391a7ffc0e44ed8@www.novabbs.com>,
minforth <minforth@gmx.net> wrote:
In most cases 'bigger' fp data will be stored in memory anyhow,
which can be cached before disk access. The old 8087 improvements
were caused by its new fp operators, the stack was unusable.
And if CPU based stacks were so lucrative for high performance,
CPU makers would have implemented them since long for normal
integer data.
The iA64 comes to mind. Apparently a failure but was a
technical or commercial failure?
On 4/14/24 21:07, dxf wrote:
...
The design criteria that never changed was the 8-level hardware stack.
Forthers can either accept it for best performance - or pick something
more forgiving at a lesser performance.
In the Lorenz equation example, which works with the 8 deep fpu stack, we have assumed that the fpu hardware stack was empty before calling DERIVS. In a real use case, the call to DERIVS is likely to occur within a deeper call chain, resulting in items already on the fpu stack before args for DERIVS are pushed. As Marcel said, using only a hardware-based fp stack is not realistic for any non-trivial floating point work.
The loss of performance with a memory-based fp stack is far less a concern than having to consider the limited stack depth when writing code involving floating point arithmetic. Failure from overflowing the fpu stack is silent. Debugging is likely to be a nightmare.
Failure from overflowing the
fpu stack is silent.
In article <3e419396b1ee93c7a391a7ffc0e44ed8@www.novabbs.com>,
minforth <minforth@gmx.net> wrote:
And if CPU based stacks were so lucrative for high performance,
CPU makers would have implemented them since long for normal
integer data.
The iA64 comes to mind. Apparently a failure but was a
technical or commercial failure?
Krishna Myneni <krishna.myneni@ccreweb.org> writes:
Failure from overflowing the
fpu stack is silent.
Reality check:
VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023
1e 2e 3e 4e 5e 6e 7e ok F:-7
8e ok
NDP Stack Fault: NDP SW = 0041
NDP Potential Exception: NDP SW = 0041
SwiftForth also seems to notice it in some way, but does not report it
as an error:
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e ok
f. 6.00000000 ok
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e 7e ok
f.
ok
I would be surprised to get a SIGFPE interrupt from x87 stack ops.
X87 mode has long been deprecated and replaced by SSE2.
X87 mode has long been deprecated
and replaced by SSE2.
gcc also has the option -mfpmath=sse,387 which tells the compiler that
it can use both.
On 4/15/24 09:09, Anton Ertl wrote:
Krishna Myneni <krishna.myneni@ccreweb.org> writes:
Failure from overflowing the
fpu stack is silent.
Reality check:
VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023
1e 2e 3e 4e 5e 6e 7e ok F:-7
8e ok
NDP Stack Fault: NDP SW = 0041
NDP Potential Exception: NDP SW = 0041
SwiftForth also seems to notice it in some way, but does not report it
as an error:
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e ok
f. 6.00000000 ok
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e 7e ok
f.
ok
I tried overflowing the fpu stack in kforth32, and no exception is
raised. Perhaps one needs to configure the fpu to raise an exception.
Also tried it in C with an assembly procedure. The executable throws no >exception.
--
Krishna
== begin fpu-stack-overflow.4th ==
fpu-stack-overflow.4th
\ for use with kforth32
include ans-words
include strings
include modules
include syscalls
include mc
include asm-x86
code fpu-stack-overflow
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
end-code
fpu-stack-overflow
== end fpu-stack-overflow.4th ==
== begin example ==
$ kforth32
kForth-32 v 2.4.5 (Build: 2024-03-30)
Copyright (c) 1998--2023 Krishna Myneni
Contributions by: dpw gd mu bk abs tn cmb bg dnw
Provided under the GNU Affero General Public License, v3.0 or later
include fpu-stack-overflow
ok
== end example ==
In article <uvk861$gs90$1@dont-email.me>,
Krishna Myneni <krishna.myneni@ccreweb.org> wrote:
On 4/15/24 09:09, Anton Ertl wrote:I tried this on ciforth. It crashes with the 10th item
Krishna Myneni <krishna.myneni@ccreweb.org> writes:
Failure from overflowing the
fpu stack is silent.
Reality check:
VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2023
1e 2e 3e 4e 5e 6e 7e ok F:-7
8e ok
NDP Stack Fault: NDP SW = 0041
NDP Potential Exception: NDP SW = 0041
SwiftForth also seems to notice it in some way, but does not report it
as an error:
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e ok
f. 6.00000000 ok
SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
1e 2e 3e 4e 5e 6e 7e ok
f.
ok
I tried overflowing the fpu stack in kforth32, and no exception is
raised. Perhaps one needs to configure the fpu to raise an exception.
Also tried it in C with an assembly procedure. The executable throws no
exception.
--
Krishna
== begin fpu-stack-overflow.4th ==
fpu-stack-overflow.4th
\ for use with kforth32
include ans-words
include strings
include modules
include syscalls
include mc
include asm-x86
code fpu-stack-overflow
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
fld1,
end-code
fpu-stack-overflow
== end fpu-stack-overflow.4th ==
== begin example ==
$ kforth32
kForth-32 v 2.4.5 (Build: 2024-03-30)
Copyright (c) 1998--2023 Krishna Myneni
Contributions by: dpw gd mu bk abs tn cmb bg dnw
Provided under the GNU Affero General Public License, v3.0 or later
include fpu-stack-overflow
ok
== end example ==
not the 8th.
Groetjes Albert
I would be surprised to get a SIGFPE interrupt from x87 stack ops.
X87 mode has long been deprecated and replaced by SSE2.
On 4/15/24 20:10, minforth wrote:
I would be surprised to get a SIGFPE interrupt from x87 stack ops.
X87 mode has long been deprecated and replaced by SSE2.
Maybe I'm missing something, but SSE2 does not seem to have anything
beyond basic floating point arithmetic e.g. transcendental functions.
On 4/15/24 20:10, minforth wrote:
I would be surprised to get a SIGFPE interrupt from x87 stack ops.
X87 mode has long been deprecated and replaced by SSE2.
Maybe I'm missing something, but SSE2 does not seem to have anything
beyond basic floating point arithmetic e.g. transcendental functions.
----- Synchronet 3.20a-Linux NewsLink 1.114
Krishna
Krishna Myneni wrote:[..]
Hardware x87 support isn't necessarily faster: https://users.ece.utexas.edu/~adnan/comm/fast-trigonometric-functions-using.pdf
But I think the main advantage lies in the possibility of parallel and/or vectorized execution.
And of course Forth is very close to assembler and as such it is natural
to use x87 instructions, unless a Forth system is implemented using libc
or using math libraries that better exploit those many features of modern CPUs.
minforth wrote:
And of course Forth is very close to assembler and as such it is naturalIt all depends, that is what I like about Forth.
to use x87 instructions, unless a Forth system is implemented using libc
or using math libraries that better exploit those many features of modern CPUs.
[..] For example, in the 32-bit x86 CRT,
some functions have both an x87 implementation and an SSE2 implementation. When
running on a CPU that supports SSE2, the faster SSE2 implementation is used. When running on a CPU that does not support SSE2, the **slower**
[ my emphasis -mhx ] x87 implementation is used.
...
And of course Forth is very close to assembler and as such it is natural
to use x87 instructions
minforth wrote:
[..]
[..] For example, in the 32-bit x86 CRT,
some functions have both an x87 implementation and an SSE2 implementation. When
running on a CPU that supports SSE2, the faster SSE2 implementation is used. >> When running on a CPU that does not support SSE2, the **slower**
[ my emphasis -mhx ] x87 implementation is used.
This strikes me as showing a strong bias (i.e. not strictly based on technical
arguments) towards SSE2. I've noticed before that Microsoft has a dislike for the x87 FPU, if not boycotting it outright (e.g. no long double in their compiler).
On 17/04/2024 11:07 pm, minforth wrote:
...
And of course Forth is very close to assembler and as such it is natural
to use x87 instructions
In addition to being available on most Intel x86 cpu's, x87 was cheap to >support. A relatively small amount of code was needed to implement the >assembler extensions:
https://pastebin.com/Md6BGWmj
There wasn't a good reason not to choose x87. If the hardware stack didn't >appeal, use a software stack. Performance will still be very good. I recall >pitting my 16-bit DTC forth with x87 against VFX using an fp intensive program/
benchmark. The difference was only a factor of 4. I couldn't believe it.
minforth wrote:
[..]
[..] For example, in the 32-bit x86 CRT,
some functions have both an x87 implementation and an SSE2 implementation. When
running on a CPU that supports SSE2, the faster SSE2 implementation is used. >> When running on a CPU that does not support SSE2, the **slower**
[ my emphasis -mhx ] x87 implementation is used.
This strikes me as showing a strong bias (i.e. not strictly based on technical >arguments) towards SSE2. I've noticed before that Microsoft has a dislike for >the x87 FPU, if not boycotting it outright (e.g. no long double in their >compiler).
-marcel--
mhx wrote:[..]
Could be. At least gcc has support for _float80. FWIW Intel's icc even has a compiler flag for x87 stack overflow warnings.[..]
dxf <dxforth@gmail.com> writes:
On 14/04/2024 6:34 pm, Anton Ertl wrote:
From what I read about this, the intention was that the FP stack would
extend into memory (and thus not be limited to 8 elements): software
should react to FP stack overflows and underflows and store some
elements on overflow, and reload some elements on underflow. However,
this functionality was implemented in a buggy way on the 8087, so it
never worked as intended. Hoever, when they noticed this, the 8087
was already on the market, and Hyrum's law ensured that this behaviour
could not be changed.
Do you have a reference for that?
Kahan writes about the original intention in
http://web.archive.org/web/20170118054747/https://cims.nyu.edu/~dbindel/class/cs279/87stack.pdf
especially starting at the last paragraph of page 2.
And about the bug (or rather design mistake):
https://history.siam.org/pdfs2/Kahan_final.pdf
Start with the second-to-last paragraph on page 163. He digresses for
a page, but continues on the fourth paragraph of page 165 and
continues to the first paragraph of page 168.
- anton
minforth wrote:
mhx wrote:[..]
Could be. At least gcc has support for _float80. FWIW Intel's icc even has a >> compiler flag for x87 stack overflow warnings.[..]
At runtime?
mhx wrote:
minforth wrote:
mhx wrote:[..]
Could be. At least gcc has support for _float80. FWIW Intel's icc even has a[..]
compiler flag for x87 stack overflow warnings.
At runtime?
AFAIK they check if computation results are popped off the x87 stack and put into the xmm0 register. Could be a flag for compile-time checking.
In lxf64/ntf64 I use an external C library, specifically fdlibm53.
This claims to be within 1 ulp correct. My testing also suggests this.
fdlibm looks to be the base for most other math libraries.
I could have used libm from gcc but I wanted the same code for both
Linux and Windows.
In lxf/ntf I use the 387 fp stack. I think this was a wrong decision.
8 stack items is to low to be useful for anything more complicated.
complex numbers is an example that quickly eats all 8 stack items.
Krishna Myneni wrote:
On 4/15/24 20:10, minforth wrote:
I would be surprised to get a SIGFPE interrupt from x87 stack ops.
X87 mode has long been deprecated and replaced by SSE2.
Maybe I'm missing something, but SSE2 does not seem to have anything
beyond basic floating point arithmetic e.g. transcendental functions.
Yes that is right. There is a sqrt but nothing more.
In lxf64/ntf64 I use an external C library, specifically fdlibm53.
This claims to be within 1 ulp correct. My testing also suggests this.
fdlibm looks to be the base for most other math libraries.
I could have used libm from gcc but I wanted the same code for both
Linux and Windows.
In lxf/ntf I use the 387 fp stack. I think this was a wrong decision.
8 stack items is to low to be useful for anything more complicated.
complex numbers is an example that quickly eats all 8 stack items.
Best Regards
Peter Fälth
In retrospect, if I were doing this again I would standardise on an
external double-double library (about 106 bits). In most cases that
we encounter, the desire for 387 FP is to gain the extra precision.
Since very few CPUs support quad precision natively, the most
obvious solution is a double-double library.
...
Interesting read. The reverse polish calculator from HP was a
resounding success, with great profits.
Kahan contributed to this.
It was killed by the bean counters at HP.
There was huge demand but they refused to expand production.
Then the calculator died because it simply was not available.
The manual (gasp) documents how to change the default FP package.
Changing the default pack also changes the system call interfaces to
match.
minforth wrote:
But I think the main advantage lies in the possibility of parallel and/or
vectorized execution.
I have not yet seen algorithms where that would bring something.
I cut the same corners with ciforth. However I think this
cannot be compliant with the IEEE requirement of the standard?
The version produced by gcc -O3 is almost three times slower on a
Skylake than the one by gcc -O1 and is actually slower than several
Forth systems, including gforth-fast.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
The version produced by gcc -O3 is almost three times slower on a
Skylake than the one by gcc -O1 and is actually slower than several
Forth systems, including gforth-fast.
Wow, that seems worth a bug report.
How about gcc -O2 ?
OTOH, yesterday I saw what gcc did for the inner loop of the bubble
benchmark from the Stanford integer benchmarks:
while ( i<top ) {
if ( sortlist[i] > sortlist[i+1] ) {
j = sortlist[i];
sortlist[i] = sortlist[i+1];
sortlist[i+1] = j;
};
i=i+1;
};
top=top-1;
};
gcc-12.2 -O1 produces straighforward scalar code, gcc-12.2 -O3 wants
to use SIMD instructions:
gcc -01 gcc -O3
1c: add $0x4,%rax c0: movq (%rax),%xmm0
cmp %rsi,%rax add $0x1,%edx
je 35 pshufd $0xe5,%xmm0,%xmm1
25: mov (%rax),%edx movd %xmm0,%edi
mov 0x4(%rax),%ecx movd %xmm1,%ecx
cmp %ecx,%edx cmp %ecx,%edi
jle 1c jle e1
mov %ecx,(%rax) pshufd $0xe1,%xmm0,%xmm0
mov %edx,0x4(%rax) movq %xmm0,(%rax)
jmp 1c e1: add $0x4,%rax
35: cmp %r8d,%edx
jl c0
The version produced by gcc -O3 is almost three times slower on a
Skylake than the one by gcc -O1 and is actually slower than several
Forth systems, including gforth-fast. I think that the reason is that
the movq towards the end stores two items, and the movq at the start
of the next iteration loads one of these item, i.e., there is partial
overlap between the store and the load. In this case the hardware
takes a slow path, which means that the slowdown is much bigger than
the instruction count suggests.
albert@spenarnc.xs4all.nl writes:
I cut the same corners with ciforth. However I think this
cannot be compliant with the IEEE requirement of the standard?
Which IEEE requirement of which standard?
- anton--
In article <2024Apr20.182321@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >>albert@spenarnc.xs4all.nl writes:
I cut the same corners with ciforth. However I think this
cannot be compliant with the IEEE requirement of the standard?
Which IEEE requirement of which standard?
The ISO 9x talks about
64-bit IEEE double-precision number
32-bit IEEE double-precision number
IEEE floating point number as defined in ANSI/IEEE Standard
754-1985
80 bit '87 is not compliant with either of these, I think.
Stephen Pelc <stephen@vfxforth.com> writes:
The manual (gasp) documents how to change the default FP package.
Changing the default pack also changes the system call interfaces to
match.
It is great that now an FP package is available by default, rather
than having to load some obscurely-named file from an
installation-specific path. E.g. appbench-1.4 contains a file
setup/vfx.fth which contains the lines:
1 cells 4 = [if]
include /usr/share/doc/VfxForth/Lib/Ndp387.fth
[then]
The result is that when I just tested various systems on the
appbench-1.4 suite, vfx64 worked (for four of the benchmarks), and I
won't spoil my positive message by reporting on what did not work.
When loading an alternate f/p package I found it removes parts of the compiler unrelated to fp. Hopefully this can be rectified. IIRC this problem didn't exist with the previous setup where fp had to be explicitly loaded.
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 915 |
Nodes: | 10 (2 / 8) |
Uptime: | 45:57:45 |
Calls: | 12,170 |
Files: | 186,521 |
Messages: | 2,234,593 |