• Floating point implementations on AMD64

    From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Apr 13 17:55:18 2024
    From Newsgroup: comp.lang.forth

    I just looked at the floating-point implementations of recent
    SwiftForth and VFX (finally present in the system from the start), and
    on iForth-5.1-mini (for comparison):

    1 FLOATS .

    reports:

    16 iforth
    10 sf64
    10 vfx64

    For

    : foo f+ f* ;

    the resulting code is:

    SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
    : foo f+ f* ; ok
    see foo
    44E8B9 ST(0) ST(1) FADDP DEC1
    44E8BB ST(0) ST(1) FMULP DEC9
    44E8BD RET C3 ok


    VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
    © MicroProcessor Engineering Ltd, 1998-2023

    : foo f+ f* ; ok
    see foo
    FOO
    ( 0050A250 DEC1 ) FADDP ST(1), ST
    ( 0050A252 DEC9 ) FMULP ST(1), ST
    ( 0050A254 C3 ) RET/NEXT
    ( 5 bytes, 3 instructions )


    iForth:
    $10226000 : foo 488BC04883ED088F4500 H.@H.m..E. $1022600A fld [r13 0 +] tbyte41DB6D00 A[m. $1022600E fld [r13 #16 +] tbyte
    41DB6D10 A[m. $10226012 fxch ST(2) D9CA YJ
    $10226014 lea r13, [r13 #32 +] qword
    4D8D6D20 M.m $10226018 faddp ST(1), ST DEC1 ^A
    $1022601A fxch ST(1) D9C9 YI
    $1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m. $10226026 fmulp ST(1), ST DEC9 ^I
    $10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}. $10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok

    So apparently the 8 hardware FP stack items are enough for SwiftForth
    and VFX, while iForth prefers to use an FP stack in memory to allow
    for a deeper FP stack.

    Gforth sticks out by using 8-byte FP values; most of those are stored
    in memory (supporting deep FP stacks), with the top of FP stack in an
    xmm register on AMD64:

    : foo f+ f* ; ok
    see-code foo
    $7FF2CE8034E0 f+ 1->1
    7FF2CE4A6E43: mov rax,r12
    7FF2CE4A6E46: lea r12,$08[r12]
    7FF2CE4A6E4B: addsd xmm15,$08[rax]
    $7FF2CE8034E8 f* 1->1
    7FF2CE4A6E51: mov rax,r12
    7FF2CE4A6E54: lea r12,$08[r12]
    7FF2CE4A6E59: mulsd xmm15,$08[rax]
    $7FF2CE8034F0 ;s 1->1
    7FF2CE4A6E5F: mov rbx,[r14]
    7FF2CE4A6E62: add r14,$08
    7FF2CE4A6E66: mov rax,[rbx]
    7FF2CE4A6E69: jmp eax

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Sat Apr 13 18:47:20 2024
    From Newsgroup: comp.lang.forth

    On 4/13/24 12:55, Anton Ertl wrote:
    I just looked at the floating-point implementations of recent
    SwiftForth and VFX (finally present in the system from the start), and
    on iForth-5.1-mini (for comparison):

    1 FLOATS .

    reports:

    16 iforth
    10 sf64
    10 vfx64

    For

    : foo f+ f* ;

    the resulting code is:

    SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
    : foo f+ f* ; ok
    see foo
    44E8B9 ST(0) ST(1) FADDP DEC1
    44E8BB ST(0) ST(1) FMULP DEC9
    44E8BD RET C3 ok


    VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
    © MicroProcessor Engineering Ltd, 1998-2023

    : foo f+ f* ; ok
    see foo
    FOO
    ( 0050A250 DEC1 ) FADDP ST(1), ST
    ( 0050A252 DEC9 ) FMULP ST(1), ST
    ( 0050A254 C3 ) RET/NEXT
    ( 5 bytes, 3 instructions )


    iForth:
    $10226000 : foo 488BC04883ED088F4500 H.@H.m..E. $1022600A fld [r13 0 +] tbyte41DB6D00 A[m. $1022600E fld [r13 #16 +] tbyte
    41DB6D10 A[m. $10226012 fxch ST(2) D9CA YJ $10226014 lea r13, [r13 #32 +] qword
    4D8D6D20 M.m $10226018 faddp ST(1), ST DEC1 ^A $1022601A fxch ST(1) D9C9 YI $1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m. $10226026 fmulp ST(1), ST DEC9 ^I $10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}. $10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok

    So apparently the 8 hardware FP stack items are enough for SwiftForth
    and VFX, while iForth prefers to use an FP stack in memory to allow
    for a deeper FP stack.

    ...

    For me, an 8 item hardware fp stack limit is too limiting to be useful.
    This is mostly because of my use of the fp stack for initializing tables (arrays and matrices), and my coding style of returning more than 8
    floats on the fp stack for some types of computation. No doubt one can
    limit themselves to an 8-item fp stack, but I'd hate to have to code wit
    such a limit.

    --
    Krishna


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From dxf@dxforth@gmail.com to comp.lang.forth on Sun Apr 14 13:29:01 2024
    From Newsgroup: comp.lang.forth

    On 14/04/2024 3:55 am, Anton Ertl wrote:
    I just looked at the floating-point implementations of recent
    SwiftForth and VFX (finally present in the system from the start), and
    on iForth-5.1-mini (for comparison):
    ...
    So apparently the 8 hardware FP stack items are enough for SwiftForth
    and VFX, while iForth prefers to use an FP stack in memory to allow
    for a deeper FP stack.

    All the more reason to make fp loadable so users can choose the model
    they want instead of built-in. IIRC VFX and SWF previously did this.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Sun Apr 14 07:03:15 2024
    From Newsgroup: comp.lang.forth

    Anton Ertl wrote:
    [..]

    iForth:
    $10226000 : foo 488BC04883ED088F4500 H.@H.m..E. $1022600A fld [r13 0 +] tbyte41DB6D00 A[m. $1022600E fld [r13 #16 +] tbyte
    41DB6D10 A[m. $10226012 fxch ST(2) D9CA YJ $10226014 lea r13, [r13 #32 +] qword
    4D8D6D20 M.m $10226018 faddp ST(1), ST DEC1 ^A $1022601A fxch ST(1) D9C9 YI $1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m. $10226026 fmulp ST(1), ST DEC9 ^I $10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}. $10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok

    So apparently the 8 hardware FP stack items are enough for SwiftForth
    and VFX, while iForth prefers to use an FP stack in memory to allow
    for a deeper FP stack.

    Turbo Pascal had a fast FP mode that used the FPU stack. I found almost immediately that that is unusable for serious work.

    The used scheme is rather complicated. iForth uses the internal stack
    when it can prove that there will be no under- or overflow. Non-inlined
    calls (F. below) always use the memory stack.

    FORTH> pi fvalue val1 pi/2 fvalue val2 ok
    FORTH> : test val1 fdup val2 foo val1 f+ val2 f* f. ; ok
    FORTH> see test
    Flags: ANSI
    $015FDA80 : test
    $015FDA8A fld $015FD650 tbyte-offset
    $015FDA90 fld ST(0)
    $015FDA92 fld $015FD670 tbyte-offset
    $015FDA98 faddp ST(1), ST
    $015FDA9A fmulp ST(1), ST
    $015FDA9C fld $015FD650 tbyte-offset
    $015FDAA2 faddp ST(1), ST
    $015FDAA4 fld $015FD670 tbyte-offset
    $015FDAAA fmulp ST(1), ST
    $015FDAAC fpush,
    $015FDAB6 jmp F.+10 ( $0124ED42 ) offset NEAR
    $015FDABB ;

    Apparently there are special interrupts that one can enable
    to signal FPU stack underflow (and then spill to memory)
    but I never got them to work reliably. The software
    analysis works fine, but can be fooled in case of rather
    contrived circumstances. I have not encountered a bug in the
    past two decades.

    -marcel
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From dxf@dxforth@gmail.com to comp.lang.forth on Sun Apr 14 18:02:10 2024
    From Newsgroup: comp.lang.forth

    On 14/04/2024 5:03 pm, mhx wrote:
    Anton Ertl wrote:
    [..]

    iForth:
    $10226000  : foo                        488BC04883ED088F4500      H.@H.m..E.
    $1022600A  fld           [r13 0 +] tbyte41DB6D00                  A[m.
    $1022600E  fld           [r13 #16 +] tbyte
                                            41DB6D10                  A[m.
    $10226012  fxch          ST(2)          D9CA                      YJ
    $10226014  lea           r13, [r13 #32 +] qword
                                            4D8D6D20                  M.m $10226018  faddp         ST(1), ST      DEC1                      ^A
    $1022601A  fxch          ST(1)          D9C9                      YI
    $1022601C  fpopswap,                    41DB6D00D9CA4D8D6D10      A[m.YJM.m.
    $10226026  fmulp         ST(1), ST      DEC9                      ^I
    $10226028  fpush,                       4D8D6DF0D9C941DB7D00      M.mpYIA[}.
    $10226032  ;                            488B45004883C508FFE0      H.E.H.E..` ok

    So apparently the 8 hardware FP stack items are enough for SwiftForth
    and VFX, while iForth prefers to use an FP stack in memory to allow
    for a deeper FP stack.

    Turbo Pascal had a fast FP mode that used the FPU stack. I found almost immediately that that is unusable for serious work.

    Were that the case Intel had plenty opportunity to change it. They had
    an academic advising them.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Apr 14 08:34:35 2024
    From Newsgroup: comp.lang.forth

    mhx@iae.nl (mhx) writes:
    Apparently there are special interrupts that one can enable
    to signal FPU stack underflow (and then spill to memory)
    but I never got them to work reliably.

    From what I read about this, the intention was that the FP stack would
    extend into memory (and thus not be limited to 8 elements): software
    should react to FP stack overflows and underflows and store some
    elements on overflow, and reload some elements on underflow. However,
    this functionality was implemented in a buggy way on the 8087, so it
    never worked as intended. Hoever, when they noticed this, the 8087
    was already on the market, and Hyrum's law ensured that this behaviour
    could not be changed.

    And apparently this feature was not considered to be important enough
    to add a new architectural feature that allows implementing the FP
    stack extension to memory. I guess that the implementations of
    Fortran and Algol-family languages (e.g., C) in the 1980s only used
    the stack within an expression, so avoiding FP stack overflows with
    compiler analysis (like you do), is relatively easy.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From dxf@dxforth@gmail.com to comp.lang.forth on Sun Apr 14 20:05:33 2024
    From Newsgroup: comp.lang.forth

    On 14/04/2024 6:34 pm, Anton Ertl wrote:
    mhx@iae.nl (mhx) writes:
    Apparently there are special interrupts that one can enable
    to signal FPU stack underflow (and then spill to memory)
    but I never got them to work reliably.

    From what I read about this, the intention was that the FP stack would
    extend into memory (and thus not be limited to 8 elements): software
    should react to FP stack overflows and underflows and store some
    elements on overflow, and reload some elements on underflow. However,
    this functionality was implemented in a buggy way on the 8087, so it
    never worked as intended. Hoever, when they noticed this, the 8087
    was already on the market, and Hyrum's law ensured that this behaviour
    could not be changed.

    Do you have a reference for that? Below is a paper written by one of the designers and it doesn't appear to be mentioned. It's of course possible
    to maintain a stack in software and use the FPU to do the calculations.
    There are instructions to load/store Temp Real format to memory and that
    gets a mention.

    https://dl.acm.org/doi/pdf/10.1145/800053.801923

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sun Apr 14 13:12:19 2024
    From Newsgroup: comp.lang.forth

    In article <2024Apr13.195518@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    I just looked at the floating-point implementations of recent
    SwiftForth and VFX (finally present in the system from the start), and
    on iForth-5.1-mini (for comparison):

    1 FLOATS .

    reports:

    16 iforth
    10 sf64
    10 vfx64

    For

    : foo f+ f* ;

    the resulting code is:

    SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
    : foo f+ f* ; ok
    see foo
    44E8B9 ST(0) ST(1) FADDP DEC1
    44E8BB ST(0) ST(1) FMULP DEC9
    44E8BD RET C3 ok


    VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
    © MicroProcessor Engineering Ltd, 1998-2023

    : foo f+ f* ; ok
    see foo
    FOO
    ( 0050A250 DEC1 ) FADDP ST(1), ST
    ( 0050A252 DEC9 ) FMULP ST(1), ST
    ( 0050A254 C3 ) RET/NEXT
    ( 5 bytes, 3 instructions )

    I cut the same corners with ciforth. However I think this
    cannot be compliant with the IEEE requirement of the standard?

    - anton
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat purring. - the Wise from Antrim -
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sun Apr 14 13:21:14 2024
    From Newsgroup: comp.lang.forth

    In article <27089a13c7ce61da7ffb927cb6c365d2@www.novabbs.com>,
    mhx <mhx@iae.nl> wrote:
    Anton Ertl wrote:
    [..]

    iForth:
    $10226000 : foo 488BC04883ED088F4500 H.@H.m..E. >> $1022600A fld [r13 0 +] tbyte41DB6D00 A[m.
    $1022600E fld [r13 #16 +] tbyte
    41DB6D10 A[m.
    $10226012 fxch ST(2) D9CA YJ
    $10226014 lea r13, [r13 #32 +] qword
    4D8D6D20 M.m
    $10226018 faddp ST(1), ST DEC1 ^A
    $1022601A fxch ST(1) D9C9 YI
    $1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m. >> $10226026 fmulp ST(1), ST DEC9 ^I
    $10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}. >> $10226032 ; 488B45004883C508FFE0
    H.E.H.E..` ok

    So apparently the 8 hardware FP stack items are enough for SwiftForth
    and VFX, while iForth prefers to use an FP stack in memory to allow
    for a deeper FP stack.

    Turbo Pascal had a fast FP mode that used the FPU stack. I found almost >immediately that that is unusable for serious work.

    The used scheme is rather complicated. iForth uses the internal stack
    when it can prove that there will be no under- or overflow. Non-inlined
    calls (F. below) always use the memory stack.

    FORTH> pi fvalue val1 pi/2 fvalue val2 ok
    FORTH> : test val1 fdup val2 foo val1 f+ val2 f* f. ; ok
    FORTH> see test
    Flags: ANSI
    $015FDA80 : test
    $015FDA8A fld $015FD650 tbyte-offset
    $015FDA90 fld ST(0)
    $015FDA92 fld $015FD670 tbyte-offset
    $015FDA98 faddp ST(1), ST
    $015FDA9A fmulp ST(1), ST
    $015FDA9C fld $015FD650 tbyte-offset
    $015FDAA2 faddp ST(1), ST
    $015FDAA4 fld $015FD670 tbyte-offset
    $015FDAAA fmulp ST(1), ST
    $015FDAAC fpush,
    $015FDAB6 jmp F.+10 ( $0124ED42 ) offset NEAR
    $015FDABB ;

    Apparently there are special interrupts that one can enable
    to signal FPU stack underflow (and then spill to memory)
    but I never got them to work reliably. The software
    analysis works fine, but can be fooled in case of rather
    contrived circumstances. I have not encountered a bug in the
    past two decades.

    This is a practical way.

    I researched whether it is possible to detect whether the
    circular stack overflows. There are instructions to
    detect whether a position in this stack is occupied.
    For a word that using a stack 4 deep, you could detect whether
    it is necessary to save words this way, I thought.
    I couldn't make it work, because essential assembler instruction
    are missing. (Or I'm not clever enough.)


    -marcel
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat purring. - the Wise from Antrim -
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Apr 14 11:25:07 2024
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    On 14/04/2024 6:34 pm, Anton Ertl wrote:
    From what I read about this, the intention was that the FP stack would
    extend into memory (and thus not be limited to 8 elements): software
    should react to FP stack overflows and underflows and store some
    elements on overflow, and reload some elements on underflow. However,
    this functionality was implemented in a buggy way on the 8087, so it
    never worked as intended. Hoever, when they noticed this, the 8087
    was already on the market, and Hyrum's law ensured that this behaviour
    could not be changed.

    Do you have a reference for that?

    Kahan writes about the original intention in

    http://web.archive.org/web/20170118054747/https://cims.nyu.edu/~dbindel/class/cs279/87stack.pdf

    especially starting at the last paragraph of page 2.

    And about the bug (or rather design mistake):

    https://history.siam.org/pdfs2/Kahan_final.pdf

    Start with the second-to-last paragraph on page 163. He digresses for
    a page, but continues on the fourth paragraph of page 165 and
    continues to the first paragraph of page 168.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Sun Apr 14 11:59:51 2024
    From Newsgroup: comp.lang.forth

    albert@spenarnc.xs4all.nl wrote:

    In article <27089a13c7ce61da7ffb927cb6c365d2@www.novabbs.com>,
    mhx <mhx@iae.nl> wrote:
    Anton Ertl wrote:
    [..]
    This is a practical way.

    I researched whether it is possible to detect whether the
    circular stack overflows. There are instructions to
    detect whether a position in this stack is occupied.
    For a word that using a stack 4 deep, you could detect whether
    it is necessary to save words this way, I thought.
    I couldn't make it work, because essential assembler instruction
    are missing. (Or I'm not clever enough.)

    I vaguely remember something like that for the FPU stack (combined
    with interrupts?). It falls under the category "I couldn't make
    it work."

    -marcel
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Sun Apr 14 07:50:42 2024
    From Newsgroup: comp.lang.forth

    On 4/14/24 03:02, dxf wrote:
    On 14/04/2024 5:03 pm, mhx wrote:
    Anton Ertl wrote:
    [..]

    iForth:
    $10226000  : foo                        488BC04883ED088F4500      H.@H.m..E.
    $1022600A  fld           [r13 0 +] tbyte41DB6D00                  A[m.
    $1022600E  fld           [r13 #16 +] tbyte
                                            41DB6D10                  A[m.
    $10226012  fxch          ST(2)          D9CA                      YJ
    $10226014  lea           r13, [r13 #32 +] qword
                                            4D8D6D20                  M.m $10226018  faddp         ST(1), ST      DEC1                      ^A
    $1022601A  fxch          ST(1)          D9C9                      YI
    $1022601C  fpopswap,                    41DB6D00D9CA4D8D6D10      A[m.YJM.m.
    $10226026  fmulp         ST(1), ST      DEC9                      ^I
    $10226028  fpush,                       4D8D6DF0D9C941DB7D00      M.mpYIA[}.
    $10226032  ;                            488B45004883C508FFE0      H.E.H.E..` ok

    So apparently the 8 hardware FP stack items are enough for SwiftForth
    and VFX, while iForth prefers to use an FP stack in memory to allow
    for a deeper FP stack.

    Turbo Pascal had a fast FP mode that used the FPU stack. I found almost
    immediately that that is unusable for serious work.

    Were that the case Intel had plenty opportunity to change it. They had
    an academic advising them.


    Let's take a non-trivial example to illustrate why the 8-deep fp stack
    may not be that useful for numerical computation. This example is
    actually from the FSL demo. The word computes the Lorenz equations,
    which give rise to the famous butterfly attractor. This is a system of
    three nonlinear first order differential equations in three variables,
    x, y, z, which are time dependent. The Lorenz equations define the instantaneous derivatives of these variables:

    dx/dt = sigma*(y - x)
    dy/dt = x*(rho -z) - y
    dz/dt = x*y - beta*z

    where sigma, rho, and beta are constant parameters.

    Let's say we want to write a word DERIVS which computes and stores the derivatives, given the instantaneous values of x, y, z. This is the
    basis for any numerical code which solves the trajectory in time,
    starting from an initial condition.

    DERIVS ( F: x y z -- )

    Hence, we want to place some values x, y, and z onto the fp stack and
    compute the three derivatives. Ideally these three values remain on the
    fp stack and don't need to be fetched from memory constantly until the
    three derivatives are computed, especially if one is using the hardware
    fp stack. We allow the constant parameters to be fetched from memory and
    the results of the derivative computation to be stored to memory so they
    don't overflow the stack. This should be doable with the 8-element
    hardware fp stack.

    Below I give Forth code which computes the derivatives. This code is
    usable only on systems with a separate FP stack. It will be interesting
    to see the compiled code given by Forth systems using the hardware fpu
    stack to compute the results. While this example may behave properly, if
    we go to a fourth order system or higher, it gets less likely that the hardware stack remains usable.

    --
    Krishna

    == begin fpstack-test.4th ==
    \ fpstack-test.4th
    \
    \ Compute the Lorenz equations, a set of three coupled
    \ nonlinear differential equations.
    \
    \ dx/dt = sigma*(y - x)
    \ dy/dt = x*(rho -z) - y
    \ dz/dt = x*y - beta*z
    \
    \ sigma, rho, and beta are constant parameters.
    \
    \ The following code requires a separate fp stack

    include ans-words \ only for kForth64
    include fsl/fsl-util


    [UNDEFINED] FPICK [IF]
    cr .( Your system may not use a separate floating point stack!)
    ABORT
    [THEN]

    [UNDEFINED] F2OVER [IF]
    : f2over ( F: r1 r2 r3 r4 -- r1 r2 r3 r4 r1 r2 )
    3 fpick 3 fpick ;
    [THEN]

    [UNDEFINED] F+! [IF]
    : f+! ( a -- ) ( F: r -- ) dup f@ f+ f! ;
    [THEN]

    16.0e0 fconstant sigma
    45.92e0 fconstant rho
    4.0e0 fconstant beta

    \ Compute the derivatives given the instantaneous values
    \ x, y, z for a given time t.

    \ xdot{ is an array consisting of dx/dt, dy/dt, dz/dt
    3 float array xdot{

    : derivs ( F: x y z -- )
    fdup f2over \ F: x y z z x y
    f- sigma f* fnegate
    xdot{ 0 } f! \ F: x y z z
    rho fover f- \ F: x y z z rho-z
    4 fpick f* \ F: x y z z x*(rho - z)
    3 fpick f-
    xdot{ 1 } f! \ F: x y z z
    fdrop
    beta f* fnegate
    xdot{ 2 } f!
    f* xdot{ 2 } f+!
    ;

    0 [IF]
    include ttester
    \ Test DERIVS
    1e-15 set-near
    t{ 0.1e 0.6e 4.0e derivs -> }t
    t{ xdot{ 0 } f@ -> 8.0e0 }t
    t{ xdot{ 1 } f@ -> 3.592e0 }t
    t{ xdot{ 2 } f@ -> -15.94e0 }t
    [THEN]

    == end fpstack-test.4th ==


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From dxf@dxforth@gmail.com to comp.lang.forth on Sun Apr 14 23:35:24 2024
    From Newsgroup: comp.lang.forth

    On 14/04/2024 9:25 pm, Anton Ertl wrote:
    dxf <dxforth@gmail.com> writes:
    On 14/04/2024 6:34 pm, Anton Ertl wrote:
    From what I read about this, the intention was that the FP stack would
    extend into memory (and thus not be limited to 8 elements): software
    should react to FP stack overflows and underflows and store some
    elements on overflow, and reload some elements on underflow. However,
    this functionality was implemented in a buggy way on the 8087, so it
    never worked as intended. Hoever, when they noticed this, the 8087
    was already on the market, and Hyrum's law ensured that this behaviour
    could not be changed.

    Do you have a reference for that?

    Kahan writes about the original intention in

    http://web.archive.org/web/20170118054747/https://cims.nyu.edu/~dbindel/class/cs279/87stack.pdf

    especially starting at the last paragraph of page 2.

    And about the bug (or rather design mistake):

    https://history.siam.org/pdfs2/Kahan_final.pdf

    Start with the second-to-last paragraph on page 163. He digresses for
    a page, but continues on the fourth paragraph of page 165 and
    continues to the first paragraph of page 168.

    The latter sounds like someone not getting his way more than a design mistake. In the first reference Kahan states:

    "When the 8087 was designed, I knew that stack over/underflow was an issue of
    more aesthetic than practical importance. I still regret that the 8087's stack
    implementation was not quite so neat as my original intention described in the
    accompanying note."

    Intel decided Kahan's aesthetic afterthought could be dispensed with. History appears to have proven them correct. Were 8 levels of stack actually insufficient,
    it would have made more sense for Intel to double it (if not for the 8087 then the
    next incarnation) than to spill to memory which was bad in every way.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From dxf@dxforth@gmail.com to comp.lang.forth on Mon Apr 15 00:27:41 2024
    From Newsgroup: comp.lang.forth

    On 14/04/2024 10:50 pm, Krishna Myneni wrote:
    ...
    Below I give Forth code which computes the derivatives. This code is usable only on systems with a separate FP stack. It will be interesting to see the compiled code given by Forth systems using the hardware fpu stack to compute the results. While this example may behave properly, if we go to a fourth order system or higher, it gets less likely that the hardware stack remains usable.

    Systems that default to hardware fpu stack may well offer a software stack option.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Apr 14 15:19:41 2024
    From Newsgroup: comp.lang.forth

    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    dx/dt = sigma*(y - x)
    dy/dt = x*(rho -z) - y
    dz/dt = x*y - beta*z

    where sigma, rho, and beta are constant parameters.

    Let's say we want to write a word DERIVS which computes and stores the >derivatives, given the instantaneous values of x, y, z. This is the
    basis for any numerical code which solves the trajectory in time,
    starting from an initial condition.

    DERIVS ( F: x y z -- )

    Hence, we want to place some values x, y, and z onto the fp stack and >compute the three derivatives. Ideally these three values remain on the
    fp stack and don't need to be fetched from memory constantly until the
    three derivatives are computed, especially if one is using the hardware
    fp stack. We allow the constant parameters to be fetched from memory and
    the results of the derivative computation to be stored to memory so they >don't overflow the stack. This should be doable with the 8-element
    hardware fp stack.

    I have adapted your Forth code:

    [UNDEFINED] F2OVER [IF]
    : f2over ( F: r1 r2 r3 r4 -- r1 r2 r3 r4 r1 r2 )
    3 fpick 3 fpick ;
    [THEN]

    16.0e0 fconstant sigma
    45.92e0 fconstant rho
    4.0e0 fconstant beta

    fvariable dx/dt
    fvariable dy/dt
    fvariable dz/dt

    : derivs ( F: x y z -- )
    fdup f2over \ F: x y z z x y
    f- sigma f* fnegate
    dx/dt f! \ F: x y z z
    rho fover f- \ F: x y z z rho-z
    4 fpick f* \ F: x y z z x*(rho - z)
    3 fpick f-
    dy/dt f! \ F: x y z z
    fdrop
    beta f* fnegate
    frot frot f* f+ dz/dt f!
    ;

    0.1e 0.6e 4.0e derivs
    dx/dt f@ f. cr \ 8.
    dy/dt f@ f. cr \ 3.592
    dz/dt f@ f. cr \ -15.94

    In particular, I eliminated the additional memory accesses to DZ/DT.

    SwiftForth, VFX and iforth produce the expected results for your test
    case. The code is:

    SwiftForth 4.0.0-RC87 VFX Forth 64 5.43 iforth-5.1-mini
    ST(0) FLD FLD ST fld ST(0)
    44E8BC ( f2over ) CALL CALL 0050A080 F2OVER fld [r13 0 +] tbyte
    ST(0) ST(1) FSUBP FSUBP ST(1), ST fxch ST(1)
    44E8FB ( sigma ) CALL CALL 0050A2BB SIGMA fld [r13 #16 +] tby
    ST(0) ST(1) FMULP FMULP ST(1), ST lea r13, [r13 #32 +]
    FCHS FCHS fxch ST(3)
    -8 [RBP] RBP LEA FSTP TBYTE FFF9CFE8 [RIP] fxch ST(1)
    RBX 0 [RBP] MOV CALL 0050A2FB RHO fld ST(3)
    4C508 [RDI] RBX LEA FLD ST(1) fld ST(3)
    0 [RBX] TBYTE FSTP FSUBP ST(1), ST fsubp ST(1), ST
    0 [RBP] RBX MOV LEA RBP, [RBP+-08] fld $101BC720 tbyte
    8 [RBP] RBP LEA MOV [RBP], RBX fmulp ST(1), ST
    44E923 ( rho ) CALL MOV EBX, # 00000004 fchs
    ST(1) FLD CALL 005030C0 FPICK fstp $10226470 tbyte
    ST(0) ST(1) FSUBP FMULP ST(1), ST fld $101BC710 tbyte
    -8 [RBP] RBP LEA LEA RBP, [RBP+-08] fld ST(1)
    RBX 0 [RBP] MOV MOV [RBP], RBX fsubp ST(1), ST
    4 # EBX MOV MOV EBX, # 00000003 fld ST(4)
    43C901 ( FPICK ) CALL CALL 005030C0 FPICK fmulp ST(1), ST
    ST(0) ST(1) FMULP FSUBP ST(1), ST fld ST(3)
    -8 [RBP] RBP LEA FSTP TBYTE FFF9CFC1 [RIP] fsubp ST(1), ST
    RBX 0 [RBP] MOV FSTP ST fstp $10226490 tbyte
    3 # EBX MOV CALL 0050A33B BETA ffreep ST(0)
    43C901 ( FPICK ) CALL FMULP ST(1), ST fld $101BC700 tbyte
    ST(0) ST(1) FSUBP FCHS fmulp ST(1), ST
    -8 [RBP] RBP LEA FXCH ST(1) fchs
    RBX 0 [RBP] MOV FXCH ST(2) fxch ST(1)
    4C530 [RDI] RBX LEA FXCH ST(1) fxch ST(2)
    0 [RBX] TBYTE FSTP FXCH ST(2) fxch ST(1)
    0 [RBP] RBX MOV FMULP ST(1), ST fxch ST(2)
    8 [RBP] RBP LEA FADDP ST(1), ST fmulp ST(1), ST
    ST(0) FSTP FSTP TBYTE FFF9CFB4 [RIP] fxch ST(1)
    44E94B ( beta ) CALL RET/NEXT fpopswap,
    ST(0) ST(1) FMULP faddp ST(1), ST
    FCHS fstp $102264B0 tbyte
    43C807 ( FROT ) CALL ;
    43C807 ( FROT ) CALL
    ST(0) ST(1) FMULP
    ST(0) ST(1) FADDP
    -8 [RBP] RBP LEA
    RBX 0 [RBP] MOV
    4C558 [RDI] RBX LEA
    0 [RBX] TBYTE FSTP
    0 [RBP] RBX MOV
    8 [RBP] RBP LEA
    RET

    FPICK is apparently implemented on SwiftForth and VFX through an
    indirect branch that branches to one of 8 variants of "FLD ST(...)",
    while iForth manages to resolve this during compilation.

    I have also looked at VFX 5.11 which uses XMM registers instead of the
    FP stack, but it does not inline FP operations, so you mostly see a long sequence of calls.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Apr 14 15:53:40 2024
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    On 14/04/2024 9:25 pm, Anton Ertl wrote:
    Kahan writes about the original intention in

    http://web.archive.org/web/20170118054747/https://cims.nyu.edu/~dbindel/class/cs279/87stack.pdf

    especially starting at the last paragraph of page 2.

    And about the bug (or rather design mistake):

    https://history.siam.org/pdfs2/Kahan_final.pdf

    Start with the second-to-last paragraph on page 163. He digresses for
    a page, but continues on the fourth paragraph of page 165 and
    continues to the first paragraph of page 168.

    The latter sounds like someone not getting his way more than a design mistake. >In the first reference Kahan states:

    "When the 8087 was designed, I knew that stack over/underflow was an issue of more aesthetic than practical importance. I still regret that the 8087's stack
    implementation was not quite so neat as my original intention described in the
    accompanying note."

    Intel decided Kahan's aesthetic afterthought could be dispensed with.

    In a way, they did, and Kahan obviously did not get his way. But to
    me it sounds like they tried and failed at implementing a stack that
    extends into memory. The tags that indicate the presence of a stack
    item are there. If they had made a conscious decision at the start to
    dispense with the idea of an extensible stack, they would have
    discpensed with these bits indicating the presence of a stack item as
    well. So what happened is that they botched the first attempt, and
    then decided that they did not want to do what would have been
    necessary to fix it.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stephen Pelc@stephen@vfxforth.com to comp.lang.forth on Sun Apr 14 20:02:01 2024
    From Newsgroup: comp.lang.forth

    On 14 Apr 2024 at 01:47:20 CEST, "Krishna Myneni" <krishna.myneni@ccreweb.org> wrote:

    On 4/13/24 12:55, Anton Ertl wrote:
    I just looked at the floating-point implementations of recent
    SwiftForth and VFX (finally present in the system from the start), and
    on iForth-5.1-mini (for comparison):

    1 FLOATS .

    reports:

    16 iforth
    10 sf64
    10 vfx64

    For

    : foo f+ f* ;

    the resulting code is:

    SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
    : foo f+ f* ; ok
    see foo
    44E8B9 ST(0) ST(1) FADDP DEC1
    44E8BB ST(0) ST(1) FMULP DEC9
    44E8BD RET C3 ok


    VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
    © MicroProcessor Engineering Ltd, 1998-2023

    : foo f+ f* ; ok
    see foo
    FOO
    ( 0050A250 DEC1 ) FADDP ST(1), ST
    ( 0050A252 DEC9 ) FMULP ST(1), ST
    ( 0050A254 C3 ) RET/NEXT
    ( 5 bytes, 3 instructions )


    iForth:
    $10226000 : foo 488BC04883ED088F4500 H.@H.m..E. >> $1022600A fld [r13 0 +] tbyte41DB6D00 A[m.
    $1022600E fld [r13 #16 +] tbyte
    41DB6D10 A[m.
    $10226012 fxch ST(2) D9CA YJ
    $10226014 lea r13, [r13 #32 +] qword
    4D8D6D20 M.m
    $10226018 faddp ST(1), ST DEC1 ^A
    $1022601A fxch ST(1) D9C9 YI
    $1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m. >> $10226026 fmulp ST(1), ST DEC9 ^I
    $10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}. >> $10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok

    So apparently the 8 hardware FP stack items are enough for SwiftForth
    and VFX, while iForth prefers to use an FP stack in memory to allow
    for a deeper FP stack.

    ...

    For me, an 8 item hardware fp stack limit is too limiting to be useful.
    This is mostly because of my use of the fp stack for initializing tables (arrays and matrices), and my coding style of returning more than 8
    floats on the fp stack for some types of computation. No doubt one can
    limit themselves to an 8-item fp stack, but I'd hate to have to code wit
    such a limit.

    The manual (gasp) documents how to change the default FP package.

    Changing the default pack also changes the system call interfaces to
    match.

    Stephen
    --
    Stephen Pelc, stephen@vfxforth.com
    MicroProcessor Engineering, Ltd. - More Real, Less Time
    133 Hill Lane, Southampton SO15 5AF, England
    tel: +44 (0)78 0390 3612, +34 649 662 974
    http://www.mpeforth.com
    MPE website
    http://www.vfxforth.com/downloads/VfxCommunity/
    downloads
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Sun Apr 14 18:32:11 2024
    From Newsgroup: comp.lang.forth

    On 4/14/24 10:19, Anton Ertl wrote:
    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    dx/dt = sigma*(y - x)
    dy/dt = x*(rho -z) - y
    dz/dt = x*y - beta*z

    where sigma, rho, and beta are constant parameters.

    Let's say we want to write a word DERIVS which computes and stores the
    derivatives, given the instantaneous values of x, y, z. This is the
    basis for any numerical code which solves the trajectory in time,
    starting from an initial condition.

    DERIVS ( F: x y z -- )

    Hence, we want to place some values x, y, and z onto the fp stack and
    compute the three derivatives. Ideally these three values remain on the
    fp stack and don't need to be fetched from memory constantly until the
    three derivatives are computed, especially if one is using the hardware
    fp stack. We allow the constant parameters to be fetched from memory and
    the results of the derivative computation to be stored to memory so they
    don't overflow the stack. This should be doable with the 8-element
    hardware fp stack.

    I have adapted your Forth code:

    [UNDEFINED] F2OVER [IF]
    : f2over ( F: r1 r2 r3 r4 -- r1 r2 r3 r4 r1 r2 )
    3 fpick 3 fpick ;
    [THEN]

    16.0e0 fconstant sigma
    45.92e0 fconstant rho
    4.0e0 fconstant beta

    fvariable dx/dt
    fvariable dy/dt
    fvariable dz/dt

    : derivs ( F: x y z -- )
    fdup f2over \ F: x y z z x y
    f- sigma f* fnegate
    dx/dt f! \ F: x y z z
    rho fover f- \ F: x y z z rho-z
    4 fpick f* \ F: x y z z x*(rho - z)
    3 fpick f-
    dy/dt f! \ F: x y z z
    fdrop
    beta f* fnegate
    frot frot f* f+ dz/dt f!
    ;

    0.1e 0.6e 4.0e derivs
    dx/dt f@ f. cr \ 8.
    dy/dt f@ f. cr \ 3.592
    dz/dt f@ f. cr \ -15.94

    In particular, I eliminated the additional memory accesses to DZ/DT.


    Nice. FROT FROT is expensive on a memory based FP stack, unless it is optimized by the compiler, but for fpu stack use it's probably very
    fast. I see that VFX Forth and iforth use a series of FXCH instructions
    to implement FROT FROT.

    SwiftForth, VFX and iforth produce the expected results for your test
    case. The code is:

    SwiftForth 4.0.0-RC87 VFX Forth 64 5.43 iforth-5.1-mini
    ST(0) FLD FLD ST fld ST(0)
    44E8BC ( f2over ) CALL CALL 0050A080 F2OVER fld [r13 0 +] tbyte
    ST(0) ST(1) FSUBP FSUBP ST(1), ST fxch ST(1)
    44E8FB ( sigma ) CALL CALL 0050A2BB SIGMA fld [r13 #16 +] tby
    ST(0) ST(1) FMULP FMULP ST(1), ST lea r13, [r13 #32 +]
    FCHS FCHS fxch ST(3)
    -8 [RBP] RBP LEA FSTP TBYTE FFF9CFE8 [RIP] fxch ST(1)
    RBX 0 [RBP] MOV CALL 0050A2FB RHO fld ST(3)
    4C508 [RDI] RBX LEA FLD ST(1) fld ST(3)
    0 [RBX] TBYTE FSTP FSUBP ST(1), ST fsubp ST(1), ST
    0 [RBP] RBX MOV LEA RBP, [RBP+-08] fld $101BC720 tbyte
    8 [RBP] RBP LEA MOV [RBP], RBX fmulp ST(1), ST
    44E923 ( rho ) CALL MOV EBX, # 00000004 fchs
    ST(1) FLD CALL 005030C0 FPICK fstp $10226470 tbyte
    ST(0) ST(1) FSUBP FMULP ST(1), ST fld $101BC710 tbyte
    -8 [RBP] RBP LEA LEA RBP, [RBP+-08] fld ST(1)
    RBX 0 [RBP] MOV MOV [RBP], RBX fsubp ST(1), ST
    4 # EBX MOV MOV EBX, # 00000003 fld ST(4)
    43C901 ( FPICK ) CALL CALL 005030C0 FPICK fmulp ST(1), ST
    ST(0) ST(1) FMULP FSUBP ST(1), ST fld ST(3)
    -8 [RBP] RBP LEA FSTP TBYTE FFF9CFC1 [RIP] fsubp ST(1), ST
    RBX 0 [RBP] MOV FSTP ST fstp $10226490 tbyte
    3 # EBX MOV CALL 0050A33B BETA ffreep ST(0)
    43C901 ( FPICK ) CALL FMULP ST(1), ST fld $101BC700 tbyte
    ST(0) ST(1) FSUBP FCHS fmulp ST(1), ST
    -8 [RBP] RBP LEA FXCH ST(1) fchs
    RBX 0 [RBP] MOV FXCH ST(2) fxch ST(1)
    4C530 [RDI] RBX LEA FXCH ST(1) fxch ST(2)
    0 [RBX] TBYTE FSTP FXCH ST(2) fxch ST(1)
    0 [RBP] RBX MOV FMULP ST(1), ST fxch ST(2)
    8 [RBP] RBP LEA FADDP ST(1), ST fmulp ST(1), ST
    ST(0) FSTP FSTP TBYTE FFF9CFB4 [RIP] fxch ST(1)
    44E94B ( beta ) CALL RET/NEXT fpopswap,
    ST(0) ST(1) FMULP faddp ST(1), ST
    FCHS fstp $102264B0 tbyte 43C807 ( FROT ) CALL ;
    43C807 ( FROT ) CALL
    ST(0) ST(1) FMULP
    ST(0) ST(1) FADDP
    -8 [RBP] RBP LEA
    RBX 0 [RBP] MOV
    4C558 [RDI] RBX LEA
    0 [RBX] TBYTE FSTP
    0 [RBP] RBX MOV
    8 [RBP] RBP LEA
    RET

    FPICK is apparently implemented on SwiftForth and VFX through an
    indirect branch that branches to one of 8 variants of "FLD ST(...)",
    while iForth manages to resolve this during compilation.


    Good to see that x, y, z are not repeatedly fetched from memory.

    For this example, the hardware fpu stack is sufficient. But, it's easy
    to see that the benefits of a hardware-only stack would diminish quickly
    as the size of the problem increased a small amount, and then the
    programmer (or compiler) would have to keep careful track of how many
    fpu registers are used.

    I have also looked at VFX 5.11 which uses XMM registers instead of the
    FP stack, but it does not inline FP operations, so you mostly see a long sequence of calls.


    --
    Krishna


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Sun Apr 14 18:34:45 2024
    From Newsgroup: comp.lang.forth

    On 4/14/24 15:02, Stephen Pelc wrote:
    On 14 Apr 2024 at 01:47:20 CEST, "Krishna Myneni" <krishna.myneni@ccreweb.org>
    wrote:

    On 4/13/24 12:55, Anton Ertl wrote:
    I just looked at the floating-point implementations of recent
    SwiftForth and VFX (finally present in the system from the start), and
    on iForth-5.1-mini (for comparison):
    ...

    So apparently the 8 hardware FP stack items are enough for SwiftForth
    and VFX, while iForth prefers to use an FP stack in memory to allow
    for a deeper FP stack.

    ...

    For me, an 8 item hardware fp stack limit is too limiting to be useful.
    This is mostly because of my use of the fp stack for initializing tables
    (arrays and matrices), and my coding style of returning more than 8
    floats on the fp stack for some types of computation. No doubt one can
    limit themselves to an 8-item fp stack, but I'd hate to have to code wit
    such a limit.

    The manual (gasp) documents how to change the default FP package.


    Good to know.

    --
    Krishna


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From dxf@dxforth@gmail.com to comp.lang.forth on Mon Apr 15 11:37:39 2024
    From Newsgroup: comp.lang.forth

    On 15/04/2024 6:02 am, Stephen Pelc wrote:
    On 14 Apr 2024 at 01:47:20 CEST, "Krishna Myneni" <krishna.myneni@ccreweb.org>
    wrote:

    On 4/13/24 12:55, Anton Ertl wrote:
    I just looked at the floating-point implementations of recent
    SwiftForth and VFX (finally present in the system from the start), and
    on iForth-5.1-mini (for comparison):

    1 FLOATS .

    reports:

    16 iforth
    10 sf64
    10 vfx64

    For

    : foo f+ f* ;

    the resulting code is:

    SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
    : foo f+ f* ; ok
    see foo
    44E8B9 ST(0) ST(1) FADDP DEC1
    44E8BB ST(0) ST(1) FMULP DEC9
    44E8BD RET C3 ok


    VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
    © MicroProcessor Engineering Ltd, 1998-2023

    : foo f+ f* ; ok
    see foo
    FOO
    ( 0050A250 DEC1 ) FADDP ST(1), ST
    ( 0050A252 DEC9 ) FMULP ST(1), ST
    ( 0050A254 C3 ) RET/NEXT
    ( 5 bytes, 3 instructions )


    iForth:
    $10226000 : foo 488BC04883ED088F4500 H.@H.m..E.
    $1022600A fld [r13 0 +] tbyte41DB6D00 A[m.
    $1022600E fld [r13 #16 +] tbyte
    41DB6D10 A[m.
    $10226012 fxch ST(2) D9CA YJ
    $10226014 lea r13, [r13 #32 +] qword
    4D8D6D20 M.m
    $10226018 faddp ST(1), ST DEC1 ^A
    $1022601A fxch ST(1) D9C9 YI
    $1022601C fpopswap, 41DB6D00D9CA4D8D6D10 A[m.YJM.m.
    $10226026 fmulp ST(1), ST DEC9 ^I
    $10226028 fpush, 4D8D6DF0D9C941DB7D00 M.mpYIA[}.
    $10226032 ; 488B45004883C508FFE0 H.E.H.E..` ok

    So apparently the 8 hardware FP stack items are enough for SwiftForth
    and VFX, while iForth prefers to use an FP stack in memory to allow
    for a deeper FP stack.

    ...

    For me, an 8 item hardware fp stack limit is too limiting to be useful.
    This is mostly because of my use of the fp stack for initializing tables
    (arrays and matrices), and my coding style of returning more than 8
    floats on the fp stack for some types of computation. No doubt one can
    limit themselves to an 8-item fp stack, but I'd hate to have to code wit
    such a limit.

    The manual (gasp) documents how to change the default FP package.

    Changing the default pack also changes the system call interfaces to
    match.

    Specifically chapter 14 in the PDF doc.

    integers
    remove-FP-pack
    include Lib/x64/Hfpx64

    swaps in the 80-bit external stack model.

    The HTML doc appears to lack this information (or hard to find) should a user select that by mistake.




    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From dxf@dxforth@gmail.com to comp.lang.forth on Mon Apr 15 12:07:47 2024
    From Newsgroup: comp.lang.forth

    On 15/04/2024 1:53 am, Anton Ertl wrote:
    dxf <dxforth@gmail.com> writes:
    On 14/04/2024 9:25 pm, Anton Ertl wrote:
    Kahan writes about the original intention in

    http://web.archive.org/web/20170118054747/https://cims.nyu.edu/~dbindel/class/cs279/87stack.pdf

    especially starting at the last paragraph of page 2.

    And about the bug (or rather design mistake):

    https://history.siam.org/pdfs2/Kahan_final.pdf

    Start with the second-to-last paragraph on page 163. He digresses for
    a page, but continues on the fourth paragraph of page 165 and
    continues to the first paragraph of page 168.

    The latter sounds like someone not getting his way more than a design mistake.
    In the first reference Kahan states:

    "When the 8087 was designed, I knew that stack over/underflow was an issue of
    more aesthetic than practical importance. I still regret that the 8087's stack
    implementation was not quite so neat as my original intention described in the
    accompanying note."

    Intel decided Kahan's aesthetic afterthought could be dispensed with.

    In a way, they did, and Kahan obviously did not get his way. But to
    me it sounds like they tried and failed at implementing a stack that
    extends into memory. The tags that indicate the presence of a stack
    item are there. If they had made a conscious decision at the start to dispense with the idea of an extensible stack, they would have
    discpensed with these bits indicating the presence of a stack item as
    well. So what happened is that they botched the first attempt, and
    then decided that they did not want to do what would have been
    necessary to fix it.

    My impression is Palmer (the mathematician Intel hired to co-head the
    project) was trying to placate Kahan and it fell through for various
    reasons.

    The design criteria that never changed was the 8-level hardware stack.
    Forthers can either accept it for best performance - or pick something
    more forgiving at a lesser performance.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Mon Apr 15 03:47:33 2024
    From Newsgroup: comp.lang.forth

    On 4/14/24 21:07, dxf wrote:
    ...
    The design criteria that never changed was the 8-level hardware stack. Forthers can either accept it for best performance - or pick something
    more forgiving at a lesser performance.


    In the Lorenz equation example, which works with the 8 deep fpu stack,
    we have assumed that the fpu hardware stack was empty before calling
    DERIVS. In a real use case, the call to DERIVS is likely to occur within
    a deeper call chain, resulting in items already on the fpu stack before
    args for DERIVS are pushed. As Marcel said, using only a hardware-based
    fp stack is not realistic for any non-trivial floating point work.

    The loss of performance with a memory-based fp stack is far less a
    concern than having to consider the limited stack depth when writing
    code involving floating point arithmetic. Failure from overflowing the
    fpu stack is silent. Debugging is likely to be a nightmare.

    --
    Krishna






    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Mon Apr 15 09:35:22 2024
    From Newsgroup: comp.lang.forth

    In most cases 'bigger' fp data will be stored in memory anyhow,
    which can be cached before disk access. The old 8087 improvements
    were caused by its new fp operators, the stack was unusable.

    And if CPU based stacks were so lucrative for high performance,
    CPU makers would have implemented them since long for normal
    integer data.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Mon Apr 15 12:37:11 2024
    From Newsgroup: comp.lang.forth

    In article <3e419396b1ee93c7a391a7ffc0e44ed8@www.novabbs.com>,
    minforth <minforth@gmx.net> wrote:
    In most cases 'bigger' fp data will be stored in memory anyhow,
    which can be cached before disk access. The old 8087 improvements
    were caused by its new fp operators, the stack was unusable.

    And if CPU based stacks were so lucrative for high performance,
    CPU makers would have implemented them since long for normal
    integer data.

    The iA64 comes to mind. Apparently a failure but was a
    technical or commercial failure?

    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat purring. - the Wise from Antrim -
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Mon Apr 15 11:37:00 2024
    From Newsgroup: comp.lang.forth

    albert@spenarnc.xs4all.nl wrote:

    In article <3e419396b1ee93c7a391a7ffc0e44ed8@www.novabbs.com>,
    minforth <minforth@gmx.net> wrote:
    In most cases 'bigger' fp data will be stored in memory anyhow,
    which can be cached before disk access. The old 8087 improvements
    were caused by its new fp operators, the stack was unusable.

    And if CPU based stacks were so lucrative for high performance,
    CPU makers would have implemented them since long for normal
    integer data.

    The iA64 comes to mind. Apparently a failure but was a
    technical or commercial failure?

    Both i.e. poor developer tool stack and strong AMD competition.
    And it was overly complex. https://softwareengineering.stackexchange.com/questions/279334/why-was-the-itanium-processor-difficult-to-write-a-compiler-for
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From dxf@dxforth@gmail.com to comp.lang.forth on Mon Apr 15 22:14:16 2024
    From Newsgroup: comp.lang.forth

    On 15/04/2024 6:47 pm, Krishna Myneni wrote:
    On 4/14/24 21:07, dxf wrote:
    ...
    The design criteria that never changed was the 8-level hardware stack.
    Forthers can either accept it for best performance - or pick something
    more forgiving at a lesser performance.


    In the Lorenz equation example, which works with the 8 deep fpu stack, we have assumed that the fpu hardware stack was empty before calling DERIVS. In a real use case, the call to DERIVS is likely to occur within a deeper call chain, resulting in items already on the fpu stack before args for DERIVS are pushed. As Marcel said, using only a hardware-based fp stack is not realistic for any non-trivial floating point work.

    The loss of performance with a memory-based fp stack is far less a concern than having to consider the limited stack depth when writing code involving floating point arithmetic. Failure from overflowing the fpu stack is silent. Debugging is likely to be a nightmare.

    Likely the designers never considered Forth. OTOH ANS-Forth did and said
    6 items ought to be good enough for anyone :)

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Mon Apr 15 14:09:28 2024
    From Newsgroup: comp.lang.forth

    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    Failure from overflowing the
    fpu stack is silent.

    Reality check:

    VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
    © MicroProcessor Engineering Ltd, 1998-2023

    1e 2e 3e 4e 5e 6e 7e ok F:-7
    8e ok
    NDP Stack Fault: NDP SW = 0041
    NDP Potential Exception: NDP SW = 0041

    SwiftForth also seems to notice it in some way, but does not report it
    as an error:

    SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
    1e 2e 3e 4e 5e 6e ok
    f. 6.00000000 ok

    SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
    1e 2e 3e 4e 5e 6e 7e ok f.
    ok

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Mon Apr 15 14:13:27 2024
    From Newsgroup: comp.lang.forth

    albert@spenarnc.xs4all.nl writes:
    In article <3e419396b1ee93c7a391a7ffc0e44ed8@www.novabbs.com>,
    minforth <minforth@gmx.net> wrote:
    And if CPU based stacks were so lucrative for high performance,
    CPU makers would have implemented them since long for normal
    integer data.

    They have: SPARC has a stack of integer register windows, AMD29K and
    IA-64 have a register stack of 128 integer registers. However, the
    29K died in the early 1990s, IA-64 always remained niche and has been
    killed (last order date in January 2020), and SPARC delivered OoO too
    late to save it, and the last new SPARC designs were introduced in
    2017.

    The iA64 comes to mind. Apparently a failure but was a
    technical or commercial failure?

    It was a great commercial success. Several companies killed their
    RISCs in favour of IA-64 based just on the IA-64 roadmaps, and when
    IA-64 failed to achieve the predicted technical and market
    superiority, they switched to Intel's AMD64 CPUs: HP, DEC (bought by
    Compaq bought by HP), SGI. Apple switched to Intel's AMD64 CPUs
    without the IA-64 step, so maybe it would have happened for the others
    as well without IA-64, but maybe one of the others would have done
    what ARM and Apple did 15-20 years later.

    Technically, IA-64 was a bet on in-order execution with compiler
    scheduling being superior to hardware scheduling (out-of-order
    execution). That was a bet on the wrong horse, as the superior
    hardware branch prediction allows OoO hardware to outperform IA-64 on
    branchy code, while SIMD and GPGPUs eat IA-64's lunch on data-parallel
    code.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Mon Apr 15 17:02:40 2024
    From Newsgroup: comp.lang.forth

    On 4/15/24 09:09, Anton Ertl wrote:
    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    Failure from overflowing the
    fpu stack is silent.

    Reality check:

    VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
    © MicroProcessor Engineering Ltd, 1998-2023

    1e 2e 3e 4e 5e 6e 7e ok F:-7
    8e ok
    NDP Stack Fault: NDP SW = 0041
    NDP Potential Exception: NDP SW = 0041

    SwiftForth also seems to notice it in some way, but does not report it
    as an error:

    SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
    1e 2e 3e 4e 5e 6e ok
    f. 6.00000000 ok

    SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
    1e 2e 3e 4e 5e 6e 7e ok
    f.
    ok

    I tried overflowing the fpu stack in kforth32, and no exception is
    raised. Perhaps one needs to configure the fpu to raise an exception.
    Also tried it in C with an assembly procedure. The executable throws no exception.

    --
    Krishna

    == begin fpu-stack-overflow.4th ==
    fpu-stack-overflow.4th
    \ for use with kforth32

    include ans-words
    include strings
    include modules
    include syscalls
    include mc
    include asm-x86

    code fpu-stack-overflow
    fld1,
    fld1,
    fld1,
    fld1,
    fld1,
    fld1,
    fld1,
    fld1,
    fld1,
    end-code

    fpu-stack-overflow


    == end fpu-stack-overflow.4th ==

    == begin example ==
    $ kforth32
    kForth-32 v 2.4.5 (Build: 2024-03-30)
    Copyright (c) 1998--2023 Krishna Myneni
    Contributions by: dpw gd mu bk abs tn cmb bg dnw
    Provided under the GNU Affero General Public License, v3.0 or later

    include fpu-stack-overflow
    ok
    == end example ==


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Tue Apr 16 01:10:42 2024
    From Newsgroup: comp.lang.forth

    I would be surprised to get a SIGFPE interrupt from x87 stack ops.
    X87 mode has long been deprecated and replaced by SSE2.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Mon Apr 15 20:41:43 2024
    From Newsgroup: comp.lang.forth

    On 4/15/24 20:10, minforth wrote:
    I would be surprised to get a SIGFPE interrupt from x87 stack ops.
    X87 mode has long been deprecated and replaced by SSE2.

    The FPU has maskable interrupts for arithmetic -- see

    https://github.com/mynenik/kForth-32/blob/master/forth-src/fpu-x86.4th

    But, yes, I'm not aware of any interrupts from stack ops.

    --
    Krishna



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Apr 16 05:53:15 2024
    From Newsgroup: comp.lang.forth

    minforth@gmx.net (minforth) writes:
    X87 mode has long been deprecated

    Citation needed.

    and replaced by SSE2.

    I just tried compiling the following program with gcc with different
    options:

    float sfplus(float a, float b)
    {
    return a+b;
    }

    double dfplus(double a, double b)
    {
    return a+b;
    }

    long double lfplus(long double a, long double b)
    {
    return a+b;
    }

    what I got was:

    sfplus() dfplus() lfplus()
    387 387 387 gcc -m32 -O
    SSE2 SSE2 387 gcc -m64 -O
    387 387 387 gcc -m64 -mfpmath=387 -O

    The System V calling convention for AMD64 passes and returns float and
    double in xmm registers, so the last option leads to moving the values
    between the xmm register and the 387 FP stack through memory, e.g.:

    000000000000001f <dfplus>:
    1f: f2 0f 11 44 24 f0 movsd %xmm0,-0x10(%rsp)
    25: f2 0f 11 4c 24 f8 movsd %xmm1,-0x8(%rsp)
    2b: dd 44 24 f0 fldl -0x10(%rsp)
    2f: dc 44 24 f8 faddl -0x8(%rsp)
    33: dd 5c 24 f0 fstpl -0x10(%rsp)
    37: f2 0f 10 44 24 f0 movsd -0x10(%rsp),%xmm0
    3d: c3 retq

    By contrast, the middle option produces:

    0000000000000005 <dfplus>:
    5: f2 0f 58 c1 addsd %xmm1,%xmm0
    9: c3 retq

    and the first option:

    00000009 <dfplus>:
    9: dd 44 24 04 fldl 0x4(%esp)
    d: dc 44 24 0c faddl 0xc(%esp)
    11: c3 ret

    The difference between the first and last option is due to the
    differences in calling convention.

    gcc also has the option -mfpmath=sse,387 which tells the compiler that
    it can use both. A small experiment only resulted in using SSE2, but
    maybe if I had used code with more values alive at the same time it
    would have used both.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Tue Apr 16 06:58:37 2024
    From Newsgroup: comp.lang.forth

    Sure, x87 mode is still there for backwards compatibility and you can
    instruct compilers to use it eg for 80-bit floats. Godbolt is your friend.
    I seem to remember the "deprecation" had to do with inefficient load/store mechanisms between memory and x87 registes.

    I can't remember or name an exact citation but found a somewhat related discussion: https://retrocomputing.stackexchange.com/questions/9751/did-any-compiler-fully-use-intel-x87-80-bit-floating-point
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Tue Apr 16 07:21:16 2024
    From Newsgroup: comp.lang.forth

    Anton Ertl wrote:
    [..]
    gcc also has the option -mfpmath=sse,387 which tells the compiler that
    it can use both.

    Nice to know! Last time I checked, only the Intel compiler could do that.

    -marcel
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Tue Apr 16 09:59:31 2024
    From Newsgroup: comp.lang.forth

    In article <uvk861$gs90$1@dont-email.me>,
    Krishna Myneni <krishna.myneni@ccreweb.org> wrote:
    On 4/15/24 09:09, Anton Ertl wrote:
    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    Failure from overflowing the
    fpu stack is silent.

    Reality check:

    VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
    © MicroProcessor Engineering Ltd, 1998-2023

    1e 2e 3e 4e 5e 6e 7e ok F:-7
    8e ok
    NDP Stack Fault: NDP SW = 0041
    NDP Potential Exception: NDP SW = 0041

    SwiftForth also seems to notice it in some way, but does not report it
    as an error:

    SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
    1e 2e 3e 4e 5e 6e ok
    f. 6.00000000 ok

    SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
    1e 2e 3e 4e 5e 6e 7e ok
    f.
    ok

    I tried overflowing the fpu stack in kforth32, and no exception is
    raised. Perhaps one needs to configure the fpu to raise an exception.
    Also tried it in C with an assembly procedure. The executable throws no >exception.

    --
    Krishna

    == begin fpu-stack-overflow.4th ==
    fpu-stack-overflow.4th
    \ for use with kforth32

    include ans-words
    include strings
    include modules
    include syscalls
    include mc
    include asm-x86

    code fpu-stack-overflow
    fld1,
    fld1,
    fld1,
    fld1,
    fld1,
    fld1,
    fld1,
    fld1,
    fld1,
    end-code

    fpu-stack-overflow


    == end fpu-stack-overflow.4th ==

    == begin example ==
    $ kforth32
    kForth-32 v 2.4.5 (Build: 2024-03-30)
    Copyright (c) 1998--2023 Krishna Myneni
    Contributions by: dpw gd mu bk abs tn cmb bg dnw
    Provided under the GNU Affero General Public License, v3.0 or later

    include fpu-stack-overflow
    ok
    == end example ==


    I tried this on ciforth. It crashes with the 10th item
    not the 8th.

    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat purring. - the Wise from Antrim -
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Tue Apr 16 07:14:46 2024
    From Newsgroup: comp.lang.forth

    On 4/16/24 02:59, albert@spenarnc.xs4all.nl wrote:
    In article <uvk861$gs90$1@dont-email.me>,
    Krishna Myneni <krishna.myneni@ccreweb.org> wrote:
    On 4/15/24 09:09, Anton Ertl wrote:
    Krishna Myneni <krishna.myneni@ccreweb.org> writes:
    Failure from overflowing the
    fpu stack is silent.

    Reality check:

    VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
    © MicroProcessor Engineering Ltd, 1998-2023

    1e 2e 3e 4e 5e 6e 7e ok F:-7
    8e ok
    NDP Stack Fault: NDP SW = 0041
    NDP Potential Exception: NDP SW = 0041

    SwiftForth also seems to notice it in some way, but does not report it
    as an error:

    SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
    1e 2e 3e 4e 5e 6e ok
    f. 6.00000000 ok

    SwiftForth x64-Linux 4.0.0-RC87 24-Mar-2024
    1e 2e 3e 4e 5e 6e 7e ok
    f.
    ok

    I tried overflowing the fpu stack in kforth32, and no exception is
    raised. Perhaps one needs to configure the fpu to raise an exception.
    Also tried it in C with an assembly procedure. The executable throws no
    exception.

    --
    Krishna

    == begin fpu-stack-overflow.4th ==
    fpu-stack-overflow.4th
    \ for use with kforth32

    include ans-words
    include strings
    include modules
    include syscalls
    include mc
    include asm-x86

    code fpu-stack-overflow
    fld1,
    fld1,
    fld1,
    fld1,
    fld1,
    fld1,
    fld1,
    fld1,
    fld1,
    end-code

    fpu-stack-overflow


    == end fpu-stack-overflow.4th ==

    == begin example ==
    $ kforth32
    kForth-32 v 2.4.5 (Build: 2024-03-30)
    Copyright (c) 1998--2023 Krishna Myneni
    Contributions by: dpw gd mu bk abs tn cmb bg dnw
    Provided under the GNU Affero General Public License, v3.0 or later

    include fpu-stack-overflow
    ok
    == end example ==


    I tried this on ciforth. It crashes with the 10th item
    not the 8th.


    That may be for some other reason. The following code executes an
    arbitrary number of FLD1 instructions:

    code fpu-stack-overflow ( n -- -n )
    0 [ebx] ecx mov, \ set loop count
    0 # eax mov,
    DO,
    fld1,
    eax dec,
    LOOP,
    eax 0 [ebx] mov,
    0 # eax mov,
    end-code

    16384 fpu-stack-overflow .
    \ end of prog

    \ run it

    include clf-code/fpu-stack-overflow

    -16384
    ok


    --
    Krishna
    Groetjes Albert

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Wed Apr 17 06:24:15 2024
    From Newsgroup: comp.lang.forth

    On 4/15/24 20:10, minforth wrote:
    I would be surprised to get a SIGFPE interrupt from x87 stack ops.
    X87 mode has long been deprecated and replaced by SSE2.

    Maybe I'm missing something, but SSE2 does not seem to have anything
    beyond basic floating point arithmetic e.g. transcendental functions.

    --
    Krishna

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Wed Apr 17 13:07:22 2024
    From Newsgroup: comp.lang.forth

    Krishna Myneni wrote:

    On 4/15/24 20:10, minforth wrote:
    I would be surprised to get a SIGFPE interrupt from x87 stack ops.
    X87 mode has long been deprecated and replaced by SSE2.

    Maybe I'm missing something, but SSE2 does not seem to have anything
    beyond basic floating point arithmetic e.g. transcendental functions.

    Hardware x87 support isn't necessarily faster: https://users.ece.utexas.edu/~adnan/comm/fast-trigonometric-functions-using.pdf

    But I think the main advantage lies in the possibility of parallel and/or vectorized execution.

    And of course Forth is very close to assembler and as such it is natural
    to use x87 instructions, unless a Forth system is implemented using libc
    or using math libraries that better exploit those many features of modern CPUs. --- Synchronet 3.20a-Linux NewsLink 1.114
  • From peter.m.falth@peter.m.falth@gmail.com (PMF) to comp.lang.forth on Wed Apr 17 13:08:01 2024
    From Newsgroup: comp.lang.forth

    Krishna Myneni wrote:

    On 4/15/24 20:10, minforth wrote:
    I would be surprised to get a SIGFPE interrupt from x87 stack ops.
    X87 mode has long been deprecated and replaced by SSE2.

    Maybe I'm missing something, but SSE2 does not seem to have anything
    beyond basic floating point arithmetic e.g. transcendental functions.

    Yes that is right. There is a sqrt but nothing more.

    In lxf64/ntf64 I use an external C library, specifically fdlibm53.

    This claims to be within 1 ulp correct. My testing also suggests this.
    fdlibm looks to be the base for most other math libraries.
    I could have used libm from gcc but I wanted the same code for both
    Linux and Windows.

    In lxf/ntf I use the 387 fp stack. I think this was a wrong decision.
    8 stack items is to low to be useful for anything more complicated.
    complex numbers is an example that quickly eats all 8 stack items.

    Best Regards
    Peter Fälth

    --
    Krishna
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Wed Apr 17 17:37:45 2024
    From Newsgroup: comp.lang.forth

    minforth wrote:

    Krishna Myneni wrote:
    [..]
    Hardware x87 support isn't necessarily faster: https://users.ece.utexas.edu/~adnan/comm/fast-trigonometric-functions-using.pdf

    .. which shows it also can be a lot slower. And if I'm not mistaken, it can't deliver the 80 bits of the FPU without breaking down by a factor of at least 1.3.

    But I think the main advantage lies in the possibility of parallel and/or vectorized execution.

    I have not yet seen algorithms where that would bring something. It might when all other code is done properly with SSE. An on-the-fly reconfigured FPGA (available on some microcontrollers) might be a better idea :--)

    And of course Forth is very close to assembler and as such it is natural
    to use x87 instructions, unless a Forth system is implemented using libc
    or using math libraries that better exploit those many features of modern CPUs.

    It all depends, that is what I like about Forth.

    -marcel
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Wed Apr 17 19:16:47 2024
    From Newsgroup: comp.lang.forth

    mhx wrote:
    minforth wrote:
    And of course Forth is very close to assembler and as such it is natural
    to use x87 instructions, unless a Forth system is implemented using libc
    or using math libraries that better exploit those many features of modern CPUs.
    It all depends, that is what I like about Forth.

    True, it all depends. Copied from a Visual Studio documentation:

    Many of the floating point math library functions have different implementations
    for different CPU architectures. For example, the 32-bit x86 CRT may have a different implementation than the 64-bit x64 CRT. In addition, some of the functions may have multiple implementations for a given CPU architecture. The most efficient implementation is selected dynamically at run-time depending on the instruction sets supported by the CPU. For example, in the 32-bit x86 CRT, some functions have both an x87 implementation and an SSE2 implementation. When running on a CPU that supports SSE2, the faster SSE2 implementation is used. When running on a CPU that does not support SSE2, the slower x87 implementation is used.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Thu Apr 18 07:23:48 2024
    From Newsgroup: comp.lang.forth

    minforth wrote:
    [..]
    [..] For example, in the 32-bit x86 CRT,
    some functions have both an x87 implementation and an SSE2 implementation. When
    running on a CPU that supports SSE2, the faster SSE2 implementation is used. When running on a CPU that does not support SSE2, the **slower**
    [ my emphasis -mhx ] x87 implementation is used.

    This strikes me as showing a strong bias (i.e. not strictly based on technical arguments) towards SSE2. I've noticed before that Microsoft has a dislike for the x87 FPU, if not boycotting it outright (e.g. no long double in their compiler).

    -marcel
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From dxf@dxforth@gmail.com to comp.lang.forth on Thu Apr 18 17:53:45 2024
    From Newsgroup: comp.lang.forth

    On 17/04/2024 11:07 pm, minforth wrote:
    ...
    And of course Forth is very close to assembler and as such it is natural
    to use x87 instructions

    In addition to being available on most Intel x86 cpu's, x87 was cheap to support. A relatively small amount of code was needed to implement the assembler extensions:

    https://pastebin.com/Md6BGWmj

    There wasn't a good reason not to choose x87. If the hardware stack didn't appeal, use a software stack. Performance will still be very good. I recall pitting my 16-bit DTC forth with x87 against VFX using an fp intensive program/ benchmark. The difference was only a factor of 4. I couldn't believe it.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Thu Apr 18 08:08:07 2024
    From Newsgroup: comp.lang.forth

    mhx wrote:

    minforth wrote:
    [..]
    [..] For example, in the 32-bit x86 CRT,
    some functions have both an x87 implementation and an SSE2 implementation. When
    running on a CPU that supports SSE2, the faster SSE2 implementation is used. >> When running on a CPU that does not support SSE2, the **slower**
    [ my emphasis -mhx ] x87 implementation is used.

    This strikes me as showing a strong bias (i.e. not strictly based on technical
    arguments) towards SSE2. I've noticed before that Microsoft has a dislike for the x87 FPU, if not boycotting it outright (e.g. no long double in their compiler).

    Could be. At least gcc has support for _float80. FWIW Intel's icc even has a compiler flag for x87 stack overflow warnings. It all depends. ;-)
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Thu Apr 18 11:00:45 2024
    From Newsgroup: comp.lang.forth

    In article <6620d188$1@news.ausics.net>, dxf <dxforth@gmail.com> wrote:
    On 17/04/2024 11:07 pm, minforth wrote:
    ...
    And of course Forth is very close to assembler and as such it is natural
    to use x87 instructions

    In addition to being available on most Intel x86 cpu's, x87 was cheap to >support. A relatively small amount of code was needed to implement the >assembler extensions:

    https://pastebin.com/Md6BGWmj

    There wasn't a good reason not to choose x87. If the hardware stack didn't >appeal, use a software stack. Performance will still be very good. I recall >pitting my 16-bit DTC forth with x87 against VFX using an fp intensive program/
    benchmark. The difference was only a factor of 4. I couldn't believe it.


    The loadable extension for ciforth runs 20 screens.
    This is not politically correct. Only 80 bits floats, only 8 stack depth.
    (The assembler is require, which contains 2 screens of fp related instructions.)
    The transcendental functions are very low cost,
    Nice if you have an occasional cosine.
    In implementing the floating point for the transputer, we had
    to generate Chebychov polynomials with a special UBASIC
    program and a truckload of testing. At least they
    accommodated ISO single and double precision because
    that was the transputer baseline.

    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat purring. - the Wise from Antrim -
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Thu Apr 18 11:04:46 2024
    From Newsgroup: comp.lang.forth

    In article <37580f936ac4a2d21fb84ccf59e6b7a4@www.novabbs.com>,
    mhx <mhx@iae.nl> wrote:
    minforth wrote:
    [..]
    [..] For example, in the 32-bit x86 CRT,
    some functions have both an x87 implementation and an SSE2 implementation. When
    running on a CPU that supports SSE2, the faster SSE2 implementation is used. >> When running on a CPU that does not support SSE2, the **slower**
    [ my emphasis -mhx ] x87 implementation is used.

    This strikes me as showing a strong bias (i.e. not strictly based on technical >arguments) towards SSE2. I've noticed before that Microsoft has a dislike for >the x87 FPU, if not boycotting it outright (e.g. no long double in their >compiler).

    Microsoft are no hobbyists. There has to be a commercial incentive to
    take it up, that is not a boycott. OTOH the gcc folks like a
    challenge, resulting in a baroque building.


    -marcel
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat purring. - the Wise from Antrim -
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Thu Apr 18 09:09:19 2024
    From Newsgroup: comp.lang.forth

    minforth wrote:

    mhx wrote:
    [..]
    Could be. At least gcc has support for _float80. FWIW Intel's icc even has a compiler flag for x87 stack overflow warnings.
    [..]

    At runtime?

    -marcel
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Thu Apr 18 11:11:17 2024
    From Newsgroup: comp.lang.forth

    In article <2024Apr14.132507@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    dxf <dxforth@gmail.com> writes:
    On 14/04/2024 6:34 pm, Anton Ertl wrote:
    From what I read about this, the intention was that the FP stack would
    extend into memory (and thus not be limited to 8 elements): software
    should react to FP stack overflows and underflows and store some
    elements on overflow, and reload some elements on underflow. However,
    this functionality was implemented in a buggy way on the 8087, so it
    never worked as intended. Hoever, when they noticed this, the 8087
    was already on the market, and Hyrum's law ensured that this behaviour
    could not be changed.

    Do you have a reference for that?

    Kahan writes about the original intention in

    http://web.archive.org/web/20170118054747/https://cims.nyu.edu/~dbindel/class/cs279/87stack.pdf

    especially starting at the last paragraph of page 2.

    And about the bug (or rather design mistake):

    https://history.siam.org/pdfs2/Kahan_final.pdf

    Start with the second-to-last paragraph on page 163. He digresses for
    a page, but continues on the fourth paragraph of page 165 and
    continues to the first paragraph of page 168.

    Interesting read. The reverse polish calculator from HP was a
    resounding success, with great profits.
    Kahan contributed to this.
    It was killed by the bean counters at HP.
    There was huge demand but they refused to expand production.
    Then the calculator died because it simply was not available.


    - anton

    Groetjes Albert
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat purring. - the Wise from Antrim -
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Thu Apr 18 09:27:07 2024
    From Newsgroup: comp.lang.forth

    mhx wrote:

    minforth wrote:

    mhx wrote:
    [..]
    Could be. At least gcc has support for _float80. FWIW Intel's icc even has a >> compiler flag for x87 stack overflow warnings.
    [..]

    At runtime?

    AFAIK they check if computation results are popped off the x87 stack and put into the xmm0 register. Could be a flag for compile-time checking.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Thu Apr 18 09:51:55 2024
    From Newsgroup: comp.lang.forth

    minforth wrote:

    mhx wrote:

    minforth wrote:

    mhx wrote:
    [..]
    Could be. At least gcc has support for _float80. FWIW Intel's icc even has a
    compiler flag for x87 stack overflow warnings.
    [..]

    At runtime?

    AFAIK they check if computation results are popped off the x87 stack and put into the xmm0 register. Could be a flag for compile-time checking.

    Correction. It generates runtime code https://www.intel.com/content/www/us/en/docs/cpp-compiler/developer-guide-reference/2021-8/fp-stack-check-qfp-stack-check.html

    but it doesn't look very convincing to me.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stephen Pelc@stephen@vfxforth.com to comp.lang.forth on Thu Apr 18 12:09:22 2024
    From Newsgroup: comp.lang.forth

    On 17 Apr 2024 at 15:08:01 CEST, "PMF" <PMF> wrote:
    In lxf64/ntf64 I use an external C library, specifically fdlibm53.

    This claims to be within 1 ulp correct. My testing also suggests this.
    fdlibm looks to be the base for most other math libraries.
    I could have used libm from gcc but I wanted the same code for both
    Linux and Windows.

    In lxf/ntf I use the 387 fp stack. I think this was a wrong decision.
    8 stack items is to low to be useful for anything more complicated.
    complex numbers is an example that quickly eats all 8 stack items.

    I share your pain. In 2023, I spent far too long trying to satisfy FP users. For VFX64, we ended up providing support for
    1) 8 item 387 FP, internal FP stack,
    2) 387 FP and FP stack in memory,
    3) SSE2 FP and FP stack in memory.

    Conversion of FP parameters to/from OS call interfaces is done
    automagically by the EXTERN: declarations for system calls and
    callbacks.

    For a significant proportion of our users, 64 bit FP is inadequate,
    and ony 64 bit FP is available on CPUs other than AMD64/x64
    unless you are using rare and expensive hardware.

    In retrospect, if I were doing this again I would standardise on an
    external double-double library (about 106 bits). In most cases that
    we encounter, the desire for 387 FP is to gain the extra precision.
    Since very few CPUs support quad precision natively, the most
    obvious solution is a double-double library.

    Stephen
    --
    Stephen Pelc, stephen@vfxforth.com
    MicroProcessor Engineering, Ltd. - More Real, Less Time
    133 Hill Lane, Southampton SO15 5AF, England
    tel: +44 (0)78 0390 3612, +34 649 662 974
    http://www.mpeforth.com
    MPE website
    http://www.vfxforth.com/downloads/VfxCommunity/
    downloads
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Thu Apr 18 07:52:31 2024
    From Newsgroup: comp.lang.forth

    On 4/17/24 08:08, PMF wrote:
    Krishna Myneni wrote:

    On 4/15/24 20:10, minforth wrote:
    I would be surprised to get a SIGFPE interrupt from x87 stack ops.
    X87 mode has long been deprecated and replaced by SSE2.

    Maybe I'm missing something, but SSE2 does not seem to have anything
    beyond basic floating point arithmetic e.g. transcendental functions.

    Yes that is right. There is a sqrt but nothing more.

    In lxf64/ntf64 I use an external C library, specifically fdlibm53.

    This claims to be within 1 ulp correct. My testing also suggests this.
    fdlibm looks to be the base for most other math libraries.
    I could have used libm from gcc but I wanted the same code for both
    Linux and Windows.

    In lxf/ntf I use the 387 fp stack. I think this was a wrong decision.
    8 stack items is to low to be useful for anything more complicated.
    complex numbers is an example that quickly eats all 8 stack items.
    Best Regards
    Peter Fälth

    Thank you for the clarification. I disassembled the GNU C Math Library
    (64-bit libm.so.6) and had a look at the code. It is a huge file. The
    long double functions are coded for x87, while functions for float,
    double, and float128 use the SSE instructions.

    This implies that if x87 is deprecated on 64-bit systems, then 80-bit
    floats are also deprecated?

    Does fdlibm53 support 80-bit floats and float128?

    --
    Krishna


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Krishna Myneni@krishna.myneni@ccreweb.org to comp.lang.forth on Thu Apr 18 08:00:17 2024
    From Newsgroup: comp.lang.forth

    On 4/18/24 07:09, Stephen Pelc wrote:
    ...
    In retrospect, if I were doing this again I would standardise on an
    external double-double library (about 106 bits). In most cases that
    we encounter, the desire for 387 FP is to gain the extra precision.
    Since very few CPUs support quad precision natively, the most
    obvious solution is a double-double library.

    I also thought double-double would be sufficient, until last year, when
    I ran into having to code an application for which it was unsuitable.
    The problem with double-double was not in the number of bits in the significand (106) -- the problem was that the exponent range of double precision was not large enough. I believe the double-double type even
    reduces the available exponent range from that of a double type. In any
    case I ended up using the MPFR library for this application. IEEE quad precision would be more suitable, since it expands the exponent range.

    --
    Krishna


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Thu Apr 18 17:10:17 2024
    From Newsgroup: comp.lang.forth

    The Cephes math library supports extended and 128b floats https://netlib.org/cephes/128bdoc.html

    But gnu libquadmath had been good enough for me until now.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From dxf@dxforth@gmail.com to comp.lang.forth on Fri Apr 19 13:48:04 2024
    From Newsgroup: comp.lang.forth

    On 18/04/2024 7:11 pm, albert@spenarnc.xs4all.nl wrote:
    ...
    Interesting read. The reverse polish calculator from HP was a
    resounding success, with great profits.
    Kahan contributed to this.
    It was killed by the bean counters at HP.
    There was huge demand but they refused to expand production.
    Then the calculator died because it simply was not available.

    We had quite a few HP instruments but then the bottom fell out of
    the test instrument market (and maintenance generally). It was
    embarrassing to see the once great HP pursue the consumer PC market.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Apr 20 15:23:22 2024
    From Newsgroup: comp.lang.forth

    Stephen Pelc <stephen@vfxforth.com> writes:
    The manual (gasp) documents how to change the default FP package.

    Changing the default pack also changes the system call interfaces to
    match.

    It is great that now an FP package is available by default, rather
    than having to load some obscurely-named file from an
    installation-specific path. E.g. appbench-1.4 contains a file
    setup/vfx.fth which contains the lines:

    1 cells 4 = [if]
    include /usr/share/doc/VfxForth/Lib/Ndp387.fth
    [then]

    The result is that when I just tested various systems on the
    appbench-1.4 suite, vfx64 worked (for four of the benchmarks), and I
    won't spoil my positive message by reporting on what did not work.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Apr 20 15:58:03 2024
    From Newsgroup: comp.lang.forth

    mhx@iae.nl (mhx) writes:
    minforth wrote:
    But I think the main advantage lies in the possibility of parallel and/or
    vectorized execution.

    I have not yet seen algorithms where that would bring something.

    Matrix multiplication is an easy case. I have also done a version of
    Jon Bentley's greedy TSP program that benefitted from SSE and AVX; I
    had to use assembly language to do this, however; see the thread
    starting at <2016Nov14.164726@mips.complang.tuwien.ac.at>.

    OTOH, yesterday I saw what gcc did for the inner loop of the bubble
    benchmark from the Stanford integer benchmarks:

    while ( i<top ) {

    if ( sortlist[i] > sortlist[i+1] ) {
    j = sortlist[i];
    sortlist[i] = sortlist[i+1];
    sortlist[i+1] = j;
    };
    i=i+1;
    };

    top=top-1;
    };

    gcc-12.2 -O1 produces straighforward scalar code, gcc-12.2 -O3 wants
    to use SIMD instructions:

    gcc -01 gcc -O3
    1c: add $0x4,%rax c0: movq (%rax),%xmm0
    cmp %rsi,%rax add $0x1,%edx
    je 35 pshufd $0xe5,%xmm0,%xmm1
    25: mov (%rax),%edx movd %xmm0,%edi
    mov 0x4(%rax),%ecx movd %xmm1,%ecx
    cmp %ecx,%edx cmp %ecx,%edi
    jle 1c jle e1
    mov %ecx,(%rax) pshufd $0xe1,%xmm0,%xmm0
    mov %edx,0x4(%rax) movq %xmm0,(%rax)
    jmp 1c e1: add $0x4,%rax
    35: cmp %r8d,%edx
    jl c0

    The version produced by gcc -O3 is almost three times slower on a
    Skylake than the one by gcc -O1 and is actually slower than several
    Forth systems, including gforth-fast. I think that the reason is that
    the movq towards the end stores two items, and the movq at the start
    of the next iteration loads one of these item, i.e., there is partial
    overlap between the store and the load. In this case the hardware
    takes a slow path, which means that the slowdown is much bigger than
    the instruction count suggests.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Apr 20 16:23:21 2024
    From Newsgroup: comp.lang.forth

    albert@spenarnc.xs4all.nl writes:
    I cut the same corners with ciforth. However I think this
    cannot be compliant with the IEEE requirement of the standard?

    Which IEEE requirement of which standard?

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Sat Apr 20 12:00:00 2024
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    The version produced by gcc -O3 is almost three times slower on a
    Skylake than the one by gcc -O1 and is actually slower than several
    Forth systems, including gforth-fast.

    Wow, that seems worth a bug report. How about gcc -O2 ?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Apr 21 06:45:29 2024
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    The version produced by gcc -O3 is almost three times slower on a
    Skylake than the one by gcc -O1 and is actually slower than several
    Forth systems, including gforth-fast.

    Wow, that seems worth a bug report.

    At least the compiled code is working as the programmer intended. gcc
    people regularly resolve bug reports as "INVALID" where earlier gcc
    versions work as intended and later gcc versions do not. And then
    there are the cases where they just do nothing, as on PR93811.

    How about gcc -O2 ?

    I have not tried it.

    If you want to measure it and/or report this as a bug, you can find
    the source code in <http://www.complang.tuwien.ac.at/forth/bench.zip>
    in the directory c-manual.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Apr 21 09:12:54 2024
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    OTOH, yesterday I saw what gcc did for the inner loop of the bubble
    benchmark from the Stanford integer benchmarks:

    while ( i<top ) {

    if ( sortlist[i] > sortlist[i+1] ) {
    j = sortlist[i];
    sortlist[i] = sortlist[i+1];
    sortlist[i+1] = j;
    };
    i=i+1;
    };

    top=top-1;
    };

    gcc-12.2 -O1 produces straighforward scalar code, gcc-12.2 -O3 wants
    to use SIMD instructions:

    gcc -01 gcc -O3
    1c: add $0x4,%rax c0: movq (%rax),%xmm0
    cmp %rsi,%rax add $0x1,%edx
    je 35 pshufd $0xe5,%xmm0,%xmm1
    25: mov (%rax),%edx movd %xmm0,%edi
    mov 0x4(%rax),%ecx movd %xmm1,%ecx
    cmp %ecx,%edx cmp %ecx,%edi
    jle 1c jle e1
    mov %ecx,(%rax) pshufd $0xe1,%xmm0,%xmm0
    mov %edx,0x4(%rax) movq %xmm0,(%rax)
    jmp 1c e1: add $0x4,%rax
    35: cmp %r8d,%edx
    jl c0

    The version produced by gcc -O3 is almost three times slower on a
    Skylake than the one by gcc -O1 and is actually slower than several
    Forth systems, including gforth-fast. I think that the reason is that
    the movq towards the end stores two items, and the movq at the start
    of the next iteration loads one of these item, i.e., there is partial
    overlap between the store and the load. In this case the hardware
    takes a slow path, which means that the slowdown is much bigger than
    the instruction count suggests.

    I was curious if a more recent Intel core had improved on that (and
    maybe such a more recent Intel core was targeted by the "optimization"
    that caused the slowdown), so I measured it on a P-core of a Core
    i3-1315U. The results are as follows:

    O1/bubble O3/bubble
    424,248,952 2,061,809,866 cpu_core/cycles/
    1,536,825,253 1,986,035,580 cpu_core/instructions/

    So, more than a factor of 4 on this microarchitecture.

    The differences in the topdown analysis are also interesting:

    O1
    1,177,188,340 cpu_core/topdown-retiring/ # 46.1% Retiring
    279,332,826 cpu_core/topdown-bad-spec/ # 10.9% Bad Speculation
    778,141,445 cpu_core/topdown-fe-bound/ # 30.5% Frontend Bound
    319,237,516 cpu_core/topdown-be-bound/ # 12.5% Backend Bound
    0 cpu_core/topdown-heavy-ops/ # 0.0% Heavy Operations
    269,356,654 cpu_core/topdown-br-mispredict/ # 10.5% Branch Mispredict
    269,356,654 cpu_core/topdown-fetch-lat/ # 10.5% Fetch Latency
    59,857,034 cpu_core/topdown-mem-bound/ # 2.3% Memory Bound

    O3
    1,599,831,263 cpu_core/topdown-retiring/ # 12.9% Retiring
    630,236,558 cpu_core/topdown-bad-spec/ # 5.1% Bad Speculation
    533,277,087 cpu_core/topdown-fe-bound/ # 4.3% Frontend Bound
    9,598,987,583 cpu_core/topdown-be-bound/ # 77.6% Backend Bound
    280,169 cpu_core/topdown-heavy-ops/ # 0.0% Heavy Operations
    630,236,558 cpu_core/topdown-br-mispredict/ # 5.1% Branch Mispredict
    193,918,941 cpu_core/topdown-fetch-lat/ # 1.6% Fetch Latency
    5,623,649,291 cpu_core/topdown-mem-bound/ # 45.5% Memory Bound

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sun Apr 21 11:44:38 2024
    From Newsgroup: comp.lang.forth

    In article <2024Apr20.182321@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    albert@spenarnc.xs4all.nl writes:
    I cut the same corners with ciforth. However I think this
    cannot be compliant with the IEEE requirement of the standard?

    Which IEEE requirement of which standard?

    The ISO 9x talks about

    64-bit IEEE double-precision number
    32-bit IEEE double-precision number

    IEEE floating point number as defined in ANSI/IEEE Standard
    754-1985

    80 bit '87 is not compliant with either of these, I think.

    - anton
    --
    Don't praise the day before the evening. One swallow doesn't make spring.
    You must not say "hey" before you have crossed the bridge. Don't sell the
    hide of the bear until you shot it. Better one bird in the hand than ten in
    the air. First gain is a cat purring. - the Wise from Antrim -
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Apr 21 10:57:40 2024
    From Newsgroup: comp.lang.forth

    albert@spenarnc.xs4all.nl writes:
    In article <2024Apr20.182321@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >>albert@spenarnc.xs4all.nl writes:
    I cut the same corners with ciforth. However I think this
    cannot be compliant with the IEEE requirement of the standard?

    Which IEEE requirement of which standard?

    The ISO 9x talks about

    Whatever "ISO 9x" may be: Why do you worry about it and mention it
    here?

    64-bit IEEE double-precision number
    32-bit IEEE double-precision number

    IEEE floating point number as defined in ANSI/IEEE Standard
    754-1985

    So?

    80 bit '87 is not compliant with either of these, I think.

    So?

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023: https://euro.theforth.net/2023
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From dxf@dxforth@gmail.com to comp.lang.forth on Mon Apr 22 21:05:10 2024
    From Newsgroup: comp.lang.forth

    On 21/04/2024 1:23 am, Anton Ertl wrote:
    Stephen Pelc <stephen@vfxforth.com> writes:
    The manual (gasp) documents how to change the default FP package.

    Changing the default pack also changes the system call interfaces to
    match.

    It is great that now an FP package is available by default, rather
    than having to load some obscurely-named file from an
    installation-specific path. E.g. appbench-1.4 contains a file
    setup/vfx.fth which contains the lines:

    1 cells 4 = [if]
    include /usr/share/doc/VfxForth/Lib/Ndp387.fth
    [then]

    The result is that when I just tested various systems on the
    appbench-1.4 suite, vfx64 worked (for four of the benchmarks), and I
    won't spoil my positive message by reporting on what did not work.

    What didn't work?

    When loading an alternate f/p package I found it removes parts of the
    compiler unrelated to fp. Hopefully this can be rectified. IIRC this
    problem didn't exist with the previous setup where fp had to be explicitly loaded.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stephen Pelc@stephen@vfxforth.com to comp.lang.forth on Mon Apr 22 19:56:55 2024
    From Newsgroup: comp.lang.forth

    On 22 Apr 2024 at 13:05:10 CEST, "dxf" <dxforth@gmail.com> wrote:

    When loading an alternate f/p package I found it removes parts of the compiler unrelated to fp. Hopefully this can be rectified. IIRC this problem didn't exist with the previous setup where fp had to be explicitly loaded.

    Please send me more details by email.

    Stephen
    --
    Stephen Pelc, stephen@vfxforth.com
    MicroProcessor Engineering, Ltd. - More Real, Less Time
    133 Hill Lane, Southampton SO15 5AF, England
    tel: +44 (0)78 0390 3612, +34 649 662 974
    http://www.mpeforth.com
    MPE website
    http://www.vfxforth.com/downloads/VfxCommunity/
    downloads
    --- Synchronet 3.20a-Linux NewsLink 1.114