• Stack vs stackless operation

    From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Mon Feb 24 19:49:01 2025
    From Newsgroup: comp.lang.forth

    I wonder: wouldn't it be useful to have stackless basic
    arith operations? I mean instead of fetching the values
    first and putting them on the stack, then doing something,
    and in the end storing the result somewhere wouldn't it
    be practical to use directly the variables? Like this:

    : +> ( addr1 addr2 addr3 -- )
    rot @ rot @ + swap ! ;

    Of course the above is just an illustration; I mean coding
    such word directly in ML. It should be significantly
    faster than going through stack usual way.

    But after I came up with this idea I realized someone
    surely invented that before - it looks so obvious — yet
    I didn't see it anywhere. Did anyone of you see something
    like this in any code? If so — actually why somehow
    (probably?) such solution has not become widespread?
    Looks good to me; math can be done completely in ML
    avoiding "Forth machine" engagement, therefore saving many
    cycles.

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Mon Feb 24 20:34:25 2025
    From Newsgroup: comp.lang.forth

    An optimising Forth compiler does exactly that.

    NT/FORTH for example:

    : +> rot @ rot @ + swap ! ; ok
    see +>
    A49E6C 409196 21 C80000 5 normal +>

    409196 8B4504 mov eax , [ebp+4h]
    409199 8B00 mov eax , [eax]
    40919B 8B4D00 mov ecx , [ebp]
    40919E 8B09 mov ecx , [ecx]
    4091A0 01C8 add eax , ecx
    4091A2 8903 mov [ebx] , eax
    4091A4 8B5D08 mov ebx , [ebp+8h]
    4091A7 8D6D0C lea ebp , [ebp+Ch]
    4091AA C3 ret near
    ok
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Mon Feb 24 20:39:59 2025
    From Newsgroup: comp.lang.forth

    So for non-optimizing one it will be handy, correct?

    BTW: could you, please, list all optimizing Forth
    compilers -- or at least the ones you know?

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Mon Feb 24 21:51:26 2025
    From Newsgroup: comp.lang.forth

    With respect, the more important questions are:
    For what type of machine?
    Desktop or embedded?
    Minimal kernel only or full standard compliant?
    Hobby or professional support/service required?

    But to mention another example:
    https://mecrisp.sourceforge.net/#
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Mon Feb 24 21:50:21 2025
    From Newsgroup: comp.lang.forth

    zbigniew2011@gmail.com (LIT) writes:
    I wonder: wouldn't it be useful to have stackless basic
    arith operations? I mean instead of fetching the values
    first and putting them on the stack, then doing something,
    and in the end storing the result somewhere wouldn't it
    be practical to use directly the variables?

    I don't remember ever doing that, so no, it would not be practical.

    Forth has had values for quite a while, so you could avoid the need to
    write @ and !; you would instead write something like:

    a b + to c

    For global values the code is typically not better than when using
    variables, though.

    But after I came up with this idea I realized someone
    surely invented that before - it looks so obvious — yet
    I didn't see it anywhere.

    The VAX architecture has instructions with three memory operands,
    including ADD3. That feature makes it pretty hard to implement
    efficiently.

    Did anyone of you see something
    like this in any code?

    No.

    If so — actually why somehow
    (probably?) such solution has not become widespread?

    Probably because the case where the two operands of a + are in memory,
    and the result is needed in memory is not that frequent.

    Looks good to me; math can be done completely in ML
    avoiding "Forth machine" engagement, therefore saving many
    cycles.

    Not sure what you mean with "Forth machine engagement"; with good
    Forth compilers these days, a typical stack-to-stack addition is
    faster than the best machine code for a memory-to-memory addition.
    E.g. VFX64 turns

    : dec-u#b ( u1 -- u2 )
    dup #-3689348814741910323 um* nip 3 rshift tuck 10 * - '0' + hold ; ok

    into

    ( 0050A300 48BACDCCCCCCCCCCCCCC ) MOV RDX, # CCCCCCCC:CCCCCCCD
    ( 0050A30A 488BC2 ) MOV RAX, RDX
    ( 0050A30D 48F7E3 ) MUL RBX
    ( 0050A310 48C1EA03 ) SHR RDX, # 03
    ( 0050A314 486BCA0A ) IMUL RCX, RDX, # 0A
    ( 0050A318 482BD9 ) SUB RBX, RCX
    ( 0050A31B 4883C330 ) ADD RBX, # 30
    ( 0050A31F 488D6DF8 ) LEA RBP, [RBP+-08]
    ( 0050A323 48895500 ) MOV [RBP], RDX
    ( 0050A327 E87CA7F1FF ) CALL 00424AA8 HOLD
    ( 0050A32C C3 ) RET/NEXT
    ( 45 bytes, 11 instructions )

    I don't think that it would be faster or shorter to use
    memory-to-memory operations here. That's also why the VAX died: RISCs
    just outperformed it.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Mon Feb 24 22:35:28 2025
    From Newsgroup: comp.lang.forth

    On Mon, 24 Feb 2025 21:51:26 +0000, minforth wrote:

    With respect, the more important questions are:
    For what type of machine?
    Desktop or embedded?

    Wow, so there are really so many?
    Say, for desktop.

    Minimal kernel only or full standard compliant?
    Hobby or professional support/service required?

    But to mention another example:
    https://mecrisp.sourceforge.net/#

    Indeed I downloaded mecrisp already long ago,
    and somehow still I didn't find time for my
    Stellaris launchpad. Maybe now it's time, finally.

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Mon Feb 24 22:41:57 2025
    From Newsgroup: comp.lang.forth

    Probably because the case where the two operands
    of a + are in memory, and the result is needed
    in memory is not that frequent.

    One example could be matrix multiplication.
    It's rather trivial but cumbersome operation,
    where usually a few transitional variables are
    used to maintain clarity of the code.

    I don't think that it would be faster or shorter to use
    memory-to-memory operations here. That's also why the VAX died: RISCs
    just outperformed it.

    Probably "bigger" Forth compilers are indeed
    already "too good" for the difference to be
    (practically) noticeable — still maybe for
    simpler Forths, I mean like the ones for DOS
    or even for 8-bit machines it would make sense?
    I'll try to do a few checks in the days that'll
    follow.

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Tue Feb 25 16:24:40 2025
    From Newsgroup: comp.lang.forth

    On 25/02/2025 6:49 am, LIT wrote:
    I wonder: wouldn't it be useful to have stackless basic
    arith operations? I mean instead of fetching the values
    first and putting them on the stack, then doing something,
    and in the end storing the result somewhere wouldn't it
    be practical to use directly the variables? Like this:

    : +> ( addr1 addr2 addr3 -- )
     rot @ rot @ + swap ! ;

    Of course the above is just an illustration; I mean coding
    such word directly in ML. It should be significantly
    faster than going through stack usual way.

    A set of three addresses on the stack is messy even before
    one does anything with them.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 07:04:26 2025
    From Newsgroup: comp.lang.forth

    : +> ( addr1 addr2 addr3 -- )
     rot @ rot @ + swap ! ;

    Of course the above is just an illustration; I mean coding
    such word directly in ML. It should be significantly
    faster than going through stack usual way.

    A set of three addresses on the stack is messy even before
    one does anything with them.

    Yep, but I meant the case of, for example:

    var1 @ var2 @ + var3 !

    The above isn't messy at all.

    So IMHO by using such OOS (out-of-stack) operation - coded
    directly in ML - we can replace the above by:

    var1 var2 var3 +>

    Or we can create increment-by-one operation (and its
    counterpart):

    var1 ++
    var1 --

    Or "multiply/divide by two a number of times":

    var1 2 lshift ( multiply by 4 )

    etc.

    In case of slower ITC non-optimizing Forths - like
    fig-Forth, as the most obvious example - the "boost"
    may be noticeable.
    I'll check that.

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Feb 25 07:26:58 2025
    From Newsgroup: comp.lang.forth

    zbigniew2011@gmail.com (LIT) writes:
    Probably because the case where the two operands
    of a + are in memory, and the result is needed
    in memory is not that frequent.

    One example could be matrix multiplication.
    It's rather trivial but cumbersome operation,
    where usually a few transitional variables are
    used to maintain clarity of the code.

    Earlier you wrote about performance, now you switch to clarity of the
    code. What is the goal?

    If we stick with performance, the fastest version in <http://theforth.net/package/matmul/current-view/matmul.4th> on all
    systems (which I measured and that does not use a primitive FAXPY) is
    version 2, and that spends most of its time in:

    : faxpy-nostride ( ra f_x f_y ucount -- )
    \ vy=ra*vx+vy
    dup >r 3 and 0 ?do
    fdup over f@ f* dup f+! float+ swap float+ swap
    loop
    r> 2 rshift 0 ?do
    fdup over f@ f* dup f+! float+ swap float+ swap
    fdup over f@ f* dup f+! float+ swap float+ swap
    fdup over f@ f* dup f+! float+ swap float+ swap
    fdup over f@ f* dup f+! float+ swap float+ swap
    loop
    2drop fdrop ;

    It's not the clearest code, and certainly the version without
    unrolling is clearer (and may be almost as fast in the newer versions
    of SwiftForth and VFX which make counted loops significantly faster):

    : faxpy-nostride ( ra f_x f_y ucount -- )
    \ vy=ra*vx+vy
    0 ?do
    fdup over f@ f* dup f+! float+ swap float+ swap
    loop
    2drop fdrop ;

    Each iteration performs 2 FP loads and 1 FP store. With
    memory-to-memory variants of F* and F+ that would be 4 FP loads and 2
    FP stores, and I don't think it would be any clearer. And if you use memory-to-memory variants of the address computation, things would
    become even slower. And I doubt that they would become clearer.

    Some time later I worked on how SIMD could be integrated into Forth,
    and used matrix multiplication as an example. With the wordset I
    propose this whole loop became

    ( v1 r addr ) v@ f*vs f+v ( v2 )

    Only one memory access is visible here at all; there are some more in
    the implementation of these words, however. You can find the paper
    about that at <http://www.euroforth.org/ef17/papers/ertl.pdf>. A
    further refinement of that work can be found at <https://www.complang.tuwien.ac.at/papers/ertl18manlang.pdf>
    (presented in a Java setting for the audience of the conference, but
    the implementation was in a Forth setting, see <https://github.com/AntonErtl/vectors>). This work eliminates many of
    the memory accesses that the earlier implementation performs,
    demonstrating that the memory accesses are not fundamental in the
    model. In particular, Figure 11 shows code corresponding to

    ( v1 r1 addr1 r2 addr2 ) v@ f*vs v@ f+v v@ f*vs f+v ( v2 )

    i.e., the code above unrolled by a factor of 2; it has 3 SIMD loads
    and 1 SIMD store per SIMD-granule processed (the SIMD granule is 4
    doubles for AVX). Further unrolling results in even fewer loads and
    stores per FLOP (FP multiplication and FP addition).

    Probably "bigger" Forth compilers are indeed
    already "too good" for the difference to be
    (practically) noticeable — still maybe for
    simpler Forths, I mean like the ones for DOS
    or even for 8-bit machines it would make sense?

    Forth was designed for small machines and very simple implementations.
    We have words like "1+" that are beneficial in that setting. We also
    have "+!", which is the closest to what you have in mind. But even in
    those times nobody went for a word like "+> ( addr1 addr2 addr3 -- )",
    because it is not useful often enough.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 08:33:16 2025
    From Newsgroup: comp.lang.forth

    One example could be matrix multiplication.
    It's rather trivial but cumbersome operation,
    where usually a few transitional variables are
    used to maintain clarity of the code.

    Earlier you wrote about performance, now you switch to clarity of the
    code. What is the goal?

    Both — one isn't contrary to another.

    If we stick with performance, the fastest version in
    [..]
    Forth was designed for small machines and very simple implementations.
    We have words like "1+" that are beneficial in that setting. We also
    have "+!", which is the closest to what you have in mind. But even in
    those times nobody went for a word like "+> ( addr1 addr2 addr3 -- )", because it is not useful often enough.

    What I have in mind is: by performing OOS operation
    we don't have to employ the whole "Forth machine" to
    do the usual things (I mean, of course, the usual
    steps described by Brad Rodriguez in his "Moving
    Forth" paper).

    It comes with a cost: usual Forth words, that use
    the stack, are versatile, while such OOS words
    aren't that versatile anymore — yet (at least in
    the case of ITC non-optimizing Forths) they should
    be faster. I'll create a few of them and I'll compare
    the processing time, then I'll publish the results.

    Clarity of the code comes as a "bonus" :) yes, we've
    got VALUEs and I use them when needed, but their use
    still means employing the "Forth machine".

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Feb 25 09:07:19 2025
    From Newsgroup: comp.lang.forth

    zbigniew2011@gmail.com (LIT) writes:
    [Anton Ertl:]
    Earlier you wrote about performance, now you switch to clarity of the
    code. What is the goal?

    Both — one isn't contrary to another.

    Sometimes the clearer code is slower and the faster code is less clear
    (as in the FAXPY-NOSTRIDE example).

    What I have in mind is: by performing OOS operation
    we don't have to employ the whole "Forth machine" to
    do the usual things (I mean, of course, the usual
    steps described by Brad Rodriguez in his "Moving
    Forth" paper).

    What does "OOS" stand for? What do you mean with "the usual steps"; I
    am not going to read the whole paper and guess which of the code shown
    there you have in mind.

    It comes with a cost: usual Forth words, that use
    the stack, are versatile, while such OOS words
    aren't that versatile anymore — yet (at least in
    the case of ITC non-optimizing Forths) they should
    be faster.

    One related thing is the work on "register"-based virtual machines.
    For interpreted implementations the VM registers are in memory, but
    they are accessed by "register number"; these usually correspond to
    locals slots on machines line the JavaVM. A well-known example of
    that is the switch of the Lua VM from stack-based to register-based.
    A later example is Android's Dalvik VM for Java, in contrast to the
    stack-based JavaVM.

    There is a paper [shi+08] that provides an academic justification for
    this approach. The gist of it is that, with some additional compiler complexity, the register-based machine can reduce the number of NEXTs
    (in Forth threaded-code terminology); depending on the implementation
    approach and the hardware, the NEXTs could be the major cost at the
    time. However, already at the time dynamic superinstructions (in implementation technique for virtual-machine interpreters) reduced the
    number of NEXTs to one per basic block, and VM registers did nothing
    to reduce NEXTs in that case; Shi et al. also showed that with a lot
    of compiler sophistication (data flow analysis etc.) VM registers can
    be as fast as stacks even with dynamic superinstructions.

    However, given that dynamic superinstructions are easier to implement
    and the VM registers do not give a benefit when they are employed, why
    would one go for VM registers? Of course, in the Forth setting one
    could offload the optimization onto the programmer, but even Chuck
    Moore did not go there.

    In any case, here's an example extracted from Figure 6 of the paper:

    Java VM (stack) VM registers
    19 iload_1
    20 bipush #31 iconst #31 -> r1
    21 imul imul r6 r1 -> r3
    22 aload_0
    23 getfield value getfield r0.value -> r5
    24 iload_3
    26 caload caload r5 r7 -> r5
    27 iadd iadd r3 r5 -> r6
    28 istore_1

    So yes, the VM register code contains fewer VM instructions. Is it
    clearer?

    The corresponding Gforth code is the stuff between IF and THEN in the following:

    0
    value: some-field
    value: value
    constant some-struct

    : foo
    {: r0 r1 r3 :}
    if
    r1 31 * r0 value r3 + c@ + to r1
    then
    r1 ;

    The code that Gforth produces for the basic block under consideration
    is:

    $7FC624AA0958 @local1 1->1
    7FC62464A5BA: mov [r10],r13
    7FC62464A5BD: sub r10,$08
    7FC62464A5C1: mov r13,$08[rbp]
    $7FC624AA0960 lit 1->2
    $7FC624AA0968 #31
    7FC62464A5C5: sub rbx,$50
    7FC62464A5C9: mov r15,-$08[rbx]
    $7FC624AA0970 * 2->1
    7FC62464A5CD: imul r13,r15
    $7FC624AA0978 @local0 1->2
    7FC62464A5D1: mov r15,$00[rbp]
    $7FC624AA0980 lit+ 2->2
    $7FC624AA0988 #8
    7FC62464A5D5: add r15,$18[rbx]
    $7FC624AA0990 @ 2->2
    7FC62464A5D9: mov r15,[r15]
    $7FC624AA0998 @local2 2->3
    7FC62464A5DC: mov r9,$10[rbp]
    $7FC624AA09A0 + 3->2
    7FC62464A5E0: add r15,r9
    $7FC624AA09A8 c@ 2->2
    7FC62464A5E3: movzx r15d,byte PTR [r15]
    $7FC624AA09B0 + 2->1
    7FC62464A5E7: add r13,r15
    $7FC624AA09B8 !local1 1->1
    7FC62464A5EA: add r10,$08
    7FC62464A5EE: mov $08[rbp],r13
    7FC62464A5F2: mov r13,[r10]
    7FC62464A5F5: add rbx,$50

    There are 8 loads and 2 stores in that code. If the VM registers are
    held in memory (as they usually are, and as the Gforth locals are),
    the VM register code performs at least 9 loads (7 register accesses,
    the getfield, and the caload) and 5 stores. Of course, in Forth one
    would write the block as:

    : foo1 ( n3 a0 n1 -- n )
    31 * swap value rot + c@ + ;

    and the code for that is (without the ";"):

    $7FC624AA0A10 lit 1->2
    $7FC624AA0A18 #31
    7FC62464A617: mov r15,$08[rbx]
    $7FC624AA0A20 * 2->1
    7FC62464A61B: imul r13,r15
    $7FC624AA0A28 swap 1->2
    7FC62464A61F: mov r15,$08[r10]
    7FC62464A623: add r10,$08
    $7FC624AA0A30 lit+ 2->2
    $7FC624AA0A38 #8
    7FC62464A627: add r15,$28[rbx]
    $7FC624AA0A40 @ 2->2
    7FC62464A62B: mov r15,[r15]
    $7FC624AA0A48 rot 2->3
    7FC62464A62E: mov r9,$08[r10]
    7FC62464A632: add r10,$08
    $7FC624AA0A50 + 3->2
    7FC62464A636: add r15,r9
    $7FC624AA0A58 c@ 2->2
    7FC62464A639: movzx r15d,byte PTR [r15]
    $7FC624AA0A60 + 2->1
    7FC62464A63D: add r13,r15

    6 loads, 0 stores.

    And if we feed the equivalent standard code

    0
    field: some-field
    field: value-addr
    constant some-struct

    : foo1 ( n3 a0 n1 -- n )
    31 * swap value-addr @ rot + c@ + ;

    into other Forth systems, some produce even better code:

    VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
    FOO1
    ( 0050A310 486BDB1F ) IMUL RBX, RBX, # 1F
    ( 0050A314 488B5500 ) MOV RDX, [RBP]
    ( 0050A318 488B4D08 ) MOV RCX, [RBP+08]
    ( 0050A31C 48034A08 ) ADD RCX, [RDX+08]
    ( 0050A320 480FB609 ) MOVZX RCX, Byte 0 [RCX]
    ( 0050A324 4803D9 ) ADD RBX, RCX
    ( 0050A327 488D6D10 ) LEA RBP, [RBP+10]
    ( 0050A32B C3 ) RET/NEXT
    ( 28 bytes, 8 instructions )

    5 loads, 0 stores. And VFX does not do data-flow analysis across
    basic blocks, unlike the Java VM -> VM register compiler that Shi
    used; i.e., VFX is probably simpler than the compiler Shi used.

    @Article{shi+08,
    author = {Yunhe Shi and Kevin Casey and M. Anton Ertl and
    David Gregg},
    title = {Virtual machine showdown: Stack versus registers},
    journal = {ACM Transactions on Architecture and Code
    Optimization (TACO)},
    year = {2008},
    volume = {4},
    number = {4},
    pages = {21:1--21:36},
    month = jan,
    url = {http://doi.acm.org/10.1145/1328195.1328197},
    abstract = {Virtual machines (VMs) enable the distribution of
    programs in an architecture-neutral format, which
    can easily be interpreted or compiled. A
    long-running question in the design of VMs is
    whether a stack architecture or register
    architecture can be implemented more efficiently
    with an interpreter. We extend existing work on
    comparing virtual stack and virtual register
    architectures in three ways. First, our translation
    from stack to register code and optimization are
    much more sophisticated. The result is that we
    eliminate an average of more than 46\% of
    executed VM instructions, with the bytecode size of
    the register machine being only 26\% larger
    than that of the corresponding stack one. Second, we
    present a fully functional virtual-register
    implementation of the Java virtual machine (JVM),
    which supports Intel, AMD64, PowerPC and Alpha
    processors. This register VM supports
    inline-threaded, direct-threaded, token-threaded,
    and switch dispatch. Third, we present experimental
    results on a range of additional optimizations such
    as register allocation and elimination of redundant
    heap loads. On the AMD64 architecture the register
    machine using switch dispatch achieves an average
    speedup of 1.48 over the corresponding stack
    machine. Even using the more efficient
    inline-threaded dispatch, the register VM achieves a
    speedup of 1.15 over the equivalent stack-based VM.}
    }

    Clarity of the code comes as a "bonus" :) yes, we've
    got VALUEs and I use them when needed, but their use
    still means employing the "Forth machine".

    What do you mean with 'the "Forth machine"', and how does "OOS"
    (whatever that is) avoid it?

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 10:58:30 2025
    From Newsgroup: comp.lang.forth

    Earlier you wrote about performance, now you switch to clarity of the
    code. What is the goal?

    Both — one isn't contrary to another.

    Sometimes the clearer code is slower and the faster code is less clear
    (as in the FAXPY-NOSTRIDE example).

    What I have in mind is: by performing OOS operation
    we don't have to employ the whole "Forth machine" to
    do the usual things (I mean, of course, the usual
    steps described by Brad Rodriguez in his "Moving
    Forth" paper).

    What does "OOS" stand for?

    It's an acronym for the term I propose: "out-of-stack" (operation).

    What do you mean with "the usual steps"; I
    am not going to read the whole paper and guess which of the code shown
    there you have in mind.

    I mean the description how the "Forth machine" works:

    "Assume SQUARE is encountered while executing some other Forth word.
    Forth's Interpreter Pointer (IP) will be pointing to a cell in memory -- contained within that "other" word -- which contains the address of the
    word SQUARE. (To be precise, that cell contains the address of SQUARE's
    Code Field.) The interpreter fetches that address, and then uses it to
    fetch the contents of SQUARE's Code Field. These contents are yet
    another address -- the address of a machine language subroutine which
    performs the word SQUARE. In pseudo-code, this is:

    (IP) -> W fetch memory pointed by IP into "W" register
    ...W now holds address of the Code Field
    IP+2 -> IP advance IP, just like a program counter
    (assuming 2-byte addresses in the thread)
    (W) -> X fetch memory pointed by W into "X" register
    ...X now holds address of the machine code
    JP (X) jump to the address in the X register

    This illustrates an important but rarely-elucidated principle: the
    address of the Forth word just entered is kept in W. CODE words don't
    need this information, but all other kinds of Forth words do.

    If SQUARE were written in machine code, this would be the end of the
    story: that bit of machine code would be executed, and then jump back to
    the Forth interpreter -- which, since IP was incremented, is pointing to
    the next word to be executed. This is why the Forth interpreter is
    usually called NEXT.

    But, SQUARE is a high-level "colon" definition… [..]” etc.

    ( https://www.bradrodriguez.com/papers/moving1.htm )

    Many of these steps in particular cases can be avoided
    by the use of proposed OOS words, making (at least sometimes)
    the Forth program faster — and, as a kinda "bonus", clarity
    of the code increases.

    Probably in case of the "optimizing compiler" the gain
    may not be too significant, from what I already learned
    here, still in the case of simpler compilers — and maybe
    especially in the case of the ones created for CPUs not
    that suitable for Forth at all (lack of registers, like
    8051, for example) — probably it may be advantageous.

    [..]
    Clarity of the code comes as a "bonus" :) yes, we've
    got VALUEs and I use them when needed, but their use
    still means employing the "Forth machine".

    What do you mean with 'the "Forth machine"', and how does "OOS"
    (whatever that is) avoid it?

    By the "Forth machine" I mean that internal work of the
    Forth compiler - see the above quote from Brad's paper
    - and when we don't need to "fetch memory pointed by
    IP into "W" register, advance IP, just like a program
    counter" etc. etc. — replacing the whole process,
    (which is repeated for each subsequent word again and
    again) by a short string of ML instructions — we should
    note significant gain in the processing speed.

    More I'll able to say after I do the comparison, at
    least for fig-Forth on x86 under DOS control - because
    so far everything of the above are just my conclusions.
    I'll publish first results today in the evening.

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Tue Feb 25 11:16:26 2025
    From Newsgroup: comp.lang.forth

    But, SQUARE is a high-level "colon" definition… [..]” etc.

    ( https://www.bradrodriguez.com/papers/moving1.htm )

    Many of these steps in particular cases can be avoided
    by the use of proposed OOS words, making (at least sometimes)
    the Forth program faster — and, as a kinda "bonus", clarity
    of the code increases.

    After having avoided premature optimisation, every 'decent'
    Forth programmer will recode some few bottleneck words e.g.
    in assembler, where necessary. IOW microbenchmarking SQUARE,
    which can be implemented in a handful of lines of machine code
    or less, does not bring new insights.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 11:40:46 2025
    From Newsgroup: comp.lang.forth

    But, SQUARE is a high-level "colon" definition… [..]” etc.

    ( https://www.bradrodriguez.com/papers/moving1.htm )

    Many of these steps in particular cases can be avoided
    by the use of proposed OOS words, making (at least sometimes)
    the Forth program faster — and, as a kinda "bonus", clarity
    of the code increases.

    After having avoided premature optimisation, every 'decent'
    Forth programmer will recode some few bottleneck words e.g.
    in assembler, where necessary. IOW microbenchmarking SQUARE,
    which can be implemented in a handful of lines of machine code
    or less, does not bring new insights.

    I agree with you - still it does take decent Forth programmer.
    Recall the ones described by Jeff Fox? These Forth programmers,
    that refused to use Machine Forth just because "they were hired
    to program in ANS Forth"?
    I don't believe they were be able to recode anything in
    assembler - and note, it was about 30 years ago. Since that
    time assembler programming became even less popular.

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Feb 25 11:20:47 2025
    From Newsgroup: comp.lang.forth

    zbigniew2011@gmail.com (LIT) writes:
    I mean the description how the "Forth machine" works:

    "Assume SQUARE is encountered while executing some other Forth word.
    Forth's Interpreter Pointer (IP) will be pointing to a cell in memory -- >contained within that "other" word -- which contains the address of the
    word SQUARE. (To be precise, that cell contains the address of SQUARE's
    Code Field.) The interpreter fetches that address, and then uses it to
    fetch the contents of SQUARE's Code Field. These contents are yet
    another address -- the address of a machine language subroutine which >performs the word SQUARE. In pseudo-code, this is:

    (IP) -> W fetch memory pointed by IP into "W" register
    ...W now holds address of the Code Field
    IP+2 -> IP advance IP, just like a program counter
    (assuming 2-byte addresses in the thread)
    (W) -> X fetch memory pointed by W into "X" register
    ...X now holds address of the machine code
    JP (X) jump to the address in the X register

    This illustrates an important but rarely-elucidated principle: the
    address of the Forth word just entered is kept in W. CODE words don't
    need this information, but all other kinds of Forth words do.

    If SQUARE were written in machine code, this would be the end of the
    story: that bit of machine code would be executed, and then jump back to
    the Forth interpreter -- which, since IP was incremented, is pointing to
    the next word to be executed. This is why the Forth interpreter is
    usually called NEXT.

    But, SQUARE is a high-level "colon" definition… [..]” etc.

    ( https://www.bradrodriguez.com/papers/moving1.htm )

    Many of these steps in particular cases can be avoided
    by the use of proposed OOS words, making (at least sometimes)
    the Forth program faster — and, as a kinda "bonus", clarity
    of the code increases.

    What Rodriguez describes above is NEXT. As I mentioned in the ealier
    posting, using a VM with VM registers reduces the number of NEXTs
    executed, but if you go for dynamic superinstructions or native-code compilation, the number of NEXTs is reduced even more. And this can
    be done while still working with ordinary Forth code, no OOS needed.
    And these kinds of compilers can be done with relatively little
    effort.

    Probably in case of the "optimizing compiler" the gain
    may not be too significant, from what I already learned
    here, still in the case of simpler compilers — and maybe
    especially in the case of the ones created for CPUs not
    that suitable for Forth at all (lack of registers, like
    8051, for example) — probably it may be advantageous.

    I cannot speak about the 8051, but machine Forth is a simple
    native-code system and it's stack-based.

    By the "Forth machine" I mean that internal work of the
    Forth compiler - see the above quote from Brad's paper
    - and when we don't need to "fetch memory pointed by
    IP into "W" register, advance IP, just like a program
    counter" etc. etc. — replacing the whole process,
    (which is repeated for each subsequent word again and
    again) by a short string of ML instructions — we should
    note significant gain in the processing speed.

    Yes, dynamic superinstructions provide a good speedup for Gforth, and native-code systems also show a good speedup compared to classic
    threaded-code systems. But it's not necessary to eliminate the stack
    for that. Actually dealing with the stack is orthogonal to
    threaded code vs. native code.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Tue Feb 25 23:07:04 2025
    From Newsgroup: comp.lang.forth

    On 25/02/2025 6:04 pm, LIT wrote:
    : +> ( addr1 addr2 addr3 -- )
     rot @ rot @ + swap ! ;

    Of course the above is just an illustration; I mean coding
    such word directly in ML. It should be significantly
    faster than going through stack usual way.

    A set of three addresses on the stack is messy even before
    one does anything with them.

    Yep, but I meant the case of, for example:

    var1 @ var2 @ + var3 !

    The above isn't messy at all.

    So IMHO by using such OOS (out-of-stack) operation - coded
    directly in ML - we can replace the above by:

    var1 var2 var3 +>

    ...
    In case of slower ITC non-optimizing Forths - like
    fig-Forth, as the most obvious example - the "boost"
    may be noticeable.
    I'll check that.

    code +> ( x y z -- )
    dx pop cx pop bx pop 0 [bx] ax mov cx bx xchg
    0 [bx] ax add dx bx xchg ax 0 [bx] mov next
    end-code

    Timing (adjusted for loop time):

    var1 @ var2 @ + var3 ! 8019 mS
    var1 var2 var3 +> 5657 mS



    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 12:10:46 2025
    From Newsgroup: comp.lang.forth

    In case of slower ITC non-optimizing Forths - like
    fig-Forth, as the most obvious example - the "boost"
    may be noticeable.
    I'll check that.

    code +> ( x y z -- )
    dx pop cx pop bx pop 0 [bx] ax mov cx bx xchg
    0 [bx] ax add dx bx xchg ax 0 [bx] mov next
    end-code

    Timing (adjusted for loop time):

    var1 @ var2 @ + var3 ! 8019 mS
    var1 var2 var3 +> 5657 mS

    So even in case of fast DTC Forth, like DX Forth,
    it's already something worth of closer attention,
    I believe.
    I expect even bigger gain in case of older fig-Forth
    model.

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 12:24:32 2025
    From Newsgroup: comp.lang.forth

    What Rodriguez describes above is NEXT. As I mentioned in the ealier posting, using a VM with VM registers reduces the number of NEXTs
    executed, but if you go for dynamic superinstructions or native-code compilation, the number of NEXTs is reduced even more. And this can
    be done while still working with ordinary Forth code, no OOS needed.
    And these kinds of compilers can be done with relatively little
    effort.

    Yes, but I actually by "Forth machine" mean the
    work of the internals of the Forth compiler (and that
    has been nicely described by Brad), not just NEXT alone.

    I do realize that proposed OOS technique may not be
    that advantageous in case of more sophisticated compiler
    that's doing fine job for the programmer. Still not
    everywhere such compiler is available; if I find enough
    time during coming weekend, I'll try to do some
    exercises and comparison using Camel Forth for 8051.
    I expect the difference may be really big.

    Probably in case of the "optimizing compiler" the gain
    may not be too significant, from what I already learned
    here, still in the case of simpler compilers — and maybe
    especially in the case of the ones created for CPUs not
    that suitable for Forth at all (lack of registers, like
    8051, for example) — probably it may be advantageous.

    I cannot speak about the 8051, but machine Forth is a simple
    native-code system and it's stack-based.

    Never used that one. I know nothing about Machine Forth.
    BTW: is it available for download anywhere (if not
    commercial/restricted)?

    By the "Forth machine" I mean that internal work of the
    Forth compiler - see the above quote from Brad's paper
    - and when we don't need to "fetch memory pointed by
    IP into "W" register, advance IP, just like a program
    counter" etc. etc. — replacing the whole process,
    (which is repeated for each subsequent word again and
    again) by a short string of ML instructions — we should
    note significant gain in the processing speed.

    Yes, dynamic superinstructions provide a good speedup for Gforth, and native-code systems also show a good speedup compared to classic threaded-code systems. But it's not necessary to eliminate the stack
    for that. Actually dealing with the stack is orthogonal to
    threaded code vs. native code.

    Yes, I agree it's not necessary - still in particular
    cases it may mean noticeable speed-up. In the other
    cases - and especially for mentioned optimizing
    compilers - it may not make much sense, probably.

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Tue Feb 25 13:45:09 2025
    From Newsgroup: comp.lang.forth

    In article <591e7bf58ebb1f90bd34fba20c730b83@www.novabbs.com>,
    LIT <zbigniew2011@gmail.com> wrote:
    I wonder: wouldn't it be useful to have stackless basic
    arith operations? I mean instead of fetching the values
    first and putting them on the stack, then doing something,
    and in the end storing the result somewhere wouldn't it
    be practical to use directly the variables? Like this:

    : +> ( addr1 addr2 addr3 -- )
    rot @ rot @ + swap ! ;

    Of course the above is just an illustration; I mean coding
    such word directly in ML. It should be significantly
    faster than going through stack usual way.

    But after I came up with this idea I realized someone
    surely invented that before - it looks so obvious — yet
    I didn't see it anywhere. Did anyone of you see something
    like this in any code? If so — actually why somehow
    (probably?) such solution has not become widespread?
    Looks good to me; math can be done completely in ML
    avoiding "Forth machine" engagement, therefore saving many
    cycles.

    I have done some work on optimisation on ciforth.
    This work has stalled, but the infamous byte prime benchmark,
    was in the ballpark of swiftforth and mpeforth.
    (Disingenuous, because this was the example I used.)
    See https://home.hccnet.nl/a.w.m.van.der.horst/forthlecture5.html
    This is about folding, a generalisation of constant folding.
    This requires that you know the properties of the Forth Words,
    i.e. that you can execute + at compile time, if the inputs
    are constant.

    The next step is inlining, which requires transforming control
    structures to jumps. This eliminates all call/return pairs.

    A further step is replacing stack offset operations with registers
    operations. I have succeeded in eliminating the use of the return
    stack in a resulting block of code. Remember, there are no longer
    return addresses on the return stack.

    Then I got stalled. I introduced complicated rules to handle
    pop push and operators to simplify by interchanging and transforming.
    E.g. a rule
    movipop-pattern DUP matches? IF ?movipop-replace? ELSE
    test if a pattern applies, that execute the replacement.
    This is a one rule of the "no brain" matches, the simplest.

    "
    <! !Q! MOVI|X, !!T 0 {L,} ~!!T !Q! POP|X, !!T !>
    <A Q: POP|X, !TALLY NEXT A>
    { bufv 7 + C@ 7 AND bufc 1 + OR!U }
    optimisation movipop-pattern \ A object

    \ Relying heavily on a smart assembler/disassembler
    \ optimisation is a class name.

    \ :" it is all the same register."
    : movipop-same bufv 1+ C@ bufv 7 + C@ XOR 7 AND 0= ;

    \ Optional replace, leave " was replaced".
    : ?movipop-replace? movipop-same DUP IF replace THEN ;
    REGRESS HERE Q: MOVI|X, BX| 0 IL, Q: POP|X, AX| matches? movipop-same S: TRUE
    "

    This is going nowhere. Instead the technique of replacing
    cells offset from the data stack must be used.
    It has proven to work totally replacing return stack manipulations
    by registers.

    I chase a different goal here, get code that I can't improve studying
    the assembler code.Pretty silly, given that i86 is a dying
    architecture.

    The goal can be attained. I remember a 4 page comparison function
    in C, compiled by Intel C-compiler. There was not a single
    thing to improve upon.

    And a general remark. Optimise where it counts, replace the
    bottle neck. That is practical. All the other is sport, like
    Mount Everest or the South Pole.








    --
    --
    Temu exploits Christians: (Disclaimer, only 10 apostles)
    Last Supper Acrylic Suncatcher - 15Cm Round Stained Glass- Style Wall
    Art For Home, Office And Garden Decor - Perfect For Windows, Bars,
    And Gifts For Friends Family And Colleagues.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Tue Feb 25 13:46:52 2025
    From Newsgroup: comp.lang.forth

    In article <a81ac9ee2ed92686e940a55bed9d4dfb@www.novabbs.com>,
    LIT <zbigniew2011@gmail.com> wrote:
    In case of slower ITC non-optimizing Forths - like
    fig-Forth, as the most obvious example - the "boost"
    may be noticeable.
    I'll check that.

    code +> ( x y z -- )
    dx pop cx pop bx pop 0 [bx] ax mov cx bx xchg
    0 [bx] ax add dx bx xchg ax 0 [bx] mov next
    end-code

    Timing (adjusted for loop time):

    var1 @ var2 @ + var3 ! 8019 mS
    var1 var2 var3 +> 5657 mS

    So even in case of fast DTC Forth, like DX Forth,
    it's already something worth of closer attention,
    I believe.
    I expect even bigger gain in case of older fig-Forth
    model.

    Gain is only to be expected in the context of an application.

    --

    Groetjes Albert
    --
    Temu exploits Christians: (Disclaimer, only 10 apostles)
    Last Supper Acrylic Suncatcher - 15Cm Round Stained Glass- Style Wall
    Art For Home, Office And Garden Decor - Perfect For Windows, Bars,
    And Gifts For Friends Family And Colleagues.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Tue Feb 25 13:25:00 2025
    From Newsgroup: comp.lang.forth

    On Tue, 25 Feb 2025 11:40:46 +0000, LIT wrote:

    I agree with you - still it does take decent Forth programmer.
    Recall the ones described by Jeff Fox? These Forth programmers,
    that refused to use Machine Forth just because "they were hired
    to program in ANS Forth"?
    I don't believe they were be able to recode anything in
    assembler - and note, it was about 30 years ago. Since that
    time assembler programming became even less popular.

    A bit off-topic: I have been in a similar situation when some of
    our service engineers were very reluctant to modify inner
    software parts of controllers. The guys were not dumb, but with
    such modifications comes responsibility when s.th. unexpected
    happens like a system crash. So it was more of a legal than a
    technical issue.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 14:36:35 2025
    From Newsgroup: comp.lang.forth

    A bit off-topic: I have been in a similar situation when some of
    our service engineers were very reluctant to modify inner
    software parts of controllers. The guys were not dumb, but with
    such modifications comes responsibility when s.th. unexpected
    happens like a system crash. So it was more of a legal than a
    technical issue.

    Yes, I'm aware the reason may be different in the different
    case; still Jeff portrayed that situation rather clear way:
    they didn't want to use Machine Forth just because "they
    were paid for ANS Forth programming", they signed kind of
    agreement for that, therefore they "weren't interested" in
    any changes etc.

    Unfortunately we won't have any opportunity anymore to ask
    Jeff for more details.

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 14:44:17 2025
    From Newsgroup: comp.lang.forth

    I have done some work on optimisation on ciforth.
    This work has stalled, but the infamous byte prime benchmark,
    was in the ballpark of swiftforth and mpeforth.
    (Disingenuous, because this was the example I used.)
    See https://home.hccnet.nl/a.w.m.van.der.horst/forthlecture5.html
    This is about folding, a generalisation of constant folding.
    This requires that you know the properties of the Forth Words,
    i.e. that you can execute + at compile time, if the inputs
    are constant.
    [..]
    Then I got stalled. I introduced complicated rules [..]

    This is more sophisticated way.

    My proposal is rather humble: a modest completion
    of Forth programming paradigm, from "every data goes
    through the stack" to "any data can go through the
    stack, still it's not strictly mandatory in every
    single case".

    Now I'm pondering about DO..LOOP construct; actually
    probably it doesn't necessarily need to rely on return
    stack. I wonder how much "lightweight" can loop become
    by rewriting it using OOS words.

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Gerry Jackson@do-not-use@swldwa.uk to comp.lang.forth on Tue Feb 25 15:32:47 2025
    From Newsgroup: comp.lang.forth

    On 25/02/2025 14:36, LIT wrote:
    A bit off-topic: I have been in a similar situation when some of
    our service engineers were very reluctant to modify inner
    software parts of controllers. The guys were not dumb, but with
    such modifications comes responsibility when s.th. unexpected
    happens like a system crash. So it was more of a legal than a
    technical issue.

    Yes, I'm aware the reason may be different in the different
    case; still Jeff portrayed that situation rather clear way:
    they didn't want to use Machine Forth just because "they
    were paid for ANS Forth programming", they signed kind of
    agreement for that, therefore they "weren't interested" in
    any changes etc.

    Unfortunately we won't have any opportunity anymore to ask
    Jeff for more details.

    --

    Sounds like a management failure, they should have mandated that Machine
    Forth was to be used when the programmers were hired.
    --
    Gerry
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Tue Feb 25 18:43:28 2025
    From Newsgroup: comp.lang.forth

    On 25-02-2025 08:04, LIT wrote:
    var1 @ var2 @ + var3 !

    The above isn't messy at all.

    Frankly, it's far worse than "messy". We passed that station 30 miles ago.

    Hans Bezemer

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Tue Feb 25 18:52:02 2025
    From Newsgroup: comp.lang.forth

    On 24-02-2025 20:49, LIT wrote:
    I wonder: wouldn't it be useful to have stackless basic
    arith operations? I mean instead of fetching the values
    first and putting them on the stack, then doing something,
    and in the end storing the result somewhere wouldn't it
    be practical to use directly the variables?
    No. Because you should minimize the use of variables. So if you're using
    THREE variables, you're definitely doing something VERY WRONG.

    If you wanna write Forth, write Forth. If you wanna write C, write C.
    If you can't handle a stack, you're definitely a C programmer. It's very simple..

    Furthermore, it allows me to write code like this:

    CODE (MINUS) DSIZE (2); DDROP; DS (1) -= DS (0); NEXT;
    CODE (MUL) DSIZE (2); DDROP; DS (1) *= DS (0); NEXT;
    CODE (NEGATE) DSIZE (1); DS (1) = -(DS (1)); NEXT;
    CODE (OR) DSIZE (2); DDROP; DS (1) |= DS (0); NEXT;
    CODE (AND) DSIZE (2); DDROP; DS (1) &= DS (0); NEXT;
    CODE (XOR) DSIZE (2); DDROP; DS (1) ^= DS (0); NEXT;

    Hans Bezemer

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Feb 25 18:04:30 2025
    From Newsgroup: comp.lang.forth

    zbigniew2011@gmail.com (LIT) writes:
    I know nothing about Machine Forth.
    BTW: is it available for download anywhere (if not
    commercial/restricted)?

    I think it's the compiler part of colorForth <https://colorforth.github.io/cf.htm>. Looking around a bit I find <https://colorforth.github.io/forth.html>, which shows how machine
    Forth primitives are compiled.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Feb 25 18:17:55 2025
    From Newsgroup: comp.lang.forth

    zbigniew2011@gmail.com (LIT) writes:
    Now I'm pondering about DO..LOOP construct; actually
    probably it doesn't necessarily need to rely on return
    stack.

    There are many ways to skin this cat, and I have written at length
    about that here. For performance you should keep those data in
    registers that you update in the loop. E.g., a simple way is to have
    the index and the limit around, and to update the index; then you
    should keep the index in a register; leaving the unchanging limit in
    memory is not so bad for performance on many CPUs.

    You need to save the old index and old limit when entering another
    do...loop, and restore them on exiting the do...loop, including when
    you exit with UNLOOP or THROW. Both SwifthForth and VFX switched to
    keeping the loop control parameters in registers in their 64-bit
    ports, and at first forgot to restore the old loop control parameters
    on THROW; they have fixed this bug as soon as it was found.

    Instead of keeping index and limit, there are also variants that keep
    other values around, to make +LOOP more efficient (sometimes at the
    cost of a more expensive I).
    <2021Jan10.112340@mips.complang.tuwien.ac.at> discusses a number of
    these variants.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 20:57:28 2025
    From Newsgroup: comp.lang.forth

    I wonder: wouldn't it be useful to have stackless basic
    arith operations? I mean instead of fetching the values
    first and putting them on the stack, then doing something,
    and in the end storing the result somewhere wouldn't it
    be practical to use directly the variables?
    No. Because you should minimize the use of variables. So if you're using THREE variables, you're definitely doing something VERY WRONG.

    If you wanna write Forth, write Forth. If you wanna write C, write C.
    If you can't handle a stack, you're definitely a C programmer. It's very simple..

    Forgive me for being contrary, but IMHO use of locals
    is much more C-ish than the use of "as many as" three
    (OMG!) variables in a single program. ;)

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 21:08:36 2025
    From Newsgroup: comp.lang.forth

    So I did some quite basic testing with x86
    fig-Forth for DOS. I devised 4 OOS words:

    :=: (exchange values among two variables)
    pop BX
    pop DI
    mov AX,[BX]
    xchg AX,[DI]
    mov [BX],AX
    jmp NEXT

    ++ (increment variable by one)
    pop BX
    inc WORD PTR [BX}
    jmp NEXT

    -- (similar to above, just uses dec -- not tested, it'll give same
    result)

    (add two variables then store result into third one)
    pop DI
    pop BX
    mov CX,[BX]
    pop BX
    mov AX,[BX]
    add AX,CX
    mov [DI],AX
    jmp NEXT

    How the simplistic tests have been done:

    7 VARIABLE V1
    8 VARIABLE V2
    9 VARIABLE V3

    : TOOK ( t1 t2 -- )
    DROP SPLIT TIME@ DROP SPLIT
    ROT SWAP - CR ." It took " U. ." seconds and "
    - 10 * U. ." milliseconds "
    ;

    : TEST1
    1000 0 DO 10000 0 DO
    ...expression...
    LOOP LOOP
    ;

    0 0 TIME! TIME@ TEST TOOK

    The results are (for the following expressions):

    V1 @ V2 @ + V3 ! - 25s 430ms
    V1 V2 V3 +> - 17s 240ms

    1 V1 +! - 14s 60ms
    V1 ++ - 10s 820ms

    V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! - 40s 150ms
    V1 V2 :=: - 15s 260ms

    So there is a noticeable difference indeed.

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Feb 25 22:35:42 2025
    From Newsgroup: comp.lang.forth

    zbigniew2011@gmail.com (LIT) writes:
    V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! - 40s 150ms

    Too much OOS thinking? Try

    V1 @ V2 @ V1 ! V2 !

    V1 V2 :=: - 15s 260ms

    So there is a noticeable difference indeed.

    The question is how often you use these new words in applications.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 22:44:04 2025
    From Newsgroup: comp.lang.forth

    On Tue, 25 Feb 2025 22:35:42 +0000, Anton Ertl wrote:

    zbigniew2011@gmail.com (LIT) writes:
    V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! - 40s 150ms

    Too much OOS thinking? Try

    V1 @ V2 @ V1 ! V2 !

    Yep. My bad.

    V1 V2 :=: - 15s 260ms

    So there is a noticeable difference indeed.

    The question is how often you use these new words in applications.

    Indeed - but it's not everything possible,
    just a few examples.
    Another exmaple: I'll try to do DO..LOOP that
    avoids the return stack and I'm curious about
    the change.

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Wed Feb 26 11:35:50 2025
    From Newsgroup: comp.lang.forth

    On 25/02/2025 11:10 pm, LIT wrote:
    In case of slower ITC non-optimizing Forths - like
    fig-Forth, as the most obvious example - the "boost"
    may be noticeable.
    I'll check that.

    code +> ( x y z -- )
      dx pop  cx pop  bx pop  0 [bx] ax mov  cx bx xchg
      0 [bx] ax add  dx bx xchg  ax 0 [bx] mov  next
    end-code

    Timing (adjusted for loop time):

     var1 @ var2 @ + var3 !   8019 mS
     var1 var2 var3 +>        5657 mS

    So even in case of fast DTC Forth, like DX Forth,
    it's already something worth of closer attention,
    I believe.
    ...

    I remain skeptical of such optimizations. Not even twice the performance
    and the hope it represents a bottle-neck in order to realize that gain.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.lang.forth on Wed Feb 26 00:50:52 2025
    From Newsgroup: comp.lang.forth

    LIT <zbigniew2011@gmail.com> wrote:
    So I did some quite basic testing with x86
    fig-Forth for DOS. I devised 4 OOS words:

    :=: (exchange values among two variables)
    pop BX
    pop DI
    mov AX,[BX]
    xchg AX,[DI]
    mov [BX],AX
    jmp NEXT

    ++ (increment variable by one)
    pop BX
    inc WORD PTR [BX}
    jmp NEXT

    -- (similar to above, just uses dec -- not tested, it'll give same
    result)

    (add two variables then store result into third one)
    pop DI
    pop BX
    mov CX,[BX]
    pop BX
    mov AX,[BX]
    add AX,CX
    mov [DI],AX
    jmp NEXT

    How the simplistic tests have been done:

    7 VARIABLE V1
    8 VARIABLE V2
    9 VARIABLE V3

    : TOOK ( t1 t2 -- )
    DROP SPLIT TIME@ DROP SPLIT
    ROT SWAP - CR ." It took " U. ." seconds and "
    - 10 * U. ." milliseconds "
    ;

    : TEST1
    1000 0 DO 10000 0 DO
    ...expression...
    LOOP LOOP
    ;

    0 0 TIME! TIME@ TEST TOOK

    The results are (for the following expressions):

    V1 @ V2 @ + V3 ! - 25s 430ms
    V1 V2 V3 +> - 17s 240ms

    1 V1 +! - 14s 60ms
    V1 ++ - 10s 820ms

    V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! - 40s 150ms
    V1 V2 :=: - 15s 260ms

    So there is a noticeable difference indeed.

    If your expected use case is operations on variables, then
    what you gain is merging @ and ! onto operations. Since
    you still have variables, gain is at most a factor of 2
    (you replace things by V1 @ by plain V1). Cost is need to
    have several extra operations. Potential alternative is
    a pair of operations, say PUSH and POP, and Forth compiler
    that replaces pair like V1 @ by PUSH(V1). Note that here
    address of V1 is intended to be part to PUSH (so it will
    take as much space as separate V1 and @, but is only a
    single primitive).

    More generally, a simple "optimizer" that replaces short
    sequences of Forth primitives by different, shorter sequence
    of primitives is likely to give similar gain. However,
    chance of match decreases with length of the sequence.
    Above you bet on relatively long seqences (and on programmer
    writing alternative seqence). Shorter seqences have more
    chance of matching, so you need smaller number of them
    for similar gain.

    Extra thing: while simple memory to memory operations appear
    with some frequency rather typical pattern is expressions
    that produce some value that is immediately used by another
    operation, stack is very good fit for such use. One can
    do better than using machine stack, namely keeping thing in
    registers, but that means generating machine code and doing
    optimization. OTOH on 64-bit machines machine code is
    very natural: machine instructions are typically smaller
    than machine words (which are natural unit for threaded
    code) and Forth primitives are likely to produce very
    small number of instructions.
    --
    Waldek Hebisch
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Wed Feb 26 09:08:19 2025
    From Newsgroup: comp.lang.forth

    I remain skeptical of such optimizations. Not even twice the
    performance
    and the hope it represents a bottle-neck in order to realize that gain.

    I've got a feeling it would have more
    of a significance in 8088 era, say IBM 5150
    or XTs. 486 is already "too good" probably
    to see as much as 50% gain.
    I've got working XT board - if I manage to
    get at least FDD interface for that (no,
    not today... it'll take some time) I'll
    do some more testing.

    In general I've got a feeling, that this
    approach may be the more profitable, the
    less suitable for Forth CPU is - say 8051
    lacking registers.

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Feb 26 08:30:28 2025
    From Newsgroup: comp.lang.forth

    antispam@fricas.org (Waldek Hebisch) writes:
    Potential alternative is
    a pair of operations, say PUSH and POP, and Forth compiler
    that replaces pair like V1 @ by PUSH(V1). Note that here
    address of V1 is intended to be part to PUSH (so it will
    take as much space as separate V1 and @, but is only a
    single primitive).

    In Gforth variables are compiled as "lit <addr>", and Gforth has a
    primitive LIT@, and a generalized constant-folding optimization that
    replaces "lit <addr> @" with "LIT@ <addr>". In the Gforth image there
    are 490 uses of lit@ (out of 33611 uses of primitives) and 76
    occurences of "lit @" (in parts that are compiled before the
    generalized constant-folding is active). There are also 293
    occurences of "lit !".

    However, given the minimal difference between the code produced for
    "LIT@" and "LIT @", LIT@ is no longer beneficial. E.g.:

    variable v ok
    : foo1 v @ + ; ok
    : foo2 v [ basic-block-end ] @ + ; ok
    see-code foo1
    $7F27184A08B0 lit@ 1->2
    $7F27184A08B8 v
    7F271806B580: mov rax,$08[rbx]
    7F271806B584: mov r15,[rax]
    $7F27184A08C0 + 2->1
    7F271806B587: add r13,r15
    $7F27184A08C8 ;s 1->1
    ...
    see-code foo2
    $7F27184A08F8 lit 1->2
    $7F27184A0900 v
    7F271806B596: mov r15,$08[rbx]
    $7F27184A0908 @ 2->2
    7F271806B59A: mov r15,[r15]
    $7F27184A0910 + 2->1
    7F271806B59D: add r13,r15
    $7F27184A0918 ;s 1->1
    ...

    More generally, a simple "optimizer" that replaces short
    sequences of Forth primitives by different, shorter sequence
    of primitives is likely to give similar gain. However,
    chance of match decreases with length of the sequence.

    Gforth has that as static superinstructions. You can see the
    sequences in
    <http://git.savannah.gnu.org/cgit/gforth.git/tree/peeprules.vmg>, in
    the lines before there is any occurence of prim-states or something
    similar. As you can see, many of the formerly-used sequences are now
    commented out, because static superinstructions do not play well with
    a) static stack caching (currently static superinstructions only work
    for the default stack cache state) and b) IP-update optimization (if
    one of the primitives in the sequence has an immediate argument (e.g.,
    LIT), you would need additional variants for various IP offsets, or
    update the IP before the sequence).

    The remaining static superinstructions

    * have to do with stacks where we do not have stack caching (FP stack,
    locals stack, return stack),

    * are combinations of comparison primitives and ?BRANCH (this avoids
    the need to reify the result of the comparison in a general-purpose
    register), or

    * are sequences of typical memory-access words (not because they occur
    so often, but because it's better to have a small number of words
    that can be combined, and a number of combinations in the optimizer
    than to have a combinatorial explosion of words).

    Above you bet on relatively long seqences (and on programmer
    writing alternative seqence). Shorter seqences have more
    chance of matching, so you need smaller number of them
    for similar gain.

    That's certainly our experience. Long sequences with high dynamic
    counts often come out of the inner loop of a single benchmark, and do
    not help other programs at all. We later preferred to go with static
    usage counts (i.e., the sequence occurs several times in the code),
    and this naturally leads to short sequences.

    One can
    do better than using machine stack, namely keeping thing in
    registers, but that means generating machine code and doing
    optimization.

    Gforth does stack caching at the level of primitives, by having
    several variants of the primitives for different start and end states
    of the primitives, and using a shortest-path search for finding out
    which combination of these variants to use. However, for multiple
    stacks this leads to a large number of states, and the shortest-path
    algorithm becomes too expensive. For now we only stack-cache the data
    stack.

    For extending this to multiple stacks, I see several alternatives:

    * Use a greedy algorithm instead of an optimal shortest-path
    algorithm. The difference is probably non-existent in most cases.

    * Manage the stack cache using register allocation techniques instead
    of representing it as an abstract state. This would often produce
    similar results as the greedy technique, but it can also handle
    stack manipulation words cheaply without having an explosion of
    stack states and the related complexity in the generator that
    generates the states and the tables for the state-handling.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Wed Feb 26 11:45:52 2025
    From Newsgroup: comp.lang.forth

    On 25-02-2025 21:57, LIT wrote:
    Forgive me for being contrary, but IMHO use of locals
    is much more C-ish than the use of "as many as" three
    (OMG!) variables in a single program. ;)

    The goal is to use as few variables as possible. There are plenty of
    words I wrote using NO variables at all. Arrays (and strings) are
    another story, since it's hard to represent them using a stack (unless
    you dump every single element there - which is not realistic).

    If I apply that rule, I wrote an entire 1000+ line BASIC interpreter
    using *three* variables (stack frame pointer, partition pointer and a
    counter on the number of currently emitted characters on a line - TAB() remember?).

    So yeah, three variables is quite a lot. It's should be rare enough not
    to worry about too much - let alone require special facilities to
    accommodate such a construct.

    Hans Bezemer
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Wed Feb 26 11:48:04 2025
    From Newsgroup: comp.lang.forth

    In article <a81ac9ee2ed92686e940a55bed9d4dfb@www.novabbs.com>,
    LIT <zbigniew2011@gmail.com> wrote:
    In case of slower ITC non-optimizing Forths - like
    fig-Forth, as the most obvious example - the "boost"
    may be noticeable.
    I'll check that.

    code +> ( x y z -- )
    dx pop cx pop bx pop 0 [bx] ax mov cx bx xchg
    0 [bx] ax add dx bx xchg ax 0 [bx] mov next
    end-code

    Timing (adjusted for loop time):

    var1 @ var2 @ + var3 ! 8019 mS
    var1 var2 var3 +> 5657 mS

    So even in case of fast DTC Forth, like DX Forth,
    it's already something worth of closer attention,
    I believe.
    I expect even bigger gain in case of older fig-Forth
    model.

    ciforth is actually fig-Forth 5.5.3 with some ansification
    and abandonment of seventies-style tricks.

    I don't expect a gain in ciforth from this.

    --

    Groetjes Albert
    --
    Temu exploits Christians: (Disclaimer, only 10 apostles)
    Last Supper Acrylic Suncatcher - 15Cm Round Stained Glass- Style Wall
    Art For Home, Office And Garden Decor - Perfect For Windows, Bars,
    And Gifts For Friends Family And Colleagues.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Wed Feb 26 12:04:10 2025
    From Newsgroup: comp.lang.forth

    In article <2025Feb25.233542@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    zbigniew2011@gmail.com (LIT) writes:
    V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! - 40s 150ms

    Too much OOS thinking? Try

    V1 @ V2 @ V1 ! V2 !

    V1 V2 :=: - 15s 260ms

    So there is a noticeable difference indeed.

    The question is how often you use these new words in applications.

    These words might make sense connected to a sorting application. 1]
    Define those words there and don't clobber the global name space.


    - anton

    1] After testing of course.

    Groetjes Albert
    --
    Temu exploits Christians: (Disclaimer, only 10 apostles)
    Last Supper Acrylic Suncatcher - 15Cm Round Stained Glass- Style Wall
    Art For Home, Office And Garden Decor - Perfect For Windows, Bars,
    And Gifts For Friends Family And Colleagues.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Wed Feb 26 11:23:03 2025
    From Newsgroup: comp.lang.forth

    On Wed, 26 Feb 2025 9:08:19 +0000, LIT wrote:

    I remain skeptical of such optimizations. Not even twice the
    performance
    and the hope it represents a bottle-neck in order to realize that gain.

    I've got a feeling it would have more
    of a significance in 8088 era, say IBM 5150
    or XTs. 486 is already "too good" probably
    to see as much as 50% gain.
    I've got working XT board - if I manage to
    get at least FDD interface for that (no,
    not today... it'll take some time) I'll
    do some more testing.

    Save yourself the time: use an emulator eg PCem, DOSBox(X) or QEMU
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Wed Feb 26 11:31:20 2025
    From Newsgroup: comp.lang.forth

    Save yourself the time: use an emulator eg PCem, DOSBox(X) or QEMU

    I was usually using Qemu, but since longer time
    it no longer properly compiles. :(

    Some improvements(?) in GCC about two years ago.

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Wed Feb 26 11:29:03 2025
    From Newsgroup: comp.lang.forth

    If I apply that rule, I wrote an entire 1000+ line BASIC interpreter
    using *three* variables (stack frame pointer, partition pointer and a
    counter on the number of currently emitted characters on a line - TAB() remember?).

    I didn't create BASIC interpreters, but I'm afraid
    just the rather trivial programs can "live" without
    handful of variables. For example: how do you create
    even modest (screen-oriented) editor without adding
    several variables that reflect its state - where the
    cursor is at the moment, what's the filename in use,
    how are values of user's settings/preferences - all
    that - etc. etc.?

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Wed Feb 26 11:41:02 2025
    From Newsgroup: comp.lang.forth

    These words might make sense connected to a sorting application. 1]
    Define those words there and don't clobber the global name space.

    These words, as I already wrote, were just
    examples to illustrate the approach, which
    isn't limited to operations commonly associated
    to do sorting kind of work.

    I created also ROR/ROL words, that have nothing
    to do with any sorting processes:

    ROR ( n1 u -- n2 ? )
    xor AX,AX
    pop CX
    pop DX
    ror DX,CL
    adc AX,0
    jmp DPUSH

    What for? At the moment just tinkering with OOS
    approach, trying to explore it some more.

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Wed Feb 26 11:45:49 2025
    From Newsgroup: comp.lang.forth

    Oh, that's the code for the "other" ROR... :D
    Never mind, you get the idea.

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Wed Feb 26 12:35:48 2025
    From Newsgroup: comp.lang.forth

    Results for iForth64.

    The runtime of test3 is remarkable. I think not much can be done
    about it, given the context.

    -marcel

    ---

    VARIABLE V1 7 V1 !
    VARIABLE V2 8 V2 !
    VARIABLE V3 9 V3 !

    : :=: ( a b -- ) \ exchange values among two variables
    OVER @ >R DUP @ ROT ! R> SWAP ! ;

    : ++ ( a -- ) \ increment variable by one
    1 SWAP +! ;

    : +> ( a b c -- ) \ add two variables then store result into third one
    -ROT @ SWAP @ + SWAP ! ;

    : t1a V1 @ V2 @ + V3 ! ; : t1b V1 V2 V3 +> ;
    : t2a 1 V1 +! ; : t2b V1 ++ ;
    : t3a V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! ; : t3b V1 V2 :=: ;

    : TESTa S" TIMER-RESET #100000 0 DO #10000 0 DO " EVALUATE ; IMMEDIATE
    : TESTb S" LOOP LOOP 3 SPACES .ELAPSED " EVALUATE ; IMMEDIATE

    : test1 CR ." \ TEST1 : " TESTa t1a TESTb TESTa t1b TESTb ;
    : test2 CR ." \ TEST2 : " TESTa t2a TESTb TESTa t2b TESTb ;
    : test3 CR ." \ TEST3 : " TESTa t3a TESTb TESTa t3b TESTb ;

    : TESTS test1 test2 test3 ;

    TESTS
    \ TEST1 : 1.646 seconds elapsed. 1.661 seconds elapsed.
    \ TEST2 : 1.778 seconds elapsed. 1.728 seconds elapsed.
    \ TEST3 : 2.194 seconds elapsed. 1.645 seconds elapsed. ok
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Wed Feb 26 13:05:35 2025
    From Newsgroup: comp.lang.forth

    This 1 billion times test of 3 cache cells is indeed remarkable. ;-)
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Feb 26 14:32:50 2025
    From Newsgroup: comp.lang.forth

    mhx@iae.nl (mhx) writes:
    : :=: ( a b -- ) \ exchange values among two variables
    OVER @ >R DUP @ ROT ! R> SWAP ! ;

    <https://www.complang.tuwien.ac.at/forth/programs/sort.fs> contains:

    : exchange ( addr1 addr2 -- )
    over @ over @ >r swap ! r> swap ! ;

    Let's see if Gforth produces better code for one of them:

    see-code :=: see-code exchange $7FBD6B6A06A8 over 1->2 $7FBD6B6A0728 over 1->2 7FBD6B26B3B0: mov r15,$08[r10] 7FBD6B26B3F0: mov r15,$08[r10] $7FBD6B6A06B0 @ 2->2 $7FBD6B6A0730 @ 2->2 7FBD6B26B3B4: mov r15,[r15] 7FBD6B26B3F4: mov r15,[r15] $7FBD6B6A06B8 >r 2->1 $7FBD6B6A0738 over 2->3 7FBD6B26B3B7: mov -$08[r14],r15 7FBD6B26B3F7: mov r9,r13 7FBD6B26B3BB: sub r14,$08 $7FBD6B6A0740 @ 3->3 $7FBD6B6A06C0 dup 1->2 7FBD6B26B3FA: mov r9,[r9] 7FBD6B26B3BF: mov r15,r13 $7FBD6B6A0748 >r 3->2 $7FBD6B6A06C8 @ 2->2 7FBD6B26B3FD: mov -$08[r14],r9 7FBD6B26B3C2: mov r15,[r15] 7FBD6B26B401: sub r14,$08 $7FBD6B6A06D0 rot 2->3 $7FBD6B6A0750 swap 2->3 7FBD6B26B3C5: mov r9,$08[r10] 7FBD6B26B405: add r10,$08 7FBD6B26B3C9: add r10,$08 7FBD6B26B409: mov r9,r13 $7FBD6B6A06D8 ! 3->1 7FBD6B26B40C: mov r13,[r10] 7FBD6B26B3CD: mov [r9],r15 $7FBD6B6A0758 ! 3->1 $7FBD6B6A06E0 r> 1->2 7FBD6B26B40F: mov [r9],r15 7FBD6B26B3D0: mov r15,[r14] $7FBD6B6A0760 r> 1->2 7FBD6B26B3D3: add r14,$08 7FBD6B26B412: mov r15,[r14] $7FBD6B6A06E8 swap 2->3 7FBD6B26B415: add r14,$08 7FBD6B26B3D7: add r10,$08 $7FBD6B6A0768 swap 2->3 7FBD6B26B3DB: mov r9,r13 7FBD6B26B419: add r10,$08 7FBD6B26B3DE: mov r13,[r10] 7FBD6B26B41D: mov r9,r13 $7FBD6B6A06F0 ! 3->1 7FBD6B26B420: mov r13,[r10] 7FBD6B26B3E1: mov [r9],r15 $7FBD6B6A0770 ! 3->1 $7FBD6B6A06F8 ;s 1->1 7FBD6B26B423: mov [r9],r15 7FBD6B26B3E4: mov rbx,[r14] $7FBD6B6A0778 ;s 1->1 7FBD6B26B3E7: add r14,$08 7FBD6B26B426: mov rbx,[r14] 7FBD6B26B3EB: mov rax,[rbx] 7FBD6B26B429: add r14,$08 7FBD6B26B3EE: jmp eax 7FBD6B26B42D: mov rax,[rbx]
    7FBD6B26B430: jmp eax

    These things are hard to predict:-)

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From John Ames@commodorejohn@gmail.com to comp.lang.forth on Wed Feb 26 08:04:44 2025
    From Newsgroup: comp.lang.forth

    On Wed, 26 Feb 2025 11:41:02 +0000
    zbigniew2011@gmail.com (LIT) wrote:

    These words, as I already wrote, were just examples to illustrate the approach, which isn't limited to operations commonly associated
    to do sorting kind of work.

    I created also ROR/ROL words, that have nothing to do with any
    sorting processes:

    I mean, that's the whole thing with Forth - you *can* define any words
    you like, based on your needs, extending the basic set of operations
    into a whole domain-specific language suited to the problem you're
    trying to solve. But that's not in itself a strong argument for adding
    XYZ to the "standard" dictionary.*

    * (And let the use of "standard" in connection with Forth never pass by
    without a muffled snort into one's sleeve.)

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Wed Feb 26 16:18:32 2025
    From Newsgroup: comp.lang.forth

    These words, as I already wrote, were just examples to illustrate the
    approach, which isn't limited to operations commonly associated
    to do sorting kind of work.

    I created also ROR/ROL words, that have nothing to do with any
    sorting processes:

    I mean, that's the whole thing with Forth - you *can* define any words
    you like, based on your needs, extending the basic set of operations
    into a whole domain-specific language suited to the problem you're
    trying to solve. But that's not in itself a strong argument for adding
    XYZ to the "standard" dictionary.*

    If you could, please, remind me when and where I was
    proposing to add these XYZs to standard dictionary?

    Thanks in advance!

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Feb 26 17:46:13 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    mhx@iae.nl (mhx) writes:
    : :=: ( a b -- ) \ exchange values among two variables
    OVER @ >R DUP @ ROT ! R> SWAP ! ;

    Another variant:

    : exchange ( addr1 addr2 -- )
    dup @ rot !@ swap ! ;

    This uses the primitive

    '!@' ( u1 a-addr -- u2 ) gforth-experimental "store-fetch"
    load U2 from A_ADDR, and store U1 there, as atomic operation

    I worry that the atomic part will result in it being slower than the
    versions that do not use !@. Let's measure that:

    : exchange ( addr1 addr2 -- )
    over @ swap !@ swap ! ;

    : :=: ( addr1 addr2 -- )
    OVER @ >R DUP @ ROT ! R> SWAP ! ;

    : bench-exchange ( addr1 addr2 -- )
    100000000 0 do 2dup exchange loop ;

    : bench-:=: ( addr1 addr2 -- )
    100000000 0 do 2dup :=: loop ;

    variable v1
    variable v2

    1 v1 !
    2 v2 !

    Measurement with
    perf stat -e cycles -e instructions gforth-fast xxxx.fs -e "v1 v2 bench-exchange bye"
    perf stat -e cycles -e instructions gforth-fast xxxx.fs -e "v1 v2 bench-:=: bye"

    Results on a Zen4:

    exchange :=:
    877_054_156 812_761_422 cycles
    3_708_692_329 3_908_642_117 instructions

    So the @! variant is indeed slower, but only a little (0.65 cycles per execution of these words); however, I would expect either a big
    slowdown (from latency when dealing with the memory subsystem,
    broadcasting to other cores, etc.) or none at all.

    And here's the code:
    see-code exchange see-code :=: $7EFDC12A06A8 over 1->2 $7FBD6B6A06A8 over 1->2 7EFDC0DEA3B0: mov r15,$08[r10] 7FBD6B26B3B0: mov r15,$08[r10] $7EFDC12A06B0 @ 2->2 $7FBD6B6A06B0 @ 2->2 7EFDC0DEA3B4: mov r15,[r15] 7FBD6B26B3B4: mov r15,[r15] $7EFDC12A06B8 swap 2->1 $7FBD6B6A06B8 >r 2->1 7EFDC0DEA3B7: mov [r10],r15 7FBD6B26B3B7: mov -$08[r14],r15 7EFDC0DEA3BA: sub r10,$08 7FBD6B26B3BB: sub r14,$08 $7EFDC12A06C0 !@ 1->1 $7FBD6B6A06C0 dup 1->2 7EFDC0DEA3BE: mov rax,$08[r10] 7FBD6B26B3BF: mov r15,r13 7EFDC0DEA3C2: add r10,$08 $7FBD6B6A06C8 @ 2->2 7EFDC0DEA3C6: xchg $00[r13],rax 7FBD6B26B3C2: mov r15,[r15] 7EFDC0DEA3CA: mov r13,rax $7FBD6B6A06D0 rot 2->3 $7EFDC12A06C8 swap 1->2 7FBD6B26B3C5: mov r9,$08[r10] 7EFDC0DEA3CD: mov r15,$08[r10] 7FBD6B26B3C9: add r10,$08 7EFDC0DEA3D1: add r10,$08 $7FBD6B6A06D8 ! 3->1 $7EFDC12A06D0 ! 2->0 7FBD6B26B3CD: mov [r9],r15 7EFDC0DEA3D5: mov [r15],r13 $7FBD6B6A06E0 r> 1->2 $7EFDC12A06D8 ;s 0->1 7FBD6B26B3D0: mov r15,[r14] 7EFDC0DEA3D8: mov r13,$08[r10] 7FBD6B26B3D3: add r14,$08 7EFDC0DEA3DC: add r10,$08 $7FBD6B6A06E8 swap 2->3 7EFDC0DEA3E0: mov rbx,[r14] 7FBD6B26B3D7: add r10,$08 7EFDC0DEA3E3: add r14,$08 7FBD6B26B3DB: mov r9,r13 7EFDC0DEA3E7: mov rax,[rbx] 7FBD6B26B3DE: mov r13,[r10] 7EFDC0DEA3EA: jmp eax $7FBD6B6A06F0 ! 3->1
    7FBD6B26B3E1: mov [r9],r15
    $7FBD6B6A06F8 ;s 1->1
    7FBD6B26B3E4: mov rbx,[r14]
    7FBD6B26B3E7: add r14,$08
    7FBD6B26B3EB: mov rax,[rbx]
    7FBD6B26B3EE: jmp eax

    The difference looks bigger than it is: There are lines for 4
    additional primitives (no influence on performance) and 2 additional instructions, resulting in a 6-line difference.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Wed Feb 26 11:44:15 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    : :=: ( addr1 addr2 -- )
    OVER @ >R DUP @ ROT ! R> SWAP ! ;

    : ex ( a1 a2 -- ) 2>r 2r@ @ swap @ r> ! r> ! ;

    looks a little simpler.

    The effort of implementing special native words for this though are
    probably better spent on locals.

    : ex {: x y -- :} x @ y @ x ! y ! ;
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Wed Feb 26 21:02:52 2025
    From Newsgroup: comp.lang.forth

    Really? ;-)

    NT/FORTH (C) 2005 Peter Fälth Version 1.6-983-824 Compiled on
    2017-12-03
    Running on Windows NT 6.2 Build 9200
    Current directory is e:\Develop\Forth\lxf
    : ex {: x y -- :} x @ y @ x ! y ! ; ok
    see ex
    A49E58 40917C 23 C80000 5 normal EX

    40917C 8B4500 mov eax , [ebp]
    40917F 8B00 mov eax , [eax]
    409181 8BCB mov ecx , ebx
    409183 8B09 mov ecx , [ecx]
    409185 8B5500 mov edx , [ebp]
    409188 890A mov [edx] , ecx
    40918A 8903 mov [ebx] , eax
    40918C 8B5D04 mov ebx , [ebp+4h]
    40918F 8D6D08 lea ebp , [ebp+8h]
    409192 C3 ret near
    ok
    : :=: OVER @ >R DUP @ ROT ! R> SWAP ! ; ok
    see :=:
    A49E6C 409193 23 C80000 5 normal :=:

    409193 8B4500 mov eax , [ebp]
    409196 8B00 mov eax , [eax]
    409198 8BCB mov ecx , ebx
    40919A 8B09 mov ecx , [ecx]
    40919C 8B5500 mov edx , [ebp]
    40919F 890A mov [edx] , ecx
    4091A1 8903 mov [ebx] , eax
    4091A3 8B5D04 mov ebx , [ebp+4h]
    4091A6 8D6D08 lea ebp , [ebp+8h]
    4091A9 C3 ret near
    ok
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Wed Feb 26 22:23:51 2025
    From Newsgroup: comp.lang.forth

    On 26-02-2025 12:29, LIT wrote:
    I didn't create BASIC interpreters, but I'm afraid
    just the rather trivial programs can "live" without
    handful of variables.
    I won't call a BASIC interpreter trivial.

    For example: how do you create
    even modest (screen-oriented) editor without adding
    several variables that reflect its state - where the
    cursor is at the moment, what's the filename in use,
    how are values of user's settings/preferences - all
    that - etc. etc.?

    Like I said - arrays are a different thing. But my repository holds a
    screen editor. It has two variables. Which screen and which position in
    that screen. Your trivial example holds three. One for parameter one,
    one for parameter two and one for the result.

    Hans Bezemer

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Wed Feb 26 21:48:49 2025
    From Newsgroup: comp.lang.forth

    For example: how do you create
    even modest (screen-oriented) editor without adding
    several variables that reflect its state - where the
    cursor is at the moment, what's the filename in use,
    how are values of user's settings/preferences - all
    that - etc. etc.?

    Like I said - arrays are a different thing.

    Oh, so it's simpler way than anyone could guess:
    "just use different term, avoid the word 'variable' ".
    Done. :)

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Thu Feb 27 10:53:40 2025
    From Newsgroup: comp.lang.forth

    On 27/02/2025 3:18 am, LIT wrote:
    These words, as I already wrote, were just examples to illustrate the
    approach, which isn't limited to operations commonly associated
    to do sorting kind of work.

    I created also ROR/ROL words, that have nothing to do with any
    sorting processes:

    I mean, that's the whole thing with Forth - you *can* define any words
    you like, based on your needs, extending the basic set of operations
    into a whole domain-specific language suited to the problem you're
    trying to solve. But that's not in itself a strong argument for adding
    XYZ to the "standard" dictionary.*

    If you could, please, remind me when and where I was
    proposing to add these XYZs to standard dictionary?

    Thanks in advance!

    Then you have an existing application that demonstrates the benefit after having examined and ruled out other ways of optimizing the code?

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Feb 27 07:29:44 2025
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    : :=: ( addr1 addr2 -- )
    OVER @ >R DUP @ ROT ! R> SWAP ! ;

    : ex ( a1 a2 -- ) 2>r 2r@ @ swap @ r> ! r> ! ;

    looks a little simpler.

    This inspires another one:

    : exchange2 ( addr1 addr2 -- )
    dup >r @ over @ r> ! swap ! ;

    With some other versions this results in the following benchmark
    program:

    [defined] !@ [if]
    : exchange ( addr1 addr2 -- )
    over @ swap !@ swap ! ;
    [then]

    \ Paul Rubin <875xkwo5io.fsf@nightsong.com>
    : ex ( addr1 addr2 -- )
    2>r 2r@ @ swap @ r> ! r> ! ;

    : ex-locals {: x y -- :} x @ y @ x ! y ! ;

    \ Anton Ertl
    : exchange2 ( addr1 addr2 -- )
    dup >r @ over @ r> ! swap ! ;

    \ Marcel Hendrix
    : :=: ( addr1 addr2 -- )
    OVER @ >R DUP @ ROT ! R> SWAP ! ;

    variable v1
    variable v2

    1 v1 !
    2 v2 !

    : bench ( "name" -- )
    v1 v2
    :noname ]] 100000000 0 do 2dup [[ parse-name evaluate ]] loop ; [[
    execute ;

    Results (on Zen4):

    gforth-fast (development):
    :=: exchange ex ex-locals exchange2
    814_881_277 879_389_133 928_825_521 875_574_895 808_543_975 cyc. 3_908_874_164 3_708_891_336 4_508_966_770 4_209_778_557 3_708_865_505 inst.

    vfx64 5.43:
    :=: ex ex-locals exchange2
    335_298_202 432_614_804 928_542_678 336_134_513 cyc. 1_166_400_242 1_366_264_943 2_866_547_067 1_166_280_641 inst.

    And here's the code produced by gforth-fast:

    :=: ex ex-locals exchange2
    over 1->2 2>r 1->0 l 1->1 dup >r 1->1
    mov r15,$08[r10] add r10,$08 mov rax,rbp >r 1->1
    @ 2->2 mov r15,r13 add r10,$08 mov -$8[r14],r13
    mov r15,[r15] mov r13,[r10] lea rbp,-$8[rbp] sub r14,$08
    r 2->1 mov -$8[r14],r13 mov -$8[rax],r13 @ 1->1
    mov -$8[r14],r15 sub r14,$10 mov r13,[r10] mov r13,$00[r13]
    sub r14,$08 mov [r14],r15 >l @local0 1->1 over 1->2
    dup 1->2 2r@ 0->2 @local0 1->1 mov r15,$08[r10]
    mov r15,r13 mov r13,$08[r14] mov rax,rbp @ 2->2
    @ 2->2 mov r15,[r14] lea rbp,-$8[rbp] mov r15,[r15]
    mov r15,[r15] @ 2->2 mov -$8[rax],r13 r> 2->3
    rot 2->3 mov r15,[r15] @ 1->1 mov r9,[r14]
    mov r9,$08[r10] swap 2->2 mov r13,$00[r13] add r14,$08
    add r10,$08 mov rax,r13 @local1 1->2 ! 3->1
    ! 3->1 mov r13,r15 mov r15,$08[rbp] mov [r9],r15
    mov [r9],r15 mov r15,rax @ 2->2 swap 1->2
    1->2 @ 2->2 mov r15,[r15] mov r15,$08[r10]
    mov r15,[r14] mov r15,[r15] @local0 2->3 add r10,$08
    add r14,$08 r> 2->3 mov r9,$00[rbp] ! 2->0
    swap 2->3 mov r9,[r14] ! 3->1 mov [r15],r13
    add r10,$08 add r14,$08 mov [r9],r15 ;s 0->1
    mov r9,r13 ! 3->1 @local1 1->2 mov r13,$08[r10]
    mov r13,[r10] mov [r9],r15 mov r15,$08[rbp] add r10,$08
    ! 3->1 r> 1->2 ! 2->0 mov rbx,[r14]
    mov [r9],r15 mov r15,[r14] mov [r15],r13 add r14,$08
    ;s 1->1 add r14,$08 lp+2 0->1 mov rax,[rbx]
    mov rbx,[r14] ! 2->0 mov r13,$08[r10] jmp eax
    add r14,$08 mov [r15],r13 add r10,$08
    mov rax,[rbx] ;s 0->1 add rbp,$10
    jmp eax mov r13,$08[r10] ;s 1->1
    add r10,$08 mov rbx,[r14]
    mov rbx,[r14] add r14,$08
    add r14,$08 mov rax,[rbx]
    mov rax,[rbx] jmp eax
    jmp eax

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Thu Feb 27 09:28:15 2025
    From Newsgroup: comp.lang.forth

    These words, as I already wrote, were just examples to illustrate the
    approach, which isn't limited to operations commonly associated
    to do sorting kind of work.

    I created also ROR/ROL words, that have nothing to do with any
    sorting processes:

    I mean, that's the whole thing with Forth - you *can* define any words
    you like, based on your needs, extending the basic set of operations
    into a whole domain-specific language suited to the problem you're
    trying to solve. But that's not in itself a strong argument for adding
    XYZ to the "standard" dictionary.*

    If you could, please, remind me when and where I was
    proposing to add these XYZs to standard dictionary?

    Thanks in advance!

    Then you have an existing application that demonstrates the benefit
    after
    having examined and ruled out other ways of optimizing the code?

    You expected me to "have an existing application..." etc. etc.
    immediately after I came up with this idea? You mean: within
    hours range, literally?

    I'd like to create one - unfortunately, I'm busy with others
    things.

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Thu Feb 27 12:51:12 2025
    From Newsgroup: comp.lang.forth

    On 26-02-2025 22:48, LIT wrote:
    Oh, so it's simpler way than anyone could guess:
    "just use different term, avoid the word 'variable' ".
    Done. :)

    Oh dear, I hope you don't have a formal education in CS. If so, I'd ask
    for my money back. You know - it's not a different term - it's a
    different concept, with quite different characteristics.

    You know, when Leo Brodie said "the stack is not an array" it was kind
    of a tautology. As if saying "a cactus is not a cow". An array (yes,
    strings can be considered to be special arrays) is accessed randomly,
    and hence has a time complexity of O(1). A stack is accessed
    sequentially and hence has a time complexity of O(n).

    Now Chuck has made that stack a bit more accessible by providing stack operators, but that goes three elements deep (usually). Anyways, it is impossible to represent an array on a stack, since a stack cannot be
    accessed randomly.

    The only way you can have anything array related on the stack is by its address or the contents of a single element. However, a stack is
    perfectly capable to replace (local) variables since their values can
    reside on the stack. It does require some skills, though, to manage that
    stack properly, agreed.

    And sure - it can be handy to have a fast word to exchange the values of
    two array elements. But that is an entirely different question from
    having two variables, do some arithmetic on their values and store the
    result in another variable.

    Hans Bezemer

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Thu Feb 27 12:58:56 2025
    From Newsgroup: comp.lang.forth

    On 27-02-2025 10:28, LIT wrote:
    You expected me to "have an existing application..." etc. etc.
    immediately after I came up with this idea? You mean: within
    hours range, literally?

    I'd like to create one - unfortunately, I'm busy with others
    things.

    Of course. Any normal human being would only look for solutions when
    having a problem. You don't look for (C-like) solutions when there is no problem to solve.

    There are plenty of areas I never covered by inventing a hammer, because
    I never had a nail to hit. That's why many of my most-beloved libraries
    are related to my professional work. I had a problem, I fixed it and now
    I can reuse it.

    A lot of the libraries I wrote "just for fun" remain unused for exactly
    that reason - I obviously never really needed them to begin with.

    Hans Bezemer

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Thu Feb 27 12:21:28 2025
    From Newsgroup: comp.lang.forth

    A lot of the libraries I wrote "just for fun" remain unused for exactly
    that reason - I obviously never really needed them to begin with.

    My heart goes out to you.

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Thu Feb 27 12:19:16 2025
    From Newsgroup: comp.lang.forth

    Oh, so it's simpler way than anyone could guess:
    "just use different term, avoid the word 'variable' ".
    Done. :)

    Oh dear, I hope you don't have a formal education in CS. If so, I'd ask
    for my money back. You know - it's not a different term - it's a
    different concept, with quite different characteristics.

    „In computer science, array is a data type that represents
    a collection of elements (values or variables), each
    selected by one or more indices”

    Now feel free to go and ask for your money back.

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Thu Feb 27 15:11:01 2025
    From Newsgroup: comp.lang.forth

    On 27-02-2025 13:19, LIT wrote:
    Oh, so it's simpler way than anyone could guess:
    "just use different term, avoid the word 'variable' ".
    Done. :)

    Oh dear, I hope you don't have a formal education in CS. If so, I'd ask
    for my money back. You know - it's not a different term - it's a
    different concept, with quite different characteristics.

    „In computer science, array is a data type that represents
    a collection of elements (values or variables), each
    selected by one or more indices”

    Now feel free to go and ask for your money back.

    Interesting.. In your class they taught computer science by Wikipedia?
    Didn't they have money for real books? Must have been a real poor city college..

    Hans Bezemer


    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Thu Feb 27 14:20:28 2025
    From Newsgroup: comp.lang.forth

    Oh, so it's simpler way than anyone could guess:
    "just use different term, avoid the word 'variable' ".
    Done. :)

    Oh dear, I hope you don't have a formal education in CS. If so, I'd ask
    for my money back. You know - it's not a different term - it's a
    different concept, with quite different characteristics.

    „In computer science, array is a data type that represents
    a collection of elements (values or variables), each
    selected by one or more indices”

    Now feel free to go and ask for your money back.

    Interesting.. In your class they taught computer science by Wikipedia?
    Didn't they have money for real books? Must have been a real poor city college..

    At least in that college they didn't taught that
    „Forth uses FIFO stack” -- as they taught you in
    your really rich city college. :]

    Anything wrong with the quoted definition?

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Thu Feb 27 17:47:08 2025
    From Newsgroup: comp.lang.forth

    An even weirder result for TEST3, although it probably has more to do
    with my aging DO LOOP construct.

    -marcel

    ---
    ANEW -oos

    VARIABLE V1 7 V1 !
    VARIABLE V2 8 V2 !
    VARIABLE V3 9 V3 !

    : :=: ( a b -- ) \ exchange values among two variables
    PARAMS| a b | a @ b @ swap b ! a ! ;

    : ++ ( a -- ) \ increment variable by one
    1 SWAP +! ;

    : +> ( a b c -- ) \ add two variables then store result into third one
    PARAMS| a b c | a @ b @ + c ! ;

    : t1a V1 @ V2 @ + V3 ! ; : t1b V1 V2 V3 +> ;
    : t2a 1 V1 +! ; : t2b V1 ++ ;
    : t3a V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! ; : t3b V1 V2 :=: ;

    : TESTS
    CR ." \ TEST1 : " TIMER-RESET #1000000000 0 DO t1a t1a t1a t1a t1a
    t1a t1a t1a t1a t1a LOOP .ELAPSED
    3 SPACES TIMER-RESET #1000000000 0 DO t1b t1b t1b t1b t1b
    t1b t1b t1b t1b t1b LOOP .ELAPSED
    CR ." \ TEST2 : " TIMER-RESET #1000000000 0 DO t2a t2a t2a t2a t2a
    t2a t2a t2a t2a t2a LOOP .ELAPSED
    3 SPACES TIMER-RESET #1000000000 0 DO t2b t2b t2b t2b t2b
    t2b t2b t2b t2b t2b LOOP .ELAPSED
    CR ." \ TEST3 : " TIMER-RESET #1000000000 0 DO t3a t3a t3a t3a t3a
    t3a t3a t3a t3a t3a LOOP .ELAPSED
    3 SPACES TIMER-RESET #1000000000 0 DO t3b t3b t3b t3b t3b
    t3b t3b t3b t3b t3b LOOP .ELAPSED ;

    \ old version
    \ TEST1 : 1.646 seconds elapsed. 1.661 seconds elapsed.
    \ TEST2 : 1.778 seconds elapsed. 1.728 seconds elapsed.
    \ TEST3 : 2.194 seconds elapsed. 1.645 seconds elapsed. ok

    \ new version (above, note: 10 Giga executions)
    \ TEST1 : 1.959 seconds elapsed. 1.958 seconds elapsed.
    \ TEST2 : 1.826 seconds elapsed. 1.827 seconds elapsed.
    \ TEST3 : 18.711 seconds elapsed. 3.849 seconds elapsed. ok
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Thu Feb 27 12:23:43 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Results (on Zen4):
    gforth-fast (development): ...

    It's interesting how little difference there is with gforth-fast. Could
    you also do gforth-itc? exchange2 is a big win with VFX, suggesting its optimizer could do better with some of the other versions.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Gerry Jackson@do-not-use@swldwa.uk to comp.lang.forth on Thu Feb 27 22:05:09 2025
    From Newsgroup: comp.lang.forth

    On 27/02/2025 07:29, Anton Ertl wrote:
    Paul Rubin <no.email@nospam.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    : :=: ( addr1 addr2 -- )
    OVER @ >R DUP @ ROT ! R> SWAP ! ;

    : ex ( a1 a2 -- ) 2>r 2r@ @ swap @ r> ! r> ! ;

    looks a little simpler.

    This inspires another one:

    : exchange2 ( addr1 addr2 -- )
    dup >r @ over @ r> ! swap ! ;

    With some other versions this results in the following benchmark
    program:

    [defined] !@ [if]
    : exchange ( addr1 addr2 -- )
    over @ swap !@ swap ! ;
    [then]

    \ Paul Rubin <875xkwo5io.fsf@nightsong.com>
    : ex ( addr1 addr2 -- )
    2>r 2r@ @ swap @ r> ! r> ! ;

    : ex-locals {: x y -- :} x @ y @ x ! y ! ;

    \ Anton Ertl
    : exchange2 ( addr1 addr2 -- )
    dup >r @ over @ r> ! swap ! ;

    \ Marcel Hendrix
    : :=: ( addr1 addr2 -- )
    OVER @ >R DUP @ ROT ! R> SWAP ! ;

    variable v1
    variable v2

    1 v1 !
    2 v2 !

    : bench ( "name" -- )
    v1 v2
    :noname ]] 100000000 0 do 2dup [[ parse-name evaluate ]] loop ; [[
    execute ;

    Results (on Zen4):

    gforth-fast (development):
    :=: exchange ex ex-locals exchange2
    814_881_277 879_389_133 928_825_521 875_574_895 808_543_975 cyc. 3_908_874_164 3_708_891_336 4_508_966_770 4_209_778_557 3_708_865_505 inst.

    vfx64 5.43:
    :=: ex ex-locals exchange2
    335_298_202 432_614_804 928_542_678 336_134_513 cyc. 1_166_400_242 1_366_264_943 2_866_547_067 1_166_280_641 inst.

    And here's the code produced by gforth-fast:

    :=: ex ex-locals exchange2
    over 1->2 2>r 1->0 l 1->1 dup >r 1->1
    mov r15,$08[r10] add r10,$08 mov rax,rbp >r 1->1
    @ 2->2 mov r15,r13 add r10,$08 mov -$8[r14],r13
    mov r15,[r15] mov r13,[r10] lea rbp,-$8[rbp] sub r14,$08
    r 2->1 mov -$8[r14],r13 mov -$8[rax],r13 @ 1->1
    mov -$8[r14],r15 sub r14,$10 mov r13,[r10] mov r13,$00[r13]
    sub r14,$08 mov [r14],r15 >l @local0 1->1 over 1->2
    dup 1->2 2r@ 0->2 @local0 1->1 mov r15,$08[r10]
    mov r15,r13 mov r13,$08[r14] mov rax,rbp @ 2->2
    @ 2->2 mov r15,[r14] lea rbp,-$8[rbp] mov r15,[r15]
    mov r15,[r15] @ 2->2 mov -$8[rax],r13 r> 2->3
    rot 2->3 mov r15,[r15] @ 1->1 mov r9,[r14]
    mov r9,$08[r10] swap 2->2 mov r13,$00[r13] add r14,$08
    add r10,$08 mov rax,r13 @local1 1->2 ! 3->1
    ! 3->1 mov r13,r15 mov r15,$08[rbp] mov [r9],r15
    mov [r9],r15 mov r15,rax @ 2->2 swap 1->2
    1->2 @ 2->2 mov r15,[r15] mov r15,$08[r10]
    mov r15,[r14] mov r15,[r15] @local0 2->3 add r10,$08
    add r14,$08 r> 2->3 mov r9,$00[rbp] ! 2->0
    swap 2->3 mov r9,[r14] ! 3->1 mov [r15],r13
    add r10,$08 add r14,$08 mov [r9],r15 ;s 0->1
    mov r9,r13 ! 3->1 @local1 1->2 mov r13,$08[r10]
    mov r13,[r10] mov [r9],r15 mov r15,$08[rbp] add r10,$08
    ! 3->1 r> 1->2 ! 2->0 mov rbx,[r14]
    mov [r9],r15 mov r15,[r14] mov [r15],r13 add r14,$08
    ;s 1->1 add r14,$08 lp+2 0->1 mov rax,[rbx]
    mov rbx,[r14] ! 2->0 mov r13,$08[r10] jmp eax
    add r14,$08 mov [r15],r13 add r10,$08
    mov rax,[rbx] ;s 0->1 add rbp,$10
    jmp eax mov r13,$08[r10] ;s 1->1
    add r10,$08 mov rbx,[r14]
    mov rbx,[r14] add r14,$08
    add r14,$08 mov rax,[rbx]
    mov rax,[rbx] jmp eax
    jmp eax


    How does a crude definition not involving the R stack compare:
    : ex3 over @ over @ 3 pick ! over ! 2drop ;
    --
    Gerry
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Feb 27 22:03:55 2025
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Results (on Zen4):
    gforth-fast (development): ...

    It's interesting how little difference there is with gforth-fast. Could
    you also do gforth-itc?

    gforth-itc (development):
    :=: exchange ex ex-locals exchange2
    7_527_256_553 5_224_615_325 6_825_283_178 9_238_357_501 7_036_128_309 c. 13_127_503_990 9_326_561_471 12_927_054_153 16_927_820_825 12_027_146_677 i.

    For comparison: gforth-fast (development):
    :=: exchange ex ex-locals exchange2
    814_881_277 879_389_133 928_825_521 875_574_895 808_543_975 cyc. 3_908_874_164 3_708_891_336 4_508_966_770 4_209_778_557 3_708_865_505 inst.

    exchange2 is a big win with VFX, suggesting its
    optimizer could do better with some of the other versions.

    On VFX exchange2 takes the same speed and the same number of
    instructions as :=:. EX is slower because VFX does not analyse the
    return stack, unlike the data stack. EX-LOCALS is slow because VFX's
    locals implementation is not particularly good.

    To see what a better analysis can do, let's look at lxf:

    :=: ex ex-locals exchange2
    502_740_029 502_189_567 502_134_842 502_043_217 cycles
    1_701_663_782 1_701_657_866 1_701_677_273 1_701_684_186 instructions

    The cycles and instructions are worse (except for ex-locals) than with
    VFX, but that's due to inlining (which VFX does and lxf does not).

    E.g., here's lxf's code for EX-LOCALS:

    869204C 804FCE2 23 88C8000 5 normal EX-LOCALS

    804FCE2 8B4500 mov eax , [ebp]
    804FCE5 8B00 mov eax , [eax]
    804FCE7 8BCB mov ecx , ebx
    804FCE9 8B09 mov ecx , [ecx]
    804FCEB 8B5500 mov edx , [ebp]
    804FCEE 890A mov [edx] , ecx
    804FCF0 8903 mov [ebx] , eax
    804FCF2 8B5D04 mov ebx , [ebp+4h]
    804FCF5 8D6D08 lea ebp , [ebp+8h]
    804FCF8 C3 ret near

    It's the same code as lxf produces for :=:.

    The code lxf produces for EX and EXCHANGE2 is:

    804FCF9 8BC3 mov eax , ebx
    804FCFB 8B00 mov eax , [eax]
    804FCFD 8B4D00 mov ecx , [ebp]
    804FD00 8B09 mov ecx , [ecx]
    804FD02 890B mov [ebx] , ecx
    804FD04 8B5D00 mov ebx , [ebp]
    804FD07 8903 mov [ebx] , eax
    804FD09 8B5D04 mov ebx , [ebp+4h]
    804FD0C 8D6D08 lea ebp , [ebp+8h]
    804FD0F C3 ret near

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Feb 27 22:53:47 2025
    From Newsgroup: comp.lang.forth

    Gerry Jackson <do-not-use@swldwa.uk> writes:
    On 27/02/2025 07:29, Anton Ertl wrote:
    \ Anton Ertl
    : exchange2 ( addr1 addr2 -- )
    dup >r @ over @ r> ! swap ! ;
    ...
    Results (on Zen4):

    gforth-fast (development):
    :=: exchange ex ex-locals exchange2
    814_881_277 879_389_133 928_825_521 875_574_895 808_543_975 cyc. >> 3_908_874_164 3_708_891_336 4_508_966_770 4_209_778_557 3_708_865_505 inst. ...
    How does a crude definition not involving the R stack compare:
    : ex3 over @ over @ 3 pick ! over ! 2drop ;

    exchange2 ex3
    dup >r 1->1 over 1->1
    r 1->1 mov [r10],r13
    mov -$08[r14],r13 sub r10,$08
    sub r14,$08 mov r13,$10[r10]
    @ 1->1 @ 1->1
    mov r13,$00[r13] mov r13,$00[r13]
    over 1->2 over 1->2
    mov r15,$08[r10] mov r15,$08[r10]
    @ 2->2 @ 2->2
    mov r15,[r15] mov r15,[r15]
    2->3 fourth 2->3
    mov r9,[r14] mov r9,$10[r10]
    add r14,$08 ! 3->1
    ! 3->1 mov [r9],r15
    mov [r9],r15 over 1->2
    swap 1->2 mov r15,$08[r10]
    mov r15,$08[r10] ! 2->0
    add r10,$08 mov [r15],r13
    ! 2->0 2drop 0->0
    mov [r15],r13 add r10,$10
    ;s 0->1 ;s 0->1
    mov r13,$08[r10] mov r13,$08[r10]
    add r10,$08 add r10,$08
    mov rbx,[r14] mov rbx,[r14]
    add r14,$08 add r14,$08
    mov rax,[rbx] mov rax,[rbx]
    jmp eax jmp eax

    EX3 plays to Gforth's strengths: copying words (e.g., OVER) instead of shuffling words (e.g., SWAP), remove superfluous stuff with 2DROP.

    It also plays to VFX's strengths: being analytic about the dats stack. EXCHANGE2 was the fastest version (together with :=:) before, here's
    that compared to EX3:

    exchange2 ex3
    334_718_398 273_592_214 cycles
    1_167_276_392 967_258_380 instructions

    EXCHANGE2 EX3
    PUSH RBX MOV RDX, [RBP]
    MOV RDX, [RBP] MOV RDX, 0 [RDX]
    MOV RDX, 0 [RDX] MOV RCX, 0 [RBX]
    POP RCX MOV RAX, [RBP]
    MOV RBX, 0 [RBX] MOV 0 [RAX], RCX
    MOV 0 [RCX], RDX MOV 0 [RBX], RDX
    MOV RDX, [RBP] MOV RBX, [RBP+08]
    MOV 0 [RDX], RBX LEA RBP, [RBP+10]
    MOV RBX, [RBP+08] RET/NEXT
    LEA RBP, [RBP+10] ( 29 bytes, 9 instructions )
    RET/NEXT
    ( 31 bytes, 11 instructions )

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Fri Feb 28 15:03:47 2025
    From Newsgroup: comp.lang.forth

    On 27-02-2025 15:20, LIT wrote:
    Oh, so it's simpler way than anyone could guess:
    "just use different term, avoid the word 'variable' ".
    Done. :)

    Oh dear, I hope you don't have a formal education in CS. If so, I'd ask >>>> for my money back. You know - it's not a different term - it's a
    different concept, with quite different characteristics.

    „In computer science, array is a data type that represents
    a collection of elements (values or variables), each
    selected by one or more indices”

    Now feel free to go and ask for your money back.

    Interesting.. In your class they taught computer science by Wikipedia?
    Didn't they have money for real books? Must have been a real poor city
    college..

    At least in that college they didn't taught that
    „Forth uses FIFO stack” -- as they taught you in
    your really rich city college. :]

    I don't think I ever did that in any publication, but even if I did -
    people get confused when calling bit 0 "bit 1" because it represents
    "1". They get confused choosing the wrong side when they talk about "big endian". They get confused when classifying the 8088. They go left when
    their instructor calls "right".

    It's like a spelling error. Only petty people try to use that as a
    counter argument. It's a different kind of error compared to proposing "stackless operations" on a stack based language. It's like asking why a Ferrari can't pour a concrete floor.

    Anything wrong with the quoted definition?

    Yes. You couldn't produce one. You had to look it up. Something as basic
    a concept as "array".

    Hans Bezemer
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Feb 28 21:55:05 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Another variant:

    : exchange ( addr1 addr2 -- )
    dup @ rot !@ swap ! ;

    This uses the primitive

    '!@' ( u1 a-addr -- u2 ) gforth-experimental "store-fetch"
    load U2 from A_ADDR, and store U1 there, as atomic operation

    I worry that the atomic part will result in it being slower than the
    versions that do not use !@.

    It's barely noticable on Zen4, but it makes a big difference on the
    Cortex-A55. Therefore we decided to also have a nonatomic !@. We renamed the atomic one into ATOMIC!@ and !@ is now the nonatomic version.

    How do they perform?

    On Zen4:
    !@ atomic!@
    821_538_216 880_459_702 cycles
    3_815_202_629 3_710_937_849 instructions

    On Cortex-A55:
    !@ atomic!@
    3355427045 5856496676 cycles
    3115589778 4318749543 instructions

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Fri Feb 28 14:45:14 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    !@ is now the nonatomic version.

    Is the nonatomic one useful often? We've done without it all this time.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Mar 1 07:32:09 2025
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    !@ is now the nonatomic version.

    Is the nonatomic one useful often?

    Some numbers of uses in the Gforth image:

    11 !@
    3 atomic!@
    66 +!

    We've done without it all this time.

    Sure, you can replace it with DUP @ >R ! R>. Having a word for that
    relieves the programmer of producing such a sequence (possibly with a
    bug) and the reader of having to analyse what's going on here.

    I have now added stack-state variants for !@, resulting in better
    performance in some cases. Is !@ used often enough to merit the extra
    build time of Gforth? That's not clear, but the benefit I see is that
    I want to provide a system where the programmer does not have to
    wonder whether he should avoid !@ for better performance.

    I also tried out another variant that uses !@:

    : exchange4 ( addr1 addr2 -- )
    dup @ rot !@ swap ! ;

    The resulting code for EXCHANGE, EXCHANGE4, and EXCHANGE2 (the latter
    without !@):

    see-code exchange see-code exchange4 see-code exchange2
    over 1->2 dup 1->2 dup >r 1->1
    mov r15,$08[r12] mov r15,r8 >r 1->1
    @ 2->2 @ 2->2 mov -$08[r13],r8
    mov r15,[r15] mov r15,[r15] sub r13,$08
    swap 2->3 rot 2->3 @ 1->1
    add r12,$08 mov r9,$08[r12] mov r8,[r8]
    mov r9,r8 add r12,$08 over 1->2
    mov r8,[r12] !@ 3->2 mov r15,$08[r12]
    !@ 3->2 mov rax,r15 @ 2->2
    mov rax,r15 mov r15,[r9] mov r15,[r15]
    mov r15,[r9] mov [r9],rax r> 2->3
    mov [r9],rax swap 2->3 mov r9,$00[r13]
    swap 2->3 add r12,$08 add r13,$08
    add r12,$08 mov r9,r8 ! 3->1
    mov r9,r8 mov r8,[r12] mov [r9],r15
    mov r8,[r12] ! 3->1 swap 1->2
    ! 3->1 mov [r9],r15 mov r15,$08[r12]
    mov [r9],r15 ;s 1->1 add r12,$08
    ;s 1->1 mov rbx,$00[r13] ! 2->0
    mov rbx,$00[r13] add r13,$08 mov [r15],r8
    add r13,$08 mov rax,[rbx] ;s 0->1
    mov rax,[rbx] jmp eax mov r8,$08[r12]
    jmp eax add r12,$08
    mov rbx,$00[r13]
    add r13,$08
    mov rax,[rbx]
    jmp eax

    EXCHANGE performs 1 instruction less than EXCHANGE2, EXCHANGE4
    performs 2 instructions less than EXCHANGE2; both contain three less primitives.

    Performance on Zen4:
    exchange exchange4 exchange2
    748_033_428 699_870_875 809_204_577 cycles
    3_610_871_416 3_510_578_833 3_710_662_751 instructions

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Mar 1 08:18:06 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    see-code exchange see-code exchange4 see-code exchange2
    over 1->2 dup 1->2 dup >r 1->1
    mov r15,$08[r12] mov r15,r8 >r 1->1
    @ 2->2 @ 2->2 mov -$08[r13],r8
    mov r15,[r15] mov r15,[r15] sub r13,$08
    swap 2->3 rot 2->3 @ 1->1
    add r12,$08 mov r9,$08[r12] mov r8,[r8]
    mov r9,r8 add r12,$08 over 1->2
    mov r8,[r12] !@ 3->2 mov r15,$08[r12]
    !@ 3->2 mov rax,r15 @ 2->2
    mov rax,r15 mov r15,[r9] mov r15,[r15]
    mov r15,[r9] mov [r9],rax r> 2->3
    mov [r9],rax swap 2->3 mov r9,$00[r13]
    swap 2->3 add r12,$08 add r13,$08
    add r12,$08 mov r9,r8 ! 3->1
    mov r9,r8 mov r8,[r12] mov [r9],r15
    mov r8,[r12] ! 3->1 swap 1->2
    ! 3->1 mov [r9],r15 mov r15,$08[r12]
    mov [r9],r15 ;s 1->1 add r12,$08
    ;s 1->1 mov rbx,$00[r13] ! 2->0
    mov rbx,$00[r13] add r13,$08 mov [r15],r8
    add r13,$08 mov rax,[rbx] ;s 0->1
    mov rax,[rbx] jmp eax mov r8,$08[r12]
    jmp eax add r12,$08
    mov rbx,$00[r13]
    add r13,$08
    mov rax,[rbx]
    jmp eax

    The difference between exchange and exchange4 shows how stack caching
    can have a hard-to-predict effect. Gforth searches for the shortest
    path through the available stack-cache states, where shortness is
    defined by the native-code length. E.g., it starts with state 1, and
    from there it can use any of the dup variants starting in state 1, or
    first transition to another state and use a dup variant starting from
    there.

    For SWAP and ROT gforth-fast has the following variants:

    primitive in-out # code bytes
    swap 1-1 132 len= 4+ 13+ 3
    swap 2-2 37 len= 4+ 9+ 3
    swap 3-3 4 len= 4+ 9+ 3
    swap 0-2 8 len= 4+ 14+ 3
    swap 1-2 82 len= 4+ 9+ 3
    swap 2-1 74 len= 4+ 8+ 3
    swap 2-3 30 len= 4+ 11+ 3
    swap 3-2 3 len= 4+ 11+ 3
    swap 2-0 20 len= 4+ 13+ 3
    rot 1-1 46 len= 4+ 23+ 3
    rot 3-3 6 len= 4+ 12+ 3
    rot 3-1 24 len= 4+ 13+ 3
    rot 2-3 15 len= 4+ 9+ 3
    rot 1-3 17 len= 4+ 17+ 3
    rot 0-3 1 len= 4+ 19+ 3

    You get these data (in a rawer form) with

    gforth-fast --print-prim -e bye |& grep ^swap
    gforth-fast --print-prim -e bye |& grep ^rot

    The in column is the stack-cache state on entering the word, the out
    column is the stack-cache state on leaving the word. The # column
    shows how many times this variant of the primitive is used (static
    counts). The code length colum shows three parts, the middle of which
    is the part that's copied to dynamic superinstructions like the ones
    shown above, and this length is what is used in the search for the
    shortest path.

    In EXCHANGE4, the shortest variant of ROT is used: ROT 2->3; and the
    primitives of the other variants are selected to also result in the
    shortest overall code.

    In EXCHANGE and EXCHANGE4, SWAP 2->3 is not the shortest variant of
    SWAP, not even the shortest variant starting from state 2, but ending
    in state 3 allows to use cheap variants of further primitives such as
    !@ and !, resulting in the overall shortest code for this sequence.
    In EXCHANGE2, we see the selection of a shorter version of SWAP, but
    one of the costs is that ;s becomes longer (but in this case the
    overall savings from using a shorter version of SWAP and shorter
    versions of earlier instructions make up for that).

    Why am I looking at this? For stack-shuffling primitives like SWAP
    and ROT, it's not obvious which variant is how long and which variant
    should be selected.

    These stack-shiffling words therefore are also good candidates for
    performing stack-cache state transitions that would otherwise require
    inserting extra transition code:

    E.g., EXCHANGE and its variants consume two stack items, but need to
    start in stack-cache state 1 and end in stack-cache state 1 (gforth is currently not smart enough to deal with other states at basic-block boundaries), so not everything can be done in the stack cache; the
    stack pointer needs to be increased by two cells, and there need to be
    accesses to the memory part of the stack for two stack items.

    In EXCHANGE, the adjustment by one cell and memory access for one
    stack item is done in the first SWAP 2->3, and another one in the
    second one. In EXCHANGE4, ROT 2->3 and SWAP 2->3 perform these tasks.
    In EXCHANGE2, the SWAP 1->3 does it for one cell, and the stack-cache
    state transition 0->1 in the first two instructions of ;s do it for
    the other cell (gforth-fast actually does not have a built-in variant
    ;S 0->1 and the code shown as ;S 0->1 by SEE-CODE is actually composed
    of a transition 0->1 and the ;S 1->1 variant).

    I wanted to know how often which variant of these stack-shuffling
    primitives is used, and how this relates to their length. One
    interesting result is that ROT 1->3 is used relatively frequently
    despite having relatively long code. Apparently the code that comes
    before these 17 instances of ROT benefits a lot from being in
    stack-cache state 1 and this amortizes the longer code for the ROT
    3 compared to ROT 2->3.

    Another interesting result is the low usage of SWAP 3->2 compared to
    SWAP 2->3. This may say something about how SWAP is used in Forth
    programs. Or it may be an artifact of tie-breaking: If two paths have
    the same length, one is chosen rather arbitrarily, but consistently,
    and this may make one variant appear more useful than merited by the
    benefit that the existence of the variant has on code length.

    For those interested, here's the code for the various variants shown
    above:

    r12: stack pointer
    r8: stack cache register a (tos in state 1, 2nd in state 2, 3rd in state 3) r15: stack-cache register b (tos in state 2, 2nd in state 3)
    r9: stack-cache register c (tos in state 3)

    swap 1-1
    559E7F769425: mov rax,$08[r12]
    559E7F76942A: mov $08[r12],r8
    559E7F76942F: mov r8,rax

    swap 2-2
    559E7F76E5B1: mov rax,r8
    559E7F76E5B4: mov r8,r15
    559E7F76E5B7: mov r15,rax

    swap 3-3
    559E7F76E5C3: mov rax,r15
    559E7F76E5C6: mov r15,r9
    559E7F76E5C9: mov r9,rax

    swap 0-2
    559E7F76E5D5: mov r15,$10[r12]
    559E7F76E5DA: mov r8,$08[r12]
    559E7F76E5DF: add r12,$10

    swap 1-2
    559E7F76E5EC: mov r15,$08[r12]
    559E7F76E5F1: add r12,$08

    swap 2-1
    559E7F76E5FE: mov [r12],r15
    559E7F76E602: sub r12,$08

    swap 2-3
    559E7F76E60F: add r12,$08
    559E7F76E613: mov r9,r8
    559E7F76E616: mov r8,[r12]

    swap 3-2
    559E7F76E623: mov [r12],r8
    559E7F76E627: mov r8,r9
    559E7F76E62A: sub r12,$08

    swap 2-0
    559E7F76E637: mov [r12],r15
    559E7F76E63B: sub r12,$10
    559E7F76E63F: mov $08[r12],r8

    rot 1-1
    559E7F76944C: mov rdx,$08[r12]
    559E7F769451: mov rax,$10[r12]
    559E7F769456: mov $08[r12],r8
    559E7F76945B: mov $10[r12],rdx
    559E7F769460: mov r8,rax

    rot 3-3
    559E7F76EDC0: mov rax,r8
    559E7F76EDC3: mov r8,r15
    559E7F76EDC6: mov r15,r9
    559E7F76EDC9: mov r9,rax

    rot 3-1
    559E7F76EDD5: mov [r12],r15
    559E7F76EDD9: sub r12,$10
    559E7F76EDDD: mov $08[r12],r9

    rot 2-3
    559E7F76EDEB: mov r9,$08[r12]
    559E7F76EDF0: add r12,$08

    rot 1-3
    559E7F76EDFD: mov r9,$10[r12]
    559E7F76EE02: mov r15,r8
    559E7F76EE05: add r12,$10
    559E7F76EE09: mov r8,-$08[r12]

    rot 0-3
    559E7F76EE17: mov r9,$18[r12]
    559E7F76EE1C: mov r8,$10[r12]
    559E7F76EE21: add r12,$18
    559E7F76EE25: mov r15,-$10[r12]

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Mar 1 11:47:54 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Paul Rubin <no.email@nospam.invalid> writes: >>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    !@ is now the nonatomic version.

    Is the nonatomic one useful often?

    Some numbers of uses in the Gforth image:

    11 !@
    3 atomic!@
    66 +!

    We've done without it all this time.

    Sure, you can replace it with DUP @ >R ! R>. Having a word for that
    relieves the programmer of producing such a sequence (possibly with a
    bug) and the reader of having to analyse what's going on here.

    Another point: These 11 uses of non-atomic !@ used to be uses of the
    slow atomic !@. So even the slow atomic !@ was preferred by the
    programmer to doing it with @, ! and stack manipulation. In that
    situation the non-atomic !@ provides the wanted capability without
    incurring the atomic slowness.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Sat Mar 1 16:20:24 2025
    From Newsgroup: comp.lang.forth

    On 01-03-2025 12:47, Anton Ertl wrote:
    11 !@
    3 atomic!@
    66 +!

    We've done without it all this time.

    Sure, you can replace it with DUP @ >R ! R>. Having a word for that
    relieves the programmer of producing such a sequence (possibly with a
    bug) and the reader of having to analyse what's going on here.

    I found the sequence exactly twice in my code - one an application
    program and one a library. I agree whole-heartedly that such a sequence
    may help a programmer to abstract such a pattern - I've added several of
    those myself.

    However, if it is that rare there is no point in adding it. Creating too
    many superfluous abstractions may even get counter productive in the
    sense that predefined abstractions are ignored and reinvented. So, this
    is not one I'd add immediately - but never say never. May be it will pop
    up in the future. Who knows.. ;-)

    Hans Bezemer

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Mar 1 17:22:45 2025
    From Newsgroup: comp.lang.forth

    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    On 01-03-2025 12:47, Anton Ertl wrote:
    11 !@
    3 atomic!@
    66 +!

    We've done without it all this time.

    Sure, you can replace it with DUP @ >R ! R>. Having a word for that
    relieves the programmer of producing such a sequence (possibly with a
    bug) and the reader of having to analyse what's going on here.

    I found the sequence exactly twice in my code

    Yes, you can replace !@ with that sequence, but not every case where
    one fetches one value from and address and stores another value to
    that address is expressed by this sequence. E.g., another equivalent
    sequence is: DUP >R @ SWAP R> !; and another: DUP @ -ROT !. And you
    can also use that word profitably in cases where some other
    functionality is mixed in with the code without !@. E.g., in none of
    the variants of :=:/etc. without !@ in this thread one of the two
    sequences occured; in several of them the ! of the other address was
    inserted before the ! of the address that was fetched the second time.
    E.g.,

    : exchange2 ( addr1 addr2 -- )
    dup >r @ over @ r> ! swap ! ;

    Yet

    : exchange ( addr1 addr2 -- )
    over @ swap !@ swap ! ;

    is shorter, easier to follow, and (in gforth-fast) faster.

    As mentioned, Bernd Paysan used !@ 11 times in the Gforth image in
    code where atomicity is not needed. Up to yesterday we only had the
    atomic version and I have avoided using !@ because I was worried that
    it would be slow, so there may be some additional opportunity in the
    Gforth image for using it.

    However, if it is that rare there is no point in adding it. Creating too >many superfluous abstractions may even get counter productive in the
    sense that predefined abstractions are ignored and reinvented.

    In that case they are obviously not superfluous. Yes, reinvention
    happens; it shows that the word is needed. Then at some point
    somebody notices the duplication, decides on a canonical version and
    goes through the code and replaces all uses of the duplicated words
    with the canonical version.

    There is a valid reason to avoid rarely used words that can be
    replaced by a sequence: human memory load. I don't think that !@ is
    such a case, though.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Sat Mar 1 21:35:16 2025
    From Newsgroup: comp.lang.forth

    I can't find `DUP @ >R ! R>` (+ variants with spacings)
    in any of 1667 files.
    However, `DUP @ >R` is found 12 times and `! R>` 29 times.

    `DUP @ -ROT !` gets hit 0 times, `DUP >R @ SWAP R> !` once.

    -marcel
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Sun Mar 2 10:35:27 2025
    From Newsgroup: comp.lang.forth

    Oh, so it's simpler way than anyone could guess:
    "just use different term, avoid the word 'variable' ".
    Done. :)

    Oh dear, I hope you don't have a formal education in CS. If so, I'd ask >>>>> for my money back. You know - it's not a different term - it's a
    different concept, with quite different characteristics.

    „In computer science, array is a data type that represents
    a collection of elements (values or variables), each
    selected by one or more indices”

    Now feel free to go and ask for your money back.

    Interesting.. In your class they taught computer science by Wikipedia?
    Didn't they have money for real books? Must have been a real poor city
    college..

    At least in that college they didn't taught that
    „Forth uses FIFO stack” -- as they taught you in
    your really rich city college. :]

    I don't think I ever did that in any publication, but even if I did -
    people get confused when calling bit 0 "bit 1" because it represents
    "1". They get confused choosing the wrong side when they talk about "big endian". They get confused when classifying the 8088. They go left when
    their instructor calls "right".

    It's like a spelling error. Only petty people try to use that as a
    counter argument. It's a different kind of error compared to proposing "stackless operations" on a stack based language. It's like asking why a Ferrari can't pour a concrete floor.

    No, it WASN'T "spelling error"; you stated that out loud
    in your "educational" YT-clip. You forgot? Or you're trying
    very hard to forget? No, it wasn't just replacing L by F, you
    stated that word-by-word: "Forth uses FIFO stack — first in,
    first out".

    I'm going to remind you:

    https://groups.google.com/g/comp.lang.forth/c/EWLqO2b26nM/m/B7gnoD7dAgAJ

    "> > It's here - https://www.youtube.com/watch?v=hpw__rmBisU
    01:45 -- but you know: Forth's stack works on the rule „last in — first out”, not „first in, first out”. Or am I wrong?
    No, you're not. I pulled it and I'm uploading an updated version.

    Hans Bezemer"

    And you try to present yourself as an authority after something
    like that?

    Anything wrong with the quoted definition?

    Yes. You couldn't produce one. You had to look it up. Something as basic
    a concept as "array".

    Mr. FIFO, don't you be ridiculous again... :]

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Sun Mar 2 11:39:38 2025
    From Newsgroup: comp.lang.forth

    On 01-03-2025 22:35, mhx wrote:
    I can't find `DUP @ >R ! R>` (+ variants with spacings)
    in any of 1667 files.
    However, `DUP @ >R` is found 12 times and `! R>` 29 times.

    `DUP @ -ROT !` gets hit 0 times, `DUP >R @ SWAP R> !` once.

    In 1073 files, `DUP @ -ROT !` and `DUP >R @ SWAP R> !` aren't found.
    The sequence `DUP @ >R` is found 10 times, `! R>` a whopping 65 times.

    Hans Bezemer



    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Wed Mar 5 16:40:55 2025
    From Newsgroup: comp.lang.forth

    On 02-03-2025 11:35, LIT wrote:
    And you try to present yourself as an authority after something
    like that?

    "Mr. Twain - you made a spelling error. And you call yourself the
    greatest American writer of the 19th century?"

    I told you you were petty.. :)

    Mr. FIFO, don't you be ridiculous again... :]

    Well, until you can make a non-trivial Forth program without resorting
    to variables, I think I still have an edge on you where Forth is concerned!

    A significant edge, I might add..! ;-)

    Hans Bezemer

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Thu Mar 6 11:33:35 2025
    From Newsgroup: comp.lang.forth

    And you try to present yourself as an authority after something
    like that?

    "Mr. Twain - you made a spelling error. And you call yourself the
    greatest American writer of the 19th century?"

    I told you you were petty.. :)

    No, it WASN'T humble "spelling error"; YOU STATED
    THAT OUT LOUD, in a complete sentence. :]

    BTW: comparing yourself to Twain? It seems you're
    not just a greatest "computer scientist" if not in
    a world, then at least in this newsgroup, sure :D
    — but also the most modest one... :)))

    --
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Thu Mar 6 13:36:53 2025
    From Newsgroup: comp.lang.forth

    On 06-03-2025 12:33, LIT wrote:
    No, it WASN'T humble "spelling error"; YOU STATED
    THAT OUT LOUD, in a complete sentence. :]

    BTW: comparing yourself to Twain? It seems you're
    not just a greatest "computer scientist" if not in
    a world, then at least in this newsgroup, sure :D
    — but also the most modest one... :)))

    Oh dear - it actually was a typo - and I can prove it.

    First the audio file has a timestamp of 15:32. Now, you know it's not
    hard to "touch" a file, so if you don't want to believe me - fair enough.
    The error was reported at 18:36. You can check that for yourself. The
    text is part of an animation. So I had to redo the entire animation. I
    have to do the timing of the animation with the audio manually. Then I
    have to render the video. That takes about ten minutes. The render was complete at 18:56. Again, you don't have to believe me.

    But I confirmed the reupload at 19:06. You can check that for yourself.
    That means uploading the file, adding all the information, adding the subtitle. I think ten minutes is hardly unreasonable. So, that is half
    an hour to:
    • Editing the animation;
    • Syncing the animation;
    • Edit the video;
    • Render the video;
    • Upload and process the video.

    Now, I use Audacity - and that doesn't support "punch and roll". It
    would be quite hard to seamlessly edit the audio that way. And also - my script clearly states: "If we want to get a dime out, the first one that
    drops out is the one you put in last. " Again, you don't have to believe me.

    But either I just fixed the video - OR I did the video AND the audio
    plus the script in less than ten minutes. If that were the case, it
    would make me even more formidable! Your choice..

    Hans Bezemer

    --- Synchronet 3.20c-Linux NewsLink 1.2