Forum: War Ensemble BBS

Stack vs stackless operation

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Mon Feb 24 19:49:01 2025

From Newsgroup: comp.lang.forth

I wonder: wouldn't it be useful to have stackless basic
arith operations? I mean instead of fetching the values
first and putting them on the stack, then doing something,
and in the end storing the result somewhere wouldn't it
be practical to use directly the variables? Like this:

: +> ( addr1 addr2 addr3 -- )
rot @ rot @ + swap ! ;

Of course the above is just an illustration; I mean coding
such word directly in ML. It should be significantly
faster than going through stack usual way.

But after I came up with this idea I realized someone
surely invented that before - it looks so obvious — yet
I didn't see it anywhere. Did anyone of you see something
like this in any code? If so — actually why somehow
(probably?) such solution has not become widespread?
Looks good to me; math can be done completely in ML
avoiding "Forth machine" engagement, therefore saving many
cycles.

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Mon Feb 24 20:34:25 2025

From Newsgroup: comp.lang.forth

An optimising Forth compiler does exactly that.

NT/FORTH for example:

: +> rot @ rot @ + swap ! ; ok
see +>
A49E6C 409196 21 C80000 5 normal +>

409196 8B4504 mov eax , [ebp+4h]
409199 8B00 mov eax , [eax]
40919B 8B4D00 mov ecx , [ebp]
40919E 8B09 mov ecx , [ecx]
4091A0 01C8 add eax , ecx
4091A2 8903 mov [ebx] , eax
4091A4 8B5D08 mov ebx , [ebp+8h]
4091A7 8D6D0C lea ebp , [ebp+Ch]
4091AA C3 ret near
ok
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Mon Feb 24 20:39:59 2025

From Newsgroup: comp.lang.forth

So for non-optimizing one it will be handy, correct?

BTW: could you, please, list all optimizing Forth
compilers -- or at least the ones you know?

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Mon Feb 24 21:51:26 2025

From Newsgroup: comp.lang.forth

With respect, the more important questions are:
For what type of machine?
Desktop or embedded?
Minimal kernel only or full standard compliant?
Hobby or professional support/service required?

But to mention another example:
https://mecrisp.sourceforge.net/#
--- Synchronet 3.20c-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Mon Feb 24 21:50:21 2025

From Newsgroup: comp.lang.forth

zbigniew2011@gmail.com (LIT) writes:

I wonder: wouldn't it be useful to have stackless basic
arith operations? I mean instead of fetching the values
first and putting them on the stack, then doing something,
and in the end storing the result somewhere wouldn't it
be practical to use directly the variables?

I don't remember ever doing that, so no, it would not be practical.

Forth has had values for quite a while, so you could avoid the need to
write @ and !; you would instead write something like:

a b + to c

For global values the code is typically not better than when using
variables, though.

But after I came up with this idea I realized someone
surely invented that before - it looks so obvious — yet
I didn't see it anywhere.

The VAX architecture has instructions with three memory operands,
including ADD3. That feature makes it pretty hard to implement
efficiently.

Did anyone of you see something
like this in any code?

No.

If so — actually why somehow
(probably?) such solution has not become widespread?

Probably because the case where the two operands of a + are in memory,
and the result is needed in memory is not that frequent.

Looks good to me; math can be done completely in ML
avoiding "Forth machine" engagement, therefore saving many
cycles.

Not sure what you mean with "Forth machine engagement"; with good
Forth compilers these days, a typical stack-to-stack addition is
faster than the best machine code for a memory-to-memory addition.
E.g. VFX64 turns

: dec-u#b ( u1 -- u2 )
dup #-3689348814741910323 um* nip 3 rshift tuck 10 * - '0' + hold ; ok

into

( 0050A300 48BACDCCCCCCCCCCCCCC ) MOV RDX, # CCCCCCCC:CCCCCCCD
( 0050A30A 488BC2 ) MOV RAX, RDX
( 0050A30D 48F7E3 ) MUL RBX
( 0050A310 48C1EA03 ) SHR RDX, # 03
( 0050A314 486BCA0A ) IMUL RCX, RDX, # 0A
( 0050A318 482BD9 ) SUB RBX, RCX
( 0050A31B 4883C330 ) ADD RBX, # 30
( 0050A31F 488D6DF8 ) LEA RBP, [RBP+-08]
( 0050A323 48895500 ) MOV [RBP], RDX
( 0050A327 E87CA7F1FF ) CALL 00424AA8 HOLD
( 0050A32C C3 ) RET/NEXT
( 45 bytes, 11 instructions )

I don't think that it would be faster or shorter to use
memory-to-memory operations here. That's also why the VAX died: RISCs
just outperformed it.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Mon Feb 24 22:35:28 2025

From Newsgroup: comp.lang.forth

On Mon, 24 Feb 2025 21:51:26 +0000, minforth wrote:

With respect, the more important questions are:
For what type of machine?
Desktop or embedded?

Wow, so there are really so many?
Say, for desktop.

Minimal kernel only or full standard compliant?
Hobby or professional support/service required?

But to mention another example:
https://mecrisp.sourceforge.net/#

Indeed I downloaded mecrisp already long ago,
and somehow still I didn't find time for my
Stellaris launchpad. Maybe now it's time, finally.

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Mon Feb 24 22:41:57 2025

From Newsgroup: comp.lang.forth

Probably because the case where the two operands
of a + are in memory, and the result is needed
in memory is not that frequent.

One example could be matrix multiplication.
It's rather trivial but cumbersome operation,
where usually a few transitional variables are
used to maintain clarity of the code.

I don't think that it would be faster or shorter to use
memory-to-memory operations here. That's also why the VAX died: RISCs
just outperformed it.

Probably "bigger" Forth compilers are indeed
already "too good" for the difference to be
(practically) noticeable — still maybe for
simpler Forths, I mean like the ones for DOS
or even for 8-bit machines it would make sense?
I'll try to do a few checks in the days that'll
follow.

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From dxf@dxforth@gmail.com to comp.lang.forth on Tue Feb 25 16:24:40 2025

From Newsgroup: comp.lang.forth

On 25/02/2025 6:49 am, LIT wrote:

I wonder: wouldn't it be useful to have stackless basic
arith operations? I mean instead of fetching the values
first and putting them on the stack, then doing something,
and in the end storing the result somewhere wouldn't it
be practical to use directly the variables? Like this:

: +> ( addr1 addr2 addr3 -- )
rot @ rot @ + swap ! ;

Of course the above is just an illustration; I mean coding
such word directly in ML. It should be significantly
faster than going through stack usual way.

A set of three addresses on the stack is messy even before
one does anything with them.

--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 07:04:26 2025

From Newsgroup: comp.lang.forth

: +> ( addr1 addr2 addr3 -- )
rot @ rot @ + swap ! ;

Of course the above is just an illustration; I mean coding
such word directly in ML. It should be significantly
faster than going through stack usual way.

A set of three addresses on the stack is messy even before
one does anything with them.

Yep, but I meant the case of, for example:

var1 @ var2 @ + var3 !

The above isn't messy at all.

So IMHO by using such OOS (out-of-stack) operation - coded
directly in ML - we can replace the above by:

var1 var2 var3 +>

Or we can create increment-by-one operation (and its
counterpart):

var1 ++
var1 --

Or "multiply/divide by two a number of times":

var1 2 lshift ( multiply by 4 )

etc.

In case of slower ITC non-optimizing Forths - like
fig-Forth, as the most obvious example - the "boost"
may be noticeable.
I'll check that.

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Feb 25 07:26:58 2025

From Newsgroup: comp.lang.forth

zbigniew2011@gmail.com (LIT) writes:

Probably because the case where the two operands
of a + are in memory, and the result is needed
in memory is not that frequent.

One example could be matrix multiplication.
It's rather trivial but cumbersome operation,
where usually a few transitional variables are
used to maintain clarity of the code.

Earlier you wrote about performance, now you switch to clarity of the
code. What is the goal?

If we stick with performance, the fastest version in <http://theforth.net/package/matmul/current-view/matmul.4th> on all
systems (which I measured and that does not use a primitive FAXPY) is
version 2, and that spends most of its time in:

: faxpy-nostride ( ra f_x f_y ucount -- )
\ vy=ra*vx+vy
dup >r 3 and 0 ?do
fdup over f@ f* dup f+! float+ swap float+ swap
loop
r> 2 rshift 0 ?do
fdup over f@ f* dup f+! float+ swap float+ swap
fdup over f@ f* dup f+! float+ swap float+ swap
fdup over f@ f* dup f+! float+ swap float+ swap
fdup over f@ f* dup f+! float+ swap float+ swap
loop
2drop fdrop ;

It's not the clearest code, and certainly the version without
unrolling is clearer (and may be almost as fast in the newer versions
of SwiftForth and VFX which make counted loops significantly faster):

: faxpy-nostride ( ra f_x f_y ucount -- )
\ vy=ra*vx+vy
0 ?do
fdup over f@ f* dup f+! float+ swap float+ swap
loop
2drop fdrop ;

Each iteration performs 2 FP loads and 1 FP store. With
memory-to-memory variants of F* and F+ that would be 4 FP loads and 2
FP stores, and I don't think it would be any clearer. And if you use memory-to-memory variants of the address computation, things would
become even slower. And I doubt that they would become clearer.

Some time later I worked on how SIMD could be integrated into Forth,
and used matrix multiplication as an example. With the wordset I
propose this whole loop became

( v1 r addr ) v@ f*vs f+v ( v2 )

Only one memory access is visible here at all; there are some more in
the implementation of these words, however. You can find the paper
about that at <http://www.euroforth.org/ef17/papers/ertl.pdf>. A
further refinement of that work can be found at <https://www.complang.tuwien.ac.at/papers/ertl18manlang.pdf>
(presented in a Java setting for the audience of the conference, but
the implementation was in a Forth setting, see <https://github.com/AntonErtl/vectors>). This work eliminates many of
the memory accesses that the earlier implementation performs,
demonstrating that the memory accesses are not fundamental in the
model. In particular, Figure 11 shows code corresponding to

( v1 r1 addr1 r2 addr2 ) v@ f*vs v@ f+v v@ f*vs f+v ( v2 )

i.e., the code above unrolled by a factor of 2; it has 3 SIMD loads
and 1 SIMD store per SIMD-granule processed (the SIMD granule is 4
doubles for AVX). Further unrolling results in even fewer loads and
stores per FLOP (FP multiplication and FP addition).

Probably "bigger" Forth compilers are indeed
already "too good" for the difference to be
(practically) noticeable — still maybe for
simpler Forths, I mean like the ones for DOS
or even for 8-bit machines it would make sense?

Forth was designed for small machines and very simple implementations.
We have words like "1+" that are beneficial in that setting. We also
have "+!", which is the closest to what you have in mind. But even in
those times nobody went for a word like "+> ( addr1 addr2 addr3 -- )",
because it is not useful often enough.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 08:33:16 2025

From Newsgroup: comp.lang.forth

One example could be matrix multiplication.
It's rather trivial but cumbersome operation,
where usually a few transitional variables are
used to maintain clarity of the code.

Earlier you wrote about performance, now you switch to clarity of the
code. What is the goal?

Both — one isn't contrary to another.

If we stick with performance, the fastest version in
[..]
Forth was designed for small machines and very simple implementations.
We have words like "1+" that are beneficial in that setting. We also
have "+!", which is the closest to what you have in mind. But even in
those times nobody went for a word like "+> ( addr1 addr2 addr3 -- )", because it is not useful often enough.

What I have in mind is: by performing OOS operation
we don't have to employ the whole "Forth machine" to
do the usual things (I mean, of course, the usual
steps described by Brad Rodriguez in his "Moving
Forth" paper).

It comes with a cost: usual Forth words, that use
the stack, are versatile, while such OOS words
aren't that versatile anymore — yet (at least in
the case of ITC non-optimizing Forths) they should
be faster. I'll create a few of them and I'll compare
the processing time, then I'll publish the results.

Clarity of the code comes as a "bonus" :) yes, we've
got VALUEs and I use them when needed, but their use
still means employing the "Forth machine".

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Feb 25 09:07:19 2025

From Newsgroup: comp.lang.forth

zbigniew2011@gmail.com (LIT) writes:
[Anton Ertl:]

Earlier you wrote about performance, now you switch to clarity of the
code. What is the goal?

Both — one isn't contrary to another.

Sometimes the clearer code is slower and the faster code is less clear
(as in the FAXPY-NOSTRIDE example).

What I have in mind is: by performing OOS operation
we don't have to employ the whole "Forth machine" to
do the usual things (I mean, of course, the usual
steps described by Brad Rodriguez in his "Moving
Forth" paper).

What does "OOS" stand for? What do you mean with "the usual steps"; I
am not going to read the whole paper and guess which of the code shown
there you have in mind.

It comes with a cost: usual Forth words, that use
the stack, are versatile, while such OOS words
aren't that versatile anymore — yet (at least in
the case of ITC non-optimizing Forths) they should
be faster.

One related thing is the work on "register"-based virtual machines.
For interpreted implementations the VM registers are in memory, but
they are accessed by "register number"; these usually correspond to
locals slots on machines line the JavaVM. A well-known example of
that is the switch of the Lua VM from stack-based to register-based.
A later example is Android's Dalvik VM for Java, in contrast to the
stack-based JavaVM.

There is a paper [shi+08] that provides an academic justification for
this approach. The gist of it is that, with some additional compiler complexity, the register-based machine can reduce the number of NEXTs
(in Forth threaded-code terminology); depending on the implementation
approach and the hardware, the NEXTs could be the major cost at the
time. However, already at the time dynamic superinstructions (in implementation technique for virtual-machine interpreters) reduced the
number of NEXTs to one per basic block, and VM registers did nothing
to reduce NEXTs in that case; Shi et al. also showed that with a lot
of compiler sophistication (data flow analysis etc.) VM registers can
be as fast as stacks even with dynamic superinstructions.

However, given that dynamic superinstructions are easier to implement
and the VM registers do not give a benefit when they are employed, why
would one go for VM registers? Of course, in the Forth setting one
could offload the optimization onto the programmer, but even Chuck
Moore did not go there.

In any case, here's an example extracted from Figure 6 of the paper:

Java VM (stack) VM registers
19 iload_1
20 bipush #31 iconst #31 -> r1
21 imul imul r6 r1 -> r3
22 aload_0
23 getfield value getfield r0.value -> r5
24 iload_3
26 caload caload r5 r7 -> r5
27 iadd iadd r3 r5 -> r6
28 istore_1

So yes, the VM register code contains fewer VM instructions. Is it
clearer?

The corresponding Gforth code is the stuff between IF and THEN in the following:

0
value: some-field
value: value
constant some-struct

: foo
{: r0 r1 r3 :}
if
r1 31 * r0 value r3 + c@ + to r1
then
r1 ;

The code that Gforth produces for the basic block under consideration
is:

$7FC624AA0958 @local1 1->1
7FC62464A5BA: mov [r10],r13
7FC62464A5BD: sub r10,$08
7FC62464A5C1: mov r13,$08[rbp]
$7FC624AA0960 lit 1->2
$7FC624AA0968 #31
7FC62464A5C5: sub rbx,$50
7FC62464A5C9: mov r15,-$08[rbx]
$7FC624AA0970 * 2->1
7FC62464A5CD: imul r13,r15
$7FC624AA0978 @local0 1->2
7FC62464A5D1: mov r15,$00[rbp]
$7FC624AA0980 lit+ 2->2
$7FC624AA0988 #8
7FC62464A5D5: add r15,$18[rbx]
$7FC624AA0990 @ 2->2
7FC62464A5D9: mov r15,[r15]
$7FC624AA0998 @local2 2->3
7FC62464A5DC: mov r9,$10[rbp]
$7FC624AA09A0 + 3->2
7FC62464A5E0: add r15,r9
$7FC624AA09A8 c@ 2->2
7FC62464A5E3: movzx r15d,byte PTR [r15]
$7FC624AA09B0 + 2->1
7FC62464A5E7: add r13,r15
$7FC624AA09B8 !local1 1->1
7FC62464A5EA: add r10,$08
7FC62464A5EE: mov $08[rbp],r13
7FC62464A5F2: mov r13,[r10]
7FC62464A5F5: add rbx,$50

There are 8 loads and 2 stores in that code. If the VM registers are
held in memory (as they usually are, and as the Gforth locals are),
the VM register code performs at least 9 loads (7 register accesses,
the getfield, and the caload) and 5 stores. Of course, in Forth one
would write the block as:

: foo1 ( n3 a0 n1 -- n )
31 * swap value rot + c@ + ;

and the code for that is (without the ";"):

$7FC624AA0A10 lit 1->2
$7FC624AA0A18 #31
7FC62464A617: mov r15,$08[rbx]
$7FC624AA0A20 * 2->1
7FC62464A61B: imul r13,r15
$7FC624AA0A28 swap 1->2
7FC62464A61F: mov r15,$08[r10]
7FC62464A623: add r10,$08
$7FC624AA0A30 lit+ 2->2
$7FC624AA0A38 #8
7FC62464A627: add r15,$28[rbx]
$7FC624AA0A40 @ 2->2
7FC62464A62B: mov r15,[r15]
$7FC624AA0A48 rot 2->3
7FC62464A62E: mov r9,$08[r10]
7FC62464A632: add r10,$08
$7FC624AA0A50 + 3->2
7FC62464A636: add r15,r9
$7FC624AA0A58 c@ 2->2
7FC62464A639: movzx r15d,byte PTR [r15]
$7FC624AA0A60 + 2->1
7FC62464A63D: add r13,r15

6 loads, 0 stores.

And if we feed the equivalent standard code

0
field: some-field
field: value-addr
constant some-struct

: foo1 ( n3 a0 n1 -- n )
31 * swap value-addr @ rot + c@ + ;

into other Forth systems, some produce even better code:

VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
FOO1
( 0050A310 486BDB1F ) IMUL RBX, RBX, # 1F
( 0050A314 488B5500 ) MOV RDX, [RBP]
( 0050A318 488B4D08 ) MOV RCX, [RBP+08]
( 0050A31C 48034A08 ) ADD RCX, [RDX+08]
( 0050A320 480FB609 ) MOVZX RCX, Byte 0 [RCX]
( 0050A324 4803D9 ) ADD RBX, RCX
( 0050A327 488D6D10 ) LEA RBP, [RBP+10]
( 0050A32B C3 ) RET/NEXT
( 28 bytes, 8 instructions )

5 loads, 0 stores. And VFX does not do data-flow analysis across
basic blocks, unlike the Java VM -> VM register compiler that Shi
used; i.e., VFX is probably simpler than the compiler Shi used.

@Article{shi+08,
author = {Yunhe Shi and Kevin Casey and M. Anton Ertl and
David Gregg},
title = {Virtual machine showdown: Stack versus registers},
journal = {ACM Transactions on Architecture and Code
Optimization (TACO)},
year = {2008},
volume = {4},
number = {4},
pages = {21:1--21:36},
month = jan,
url = {http://doi.acm.org/10.1145/1328195.1328197},
abstract = {Virtual machines (VMs) enable the distribution of
programs in an architecture-neutral format, which
can easily be interpreted or compiled. A
long-running question in the design of VMs is
whether a stack architecture or register
architecture can be implemented more efficiently
with an interpreter. We extend existing work on
comparing virtual stack and virtual register
architectures in three ways. First, our translation
from stack to register code and optimization are
much more sophisticated. The result is that we
eliminate an average of more than 46\% of
executed VM instructions, with the bytecode size of
the register machine being only 26\% larger
than that of the corresponding stack one. Second, we
present a fully functional virtual-register
implementation of the Java virtual machine (JVM),
which supports Intel, AMD64, PowerPC and Alpha
processors. This register VM supports
inline-threaded, direct-threaded, token-threaded,
and switch dispatch. Third, we present experimental
results on a range of additional optimizations such
as register allocation and elimination of redundant
heap loads. On the AMD64 architecture the register
machine using switch dispatch achieves an average
speedup of 1.48 over the corresponding stack
machine. Even using the more efficient
inline-threaded dispatch, the register VM achieves a
speedup of 1.15 over the equivalent stack-based VM.}
}

Clarity of the code comes as a "bonus" :) yes, we've
got VALUEs and I use them when needed, but their use
still means employing the "Forth machine".

What do you mean with 'the "Forth machine"', and how does "OOS"
(whatever that is) avoid it?

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 10:58:30 2025

From Newsgroup: comp.lang.forth

Earlier you wrote about performance, now you switch to clarity of the
code. What is the goal?

Both — one isn't contrary to another.

Sometimes the clearer code is slower and the faster code is less clear
(as in the FAXPY-NOSTRIDE example).

What I have in mind is: by performing OOS operation
we don't have to employ the whole "Forth machine" to
do the usual things (I mean, of course, the usual
steps described by Brad Rodriguez in his "Moving
Forth" paper).

What does "OOS" stand for?

It's an acronym for the term I propose: "out-of-stack" (operation).

What do you mean with "the usual steps"; I
am not going to read the whole paper and guess which of the code shown
there you have in mind.

I mean the description how the "Forth machine" works:

"Assume SQUARE is encountered while executing some other Forth word.
Forth's Interpreter Pointer (IP) will be pointing to a cell in memory -- contained within that "other" word -- which contains the address of the
word SQUARE. (To be precise, that cell contains the address of SQUARE's
Code Field.) The interpreter fetches that address, and then uses it to
fetch the contents of SQUARE's Code Field. These contents are yet
another address -- the address of a machine language subroutine which
performs the word SQUARE. In pseudo-code, this is:

(IP) -> W fetch memory pointed by IP into "W" register
...W now holds address of the Code Field
IP+2 -> IP advance IP, just like a program counter
(assuming 2-byte addresses in the thread)
(W) -> X fetch memory pointed by W into "X" register
...X now holds address of the machine code
JP (X) jump to the address in the X register

This illustrates an important but rarely-elucidated principle: the
address of the Forth word just entered is kept in W. CODE words don't
need this information, but all other kinds of Forth words do.

If SQUARE were written in machine code, this would be the end of the
story: that bit of machine code would be executed, and then jump back to
the Forth interpreter -- which, since IP was incremented, is pointing to
the next word to be executed. This is why the Forth interpreter is
usually called NEXT.

But, SQUARE is a high-level "colon" definition… [..]” etc.

( https://www.bradrodriguez.com/papers/moving1.htm )

Many of these steps in particular cases can be avoided
by the use of proposed OOS words, making (at least sometimes)
the Forth program faster — and, as a kinda "bonus", clarity
of the code increases.

Probably in case of the "optimizing compiler" the gain
may not be too significant, from what I already learned
here, still in the case of simpler compilers — and maybe
especially in the case of the ones created for CPUs not
that suitable for Forth at all (lack of registers, like
8051, for example) — probably it may be advantageous.

[..]

Clarity of the code comes as a "bonus" :) yes, we've
got VALUEs and I use them when needed, but their use
still means employing the "Forth machine".

What do you mean with 'the "Forth machine"', and how does "OOS"
(whatever that is) avoid it?

By the "Forth machine" I mean that internal work of the
Forth compiler - see the above quote from Brad's paper
- and when we don't need to "fetch memory pointed by
IP into "W" register, advance IP, just like a program
counter" etc. etc. — replacing the whole process,
(which is repeated for each subsequent word again and
again) by a short string of ML instructions — we should
note significant gain in the processing speed.

More I'll able to say after I do the comparison, at
least for fig-Forth on x86 under DOS control - because
so far everything of the above are just my conclusions.
I'll publish first results today in the evening.

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Tue Feb 25 11:16:26 2025

From Newsgroup: comp.lang.forth

But, SQUARE is a high-level "colon" definition… [..]” etc.

( https://www.bradrodriguez.com/papers/moving1.htm )

Many of these steps in particular cases can be avoided
by the use of proposed OOS words, making (at least sometimes)
the Forth program faster — and, as a kinda "bonus", clarity
of the code increases.

After having avoided premature optimisation, every 'decent'
Forth programmer will recode some few bottleneck words e.g.
in assembler, where necessary. IOW microbenchmarking SQUARE,
which can be implemented in a handful of lines of machine code
or less, does not bring new insights.
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 11:40:46 2025

From Newsgroup: comp.lang.forth

But, SQUARE is a high-level "colon" definition… [..]” etc.

( https://www.bradrodriguez.com/papers/moving1.htm )

Many of these steps in particular cases can be avoided
by the use of proposed OOS words, making (at least sometimes)
the Forth program faster — and, as a kinda "bonus", clarity
of the code increases.

After having avoided premature optimisation, every 'decent'
Forth programmer will recode some few bottleneck words e.g.
in assembler, where necessary. IOW microbenchmarking SQUARE,
which can be implemented in a handful of lines of machine code
or less, does not bring new insights.

I agree with you - still it does take decent Forth programmer.
Recall the ones described by Jeff Fox? These Forth programmers,
that refused to use Machine Forth just because "they were hired
to program in ANS Forth"?
I don't believe they were be able to recode anything in
assembler - and note, it was about 30 years ago. Since that
time assembler programming became even less popular.

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Feb 25 11:20:47 2025

From Newsgroup: comp.lang.forth

zbigniew2011@gmail.com (LIT) writes:

I mean the description how the "Forth machine" works:

"Assume SQUARE is encountered while executing some other Forth word.
Forth's Interpreter Pointer (IP) will be pointing to a cell in memory -- >contained within that "other" word -- which contains the address of the
word SQUARE. (To be precise, that cell contains the address of SQUARE's
Code Field.) The interpreter fetches that address, and then uses it to
fetch the contents of SQUARE's Code Field. These contents are yet
another address -- the address of a machine language subroutine which >performs the word SQUARE. In pseudo-code, this is:

(IP) -> W fetch memory pointed by IP into "W" register
...W now holds address of the Code Field
IP+2 -> IP advance IP, just like a program counter
(assuming 2-byte addresses in the thread)
(W) -> X fetch memory pointed by W into "X" register
...X now holds address of the machine code
JP (X) jump to the address in the X register

This illustrates an important but rarely-elucidated principle: the
address of the Forth word just entered is kept in W. CODE words don't
need this information, but all other kinds of Forth words do.

If SQUARE were written in machine code, this would be the end of the
story: that bit of machine code would be executed, and then jump back to
the Forth interpreter -- which, since IP was incremented, is pointing to
the next word to be executed. This is why the Forth interpreter is
usually called NEXT.

But, SQUARE is a high-level "colon" definition… [..]” etc.

( https://www.bradrodriguez.com/papers/moving1.htm )

Many of these steps in particular cases can be avoided
by the use of proposed OOS words, making (at least sometimes)
the Forth program faster — and, as a kinda "bonus", clarity
of the code increases.

What Rodriguez describes above is NEXT. As I mentioned in the ealier
posting, using a VM with VM registers reduces the number of NEXTs
executed, but if you go for dynamic superinstructions or native-code compilation, the number of NEXTs is reduced even more. And this can
be done while still working with ordinary Forth code, no OOS needed.
And these kinds of compilers can be done with relatively little
effort.

Probably in case of the "optimizing compiler" the gain
may not be too significant, from what I already learned
here, still in the case of simpler compilers — and maybe
especially in the case of the ones created for CPUs not
that suitable for Forth at all (lack of registers, like
8051, for example) — probably it may be advantageous.

I cannot speak about the 8051, but machine Forth is a simple
native-code system and it's stack-based.

By the "Forth machine" I mean that internal work of the
Forth compiler - see the above quote from Brad's paper
- and when we don't need to "fetch memory pointed by
IP into "W" register, advance IP, just like a program
counter" etc. etc. — replacing the whole process,
(which is repeated for each subsequent word again and
again) by a short string of ML instructions — we should
note significant gain in the processing speed.

Yes, dynamic superinstructions provide a good speedup for Gforth, and native-code systems also show a good speedup compared to classic
threaded-code systems. But it's not necessary to eliminate the stack
for that. Actually dealing with the stack is orthogonal to
threaded code vs. native code.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.20c-Linux NewsLink 1.2

From dxf@dxforth@gmail.com to comp.lang.forth on Tue Feb 25 23:07:04 2025

From Newsgroup: comp.lang.forth

On 25/02/2025 6:04 pm, LIT wrote:

: +> ( addr1 addr2 addr3 -- )
rot @ rot @ + swap ! ;

Of course the above is just an illustration; I mean coding
such word directly in ML. It should be significantly
faster than going through stack usual way.

A set of three addresses on the stack is messy even before
one does anything with them.

Yep, but I meant the case of, for example:

var1 @ var2 @ + var3 !

The above isn't messy at all.

So IMHO by using such OOS (out-of-stack) operation - coded
directly in ML - we can replace the above by:

var1 var2 var3 +>

...
In case of slower ITC non-optimizing Forths - like
fig-Forth, as the most obvious example - the "boost"
may be noticeable.
I'll check that.

code +> ( x y z -- )
dx pop cx pop bx pop 0 [bx] ax mov cx bx xchg
0 [bx] ax add dx bx xchg ax 0 [bx] mov next
end-code

Timing (adjusted for loop time):

var1 @ var2 @ + var3 ! 8019 mS
var1 var2 var3 +> 5657 mS

--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 12:10:46 2025

From Newsgroup: comp.lang.forth

In case of slower ITC non-optimizing Forths - like
fig-Forth, as the most obvious example - the "boost"
may be noticeable.
I'll check that.

code +> ( x y z -- )
dx pop cx pop bx pop 0 [bx] ax mov cx bx xchg
0 [bx] ax add dx bx xchg ax 0 [bx] mov next
end-code

Timing (adjusted for loop time):

var1 @ var2 @ + var3 ! 8019 mS
var1 var2 var3 +> 5657 mS

So even in case of fast DTC Forth, like DX Forth,
it's already something worth of closer attention,
I believe.
I expect even bigger gain in case of older fig-Forth
model.

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 12:24:32 2025

From Newsgroup: comp.lang.forth

What Rodriguez describes above is NEXT. As I mentioned in the ealier posting, using a VM with VM registers reduces the number of NEXTs
executed, but if you go for dynamic superinstructions or native-code compilation, the number of NEXTs is reduced even more. And this can
be done while still working with ordinary Forth code, no OOS needed.
And these kinds of compilers can be done with relatively little
effort.

Yes, but I actually by "Forth machine" mean the
work of the internals of the Forth compiler (and that
has been nicely described by Brad), not just NEXT alone.

I do realize that proposed OOS technique may not be
that advantageous in case of more sophisticated compiler
that's doing fine job for the programmer. Still not
everywhere such compiler is available; if I find enough
time during coming weekend, I'll try to do some
exercises and comparison using Camel Forth for 8051.
I expect the difference may be really big.

Probably in case of the "optimizing compiler" the gain
may not be too significant, from what I already learned
here, still in the case of simpler compilers — and maybe
especially in the case of the ones created for CPUs not
that suitable for Forth at all (lack of registers, like
8051, for example) — probably it may be advantageous.

I cannot speak about the 8051, but machine Forth is a simple
native-code system and it's stack-based.

Never used that one. I know nothing about Machine Forth.
BTW: is it available for download anywhere (if not
commercial/restricted)?

By the "Forth machine" I mean that internal work of the
Forth compiler - see the above quote from Brad's paper
- and when we don't need to "fetch memory pointed by
IP into "W" register, advance IP, just like a program
counter" etc. etc. — replacing the whole process,
(which is repeated for each subsequent word again and
again) by a short string of ML instructions — we should
note significant gain in the processing speed.

Yes, dynamic superinstructions provide a good speedup for Gforth, and native-code systems also show a good speedup compared to classic threaded-code systems. But it's not necessary to eliminate the stack
for that. Actually dealing with the stack is orthogonal to
threaded code vs. native code.

Yes, I agree it's not necessary - still in particular
cases it may mean noticeable speed-up. In the other
cases - and especially for mentioned optimizing
compilers - it may not make much sense, probably.

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Tue Feb 25 13:45:09 2025

From Newsgroup: comp.lang.forth

In article <591e7bf58ebb1f90bd34fba20c730b83@www.novabbs.com>,
LIT <zbigniew2011@gmail.com> wrote:

I wonder: wouldn't it be useful to have stackless basic
arith operations? I mean instead of fetching the values
first and putting them on the stack, then doing something,
and in the end storing the result somewhere wouldn't it
be practical to use directly the variables? Like this:

: +> ( addr1 addr2 addr3 -- )
rot @ rot @ + swap ! ;

Of course the above is just an illustration; I mean coding
such word directly in ML. It should be significantly
faster than going through stack usual way.

But after I came up with this idea I realized someone
surely invented that before - it looks so obvious — yet
I didn't see it anywhere. Did anyone of you see something
like this in any code? If so — actually why somehow
(probably?) such solution has not become widespread?
Looks good to me; math can be done completely in ML
avoiding "Forth machine" engagement, therefore saving many
cycles.

I have done some work on optimisation on ciforth.
This work has stalled, but the infamous byte prime benchmark,
was in the ballpark of swiftforth and mpeforth.
(Disingenuous, because this was the example I used.)
See https://home.hccnet.nl/a.w.m.van.der.horst/forthlecture5.html
This is about folding, a generalisation of constant folding.
This requires that you know the properties of the Forth Words,
i.e. that you can execute + at compile time, if the inputs
are constant.

The next step is inlining, which requires transforming control
structures to jumps. This eliminates all call/return pairs.

A further step is replacing stack offset operations with registers
operations. I have succeeded in eliminating the use of the return
stack in a resulting block of code. Remember, there are no longer
return addresses on the return stack.

Then I got stalled. I introduced complicated rules to handle
pop push and operators to simplify by interchanging and transforming.
E.g. a rule
movipop-pattern DUP matches? IF ?movipop-replace? ELSE
test if a pattern applies, that execute the replacement.
This is a one rule of the "no brain" matches, the simplest.

"
<! !Q! MOVI|X, !!T 0 {L,} ~!!T !Q! POP|X, !!T !>
<A Q: POP|X, !TALLY NEXT A>
{ bufv 7 + C@ 7 AND bufc 1 + OR!U }
optimisation movipop-pattern \ A object

\ Relying heavily on a smart assembler/disassembler
\ optimisation is a class name.

\ :" it is all the same register."
: movipop-same bufv 1+ C@ bufv 7 + C@ XOR 7 AND 0= ;

\ Optional replace, leave " was replaced".
: ?movipop-replace? movipop-same DUP IF replace THEN ;
REGRESS HERE Q: MOVI|X, BX| 0 IL, Q: POP|X, AX| matches? movipop-same S: TRUE
"

This is going nowhere. Instead the technique of replacing
cells offset from the data stack must be used.
It has proven to work totally replacing return stack manipulations
by registers.

I chase a different goal here, get code that I can't improve studying
the assembler code.Pretty silly, given that i86 is a dying
architecture.

The goal can be attained. I remember a 4 page comparison function
in C, compiled by Intel C-compiler. There was not a single
thing to improve upon.

And a general remark. Optimise where it counts, replace the
bottle neck. That is practical. All the other is sport, like
Mount Everest or the South Pole.

--

--
Temu exploits Christians: (Disclaimer, only 10 apostles)
Last Supper Acrylic Suncatcher - 15Cm Round Stained Glass- Style Wall
Art For Home, Office And Garden Decor - Perfect For Windows, Bars,
And Gifts For Friends Family And Colleagues.
--- Synchronet 3.20c-Linux NewsLink 1.2

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Tue Feb 25 13:46:52 2025

From Newsgroup: comp.lang.forth

In article <a81ac9ee2ed92686e940a55bed9d4dfb@www.novabbs.com>,
LIT <zbigniew2011@gmail.com> wrote:

In case of slower ITC non-optimizing Forths - like
fig-Forth, as the most obvious example - the "boost"
may be noticeable.
I'll check that.

code +> ( x y z -- )
dx pop cx pop bx pop 0 [bx] ax mov cx bx xchg
0 [bx] ax add dx bx xchg ax 0 [bx] mov next
end-code

Timing (adjusted for loop time):

var1 @ var2 @ + var3 ! 8019 mS
var1 var2 var3 +> 5657 mS

So even in case of fast DTC Forth, like DX Forth,
it's already something worth of closer attention,
I believe.
I expect even bigger gain in case of older fig-Forth
model.

Gain is only to be expected in the context of an application.

--

Groetjes Albert
--
Temu exploits Christians: (Disclaimer, only 10 apostles)
Last Supper Acrylic Suncatcher - 15Cm Round Stained Glass- Style Wall
Art For Home, Office And Garden Decor - Perfect For Windows, Bars,
And Gifts For Friends Family And Colleagues.
--- Synchronet 3.20c-Linux NewsLink 1.2

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Tue Feb 25 13:25:00 2025

From Newsgroup: comp.lang.forth

On Tue, 25 Feb 2025 11:40:46 +0000, LIT wrote:

I agree with you - still it does take decent Forth programmer.
Recall the ones described by Jeff Fox? These Forth programmers,
that refused to use Machine Forth just because "they were hired
to program in ANS Forth"?
I don't believe they were be able to recode anything in
assembler - and note, it was about 30 years ago. Since that
time assembler programming became even less popular.

A bit off-topic: I have been in a similar situation when some of
our service engineers were very reluctant to modify inner
software parts of controllers. The guys were not dumb, but with
such modifications comes responsibility when s.th. unexpected
happens like a system crash. So it was more of a legal than a
technical issue.
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 14:36:35 2025

From Newsgroup: comp.lang.forth

A bit off-topic: I have been in a similar situation when some of
our service engineers were very reluctant to modify inner
software parts of controllers. The guys were not dumb, but with
such modifications comes responsibility when s.th. unexpected
happens like a system crash. So it was more of a legal than a
technical issue.

Yes, I'm aware the reason may be different in the different
case; still Jeff portrayed that situation rather clear way:
they didn't want to use Machine Forth just because "they
were paid for ANS Forth programming", they signed kind of
agreement for that, therefore they "weren't interested" in
any changes etc.

Unfortunately we won't have any opportunity anymore to ask
Jeff for more details.

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 14:44:17 2025

From Newsgroup: comp.lang.forth

I have done some work on optimisation on ciforth.
This work has stalled, but the infamous byte prime benchmark,
was in the ballpark of swiftforth and mpeforth.
(Disingenuous, because this was the example I used.)
See https://home.hccnet.nl/a.w.m.van.der.horst/forthlecture5.html
This is about folding, a generalisation of constant folding.
This requires that you know the properties of the Forth Words,
i.e. that you can execute + at compile time, if the inputs
are constant.
[..]
Then I got stalled. I introduced complicated rules [..]

This is more sophisticated way.

My proposal is rather humble: a modest completion
of Forth programming paradigm, from "every data goes
through the stack" to "any data can go through the
stack, still it's not strictly mandatory in every
single case".

Now I'm pondering about DO..LOOP construct; actually
probably it doesn't necessarily need to rely on return
stack. I wonder how much "lightweight" can loop become
by rewriting it using OOS words.

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From Gerry Jackson@do-not-use@swldwa.uk to comp.lang.forth on Tue Feb 25 15:32:47 2025

From Newsgroup: comp.lang.forth

On 25/02/2025 14:36, LIT wrote:

A bit off-topic: I have been in a similar situation when some of
our service engineers were very reluctant to modify inner
software parts of controllers. The guys were not dumb, but with
such modifications comes responsibility when s.th. unexpected
happens like a system crash. So it was more of a legal than a
technical issue.

Yes, I'm aware the reason may be different in the different
case; still Jeff portrayed that situation rather clear way:
they didn't want to use Machine Forth just because "they
were paid for ANS Forth programming", they signed kind of
agreement for that, therefore they "weren't interested" in
any changes etc.

Unfortunately we won't have any opportunity anymore to ask
Jeff for more details.

--

Sounds like a management failure, they should have mandated that Machine
Forth was to be used when the programmers were hired.
--
Gerry
--- Synchronet 3.20c-Linux NewsLink 1.2

From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Tue Feb 25 18:43:28 2025

From Newsgroup: comp.lang.forth

On 25-02-2025 08:04, LIT wrote:

var1 @ var2 @ + var3 !

The above isn't messy at all.

Frankly, it's far worse than "messy". We passed that station 30 miles ago.

Hans Bezemer

--- Synchronet 3.20c-Linux NewsLink 1.2

From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Tue Feb 25 18:52:02 2025

From Newsgroup: comp.lang.forth

On 24-02-2025 20:49, LIT wrote:

I wonder: wouldn't it be useful to have stackless basic
arith operations? I mean instead of fetching the values
first and putting them on the stack, then doing something,
and in the end storing the result somewhere wouldn't it
be practical to use directly the variables?

No. Because you should minimize the use of variables. So if you're using
THREE variables, you're definitely doing something VERY WRONG.

If you wanna write Forth, write Forth. If you wanna write C, write C.
If you can't handle a stack, you're definitely a C programmer. It's very simple..

Furthermore, it allows me to write code like this:

CODE (MINUS) DSIZE (2); DDROP; DS (1) -= DS (0); NEXT;
CODE (MUL) DSIZE (2); DDROP; DS (1) *= DS (0); NEXT;
CODE (NEGATE) DSIZE (1); DS (1) = -(DS (1)); NEXT;
CODE (OR) DSIZE (2); DDROP; DS (1) |= DS (0); NEXT;
CODE (AND) DSIZE (2); DDROP; DS (1) &= DS (0); NEXT;
CODE (XOR) DSIZE (2); DDROP; DS (1) ^= DS (0); NEXT;

Hans Bezemer

--- Synchronet 3.20c-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Feb 25 18:04:30 2025

From Newsgroup: comp.lang.forth

zbigniew2011@gmail.com (LIT) writes:

I know nothing about Machine Forth.
BTW: is it available for download anywhere (if not
commercial/restricted)?

I think it's the compiler part of colorForth <https://colorforth.github.io/cf.htm>. Looking around a bit I find <https://colorforth.github.io/forth.html>, which shows how machine
Forth primitives are compiled.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.20c-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Feb 25 18:17:55 2025

From Newsgroup: comp.lang.forth

zbigniew2011@gmail.com (LIT) writes:

Now I'm pondering about DO..LOOP construct; actually
probably it doesn't necessarily need to rely on return
stack.

There are many ways to skin this cat, and I have written at length
about that here. For performance you should keep those data in
registers that you update in the loop. E.g., a simple way is to have
the index and the limit around, and to update the index; then you
should keep the index in a register; leaving the unchanging limit in
memory is not so bad for performance on many CPUs.

You need to save the old index and old limit when entering another
do...loop, and restore them on exiting the do...loop, including when
you exit with UNLOOP or THROW. Both SwifthForth and VFX switched to
keeping the loop control parameters in registers in their 64-bit
ports, and at first forgot to restore the old loop control parameters
on THROW; they have fixed this bug as soon as it was found.

Instead of keeping index and limit, there are also variants that keep
other values around, to make +LOOP more efficient (sometimes at the
cost of a more expensive I).
<2021Jan10.112340@mips.complang.tuwien.ac.at> discusses a number of
these variants.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 20:57:28 2025

From Newsgroup: comp.lang.forth

I wonder: wouldn't it be useful to have stackless basic
arith operations? I mean instead of fetching the values
first and putting them on the stack, then doing something,
and in the end storing the result somewhere wouldn't it
be practical to use directly the variables?

No. Because you should minimize the use of variables. So if you're using THREE variables, you're definitely doing something VERY WRONG.

If you wanna write Forth, write Forth. If you wanna write C, write C.
If you can't handle a stack, you're definitely a C programmer. It's very simple..

Forgive me for being contrary, but IMHO use of locals
is much more C-ish than the use of "as many as" three
(OMG!) variables in a single program. ;)

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 21:08:36 2025

From Newsgroup: comp.lang.forth

So I did some quite basic testing with x86
fig-Forth for DOS. I devised 4 OOS words:

:=: (exchange values among two variables)
pop BX
pop DI
mov AX,[BX]
xchg AX,[DI]
mov [BX],AX
jmp NEXT

++ (increment variable by one)
pop BX
inc WORD PTR [BX}
jmp NEXT

-- (similar to above, just uses dec -- not tested, it'll give same
result)

(add two variables then store result into third one)

pop DI
pop BX
mov CX,[BX]
pop BX
mov AX,[BX]
add AX,CX
mov [DI],AX
jmp NEXT

How the simplistic tests have been done:

7 VARIABLE V1
8 VARIABLE V2
9 VARIABLE V3

: TOOK ( t1 t2 -- )
DROP SPLIT TIME@ DROP SPLIT
ROT SWAP - CR ." It took " U. ." seconds and "
- 10 * U. ." milliseconds "
;

: TEST1
1000 0 DO 10000 0 DO
...expression...
LOOP LOOP
;

0 0 TIME! TIME@ TEST TOOK

The results are (for the following expressions):

V1 @ V2 @ + V3 ! - 25s 430ms
V1 V2 V3 +> - 17s 240ms

1 V1 +! - 14s 60ms
V1 ++ - 10s 820ms

V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! - 40s 150ms
V1 V2 :=: - 15s 260ms

So there is a noticeable difference indeed.

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Tue Feb 25 22:35:42 2025

From Newsgroup: comp.lang.forth

zbigniew2011@gmail.com (LIT) writes:

V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! - 40s 150ms

Too much OOS thinking? Try

V1 @ V2 @ V1 ! V2 !

V1 V2 :=: - 15s 260ms

So there is a noticeable difference indeed.

The question is how often you use these new words in applications.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Tue Feb 25 22:44:04 2025

From Newsgroup: comp.lang.forth

On Tue, 25 Feb 2025 22:35:42 +0000, Anton Ertl wrote:

zbigniew2011@gmail.com (LIT) writes:

V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! - 40s 150ms

Too much OOS thinking? Try

V1 @ V2 @ V1 ! V2 !

Yep. My bad.

V1 V2 :=: - 15s 260ms

So there is a noticeable difference indeed.

The question is how often you use these new words in applications.

Indeed - but it's not everything possible,
just a few examples.
Another exmaple: I'll try to do DO..LOOP that
avoids the return stack and I'm curious about
the change.

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From dxf@dxforth@gmail.com to comp.lang.forth on Wed Feb 26 11:35:50 2025

From Newsgroup: comp.lang.forth

On 25/02/2025 11:10 pm, LIT wrote:

In case of slower ITC non-optimizing Forths - like
fig-Forth, as the most obvious example - the "boost"
may be noticeable.
I'll check that.

code +> ( x y z -- )
dx pop cx pop bx pop 0 [bx] ax mov cx bx xchg
0 [bx] ax add dx bx xchg ax 0 [bx] mov next
end-code

Timing (adjusted for loop time):

var1 @ var2 @ + var3 ! 8019 mS
var1 var2 var3 +> 5657 mS

So even in case of fast DTC Forth, like DX Forth,
it's already something worth of closer attention,
I believe.
...

I remain skeptical of such optimizations. Not even twice the performance
and the hope it represents a bottle-neck in order to realize that gain.

--- Synchronet 3.20c-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.lang.forth on Wed Feb 26 00:50:52 2025

From Newsgroup: comp.lang.forth

LIT <zbigniew2011@gmail.com> wrote:

So I did some quite basic testing with x86
fig-Forth for DOS. I devised 4 OOS words:

:=: (exchange values among two variables)
pop BX
pop DI
mov AX,[BX]
xchg AX,[DI]
mov [BX],AX
jmp NEXT

++ (increment variable by one)
pop BX
inc WORD PTR [BX}
jmp NEXT

-- (similar to above, just uses dec -- not tested, it'll give same
result)

(add two variables then store result into third one)

pop DI
pop BX
mov CX,[BX]
pop BX
mov AX,[BX]
add AX,CX
mov [DI],AX
jmp NEXT

How the simplistic tests have been done:

7 VARIABLE V1
8 VARIABLE V2
9 VARIABLE V3

: TOOK ( t1 t2 -- )
DROP SPLIT TIME@ DROP SPLIT
ROT SWAP - CR ." It took " U. ." seconds and "
- 10 * U. ." milliseconds "
;

: TEST1
1000 0 DO 10000 0 DO
...expression...
LOOP LOOP
;

0 0 TIME! TIME@ TEST TOOK

The results are (for the following expressions):

V1 @ V2 @ + V3 ! - 25s 430ms
V1 V2 V3 +> - 17s 240ms

1 V1 +! - 14s 60ms
V1 ++ - 10s 820ms

V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! - 40s 150ms
V1 V2 :=: - 15s 260ms

So there is a noticeable difference indeed.

If your expected use case is operations on variables, then
what you gain is merging @ and ! onto operations. Since
you still have variables, gain is at most a factor of 2
(you replace things by V1 @ by plain V1). Cost is need to
have several extra operations. Potential alternative is
a pair of operations, say PUSH and POP, and Forth compiler
that replaces pair like V1 @ by PUSH(V1). Note that here
address of V1 is intended to be part to PUSH (so it will
take as much space as separate V1 and @, but is only a
single primitive).

More generally, a simple "optimizer" that replaces short
sequences of Forth primitives by different, shorter sequence
of primitives is likely to give similar gain. However,
chance of match decreases with length of the sequence.
Above you bet on relatively long seqences (and on programmer
writing alternative seqence). Shorter seqences have more
chance of matching, so you need smaller number of them
for similar gain.

Extra thing: while simple memory to memory operations appear
with some frequency rather typical pattern is expressions
that produce some value that is immediately used by another
operation, stack is very good fit for such use. One can
do better than using machine stack, namely keeping thing in
registers, but that means generating machine code and doing
optimization. OTOH on 64-bit machines machine code is
very natural: machine instructions are typically smaller
than machine words (which are natural unit for threaded
code) and Forth primitives are likely to produce very
small number of instructions.
--
Waldek Hebisch
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Wed Feb 26 09:08:19 2025

From Newsgroup: comp.lang.forth

I remain skeptical of such optimizations. Not even twice the
performance
and the hope it represents a bottle-neck in order to realize that gain.

I've got a feeling it would have more
of a significance in 8088 era, say IBM 5150
or XTs. 486 is already "too good" probably
to see as much as 50% gain.
I've got working XT board - if I manage to
get at least FDD interface for that (no,
not today... it'll take some time) I'll
do some more testing.

In general I've got a feeling, that this
approach may be the more profitable, the
less suitable for Forth CPU is - say 8051
lacking registers.

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Feb 26 08:30:28 2025

From Newsgroup: comp.lang.forth

antispam@fricas.org (Waldek Hebisch) writes:

Potential alternative is
a pair of operations, say PUSH and POP, and Forth compiler
that replaces pair like V1 @ by PUSH(V1). Note that here
address of V1 is intended to be part to PUSH (so it will
take as much space as separate V1 and @, but is only a
single primitive).

In Gforth variables are compiled as "lit <addr>", and Gforth has a
primitive LIT@, and a generalized constant-folding optimization that
replaces "lit <addr> @" with "LIT@ <addr>". In the Gforth image there
are 490 uses of lit@ (out of 33611 uses of primitives) and 76
occurences of "lit @" (in parts that are compiled before the
generalized constant-folding is active). There are also 293
occurences of "lit !".

However, given the minimal difference between the code produced for
"LIT@" and "LIT @", LIT@ is no longer beneficial. E.g.:

variable v ok
: foo1 v @ + ; ok
: foo2 v [ basic-block-end ] @ + ; ok
see-code foo1
$7F27184A08B0 lit@ 1->2
$7F27184A08B8 v
7F271806B580: mov rax,$08[rbx]
7F271806B584: mov r15,[rax]
$7F27184A08C0 + 2->1
7F271806B587: add r13,r15
$7F27184A08C8 ;s 1->1
...
see-code foo2
$7F27184A08F8 lit 1->2
$7F27184A0900 v
7F271806B596: mov r15,$08[rbx]
$7F27184A0908 @ 2->2
7F271806B59A: mov r15,[r15]
$7F27184A0910 + 2->1
7F271806B59D: add r13,r15
$7F27184A0918 ;s 1->1
...

More generally, a simple "optimizer" that replaces short
sequences of Forth primitives by different, shorter sequence
of primitives is likely to give similar gain. However,
chance of match decreases with length of the sequence.

Gforth has that as static superinstructions. You can see the
sequences in
<http://git.savannah.gnu.org/cgit/gforth.git/tree/peeprules.vmg>, in
the lines before there is any occurence of prim-states or something
similar. As you can see, many of the formerly-used sequences are now
commented out, because static superinstructions do not play well with
a) static stack caching (currently static superinstructions only work
for the default stack cache state) and b) IP-update optimization (if
one of the primitives in the sequence has an immediate argument (e.g.,
LIT), you would need additional variants for various IP offsets, or
update the IP before the sequence).

The remaining static superinstructions

* have to do with stacks where we do not have stack caching (FP stack,
locals stack, return stack),

* are combinations of comparison primitives and ?BRANCH (this avoids
the need to reify the result of the comparison in a general-purpose
register), or

* are sequences of typical memory-access words (not because they occur
so often, but because it's better to have a small number of words
that can be combined, and a number of combinations in the optimizer
than to have a combinatorial explosion of words).

Above you bet on relatively long seqences (and on programmer
writing alternative seqence). Shorter seqences have more
chance of matching, so you need smaller number of them
for similar gain.

That's certainly our experience. Long sequences with high dynamic
counts often come out of the inner loop of a single benchmark, and do
not help other programs at all. We later preferred to go with static
usage counts (i.e., the sequence occurs several times in the code),
and this naturally leads to short sequences.

One can
do better than using machine stack, namely keeping thing in
registers, but that means generating machine code and doing
optimization.

Gforth does stack caching at the level of primitives, by having
several variants of the primitives for different start and end states
of the primitives, and using a shortest-path search for finding out
which combination of these variants to use. However, for multiple
stacks this leads to a large number of states, and the shortest-path
algorithm becomes too expensive. For now we only stack-cache the data
stack.

For extending this to multiple stacks, I see several alternatives:

* Use a greedy algorithm instead of an optimal shortest-path
algorithm. The difference is probably non-existent in most cases.

* Manage the stack cache using register allocation techniques instead
of representing it as an abstract state. This would often produce
similar results as the greedy technique, but it can also handle
stack manipulation words cheaply without having an explosion of
stack states and the related complexity in the generator that
generates the states and the tables for the state-handling.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.20c-Linux NewsLink 1.2

From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Wed Feb 26 11:45:52 2025

From Newsgroup: comp.lang.forth

On 25-02-2025 21:57, LIT wrote:

Forgive me for being contrary, but IMHO use of locals
is much more C-ish than the use of "as many as" three
(OMG!) variables in a single program. ;)

The goal is to use as few variables as possible. There are plenty of
words I wrote using NO variables at all. Arrays (and strings) are
another story, since it's hard to represent them using a stack (unless
you dump every single element there - which is not realistic).

If I apply that rule, I wrote an entire 1000+ line BASIC interpreter
using *three* variables (stack frame pointer, partition pointer and a
counter on the number of currently emitted characters on a line - TAB() remember?).

So yeah, three variables is quite a lot. It's should be rare enough not
to worry about too much - let alone require special facilities to
accommodate such a construct.

Hans Bezemer
--- Synchronet 3.20c-Linux NewsLink 1.2

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Wed Feb 26 11:48:04 2025

From Newsgroup: comp.lang.forth

In article <a81ac9ee2ed92686e940a55bed9d4dfb@www.novabbs.com>,
LIT <zbigniew2011@gmail.com> wrote:

In case of slower ITC non-optimizing Forths - like
fig-Forth, as the most obvious example - the "boost"
may be noticeable.
I'll check that.

code +> ( x y z -- )
dx pop cx pop bx pop 0 [bx] ax mov cx bx xchg
0 [bx] ax add dx bx xchg ax 0 [bx] mov next
end-code

Timing (adjusted for loop time):

var1 @ var2 @ + var3 ! 8019 mS
var1 var2 var3 +> 5657 mS

So even in case of fast DTC Forth, like DX Forth,
it's already something worth of closer attention,
I believe.
I expect even bigger gain in case of older fig-Forth
model.

ciforth is actually fig-Forth 5.5.3 with some ansification
and abandonment of seventies-style tricks.

I don't expect a gain in ciforth from this.

--

Groetjes Albert
--
Temu exploits Christians: (Disclaimer, only 10 apostles)
Last Supper Acrylic Suncatcher - 15Cm Round Stained Glass- Style Wall
Art For Home, Office And Garden Decor - Perfect For Windows, Bars,
And Gifts For Friends Family And Colleagues.
--- Synchronet 3.20c-Linux NewsLink 1.2

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Wed Feb 26 12:04:10 2025

From Newsgroup: comp.lang.forth

In article <2025Feb25.233542@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

zbigniew2011@gmail.com (LIT) writes:

V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! - 40s 150ms

Too much OOS thinking? Try

V1 @ V2 @ V1 ! V2 !

V1 V2 :=: - 15s 260ms

So there is a noticeable difference indeed.

The question is how often you use these new words in applications.

These words might make sense connected to a sorting application. 1]
Define those words there and don't clobber the global name space.

- anton

1] After testing of course.

Groetjes Albert
--
Temu exploits Christians: (Disclaimer, only 10 apostles)
Last Supper Acrylic Suncatcher - 15Cm Round Stained Glass- Style Wall
Art For Home, Office And Garden Decor - Perfect For Windows, Bars,
And Gifts For Friends Family And Colleagues.
--- Synchronet 3.20c-Linux NewsLink 1.2

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Wed Feb 26 11:23:03 2025

From Newsgroup: comp.lang.forth

On Wed, 26 Feb 2025 9:08:19 +0000, LIT wrote:

I remain skeptical of such optimizations. Not even twice the
performance
and the hope it represents a bottle-neck in order to realize that gain.

I've got a feeling it would have more
of a significance in 8088 era, say IBM 5150
or XTs. 486 is already "too good" probably
to see as much as 50% gain.
I've got working XT board - if I manage to
get at least FDD interface for that (no,
not today... it'll take some time) I'll
do some more testing.

Save yourself the time: use an emulator eg PCem, DOSBox(X) or QEMU
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Wed Feb 26 11:31:20 2025

From Newsgroup: comp.lang.forth

Save yourself the time: use an emulator eg PCem, DOSBox(X) or QEMU

I was usually using Qemu, but since longer time
it no longer properly compiles. :(

Some improvements(?) in GCC about two years ago.

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Wed Feb 26 11:29:03 2025

From Newsgroup: comp.lang.forth

If I apply that rule, I wrote an entire 1000+ line BASIC interpreter
using *three* variables (stack frame pointer, partition pointer and a
counter on the number of currently emitted characters on a line - TAB() remember?).

I didn't create BASIC interpreters, but I'm afraid
just the rather trivial programs can "live" without
handful of variables. For example: how do you create
even modest (screen-oriented) editor without adding
several variables that reflect its state - where the
cursor is at the moment, what's the filename in use,
how are values of user's settings/preferences - all
that - etc. etc.?

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Wed Feb 26 11:41:02 2025

From Newsgroup: comp.lang.forth

These words might make sense connected to a sorting application. 1]
Define those words there and don't clobber the global name space.

These words, as I already wrote, were just
examples to illustrate the approach, which
isn't limited to operations commonly associated
to do sorting kind of work.

I created also ROR/ROL words, that have nothing
to do with any sorting processes:

ROR ( n1 u -- n2 ? )
xor AX,AX
pop CX
pop DX
ror DX,CL
adc AX,0
jmp DPUSH

What for? At the moment just tinkering with OOS
approach, trying to explore it some more.

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Wed Feb 26 11:45:49 2025

From Newsgroup: comp.lang.forth

Oh, that's the code for the "other" ROR... :D
Never mind, you get the idea.

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Wed Feb 26 12:35:48 2025

From Newsgroup: comp.lang.forth

Results for iForth64.

The runtime of test3 is remarkable. I think not much can be done
about it, given the context.

-marcel

---

VARIABLE V1 7 V1 !
VARIABLE V2 8 V2 !
VARIABLE V3 9 V3 !

: :=: ( a b -- ) \ exchange values among two variables
OVER @ >R DUP @ ROT ! R> SWAP ! ;

: ++ ( a -- ) \ increment variable by one
1 SWAP +! ;

: +> ( a b c -- ) \ add two variables then store result into third one
-ROT @ SWAP @ + SWAP ! ;

: t1a V1 @ V2 @ + V3 ! ; : t1b V1 V2 V3 +> ;
: t2a 1 V1 +! ; : t2b V1 ++ ;
: t3a V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! ; : t3b V1 V2 :=: ;

: TESTa S" TIMER-RESET #100000 0 DO #10000 0 DO " EVALUATE ; IMMEDIATE
: TESTb S" LOOP LOOP 3 SPACES .ELAPSED " EVALUATE ; IMMEDIATE

: test1 CR ." \ TEST1 : " TESTa t1a TESTb TESTa t1b TESTb ;
: test2 CR ." \ TEST2 : " TESTa t2a TESTb TESTa t2b TESTb ;
: test3 CR ." \ TEST3 : " TESTa t3a TESTb TESTa t3b TESTb ;

: TESTS test1 test2 test3 ;

TESTS
\ TEST1 : 1.646 seconds elapsed. 1.661 seconds elapsed.
\ TEST2 : 1.778 seconds elapsed. 1.728 seconds elapsed.
\ TEST3 : 2.194 seconds elapsed. 1.645 seconds elapsed. ok
--- Synchronet 3.20c-Linux NewsLink 1.2

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Wed Feb 26 13:05:35 2025

From Newsgroup: comp.lang.forth

This 1 billion times test of 3 cache cells is indeed remarkable. ;-)
--- Synchronet 3.20c-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Feb 26 14:32:50 2025

From Newsgroup: comp.lang.forth

mhx@iae.nl (mhx) writes:

: :=: ( a b -- ) \ exchange values among two variables
OVER @ >R DUP @ ROT ! R> SWAP ! ;

<https://www.complang.tuwien.ac.at/forth/programs/sort.fs> contains:

: exchange ( addr1 addr2 -- )
over @ over @ >r swap ! r> swap ! ;

Let's see if Gforth produces better code for one of them:

see-code :=: see-code exchange $7FBD6B6A06A8 over 1->2 $7FBD6B6A0728 over 1->2 7FBD6B26B3B0: mov r15,$08[r10] 7FBD6B26B3F0: mov r15,$08[r10] $7FBD6B6A06B0 @ 2->2 $7FBD6B6A0730 @ 2->2 7FBD6B26B3B4: mov r15,[r15] 7FBD6B26B3F4: mov r15,[r15] $7FBD6B6A06B8 >r 2->1 $7FBD6B6A0738 over 2->3 7FBD6B26B3B7: mov -$08[r14],r15 7FBD6B26B3F7: mov r9,r13 7FBD6B26B3BB: sub r14,$08 $7FBD6B6A0740 @ 3->3 $7FBD6B6A06C0 dup 1->2 7FBD6B26B3FA: mov r9,[r9] 7FBD6B26B3BF: mov r15,r13 $7FBD6B6A0748 >r 3->2 $7FBD6B6A06C8 @ 2->2 7FBD6B26B3FD: mov -$08[r14],r9 7FBD6B26B3C2: mov r15,[r15] 7FBD6B26B401: sub r14,$08 $7FBD6B6A06D0 rot 2->3 $7FBD6B6A0750 swap 2->3 7FBD6B26B3C5: mov r9,$08[r10] 7FBD6B26B405: add r10,$08 7FBD6B26B3C9: add r10,$08 7FBD6B26B409: mov r9,r13 $7FBD6B6A06D8 ! 3->1 7FBD6B26B40C: mov r13,[r10] 7FBD6B26B3CD: mov [r9],r15 $7FBD6B6A0758 ! 3->1 $7FBD6B6A06E0 r> 1->2 7FBD6B26B40F: mov [r9],r15 7FBD6B26B3D0: mov r15,[r14] $7FBD6B6A0760 r> 1->2 7FBD6B26B3D3: add r14,$08 7FBD6B26B412: mov r15,[r14] $7FBD6B6A06E8 swap 2->3 7FBD6B26B415: add r14,$08 7FBD6B26B3D7: add r10,$08 $7FBD6B6A0768 swap 2->3 7FBD6B26B3DB: mov r9,r13 7FBD6B26B419: add r10,$08 7FBD6B26B3DE: mov r13,[r10] 7FBD6B26B41D: mov r9,r13 $7FBD6B6A06F0 ! 3->1 7FBD6B26B420: mov r13,[r10] 7FBD6B26B3E1: mov [r9],r15 $7FBD6B6A0770 ! 3->1 $7FBD6B6A06F8 ;s 1->1 7FBD6B26B423: mov [r9],r15 7FBD6B26B3E4: mov rbx,[r14] $7FBD6B6A0778 ;s 1->1 7FBD6B26B3E7: add r14,$08 7FBD6B26B426: mov rbx,[r14] 7FBD6B26B3EB: mov rax,[rbx] 7FBD6B26B429: add r14,$08 7FBD6B26B3EE: jmp eax 7FBD6B26B42D: mov rax,[rbx]
7FBD6B26B430: jmp eax

These things are hard to predict:-)

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.20c-Linux NewsLink 1.2

From John Ames@commodorejohn@gmail.com to comp.lang.forth on Wed Feb 26 08:04:44 2025

From Newsgroup: comp.lang.forth

On Wed, 26 Feb 2025 11:41:02 +0000
zbigniew2011@gmail.com (LIT) wrote:

These words, as I already wrote, were just examples to illustrate the approach, which isn't limited to operations commonly associated
to do sorting kind of work.

I created also ROR/ROL words, that have nothing to do with any
sorting processes:

I mean, that's the whole thing with Forth - you *can* define any words
you like, based on your needs, extending the basic set of operations
into a whole domain-specific language suited to the problem you're
trying to solve. But that's not in itself a strong argument for adding
XYZ to the "standard" dictionary.*

* (And let the use of "standard" in connection with Forth never pass by
without a muffled snort into one's sleeve.)

--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Wed Feb 26 16:18:32 2025

From Newsgroup: comp.lang.forth

These words, as I already wrote, were just examples to illustrate the
approach, which isn't limited to operations commonly associated
to do sorting kind of work.

I created also ROR/ROL words, that have nothing to do with any
sorting processes:

I mean, that's the whole thing with Forth - you *can* define any words
you like, based on your needs, extending the basic set of operations
into a whole domain-specific language suited to the problem you're
trying to solve. But that's not in itself a strong argument for adding
XYZ to the "standard" dictionary.*

If you could, please, remind me when and where I was
proposing to add these XYZs to standard dictionary?

Thanks in advance!

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Wed Feb 26 17:46:13 2025

From Newsgroup: comp.lang.forth

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

mhx@iae.nl (mhx) writes:

: :=: ( a b -- ) \ exchange values among two variables
OVER @ >R DUP @ ROT ! R> SWAP ! ;

Another variant:

: exchange ( addr1 addr2 -- )
dup @ rot !@ swap ! ;

This uses the primitive

'!@' ( u1 a-addr -- u2 ) gforth-experimental "store-fetch"
load U2 from A_ADDR, and store U1 there, as atomic operation

I worry that the atomic part will result in it being slower than the
versions that do not use !@. Let's measure that:

: exchange ( addr1 addr2 -- )
over @ swap !@ swap ! ;

: :=: ( addr1 addr2 -- )
OVER @ >R DUP @ ROT ! R> SWAP ! ;

: bench-exchange ( addr1 addr2 -- )
100000000 0 do 2dup exchange loop ;

: bench-:=: ( addr1 addr2 -- )
100000000 0 do 2dup :=: loop ;

variable v1
variable v2

1 v1 !
2 v2 !

Measurement with
perf stat -e cycles -e instructions gforth-fast xxxx.fs -e "v1 v2 bench-exchange bye"
perf stat -e cycles -e instructions gforth-fast xxxx.fs -e "v1 v2 bench-:=: bye"

Results on a Zen4:

exchange :=:
877_054_156 812_761_422 cycles
3_708_692_329 3_908_642_117 instructions

So the @! variant is indeed slower, but only a little (0.65 cycles per execution of these words); however, I would expect either a big
slowdown (from latency when dealing with the memory subsystem,
broadcasting to other cores, etc.) or none at all.

And here's the code:
see-code exchange see-code :=: $7EFDC12A06A8 over 1->2 $7FBD6B6A06A8 over 1->2 7EFDC0DEA3B0: mov r15,$08[r10] 7FBD6B26B3B0: mov r15,$08[r10] $7EFDC12A06B0 @ 2->2 $7FBD6B6A06B0 @ 2->2 7EFDC0DEA3B4: mov r15,[r15] 7FBD6B26B3B4: mov r15,[r15] $7EFDC12A06B8 swap 2->1 $7FBD6B6A06B8 >r 2->1 7EFDC0DEA3B7: mov [r10],r15 7FBD6B26B3B7: mov -$08[r14],r15 7EFDC0DEA3BA: sub r10,$08 7FBD6B26B3BB: sub r14,$08 $7EFDC12A06C0 !@ 1->1 $7FBD6B6A06C0 dup 1->2 7EFDC0DEA3BE: mov rax,$08[r10] 7FBD6B26B3BF: mov r15,r13 7EFDC0DEA3C2: add r10,$08 $7FBD6B6A06C8 @ 2->2 7EFDC0DEA3C6: xchg $00[r13],rax 7FBD6B26B3C2: mov r15,[r15] 7EFDC0DEA3CA: mov r13,rax $7FBD6B6A06D0 rot 2->3 $7EFDC12A06C8 swap 1->2 7FBD6B26B3C5: mov r9,$08[r10] 7EFDC0DEA3CD: mov r15,$08[r10] 7FBD6B26B3C9: add r10,$08 7EFDC0DEA3D1: add r10,$08 $7FBD6B6A06D8 ! 3->1 $7EFDC12A06D0 ! 2->0 7FBD6B26B3CD: mov [r9],r15 7EFDC0DEA3D5: mov [r15],r13 $7FBD6B6A06E0 r> 1->2 $7EFDC12A06D8 ;s 0->1 7FBD6B26B3D0: mov r15,[r14] 7EFDC0DEA3D8: mov r13,$08[r10] 7FBD6B26B3D3: add r14,$08 7EFDC0DEA3DC: add r10,$08 $7FBD6B6A06E8 swap 2->3 7EFDC0DEA3E0: mov rbx,[r14] 7FBD6B26B3D7: add r10,$08 7EFDC0DEA3E3: add r14,$08 7FBD6B26B3DB: mov r9,r13 7EFDC0DEA3E7: mov rax,[rbx] 7FBD6B26B3DE: mov r13,[r10] 7EFDC0DEA3EA: jmp eax $7FBD6B6A06F0 ! 3->1
7FBD6B26B3E1: mov [r9],r15
$7FBD6B6A06F8 ;s 1->1
7FBD6B26B3E4: mov rbx,[r14]
7FBD6B26B3E7: add r14,$08
7FBD6B26B3EB: mov rax,[rbx]
7FBD6B26B3EE: jmp eax

The difference looks bigger than it is: There are lines for 4
additional primitives (no influence on performance) and 2 additional instructions, resulting in a 6-line difference.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.20c-Linux NewsLink 1.2

From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Wed Feb 26 11:44:15 2025

From Newsgroup: comp.lang.forth

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

: :=: ( addr1 addr2 -- )
OVER @ >R DUP @ ROT ! R> SWAP ! ;

: ex ( a1 a2 -- ) 2>r 2r@ @ swap @ r> ! r> ! ;

looks a little simpler.

The effort of implementing special native words for this though are
probably better spent on locals.

: ex {: x y -- :} x @ y @ x ! y ! ;
--- Synchronet 3.20c-Linux NewsLink 1.2

From minforth@minforth@gmx.net (minforth) to comp.lang.forth on Wed Feb 26 21:02:52 2025

From Newsgroup: comp.lang.forth

Really? ;-)

NT/FORTH (C) 2005 Peter Fälth Version 1.6-983-824 Compiled on
2017-12-03
Running on Windows NT 6.2 Build 9200
Current directory is e:\Develop\Forth\lxf
: ex {: x y -- :} x @ y @ x ! y ! ; ok
see ex
A49E58 40917C 23 C80000 5 normal EX

40917C 8B4500 mov eax , [ebp]
40917F 8B00 mov eax , [eax]
409181 8BCB mov ecx , ebx
409183 8B09 mov ecx , [ecx]
409185 8B5500 mov edx , [ebp]
409188 890A mov [edx] , ecx
40918A 8903 mov [ebx] , eax
40918C 8B5D04 mov ebx , [ebp+4h]
40918F 8D6D08 lea ebp , [ebp+8h]
409192 C3 ret near
ok
: :=: OVER @ >R DUP @ ROT ! R> SWAP ! ; ok
see :=:
A49E6C 409193 23 C80000 5 normal :=:

409193 8B4500 mov eax , [ebp]
409196 8B00 mov eax , [eax]
409198 8BCB mov ecx , ebx
40919A 8B09 mov ecx , [ecx]
40919C 8B5500 mov edx , [ebp]
40919F 890A mov [edx] , ecx
4091A1 8903 mov [ebx] , eax
4091A3 8B5D04 mov ebx , [ebp+4h]
4091A6 8D6D08 lea ebp , [ebp+8h]
4091A9 C3 ret near
ok
--- Synchronet 3.20c-Linux NewsLink 1.2

From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Wed Feb 26 22:23:51 2025

From Newsgroup: comp.lang.forth

On 26-02-2025 12:29, LIT wrote:

I didn't create BASIC interpreters, but I'm afraid
just the rather trivial programs can "live" without
handful of variables.

I won't call a BASIC interpreter trivial.

For example: how do you create
even modest (screen-oriented) editor without adding
several variables that reflect its state - where the
cursor is at the moment, what's the filename in use,
how are values of user's settings/preferences - all
that - etc. etc.?

Like I said - arrays are a different thing. But my repository holds a
screen editor. It has two variables. Which screen and which position in
that screen. Your trivial example holds three. One for parameter one,
one for parameter two and one for the result.

Hans Bezemer

--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Wed Feb 26 21:48:49 2025

From Newsgroup: comp.lang.forth

For example: how do you create
even modest (screen-oriented) editor without adding
several variables that reflect its state - where the
cursor is at the moment, what's the filename in use,
how are values of user's settings/preferences - all
that - etc. etc.?

Like I said - arrays are a different thing.

Oh, so it's simpler way than anyone could guess:
"just use different term, avoid the word 'variable' ".
Done. :)

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From dxf@dxforth@gmail.com to comp.lang.forth on Thu Feb 27 10:53:40 2025

From Newsgroup: comp.lang.forth

On 27/02/2025 3:18 am, LIT wrote:

These words, as I already wrote, were just examples to illustrate the
approach, which isn't limited to operations commonly associated
to do sorting kind of work.

I created also ROR/ROL words, that have nothing to do with any
sorting processes:

I mean, that's the whole thing with Forth - you *can* define any words
you like, based on your needs, extending the basic set of operations
into a whole domain-specific language suited to the problem you're
trying to solve. But that's not in itself a strong argument for adding
XYZ to the "standard" dictionary.*

If you could, please, remind me when and where I was
proposing to add these XYZs to standard dictionary?

Thanks in advance!

Then you have an existing application that demonstrates the benefit after having examined and ruled out other ways of optimizing the code?

--- Synchronet 3.20c-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Feb 27 07:29:44 2025

From Newsgroup: comp.lang.forth

Paul Rubin <no.email@nospam.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

: :=: ( addr1 addr2 -- )
OVER @ >R DUP @ ROT ! R> SWAP ! ;

: ex ( a1 a2 -- ) 2>r 2r@ @ swap @ r> ! r> ! ;

looks a little simpler.

This inspires another one:

: exchange2 ( addr1 addr2 -- )
dup >r @ over @ r> ! swap ! ;

With some other versions this results in the following benchmark
program:

[defined] !@ [if]
: exchange ( addr1 addr2 -- )
over @ swap !@ swap ! ;
[then]

\ Paul Rubin <875xkwo5io.fsf@nightsong.com>
: ex ( addr1 addr2 -- )
2>r 2r@ @ swap @ r> ! r> ! ;

: ex-locals {: x y -- :} x @ y @ x ! y ! ;

\ Anton Ertl
: exchange2 ( addr1 addr2 -- )
dup >r @ over @ r> ! swap ! ;

\ Marcel Hendrix
: :=: ( addr1 addr2 -- )
OVER @ >R DUP @ ROT ! R> SWAP ! ;

variable v1
variable v2

1 v1 !
2 v2 !

: bench ( "name" -- )
v1 v2
:noname ]] 100000000 0 do 2dup [[ parse-name evaluate ]] loop ; [[
execute ;

Results (on Zen4):

gforth-fast (development):
:=: exchange ex ex-locals exchange2
814_881_277 879_389_133 928_825_521 875_574_895 808_543_975 cyc. 3_908_874_164 3_708_891_336 4_508_966_770 4_209_778_557 3_708_865_505 inst.

vfx64 5.43:
:=: ex ex-locals exchange2
335_298_202 432_614_804 928_542_678 336_134_513 cyc. 1_166_400_242 1_366_264_943 2_866_547_067 1_166_280_641 inst.

And here's the code produced by gforth-fast:

:=: ex ex-locals exchange2
over 1->2 2>r 1->0 l 1->1 dup >r 1->1
mov r15,$08[r10] add r10,$08 mov rax,rbp >r 1->1
@ 2->2 mov r15,r13 add r10,$08 mov -$8[r14],r13
mov r15,[r15] mov r13,[r10] lea rbp,-$8[rbp] sub r14,$08

r 2->1 mov -$8[r14],r13 mov -$8[rax],r13 @ 1->1

mov -$8[r14],r15 sub r14,$10 mov r13,[r10] mov r13,$00[r13]
sub r14,$08 mov [r14],r15 >l @local0 1->1 over 1->2
dup 1->2 2r@ 0->2 @local0 1->1 mov r15,$08[r10]
mov r15,r13 mov r13,$08[r14] mov rax,rbp @ 2->2
@ 2->2 mov r15,[r14] lea rbp,-$8[rbp] mov r15,[r15]
mov r15,[r15] @ 2->2 mov -$8[rax],r13 r> 2->3
rot 2->3 mov r15,[r15] @ 1->1 mov r9,[r14]
mov r9,$08[r10] swap 2->2 mov r13,$00[r13] add r14,$08
add r10,$08 mov rax,r13 @local1 1->2 ! 3->1
! 3->1 mov r13,r15 mov r15,$08[rbp] mov [r9],r15
mov [r9],r15 mov r15,rax @ 2->2 swap 1->2

1->2 @ 2->2 mov r15,[r15] mov r15,$08[r10]

mov r15,[r14] mov r15,[r15] @local0 2->3 add r10,$08
add r14,$08 r> 2->3 mov r9,$00[rbp] ! 2->0
swap 2->3 mov r9,[r14] ! 3->1 mov [r15],r13
add r10,$08 add r14,$08 mov [r9],r15 ;s 0->1
mov r9,r13 ! 3->1 @local1 1->2 mov r13,$08[r10]
mov r13,[r10] mov [r9],r15 mov r15,$08[rbp] add r10,$08
! 3->1 r> 1->2 ! 2->0 mov rbx,[r14]
mov [r9],r15 mov r15,[r14] mov [r15],r13 add r14,$08
;s 1->1 add r14,$08 lp+2 0->1 mov rax,[rbx]
mov rbx,[r14] ! 2->0 mov r13,$08[r10] jmp eax
add r14,$08 mov [r15],r13 add r10,$08
mov rax,[rbx] ;s 0->1 add rbp,$10
jmp eax mov r13,$08[r10] ;s 1->1
add r10,$08 mov rbx,[r14]
mov rbx,[r14] add r14,$08
add r14,$08 mov rax,[rbx]
mov rax,[rbx] jmp eax
jmp eax

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Thu Feb 27 09:28:15 2025

From Newsgroup: comp.lang.forth

These words, as I already wrote, were just examples to illustrate the
approach, which isn't limited to operations commonly associated
to do sorting kind of work.

I created also ROR/ROL words, that have nothing to do with any
sorting processes:

I mean, that's the whole thing with Forth - you *can* define any words
you like, based on your needs, extending the basic set of operations
into a whole domain-specific language suited to the problem you're
trying to solve. But that's not in itself a strong argument for adding
XYZ to the "standard" dictionary.*

If you could, please, remind me when and where I was
proposing to add these XYZs to standard dictionary?

Thanks in advance!

Then you have an existing application that demonstrates the benefit
after
having examined and ruled out other ways of optimizing the code?

You expected me to "have an existing application..." etc. etc.
immediately after I came up with this idea? You mean: within
hours range, literally?

I'd like to create one - unfortunately, I'm busy with others
things.

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Thu Feb 27 12:51:12 2025

From Newsgroup: comp.lang.forth

On 26-02-2025 22:48, LIT wrote:

Oh, so it's simpler way than anyone could guess:
"just use different term, avoid the word 'variable' ".
Done. :)

Oh dear, I hope you don't have a formal education in CS. If so, I'd ask
for my money back. You know - it's not a different term - it's a
different concept, with quite different characteristics.

You know, when Leo Brodie said "the stack is not an array" it was kind
of a tautology. As if saying "a cactus is not a cow". An array (yes,
strings can be considered to be special arrays) is accessed randomly,
and hence has a time complexity of O(1). A stack is accessed
sequentially and hence has a time complexity of O(n).

Now Chuck has made that stack a bit more accessible by providing stack operators, but that goes three elements deep (usually). Anyways, it is impossible to represent an array on a stack, since a stack cannot be
accessed randomly.

The only way you can have anything array related on the stack is by its address or the contents of a single element. However, a stack is
perfectly capable to replace (local) variables since their values can
reside on the stack. It does require some skills, though, to manage that
stack properly, agreed.

And sure - it can be handy to have a fast word to exchange the values of
two array elements. But that is an entirely different question from
having two variables, do some arithmetic on their values and store the
result in another variable.

Hans Bezemer

--- Synchronet 3.20c-Linux NewsLink 1.2

From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Thu Feb 27 12:58:56 2025

From Newsgroup: comp.lang.forth

On 27-02-2025 10:28, LIT wrote:

You expected me to "have an existing application..." etc. etc.
immediately after I came up with this idea? You mean: within
hours range, literally?

I'd like to create one - unfortunately, I'm busy with others
things.

Of course. Any normal human being would only look for solutions when
having a problem. You don't look for (C-like) solutions when there is no problem to solve.

There are plenty of areas I never covered by inventing a hammer, because
I never had a nail to hit. That's why many of my most-beloved libraries
are related to my professional work. I had a problem, I fixed it and now
I can reuse it.

A lot of the libraries I wrote "just for fun" remain unused for exactly
that reason - I obviously never really needed them to begin with.

Hans Bezemer

--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Thu Feb 27 12:21:28 2025

From Newsgroup: comp.lang.forth

A lot of the libraries I wrote "just for fun" remain unused for exactly
that reason - I obviously never really needed them to begin with.

My heart goes out to you.

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Thu Feb 27 12:19:16 2025

From Newsgroup: comp.lang.forth

Oh, so it's simpler way than anyone could guess:
"just use different term, avoid the word 'variable' ".
Done. :)

Oh dear, I hope you don't have a formal education in CS. If so, I'd ask
for my money back. You know - it's not a different term - it's a
different concept, with quite different characteristics.

„In computer science, array is a data type that represents
a collection of elements (values or variables), each
selected by one or more indices”

Now feel free to go and ask for your money back.

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Thu Feb 27 15:11:01 2025

From Newsgroup: comp.lang.forth

On 27-02-2025 13:19, LIT wrote:

Oh, so it's simpler way than anyone could guess:
"just use different term, avoid the word 'variable' ".
Done. :)

Oh dear, I hope you don't have a formal education in CS. If so, I'd ask
for my money back. You know - it's not a different term - it's a
different concept, with quite different characteristics.

„In computer science, array is a data type that represents
a collection of elements (values or variables), each
selected by one or more indices”

Now feel free to go and ask for your money back.

Interesting.. In your class they taught computer science by Wikipedia?
Didn't they have money for real books? Must have been a real poor city college..

Hans Bezemer

--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Thu Feb 27 14:20:28 2025

From Newsgroup: comp.lang.forth

Oh, so it's simpler way than anyone could guess:
"just use different term, avoid the word 'variable' ".
Done. :)

Oh dear, I hope you don't have a formal education in CS. If so, I'd ask
for my money back. You know - it's not a different term - it's a
different concept, with quite different characteristics.

„In computer science, array is a data type that represents
a collection of elements (values or variables), each
selected by one or more indices”

Now feel free to go and ask for your money back.

Interesting.. In your class they taught computer science by Wikipedia?
Didn't they have money for real books? Must have been a real poor city college..

At least in that college they didn't taught that
„Forth uses FIFO stack” -- as they taught you in
your really rich city college. :]

Anything wrong with the quoted definition?

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Thu Feb 27 17:47:08 2025

From Newsgroup: comp.lang.forth

An even weirder result for TEST3, although it probably has more to do
with my aging DO LOOP construct.

-marcel

---
ANEW -oos

VARIABLE V1 7 V1 !
VARIABLE V2 8 V2 !
VARIABLE V3 9 V3 !

: :=: ( a b -- ) \ exchange values among two variables
PARAMS| a b | a @ b @ swap b ! a ! ;

: ++ ( a -- ) \ increment variable by one
1 SWAP +! ;

: +> ( a b c -- ) \ add two variables then store result into third one
PARAMS| a b c | a @ b @ + c ! ;

: t1a V1 @ V2 @ + V3 ! ; : t1b V1 V2 V3 +> ;
: t2a 1 V1 +! ; : t2b V1 ++ ;
: t3a V1 @ V3 ! V2 @ V1 ! V3 @ V2 ! ; : t3b V1 V2 :=: ;

: TESTS
CR ." \ TEST1 : " TIMER-RESET #1000000000 0 DO t1a t1a t1a t1a t1a
t1a t1a t1a t1a t1a LOOP .ELAPSED
3 SPACES TIMER-RESET #1000000000 0 DO t1b t1b t1b t1b t1b
t1b t1b t1b t1b t1b LOOP .ELAPSED
CR ." \ TEST2 : " TIMER-RESET #1000000000 0 DO t2a t2a t2a t2a t2a
t2a t2a t2a t2a t2a LOOP .ELAPSED
3 SPACES TIMER-RESET #1000000000 0 DO t2b t2b t2b t2b t2b
t2b t2b t2b t2b t2b LOOP .ELAPSED
CR ." \ TEST3 : " TIMER-RESET #1000000000 0 DO t3a t3a t3a t3a t3a
t3a t3a t3a t3a t3a LOOP .ELAPSED
3 SPACES TIMER-RESET #1000000000 0 DO t3b t3b t3b t3b t3b
t3b t3b t3b t3b t3b LOOP .ELAPSED ;

\ old version
\ TEST1 : 1.646 seconds elapsed. 1.661 seconds elapsed.
\ TEST2 : 1.778 seconds elapsed. 1.728 seconds elapsed.
\ TEST3 : 2.194 seconds elapsed. 1.645 seconds elapsed. ok

\ new version (above, note: 10 Giga executions)
\ TEST1 : 1.959 seconds elapsed. 1.958 seconds elapsed.
\ TEST2 : 1.826 seconds elapsed. 1.827 seconds elapsed.
\ TEST3 : 18.711 seconds elapsed. 3.849 seconds elapsed. ok
--- Synchronet 3.20c-Linux NewsLink 1.2

From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Thu Feb 27 12:23:43 2025

From Newsgroup: comp.lang.forth

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Results (on Zen4):
gforth-fast (development): ...

It's interesting how little difference there is with gforth-fast. Could
you also do gforth-itc? exchange2 is a big win with VFX, suggesting its optimizer could do better with some of the other versions.
--- Synchronet 3.20c-Linux NewsLink 1.2

From Gerry Jackson@do-not-use@swldwa.uk to comp.lang.forth on Thu Feb 27 22:05:09 2025

From Newsgroup: comp.lang.forth

On 27/02/2025 07:29, Anton Ertl wrote:

Paul Rubin <no.email@nospam.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

: :=: ( addr1 addr2 -- )
OVER @ >R DUP @ ROT ! R> SWAP ! ;

: ex ( a1 a2 -- ) 2>r 2r@ @ swap @ r> ! r> ! ;

looks a little simpler.

This inspires another one:

: exchange2 ( addr1 addr2 -- )
dup >r @ over @ r> ! swap ! ;

With some other versions this results in the following benchmark
program:

[defined] !@ [if]
: exchange ( addr1 addr2 -- )
over @ swap !@ swap ! ;
[then]

\ Paul Rubin <875xkwo5io.fsf@nightsong.com>
: ex ( addr1 addr2 -- )
2>r 2r@ @ swap @ r> ! r> ! ;

: ex-locals {: x y -- :} x @ y @ x ! y ! ;

\ Anton Ertl
: exchange2 ( addr1 addr2 -- )
dup >r @ over @ r> ! swap ! ;

\ Marcel Hendrix
: :=: ( addr1 addr2 -- )
OVER @ >R DUP @ ROT ! R> SWAP ! ;

variable v1
variable v2

1 v1 !
2 v2 !

: bench ( "name" -- )
v1 v2
:noname ]] 100000000 0 do 2dup [[ parse-name evaluate ]] loop ; [[
execute ;

Results (on Zen4):

gforth-fast (development):
:=: exchange ex ex-locals exchange2
814_881_277 879_389_133 928_825_521 875_574_895 808_543_975 cyc. 3_908_874_164 3_708_891_336 4_508_966_770 4_209_778_557 3_708_865_505 inst.

vfx64 5.43:
:=: ex ex-locals exchange2
335_298_202 432_614_804 928_542_678 336_134_513 cyc. 1_166_400_242 1_366_264_943 2_866_547_067 1_166_280_641 inst.

And here's the code produced by gforth-fast:

:=: ex ex-locals exchange2
over 1->2 2>r 1->0 l 1->1 dup >r 1->1
mov r15,$08[r10] add r10,$08 mov rax,rbp >r 1->1
@ 2->2 mov r15,r13 add r10,$08 mov -$8[r14],r13
mov r15,[r15] mov r13,[r10] lea rbp,-$8[rbp] sub r14,$08

r 2->1 mov -$8[r14],r13 mov -$8[rax],r13 @ 1->1

mov -$8[r14],r15 sub r14,$10 mov r13,[r10] mov r13,$00[r13]
sub r14,$08 mov [r14],r15 >l @local0 1->1 over 1->2
dup 1->2 2r@ 0->2 @local0 1->1 mov r15,$08[r10]
mov r15,r13 mov r13,$08[r14] mov rax,rbp @ 2->2
@ 2->2 mov r15,[r14] lea rbp,-$8[rbp] mov r15,[r15]
mov r15,[r15] @ 2->2 mov -$8[rax],r13 r> 2->3
rot 2->3 mov r15,[r15] @ 1->1 mov r9,[r14]
mov r9,$08[r10] swap 2->2 mov r13,$00[r13] add r14,$08
add r10,$08 mov rax,r13 @local1 1->2 ! 3->1
! 3->1 mov r13,r15 mov r15,$08[rbp] mov [r9],r15
mov [r9],r15 mov r15,rax @ 2->2 swap 1->2

1->2 @ 2->2 mov r15,[r15] mov r15,$08[r10]

mov r15,[r14] mov r15,[r15] @local0 2->3 add r10,$08
add r14,$08 r> 2->3 mov r9,$00[rbp] ! 2->0
swap 2->3 mov r9,[r14] ! 3->1 mov [r15],r13
add r10,$08 add r14,$08 mov [r9],r15 ;s 0->1
mov r9,r13 ! 3->1 @local1 1->2 mov r13,$08[r10]
mov r13,[r10] mov [r9],r15 mov r15,$08[rbp] add r10,$08
! 3->1 r> 1->2 ! 2->0 mov rbx,[r14]
mov [r9],r15 mov r15,[r14] mov [r15],r13 add r14,$08
;s 1->1 add r14,$08 lp+2 0->1 mov rax,[rbx]
mov rbx,[r14] ! 2->0 mov r13,$08[r10] jmp eax
add r14,$08 mov [r15],r13 add r10,$08
mov rax,[rbx] ;s 0->1 add rbp,$10
jmp eax mov r13,$08[r10] ;s 1->1
add r10,$08 mov rbx,[r14]
mov rbx,[r14] add r14,$08
add r14,$08 mov rax,[rbx]
mov rax,[rbx] jmp eax
jmp eax

How does a crude definition not involving the R stack compare:
: ex3 over @ over @ 3 pick ! over ! 2drop ;
--
Gerry
--- Synchronet 3.20c-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Feb 27 22:03:55 2025

From Newsgroup: comp.lang.forth

Paul Rubin <no.email@nospam.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Results (on Zen4):
gforth-fast (development): ...

It's interesting how little difference there is with gforth-fast. Could
you also do gforth-itc?

gforth-itc (development):
:=: exchange ex ex-locals exchange2
7_527_256_553 5_224_615_325 6_825_283_178 9_238_357_501 7_036_128_309 c. 13_127_503_990 9_326_561_471 12_927_054_153 16_927_820_825 12_027_146_677 i.

For comparison: gforth-fast (development):
:=: exchange ex ex-locals exchange2
814_881_277 879_389_133 928_825_521 875_574_895 808_543_975 cyc. 3_908_874_164 3_708_891_336 4_508_966_770 4_209_778_557 3_708_865_505 inst.

exchange2 is a big win with VFX, suggesting its
optimizer could do better with some of the other versions.

On VFX exchange2 takes the same speed and the same number of
instructions as :=:. EX is slower because VFX does not analyse the
return stack, unlike the data stack. EX-LOCALS is slow because VFX's
locals implementation is not particularly good.

To see what a better analysis can do, let's look at lxf:

:=: ex ex-locals exchange2
502_740_029 502_189_567 502_134_842 502_043_217 cycles
1_701_663_782 1_701_657_866 1_701_677_273 1_701_684_186 instructions

The cycles and instructions are worse (except for ex-locals) than with
VFX, but that's due to inlining (which VFX does and lxf does not).

E.g., here's lxf's code for EX-LOCALS:

869204C 804FCE2 23 88C8000 5 normal EX-LOCALS

804FCE2 8B4500 mov eax , [ebp]
804FCE5 8B00 mov eax , [eax]
804FCE7 8BCB mov ecx , ebx
804FCE9 8B09 mov ecx , [ecx]
804FCEB 8B5500 mov edx , [ebp]
804FCEE 890A mov [edx] , ecx
804FCF0 8903 mov [ebx] , eax
804FCF2 8B5D04 mov ebx , [ebp+4h]
804FCF5 8D6D08 lea ebp , [ebp+8h]
804FCF8 C3 ret near

It's the same code as lxf produces for :=:.

The code lxf produces for EX and EXCHANGE2 is:

804FCF9 8BC3 mov eax , ebx
804FCFB 8B00 mov eax , [eax]
804FCFD 8B4D00 mov ecx , [ebp]
804FD00 8B09 mov ecx , [ecx]
804FD02 890B mov [ebx] , ecx
804FD04 8B5D00 mov ebx , [ebp]
804FD07 8903 mov [ebx] , eax
804FD09 8B5D04 mov ebx , [ebp+4h]
804FD0C 8D6D08 lea ebp , [ebp+8h]
804FD0F C3 ret near

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.20c-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Feb 27 22:53:47 2025

From Newsgroup: comp.lang.forth

Gerry Jackson <do-not-use@swldwa.uk> writes:

On 27/02/2025 07:29, Anton Ertl wrote:

\ Anton Ertl
: exchange2 ( addr1 addr2 -- )
dup >r @ over @ r> ! swap ! ;

...

Results (on Zen4):

gforth-fast (development):
:=: exchange ex ex-locals exchange2
814_881_277 879_389_133 928_825_521 875_574_895 808_543_975 cyc. >> 3_908_874_164 3_708_891_336 4_508_966_770 4_209_778_557 3_708_865_505 inst. ...

How does a crude definition not involving the R stack compare:
: ex3 over @ over @ 3 pick ! over ! 2drop ;

exchange2 ex3
dup >r 1->1 over 1->1

r 1->1 mov [r10],r13

mov -$08[r14],r13 sub r10,$08
sub r14,$08 mov r13,$10[r10]
@ 1->1 @ 1->1
mov r13,$00[r13] mov r13,$00[r13]
over 1->2 over 1->2
mov r15,$08[r10] mov r15,$08[r10]
@ 2->2 @ 2->2
mov r15,[r15] mov r15,[r15]

2->3 fourth 2->3

mov r9,[r14] mov r9,$10[r10]
add r14,$08 ! 3->1
! 3->1 mov [r9],r15
mov [r9],r15 over 1->2
swap 1->2 mov r15,$08[r10]
mov r15,$08[r10] ! 2->0
add r10,$08 mov [r15],r13
! 2->0 2drop 0->0
mov [r15],r13 add r10,$10
;s 0->1 ;s 0->1
mov r13,$08[r10] mov r13,$08[r10]
add r10,$08 add r10,$08
mov rbx,[r14] mov rbx,[r14]
add r14,$08 add r14,$08
mov rax,[rbx] mov rax,[rbx]
jmp eax jmp eax

EX3 plays to Gforth's strengths: copying words (e.g., OVER) instead of shuffling words (e.g., SWAP), remove superfluous stuff with 2DROP.

It also plays to VFX's strengths: being analytic about the dats stack. EXCHANGE2 was the fastest version (together with :=:) before, here's
that compared to EX3:

exchange2 ex3
334_718_398 273_592_214 cycles
1_167_276_392 967_258_380 instructions

EXCHANGE2 EX3
PUSH RBX MOV RDX, [RBP]
MOV RDX, [RBP] MOV RDX, 0 [RDX]
MOV RDX, 0 [RDX] MOV RCX, 0 [RBX]
POP RCX MOV RAX, [RBP]
MOV RBX, 0 [RBX] MOV 0 [RAX], RCX
MOV 0 [RCX], RDX MOV 0 [RBX], RDX
MOV RDX, [RBP] MOV RBX, [RBP+08]
MOV 0 [RDX], RBX LEA RBP, [RBP+10]
MOV RBX, [RBP+08] RET/NEXT
LEA RBP, [RBP+10] ( 29 bytes, 9 instructions )
RET/NEXT
( 31 bytes, 11 instructions )

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.20c-Linux NewsLink 1.2

From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Fri Feb 28 15:03:47 2025

From Newsgroup: comp.lang.forth

On 27-02-2025 15:20, LIT wrote:

Oh, so it's simpler way than anyone could guess:
"just use different term, avoid the word 'variable' ".
Done. :)

Oh dear, I hope you don't have a formal education in CS. If so, I'd ask >>>> for my money back. You know - it's not a different term - it's a
different concept, with quite different characteristics.

„In computer science, array is a data type that represents
a collection of elements (values or variables), each
selected by one or more indices”

Now feel free to go and ask for your money back.

Interesting.. In your class they taught computer science by Wikipedia?
Didn't they have money for real books? Must have been a real poor city
college..

At least in that college they didn't taught that
„Forth uses FIFO stack” -- as they taught you in
your really rich city college. :]

I don't think I ever did that in any publication, but even if I did -
people get confused when calling bit 0 "bit 1" because it represents
"1". They get confused choosing the wrong side when they talk about "big endian". They get confused when classifying the 8088. They go left when
their instructor calls "right".

It's like a spelling error. Only petty people try to use that as a
counter argument. It's a different kind of error compared to proposing "stackless operations" on a stack based language. It's like asking why a Ferrari can't pour a concrete floor.

Anything wrong with the quoted definition?

Yes. You couldn't produce one. You had to look it up. Something as basic
a concept as "array".

Hans Bezemer
--- Synchronet 3.20c-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri Feb 28 21:55:05 2025

From Newsgroup: comp.lang.forth

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Another variant:

: exchange ( addr1 addr2 -- )
dup @ rot !@ swap ! ;

This uses the primitive

'!@' ( u1 a-addr -- u2 ) gforth-experimental "store-fetch"
load U2 from A_ADDR, and store U1 there, as atomic operation

I worry that the atomic part will result in it being slower than the
versions that do not use !@.

It's barely noticable on Zen4, but it makes a big difference on the
Cortex-A55. Therefore we decided to also have a nonatomic !@. We renamed the atomic one into ATOMIC!@ and !@ is now the nonatomic version.

How do they perform?

On Zen4:
!@ atomic!@
821_538_216 880_459_702 cycles
3_815_202_629 3_710_937_849 instructions

On Cortex-A55:
!@ atomic!@
3355427045 5856496676 cycles
3115589778 4318749543 instructions

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.20c-Linux NewsLink 1.2

From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Fri Feb 28 14:45:14 2025

From Newsgroup: comp.lang.forth

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

!@ is now the nonatomic version.

Is the nonatomic one useful often? We've done without it all this time.
--- Synchronet 3.20c-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Mar 1 07:32:09 2025

From Newsgroup: comp.lang.forth

Paul Rubin <no.email@nospam.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

!@ is now the nonatomic version.

Is the nonatomic one useful often?

Some numbers of uses in the Gforth image:

11 !@
3 atomic!@
66 +!

We've done without it all this time.

Sure, you can replace it with DUP @ >R ! R>. Having a word for that
relieves the programmer of producing such a sequence (possibly with a
bug) and the reader of having to analyse what's going on here.

I have now added stack-state variants for !@, resulting in better
performance in some cases. Is !@ used often enough to merit the extra
build time of Gforth? That's not clear, but the benefit I see is that
I want to provide a system where the programmer does not have to
wonder whether he should avoid !@ for better performance.

I also tried out another variant that uses !@:

: exchange4 ( addr1 addr2 -- )
dup @ rot !@ swap ! ;

The resulting code for EXCHANGE, EXCHANGE4, and EXCHANGE2 (the latter
without !@):

see-code exchange see-code exchange4 see-code exchange2
over 1->2 dup 1->2 dup >r 1->1
mov r15,$08[r12] mov r15,r8 >r 1->1
@ 2->2 @ 2->2 mov -$08[r13],r8
mov r15,[r15] mov r15,[r15] sub r13,$08
swap 2->3 rot 2->3 @ 1->1
add r12,$08 mov r9,$08[r12] mov r8,[r8]
mov r9,r8 add r12,$08 over 1->2
mov r8,[r12] !@ 3->2 mov r15,$08[r12]
!@ 3->2 mov rax,r15 @ 2->2
mov rax,r15 mov r15,[r9] mov r15,[r15]
mov r15,[r9] mov [r9],rax r> 2->3
mov [r9],rax swap 2->3 mov r9,$00[r13]
swap 2->3 add r12,$08 add r13,$08
add r12,$08 mov r9,r8 ! 3->1
mov r9,r8 mov r8,[r12] mov [r9],r15
mov r8,[r12] ! 3->1 swap 1->2
! 3->1 mov [r9],r15 mov r15,$08[r12]
mov [r9],r15 ;s 1->1 add r12,$08
;s 1->1 mov rbx,$00[r13] ! 2->0
mov rbx,$00[r13] add r13,$08 mov [r15],r8
add r13,$08 mov rax,[rbx] ;s 0->1
mov rax,[rbx] jmp eax mov r8,$08[r12]
jmp eax add r12,$08
mov rbx,$00[r13]
add r13,$08
mov rax,[rbx]
jmp eax

EXCHANGE performs 1 instruction less than EXCHANGE2, EXCHANGE4
performs 2 instructions less than EXCHANGE2; both contain three less primitives.

Performance on Zen4:
exchange exchange4 exchange2
748_033_428 699_870_875 809_204_577 cycles
3_610_871_416 3_510_578_833 3_710_662_751 instructions

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.20c-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Mar 1 08:18:06 2025

From Newsgroup: comp.lang.forth

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

see-code exchange see-code exchange4 see-code exchange2
over 1->2 dup 1->2 dup >r 1->1
mov r15,$08[r12] mov r15,r8 >r 1->1
@ 2->2 @ 2->2 mov -$08[r13],r8
mov r15,[r15] mov r15,[r15] sub r13,$08
swap 2->3 rot 2->3 @ 1->1
add r12,$08 mov r9,$08[r12] mov r8,[r8]
mov r9,r8 add r12,$08 over 1->2
mov r8,[r12] !@ 3->2 mov r15,$08[r12]
!@ 3->2 mov rax,r15 @ 2->2
mov rax,r15 mov r15,[r9] mov r15,[r15]
mov r15,[r9] mov [r9],rax r> 2->3
mov [r9],rax swap 2->3 mov r9,$00[r13]
swap 2->3 add r12,$08 add r13,$08
add r12,$08 mov r9,r8 ! 3->1
mov r9,r8 mov r8,[r12] mov [r9],r15
mov r8,[r12] ! 3->1 swap 1->2
! 3->1 mov [r9],r15 mov r15,$08[r12]
mov [r9],r15 ;s 1->1 add r12,$08
;s 1->1 mov rbx,$00[r13] ! 2->0
mov rbx,$00[r13] add r13,$08 mov [r15],r8
add r13,$08 mov rax,[rbx] ;s 0->1
mov rax,[rbx] jmp eax mov r8,$08[r12]
jmp eax add r12,$08
mov rbx,$00[r13]
add r13,$08
mov rax,[rbx]
jmp eax

The difference between exchange and exchange4 shows how stack caching
can have a hard-to-predict effect. Gforth searches for the shortest
path through the available stack-cache states, where shortness is
defined by the native-code length. E.g., it starts with state 1, and
from there it can use any of the dup variants starting in state 1, or
first transition to another state and use a dup variant starting from
there.

For SWAP and ROT gforth-fast has the following variants:

primitive in-out # code bytes
swap 1-1 132 len= 4+ 13+ 3
swap 2-2 37 len= 4+ 9+ 3
swap 3-3 4 len= 4+ 9+ 3
swap 0-2 8 len= 4+ 14+ 3
swap 1-2 82 len= 4+ 9+ 3
swap 2-1 74 len= 4+ 8+ 3
swap 2-3 30 len= 4+ 11+ 3
swap 3-2 3 len= 4+ 11+ 3
swap 2-0 20 len= 4+ 13+ 3
rot 1-1 46 len= 4+ 23+ 3
rot 3-3 6 len= 4+ 12+ 3
rot 3-1 24 len= 4+ 13+ 3
rot 2-3 15 len= 4+ 9+ 3
rot 1-3 17 len= 4+ 17+ 3
rot 0-3 1 len= 4+ 19+ 3

You get these data (in a rawer form) with

gforth-fast --print-prim -e bye |& grep ^swap
gforth-fast --print-prim -e bye |& grep ^rot

The in column is the stack-cache state on entering the word, the out
column is the stack-cache state on leaving the word. The # column
shows how many times this variant of the primitive is used (static
counts). The code length colum shows three parts, the middle of which
is the part that's copied to dynamic superinstructions like the ones
shown above, and this length is what is used in the search for the
shortest path.

In EXCHANGE4, the shortest variant of ROT is used: ROT 2->3; and the
primitives of the other variants are selected to also result in the
shortest overall code.

In EXCHANGE and EXCHANGE4, SWAP 2->3 is not the shortest variant of
SWAP, not even the shortest variant starting from state 2, but ending
in state 3 allows to use cheap variants of further primitives such as
!@ and !, resulting in the overall shortest code for this sequence.
In EXCHANGE2, we see the selection of a shorter version of SWAP, but
one of the costs is that ;s becomes longer (but in this case the
overall savings from using a shorter version of SWAP and shorter
versions of earlier instructions make up for that).

Why am I looking at this? For stack-shuffling primitives like SWAP
and ROT, it's not obvious which variant is how long and which variant
should be selected.

These stack-shiffling words therefore are also good candidates for
performing stack-cache state transitions that would otherwise require
inserting extra transition code:

E.g., EXCHANGE and its variants consume two stack items, but need to
start in stack-cache state 1 and end in stack-cache state 1 (gforth is currently not smart enough to deal with other states at basic-block boundaries), so not everything can be done in the stack cache; the
stack pointer needs to be increased by two cells, and there need to be
accesses to the memory part of the stack for two stack items.

In EXCHANGE, the adjustment by one cell and memory access for one
stack item is done in the first SWAP 2->3, and another one in the
second one. In EXCHANGE4, ROT 2->3 and SWAP 2->3 perform these tasks.
In EXCHANGE2, the SWAP 1->3 does it for one cell, and the stack-cache
state transition 0->1 in the first two instructions of ;s do it for
the other cell (gforth-fast actually does not have a built-in variant
;S 0->1 and the code shown as ;S 0->1 by SEE-CODE is actually composed
of a transition 0->1 and the ;S 1->1 variant).

I wanted to know how often which variant of these stack-shuffling
primitives is used, and how this relates to their length. One
interesting result is that ROT 1->3 is used relatively frequently
despite having relatively long code. Apparently the code that comes
before these 17 instances of ROT benefits a lot from being in
stack-cache state 1 and this amortizes the longer code for the ROT

3 compared to ROT 2->3.

Another interesting result is the low usage of SWAP 3->2 compared to
SWAP 2->3. This may say something about how SWAP is used in Forth
programs. Or it may be an artifact of tie-breaking: If two paths have
the same length, one is chosen rather arbitrarily, but consistently,
and this may make one variant appear more useful than merited by the
benefit that the existence of the variant has on code length.

For those interested, here's the code for the various variants shown
above:

r12: stack pointer
r8: stack cache register a (tos in state 1, 2nd in state 2, 3rd in state 3) r15: stack-cache register b (tos in state 2, 2nd in state 3)
r9: stack-cache register c (tos in state 3)

swap 1-1
559E7F769425: mov rax,$08[r12]
559E7F76942A: mov $08[r12],r8
559E7F76942F: mov r8,rax

swap 2-2
559E7F76E5B1: mov rax,r8
559E7F76E5B4: mov r8,r15
559E7F76E5B7: mov r15,rax

swap 3-3
559E7F76E5C3: mov rax,r15
559E7F76E5C6: mov r15,r9
559E7F76E5C9: mov r9,rax

swap 0-2
559E7F76E5D5: mov r15,$10[r12]
559E7F76E5DA: mov r8,$08[r12]
559E7F76E5DF: add r12,$10

swap 1-2
559E7F76E5EC: mov r15,$08[r12]
559E7F76E5F1: add r12,$08

swap 2-1
559E7F76E5FE: mov [r12],r15
559E7F76E602: sub r12,$08

swap 2-3
559E7F76E60F: add r12,$08
559E7F76E613: mov r9,r8
559E7F76E616: mov r8,[r12]

swap 3-2
559E7F76E623: mov [r12],r8
559E7F76E627: mov r8,r9
559E7F76E62A: sub r12,$08

swap 2-0
559E7F76E637: mov [r12],r15
559E7F76E63B: sub r12,$10
559E7F76E63F: mov $08[r12],r8

rot 1-1
559E7F76944C: mov rdx,$08[r12]
559E7F769451: mov rax,$10[r12]
559E7F769456: mov $08[r12],r8
559E7F76945B: mov $10[r12],rdx
559E7F769460: mov r8,rax

rot 3-3
559E7F76EDC0: mov rax,r8
559E7F76EDC3: mov r8,r15
559E7F76EDC6: mov r15,r9
559E7F76EDC9: mov r9,rax

rot 3-1
559E7F76EDD5: mov [r12],r15
559E7F76EDD9: sub r12,$10
559E7F76EDDD: mov $08[r12],r9

rot 2-3
559E7F76EDEB: mov r9,$08[r12]
559E7F76EDF0: add r12,$08

rot 1-3
559E7F76EDFD: mov r9,$10[r12]
559E7F76EE02: mov r15,r8
559E7F76EE05: add r12,$10
559E7F76EE09: mov r8,-$08[r12]

rot 0-3
559E7F76EE17: mov r9,$18[r12]
559E7F76EE1C: mov r8,$10[r12]
559E7F76EE21: add r12,$18
559E7F76EE25: mov r15,-$10[r12]

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.20c-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Mar 1 11:47:54 2025

From Newsgroup: comp.lang.forth

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Paul Rubin <no.email@nospam.invalid> writes: >>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

!@ is now the nonatomic version.

Is the nonatomic one useful often?

Some numbers of uses in the Gforth image:

11 !@
3 atomic!@
66 +!

We've done without it all this time.

Sure, you can replace it with DUP @ >R ! R>. Having a word for that
relieves the programmer of producing such a sequence (possibly with a
bug) and the reader of having to analyse what's going on here.

Another point: These 11 uses of non-atomic !@ used to be uses of the
slow atomic !@. So even the slow atomic !@ was preferred by the
programmer to doing it with @, ! and stack manipulation. In that
situation the non-atomic !@ provides the wanted capability without
incurring the atomic slowness.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.20c-Linux NewsLink 1.2

From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Sat Mar 1 16:20:24 2025

From Newsgroup: comp.lang.forth

On 01-03-2025 12:47, Anton Ertl wrote:

11 !@
3 atomic!@
66 +!

We've done without it all this time.

Sure, you can replace it with DUP @ >R ! R>. Having a word for that
relieves the programmer of producing such a sequence (possibly with a
bug) and the reader of having to analyse what's going on here.

I found the sequence exactly twice in my code - one an application
program and one a library. I agree whole-heartedly that such a sequence
may help a programmer to abstract such a pattern - I've added several of
those myself.

However, if it is that rare there is no point in adding it. Creating too
many superfluous abstractions may even get counter productive in the
sense that predefined abstractions are ignored and reinvented. So, this
is not one I'd add immediately - but never say never. May be it will pop
up in the future. Who knows.. ;-)

Hans Bezemer

--- Synchronet 3.20c-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Mar 1 17:22:45 2025

From Newsgroup: comp.lang.forth

Hans Bezemer <the.beez.speaks@gmail.com> writes:

On 01-03-2025 12:47, Anton Ertl wrote:

11 !@
3 atomic!@
66 +!

We've done without it all this time.

Sure, you can replace it with DUP @ >R ! R>. Having a word for that
relieves the programmer of producing such a sequence (possibly with a
bug) and the reader of having to analyse what's going on here.

I found the sequence exactly twice in my code

Yes, you can replace !@ with that sequence, but not every case where
one fetches one value from and address and stores another value to
that address is expressed by this sequence. E.g., another equivalent
sequence is: DUP >R @ SWAP R> !; and another: DUP @ -ROT !. And you
can also use that word profitably in cases where some other
functionality is mixed in with the code without !@. E.g., in none of
the variants of :=:/etc. without !@ in this thread one of the two
sequences occured; in several of them the ! of the other address was
inserted before the ! of the address that was fetched the second time.
E.g.,

: exchange2 ( addr1 addr2 -- )
dup >r @ over @ r> ! swap ! ;

Yet

: exchange ( addr1 addr2 -- )
over @ swap !@ swap ! ;

is shorter, easier to follow, and (in gforth-fast) faster.

As mentioned, Bernd Paysan used !@ 11 times in the Gforth image in
code where atomicity is not needed. Up to yesterday we only had the
atomic version and I have avoided using !@ because I was worried that
it would be slow, so there may be some additional opportunity in the
Gforth image for using it.

However, if it is that rare there is no point in adding it. Creating too >many superfluous abstractions may even get counter productive in the
sense that predefined abstractions are ignored and reinvented.

In that case they are obviously not superfluous. Yes, reinvention
happens; it shows that the word is needed. Then at some point
somebody notices the duplication, decides on a canonical version and
goes through the code and replaces all uses of the duplicated words
with the canonical version.

There is a valid reason to avoid rarely used words that can be
replaced by a sequence: human memory load. I don't think that !@ is
such a case, though.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.20c-Linux NewsLink 1.2

From mhx@mhx@iae.nl (mhx) to comp.lang.forth on Sat Mar 1 21:35:16 2025

From Newsgroup: comp.lang.forth

I can't find `DUP @ >R ! R>` (+ variants with spacings)
in any of 1667 files.
However, `DUP @ >R` is found 12 times and `! R>` 29 times.

`DUP @ -ROT !` gets hit 0 times, `DUP >R @ SWAP R> !` once.

-marcel
--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Sun Mar 2 10:35:27 2025

From Newsgroup: comp.lang.forth

Oh, so it's simpler way than anyone could guess:
"just use different term, avoid the word 'variable' ".
Done. :)

Oh dear, I hope you don't have a formal education in CS. If so, I'd ask >>>>> for my money back. You know - it's not a different term - it's a
different concept, with quite different characteristics.

„In computer science, array is a data type that represents
a collection of elements (values or variables), each
selected by one or more indices”

Now feel free to go and ask for your money back.

Interesting.. In your class they taught computer science by Wikipedia?
Didn't they have money for real books? Must have been a real poor city
college..

At least in that college they didn't taught that
„Forth uses FIFO stack” -- as they taught you in
your really rich city college. :]

I don't think I ever did that in any publication, but even if I did -
people get confused when calling bit 0 "bit 1" because it represents
"1". They get confused choosing the wrong side when they talk about "big endian". They get confused when classifying the 8088. They go left when
their instructor calls "right".

It's like a spelling error. Only petty people try to use that as a
counter argument. It's a different kind of error compared to proposing "stackless operations" on a stack based language. It's like asking why a Ferrari can't pour a concrete floor.

No, it WASN'T "spelling error"; you stated that out loud
in your "educational" YT-clip. You forgot? Or you're trying
very hard to forget? No, it wasn't just replacing L by F, you
stated that word-by-word: "Forth uses FIFO stack — first in,
first out".

I'm going to remind you:

https://groups.google.com/g/comp.lang.forth/c/EWLqO2b26nM/m/B7gnoD7dAgAJ

"> > It's here - https://www.youtube.com/watch?v=hpw__rmBisU

01:45 -- but you know: Forth's stack works on the rule „last in — first out”, not „first in, first out”. Or am I wrong?

No, you're not. I pulled it and I'm uploading an updated version.

Hans Bezemer"

And you try to present yourself as an authority after something
like that?

Anything wrong with the quoted definition?

Yes. You couldn't produce one. You had to look it up. Something as basic
a concept as "array".

Mr. FIFO, don't you be ridiculous again... :]

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Sun Mar 2 11:39:38 2025

From Newsgroup: comp.lang.forth

On 01-03-2025 22:35, mhx wrote:

I can't find `DUP @ >R ! R>` (+ variants with spacings)
in any of 1667 files.
However, `DUP @ >R` is found 12 times and `! R>` 29 times.

`DUP @ -ROT !` gets hit 0 times, `DUP >R @ SWAP R> !` once.

In 1073 files, `DUP @ -ROT !` and `DUP >R @ SWAP R> !` aren't found.
The sequence `DUP @ >R` is found 10 times, `! R>` a whopping 65 times.

Hans Bezemer

--- Synchronet 3.20c-Linux NewsLink 1.2

From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Wed Mar 5 16:40:55 2025

From Newsgroup: comp.lang.forth

On 02-03-2025 11:35, LIT wrote:

And you try to present yourself as an authority after something
like that?

"Mr. Twain - you made a spelling error. And you call yourself the
greatest American writer of the 19th century?"

I told you you were petty.. :)

Mr. FIFO, don't you be ridiculous again... :]

Well, until you can make a non-trivial Forth program without resorting
to variables, I think I still have an edge on you where Forth is concerned!

A significant edge, I might add..! ;-)

Hans Bezemer

--- Synchronet 3.20c-Linux NewsLink 1.2

From zbigniew2011@zbigniew2011@gmail.com (LIT) to comp.lang.forth on Thu Mar 6 11:33:35 2025

From Newsgroup: comp.lang.forth

And you try to present yourself as an authority after something
like that?

"Mr. Twain - you made a spelling error. And you call yourself the
greatest American writer of the 19th century?"

I told you you were petty.. :)

No, it WASN'T humble "spelling error"; YOU STATED
THAT OUT LOUD, in a complete sentence. :]

BTW: comparing yourself to Twain? It seems you're
not just a greatest "computer scientist" if not in
a world, then at least in this newsgroup, sure :D
— but also the most modest one... :)))

--
--- Synchronet 3.20c-Linux NewsLink 1.2

From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Thu Mar 6 13:36:53 2025

From Newsgroup: comp.lang.forth

On 06-03-2025 12:33, LIT wrote:

No, it WASN'T humble "spelling error"; YOU STATED
THAT OUT LOUD, in a complete sentence. :]

BTW: comparing yourself to Twain? It seems you're
not just a greatest "computer scientist" if not in
a world, then at least in this newsgroup, sure :D
— but also the most modest one... :)))

Oh dear - it actually was a typo - and I can prove it.

First the audio file has a timestamp of 15:32. Now, you know it's not
hard to "touch" a file, so if you don't want to believe me - fair enough.
The error was reported at 18:36. You can check that for yourself. The
text is part of an animation. So I had to redo the entire animation. I
have to do the timing of the animation with the audio manually. Then I
have to render the video. That takes about ten minutes. The render was complete at 18:56. Again, you don't have to believe me.

But I confirmed the reupload at 19:06. You can check that for yourself.
That means uploading the file, adding all the information, adding the subtitle. I think ten minutes is hardly unreasonable. So, that is half
an hour to:
• Editing the animation;
• Syncing the animation;
• Edit the video;
• Render the video;
• Upload and process the video.

Now, I use Audacity - and that doesn't support "punch and roll". It
would be quite hard to seamlessly edit the audio that way. And also - my script clearly states: "If we want to get a dime out, the first one that
drops out is the one you put in last. " Again, you don't have to believe me.

But either I just fixed the video - OR I did the video AND the audio
plus the script in less than ten minutes. If that were the case, it
would make me even more formidable! Your choice..

Hans Bezemer

--- Synchronet 3.20c-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Noozle
  Mon Apr 21 07:53:02 2025
  from Noozle City via Telnet
- Microbot
  Mon Apr 21 01:36:56 2025
  from Moore, Ok via Telnet
- Noozle
  Sun Apr 20 15:14:28 2025
  from Noozle City via Telnet
- Microbot
  Sun Apr 20 03:00:36 2025
  from Moore, Ok via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,030
Nodes:	10 (1 / 9)
Uptime:	63:59:06
Calls:	13,350
Calls today:	2
Files:	186,574
D/L today:	2,075 files (560M bytes)
Messages:	3,358,635

Stack vs stackless operation

Who's Online

Recent Visitors

System Info