From Newsgroup: comp.lang.forth
dxf <
dxforth@gmail.com> writes:
On 10/03/2024 5:09 am, Anton Ertl wrote:
It's difficult to imagine under
what circumstances a loop address on the stack is faster, but it suggests
one is starting from an inefficient or compromised base.
The starting point is gforth-fast from June 2023. Here's an example.
The inner loop of the siev benchmark is:
0 i c! dup +loop
The following shows the threaded code intermixed with the native code:
loop-back address in ...
... threaded code ... return stack
lit 1->2 lit 1->2
#0 #0
mov r15,[r14] mov r15,[r14]
add r14,$10 add r14,$10
i 2->3 i 2->3
mov r9,[rbx] mov r9,[rbx]
add r14,$08 add r14,$08
c! 3->1 c! 3->1
mov [r9],r15lb mov [r9],r15lb
add r14,$08 add r14,$08
dup 1->2 dup 1->2
mov r15,r8 mov r15,r8
add r14,$08 add r14,$08
(+loop) 2->1 (+loop)-rstack 2->1
<PRIMES+$108>
mov rax,[rbx] mov rdx,[rbx]
mov rsi,[r14] mov rsi,$10[rbx]
lea r10,$08[r14] mov rax,rdx
mov rdx,rax sub rax,$08[rbx]
sub rdx,$08[rbx] add rdx,r15
add rax,r15 lea rcx,[r15][rax]
lea rcx,[r15][rdx] xor rcx,rax
xor rcx,rdx xor rax,r15
xor rdx,r15 test rcx,rax
test rcx,rdx js $7F22DC4C075F
js $7F860CE101F1 mov r14,rsi
mov [rbx],rax mov [rbx],rdx
mov rcx,[rsi] add r14,$08
lea r14,$08[rsi] mov rcx,-$08[r14]
jmp ecx jmp ecx
On Zen3 (Ryzen 5800X) and Tiger Lake (Core i5-1135G7) the return stack
variant is faster by a factor >2; we also see speedups on other
processors, but they are smaller. Where do these speedups come from?
If you look at the updates to r14, which contains the virtual-machine instruction pointer updates, they are as follows:
loop-back address in ...
... threaded code ... return stack
add r14,$10 add r14,$10
add r14,$08 add r14,$08
add r14,$08 add r14,$08
add r14,$08 add r14,$08
mov rsi,[r14] mov rsi,$10[rbx]
lea r14,$08[rsi] mov r14,rsi
add r14,$08
The crucial difference is that in the left column there is an unbroken dependence chain from the r14 at the end of the previous iteration to
the r14 at the end of the present iteration; this dependence chain has
a latency of 9 cycles per iteration on Zen3, meaning that, with enough iterations, the loop takes at least 9 cycles.
In the right column r14 at the end of one iteration does not depend on
r14 at the end of the previous iteration, because the dependence chain
starts from the instruction "mov rsi,$10[rbx]". This means that the
loop can be executed faster and on Zen3 and on Tiger Lake, that
speedup happens to be more than a factor of 2.
- anton
--
M. Anton Ertl
http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs:
http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard:
https://forth-standard.org/
EuroForth 2023:
https://euro.theforth.net/2023
--- Synchronet 3.20a-Linux NewsLink 1.114