Forum: War Ensemble BBS

On my AMD FX-8370 I don't benefit from a compact code area.

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Thu Feb 27 13:18:57 2025

From Newsgroup: comp.lang.forth

I test lina64 on my AMD FX-8370 8 core 4 Ghz.

The genuine Byte benchmark sieve takes 1.5 ms on my unmodified lina.
That is a indirect threaded Forth with no optimisation and all the
machine code scattered throughout the dictionary.

I build a version where there is actually a code segment and all code is collected there. There was no significant difference in speed.

All the code of the Forth fits comfortable in the L1 cache.
Is this to be expected?
An L1 cache hit is an L1 cache hit?

Could Intel processors respond more to this distinction?

Groetjes Albert
--
Temu exploits Christians: (Disclaimer, only 10 apostles)
Last Supper Acrylic Suncatcher - 15Cm Round Stained Glass- Style Wall
Art For Home, Office And Garden Decor - Perfect For Windows, Bars,
And Gifts For Friends Family And Colleagues.
--- Synchronet 3.20c-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Feb 27 18:18:46 2025

From Newsgroup: comp.lang.forth

albert@spenarnc.xs4all.nl writes:

I test lina64 on my AMD FX-8370 8 core 4 Ghz.

The genuine Byte benchmark sieve takes 1.5 ms on my unmodified lina.
That is a indirect threaded Forth with no optimisation and all the
machine code scattered throughout the dictionary.

I build a version where there is actually a code segment and all code is >collected there. There was no significant difference in speed.

All the code of the Forth fits comfortable in the L1 cache.
Is this to be expected?
An L1 cache hit is an L1 cache hit?

Not at all. Since the Pentium and the K5 (I think) there is an
instruction cache and a data cache (and then uop caches, which can be
seen as a kind of instruction cache). However, apart from the early
ones (Pentium, K6, and probably K5), the same grains (with typically
64-byte granularity these days) can reside in both the I-cache and the
D-cache, as long as that grain is not written to.

So if your complete Forth system including the primitives and the
sieve program fits into the D-cache and fits into the I-cache, and you
have no writes close to code, you will indeed only see compulsory
misses.

I have posted here about the performance pitfalls of keeping code
close to data since 1995, and Forth system implementors typically have
taken measures only when I presented benchmark results where there
system looks bad. But they usually only did the minimum necessary for
that particular benchmark, so over the years the issue has come up
again and again.

One interesting aspect is that small benchmarks like the sieve are
often not affected, but larger application benchmarks are. E.g., in
my recent work [ertl24] all the small benchmarks are unaffected by the
problem, whereas several of the larger benchmarks were affected in SwiftForth-4.0.0-RC87 and saw significant speedups from a fix in RC89.

So I applaud that you have done the right thing and completely
separated code from data. You may not see a benefit on Sieve, but
there may be a difference in a different program (and you may not even
notice until you measure both variants).

@InProceedings{ertl24,
author = {M. Anton Ertl},
title = {How to Implement Words (Efficiently)},
crossref = {euroforth24},
pages = {43--52},
url = {http://www.euroforth.org/ef24/papers/ertl.pdf},
url-slides = {http://www.euroforth.org/ef24/papers/ertl-slides.pdf},
video = {https://www.youtube.com/watch?v=bAq4760h5ZQ},
OPTnote = {not refereed},
abstract = {The implementation of Forth words has to satisfy the
following requirements: 1) A word must be
represented by a single cell (for
\code{execute}). 2) A word may represent a
combination of code and data (for, e.g.,
\code{does>}). In addition, on some hardware,
keeping executed native code and (written) data
close together results in slowness and therefore
should be avoided; moreover, failing to pair up
calls with returns results in (slow) branch
mispredictions. The present work describes how
various Forth systems over the decades have
satisfied the requirements, and how many systems run
into performance pitfalls in various situations.
This paper also discusses how to avoid this
slowness, including in native-code systems.}
}
@Proceedings{euroforth24,
title = {40th EuroForth Conference},
booktitle = {40th EuroForth Conference},
year = {2024},
key = {EuroForth'24},
url = {http://www.euroforth.org/ef24/papers/proceedings.pdf}
}

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
--- Synchronet 3.20c-Linux NewsLink 1.2

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Fri Feb 28 12:28:51 2025

From Newsgroup: comp.lang.forth

In article <2025Feb27.191846@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
<SNIP>
Thanks for the insight.

<SNIP>

So I applaud that you have done the right thing and completely
separated code from data. You may not see a benefit on Sieve, but
there may be a difference in a different program (and you may not even
notice until you measure both variants).

Actually I have not done that. I added another configuration file to
the 20 to be able to build a separation between code and data and
that for 64 bit linux only.
I have tested it for one of the three assembler only.
So ciforth is merely prepared for such a change.

I introduced a
define( {_SEPARATED_}, _yes)dnl
All other configrations have
define( {_SEPARATED_}, _no)dnl
The other configurations are not affected because this line is in the prelude.m4 so the latter is default.
The switching of segments is governed by _SEPARATED_ in the fasm.m4
gas.m4 and nasm.m4 macro files, because segment switching is dependant
on the actual assembler used.
So are no changes to change generic i86 assembler base (ci86.gnr) .

The slight complication will not make it into a release unless there
is a convincing evidence that it is beneficial and I have used it myself extensively.

It was more of an exercise to convince myself that I could add that.
(There is one single i86 file source, and the rest 16/32/64,
linux/DOS/windows and data separation is done by macro's governed by configuration files.)

You draw attention to the effect of assembler snippets in large programs.
This situation is unlikely to happen in ciforth.
If machine code is added because of speed that is not likely done
by CODE END-CODE words, but by compacting words into a block by
inlining everything. This would not be entangled with data.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html >comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

--
Temu exploits Christians: (Disclaimer, only 10 apostles)
Last Supper Acrylic Suncatcher - 15Cm Round Stained Glass- Style Wall
Art For Home, Office And Garden Decor - Perfect For Windows, Bars,
And Gifts For Friends Family And Colleagues.
--- Synchronet 3.20c-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,030
Nodes:	10 (0 / 10)
Uptime:	06:55:02
Calls:	13,343
Files:	186,574
D/L today:	34 files (9,955K bytes)
Messages:	3,357,309

On my AMD FX-8370 I don't benefit from a compact code area.

Who's Online

System Info