• On my AMD FX-8370 I don't benefit from a compact code area.

    From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Thu Feb 27 13:18:57 2025
    From Newsgroup: comp.lang.forth

    I test lina64 on my AMD FX-8370 8 core 4 Ghz.

    The genuine Byte benchmark sieve takes 1.5 ms on my unmodified lina.
    That is a indirect threaded Forth with no optimisation and all the
    machine code scattered throughout the dictionary.

    I build a version where there is actually a code segment and all code is collected there. There was no significant difference in speed.

    All the code of the Forth fits comfortable in the L1 cache.
    Is this to be expected?
    An L1 cache hit is an L1 cache hit?

    Could Intel processors respond more to this distinction?


    Groetjes Albert
    --
    Temu exploits Christians: (Disclaimer, only 10 apostles)
    Last Supper Acrylic Suncatcher - 15Cm Round Stained Glass- Style Wall
    Art For Home, Office And Garden Decor - Perfect For Windows, Bars,
    And Gifts For Friends Family And Colleagues.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Thu Feb 27 18:18:46 2025
    From Newsgroup: comp.lang.forth

    albert@spenarnc.xs4all.nl writes:
    I test lina64 on my AMD FX-8370 8 core 4 Ghz.

    The genuine Byte benchmark sieve takes 1.5 ms on my unmodified lina.
    That is a indirect threaded Forth with no optimisation and all the
    machine code scattered throughout the dictionary.

    I build a version where there is actually a code segment and all code is >collected there. There was no significant difference in speed.

    All the code of the Forth fits comfortable in the L1 cache.
    Is this to be expected?
    An L1 cache hit is an L1 cache hit?

    Not at all. Since the Pentium and the K5 (I think) there is an
    instruction cache and a data cache (and then uop caches, which can be
    seen as a kind of instruction cache). However, apart from the early
    ones (Pentium, K6, and probably K5), the same grains (with typically
    64-byte granularity these days) can reside in both the I-cache and the
    D-cache, as long as that grain is not written to.

    So if your complete Forth system including the primitives and the
    sieve program fits into the D-cache and fits into the I-cache, and you
    have no writes close to code, you will indeed only see compulsory
    misses.

    I have posted here about the performance pitfalls of keeping code
    close to data since 1995, and Forth system implementors typically have
    taken measures only when I presented benchmark results where there
    system looks bad. But they usually only did the minimum necessary for
    that particular benchmark, so over the years the issue has come up
    again and again.

    One interesting aspect is that small benchmarks like the sieve are
    often not affected, but larger application benchmarks are. E.g., in
    my recent work [ertl24] all the small benchmarks are unaffected by the
    problem, whereas several of the larger benchmarks were affected in SwiftForth-4.0.0-RC87 and saw significant speedups from a fix in RC89.

    So I applaud that you have done the right thing and completely
    separated code from data. You may not see a benefit on Sieve, but
    there may be a difference in a different program (and you may not even
    notice until you measure both variants).

    @InProceedings{ertl24,
    author = {M. Anton Ertl},
    title = {How to Implement Words (Efficiently)},
    crossref = {euroforth24},
    pages = {43--52},
    url = {http://www.euroforth.org/ef24/papers/ertl.pdf},
    url-slides = {http://www.euroforth.org/ef24/papers/ertl-slides.pdf},
    video = {https://www.youtube.com/watch?v=bAq4760h5ZQ},
    OPTnote = {not refereed},
    abstract = {The implementation of Forth words has to satisfy the
    following requirements: 1) A word must be
    represented by a single cell (for
    \code{execute}). 2) A word may represent a
    combination of code and data (for, e.g.,
    \code{does>}). In addition, on some hardware,
    keeping executed native code and (written) data
    close together results in slowness and therefore
    should be avoided; moreover, failing to pair up
    calls with returns results in (slow) branch
    mispredictions. The present work describes how
    various Forth systems over the decades have
    satisfied the requirements, and how many systems run
    into performance pitfalls in various situations.
    This paper also discusses how to avoid this
    slowness, including in native-code systems.}
    }
    @Proceedings{euroforth24,
    title = {40th EuroForth Conference},
    booktitle = {40th EuroForth Conference},
    year = {2024},
    key = {EuroForth'24},
    url = {http://www.euroforth.org/ef24/papers/proceedings.pdf}
    }

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Fri Feb 28 12:28:51 2025
    From Newsgroup: comp.lang.forth

    In article <2025Feb27.191846@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    <SNIP>
    Thanks for the insight.

    <SNIP>
    So I applaud that you have done the right thing and completely
    separated code from data. You may not see a benefit on Sieve, but
    there may be a difference in a different program (and you may not even
    notice until you measure both variants).

    Actually I have not done that. I added another configuration file to
    the 20 to be able to build a separation between code and data and
    that for 64 bit linux only.
    I have tested it for one of the three assembler only.
    So ciforth is merely prepared for such a change.

    I introduced a
    define( {_SEPARATED_}, _yes)dnl
    All other configrations have
    define( {_SEPARATED_}, _no)dnl
    The other configurations are not affected because this line is in the prelude.m4 so the latter is default.
    The switching of segments is governed by _SEPARATED_ in the fasm.m4
    gas.m4 and nasm.m4 macro files, because segment switching is dependant
    on the actual assembler used.
    So are no changes to change generic i86 assembler base (ci86.gnr) .

    The slight complication will not make it into a release unless there
    is a convincing evidence that it is beneficial and I have used it myself extensively.

    It was more of an exercise to convince myself that I could add that.
    (There is one single i86 file source, and the rest 16/32/64,
    linux/DOS/windows and data separation is done by macro's governed by configuration files.)

    You draw attention to the effect of assembler snippets in large programs.
    This situation is unlikely to happen in ciforth.
    If machine code is added because of speed that is not likely done
    by CODE END-CODE words, but by compacting words into a block by
    inlining everything. This would not be entangled with data.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html >comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
    EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/
    --
    Temu exploits Christians: (Disclaimer, only 10 apostles)
    Last Supper Acrylic Suncatcher - 15Cm Round Stained Glass- Style Wall
    Art For Home, Office And Garden Decor - Perfect For Windows, Bars,
    And Gifts For Friends Family And Colleagues.
    --- Synchronet 3.20c-Linux NewsLink 1.2