• memcpy and friend (was: 80286 protected mode)

    From Michael S@already5chosen@yahoo.com to comp.arch on Tue Oct 15 13:12:41 2024
    From Newsgroup: comp.arch

    On Tue, 15 Oct 2024 10:53:30 +0200
    David Brown <david.brown@hesbynett.no> wrote:
    On 14/10/2024 18:08, Michael S wrote:
    On Mon, 14 Oct 2024 17:19:40 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 14/10/2024 16:40, Terje Mathisen wrote:
    David Brown wrote:
    On 13/10/2024 21:21, Terje Mathisen wrote:
    David Brown wrote:
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

    On 09/10/2024 23:37, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to
    different objects?
    For almost all C programmers, the answer is "never". >>>>>>>>>>>
    Sometimes, it is handy to encode certain conditions in >>>>>>>>>>> pointers, rather than having only a valid pointer or
    NULL.  A compiler, for example, might want to store the >>>>>>>>>>> fact that an error occurred while parsing a subexpression >>>>>>>>>>> as a special pointer constant.

    Compilers often have the unfair advantage, though, that >>>>>>>>>>> they can rely on what application programmers cannot,
    their implementation details.  (Some do not, such as >>>>>>>>>>> f2c).

    Standard library authors have the same superpowers, so that >>>>>>>>>> they can
    implement an efficient memmove() even though a pure
    standard C programmer cannot (other than by simply calling >>>>>>>>>> the standard library
    memmove() function!).

    This is more a symptom of bad ISA design/evolution than of >>>>>>>>> libc writers needing superpowers.

    No, it is not.  It has absolutely /nothing/ to do with the >>>>>>>> ISA.

    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let
    you write an efficient memmove() in standard C.  That's why I >>>>>> said there was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in
    assembly or using inline assembly, rather than in non-portable
    C (which is the common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite >>>>>>> memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove()
    and memcpy() on large transfers, and the overhead in setting
    things up that is proportionally more costly for small
    transfers. Often that can be eliminated when the compiler
    optimises the functions inline - when the compiler knows the
    size of the move/copy, it can optimise directly.

    What you are missing here David is the fact that Mitch's MM is a
    single instruction which does the entire memmove() operation,
    and has the inside knowledge about cache (residency at level x?
    width in bytes)/memory ranges/access rights/etc needed to do so
    in a very close to optimal manner, for both short and long
    transfers.

    I am not missing that at all.  And I agree that an advanced
    hardware MM instruction could be a very efficient way to
    implement both memcpy and memmove.  (For my own kind of work,
    I'd worry about such looping instructions causing an unbounded
    increased in interrupt latency, but that too is solvable given
    enough hardware effort.)

    And I agree that once you have an "MM" (or similar) instruction,
    you don't need to re-write the implementation for your memmove()
    and memcpy() library functions for every new generation of
    processors of a given target family.

    What I /don't/ agree with is the claim that you /do/ need to keep
    re-writing your implementations all the time.  You will
    /sometimes/ get benefits from doing so, but it is not as simple
    as Mitch made out.

    I.e. totally removing the need for compiler tricks or wide
    register operations.

    Also apropos the compiler library issue:

    You start by teaching the compiler about the MM instruction, and
    to recognize common patterns (just as most compilers already do
    today), and then the memmove() calls will usually be inlined.


    The original compile library issue was that it is impossible to
    write an efficient memmove() implementation using pure portable
    standard C. That is independent of any ISA, any specialist
    instructions for memory moves, and any compiler optimisations.
    And it is independent of the fact that some good compilers can
    inline at least some calls to memcpy() and memmove() today, using
    whatever instructions are most efficient for the target.

    David, you and Mitch are among my most cherished writers here on
    c.arch, I really don't think any of us really disagree, it is just
    that we have been discussing two (mostly) orthogonal issues.

    I agree. It's a "god dag mann, økseskaft" situation.

    I have a huge respect for Mitch, his knowledge and experience, and
    his willingness to share that freely with others. That's why I
    have found this very frustrating.


    a) memmove/memcpy are so important that people have been spending
    a lot of time & effort trying to make it faster, with the
    complication that in general it cannot be implemented in pure C
    (which disallows direct comparison of arbitrary pointers).


    Yes.

    (Unlike memmov(), memcpy() can be implemented in standard C as a
    simple byte-copy loop, without needing to compare pointers. But an
    implementation that copies in larger blocks than a byte requires
    implementation dependent behaviour to determine alignments, or it
    must rely on unaligned accesses being allowed by the
    implementation.)
    b) Mitch have, like Andy ("Crazy") Glew many years before,
    realized that if a cpu architecture actually has an instruction
    designed to do this particular job, it behooves cpu architects to
    make sure that it is in fact so fast that it obviates any need
    for tricky coding to replace it.

    Yes.

    Ideally, it should be able to copy a single object, up to a cache
    line in size, in the same or less time needed to do so manually
    with a SIMD 512-bit load followed by a 512-bit store (both ops
    masked to not touch anything it shouldn't)


    Yes.

    REP MOVSB on x86 does the canonical memcpy() operation, originally
    by moving single bytes, and this was so slow that we also had REP
    MOVSW (moving 16-bit entities) and then REP MOVSD on the 386 and
    REP MOVSQ on 64-bit cpus.

    With a suitable chunk of logic, the basic MOVSB operation could in
    fact handle any kinds of alignments and sizes, while doing the
    actual transfer at maximum bus speeds, i.e. at least one cache
    line/cycle for things already in $L1.


    I agree on all of that.

    I am quite happy with the argument that suitable hardware can do
    these basic operations faster than a software loop or the x86 "rep"
    instructions.

    No, that's not true. And according to my understanding, that's not
    what Terje wrote.
    REP MOVSB _is_ almost ideal instruction for memcpy (modulo minor
    details - fixed registers for src, dest, len and Direction flag in
    PSW instead of being part of the opcode).

    My understanding of what Terje wrote is that REP MOVSB /could/ be an efficient solution if it were backed by a hardware block to run well
    (i.e., transferring as many bytes per cycle as memory bus bandwidth
    allows). But REP MOVSB is /not/ efficient - and rather than making
    it work faster, Intel introduced variants with wider fixed sizes.

    Above count of ~2000 byte REP MOVSB on few latest generations of Intel
    and AMD is very efficient.
    One can construct a case where software implementation is a little
    faster in one or another selected benchmark, but typically at cost
    of being slower in other situations.
    For smaller counts a story is different.
    Could REP MOVSB realistically be improved to be as efficient as the instructions in ARMv9, RISC-V, and Mitch'es "MM" instruction? I
    don't know. Intel and AMD have had many decades to do so, so I
    assume it's not an easy improvement.

    You somehow assume that REP MOVSB would have to be improved. That
    remains to be seen.
    It's quite likely that when (or 'if', in case of My 66000) those
    alternatives you mention hit silicon we will find out that REP MOVSB is
    already better as it is, at least for memcpy(). For memmove(), esp.
    for short memmove(), REP MOVSB is easier to beat, because it was not
    designed with memmove() in mind.
    REP MOVSW/D/Q were introduced because back then processors were
    small and stupid. When your processor is big and smart you don't
    need them any longer. REP MOVSB is sufficient.
    New Arm64 instruction that are hopefully coming next year are akin
    to REP MOVSB rather than to MOVSW/D/Q.
    Instructions for memmove, also defined by Arm and by Mitch, is the
    next logical step. IMHO, the main gain here is not measurable
    improvement in performance, but saving of code size when inlined.

    Now, is all that a good idea?

    That's a very important question.

    I am not 100% convinced.
    One can argue that streaming alignment hardware that is necessary
    for 1st-class implementation of these instructions is useful not
    only for memory copy.
    So, may be, it makes sense to expose this hardware in more generic
    ways.

    I believe that is the idea of "scalable vector" instructions as an alternative philosophy to wide explicit SIMD registers. My
    expectation is that SVE implementations will be more effort in the
    hardware than SIMD for any specific SIMD-friendly size point (i.e., power-of-two widths). That usually corresponds to lower clock rates
    and/or higher latency and more coordination from extra pipeline
    stages.

    But once you have SVE support in place, then memcpy() and memset()
    are just examples of vector operations that you get almost for free
    when you have hardware for vector MACs and other operations.

    You don't seem to understand what is 'S' in SVE.
    Read more manuals. Read less marketing slides.
    Or try to write and profile code that utilizes SVE - that would improve
    your understanding more than anything else.
    Also, you don't seem to understand an issue at hand, which is exposing
    a hardware that aligns *stream* of N+1 aligned loads turning it into N unaligned loads.
    In absence of 'load multiple' instruction 128-bit SVE would help you
    here no more than 128-bit NEON. More so, 512-bit SVE wouldn't help
    enough, even ignoring absence of prospect of 512-bit SVE in mainstream
    ARM64 cores.
    May be, at ISA level, SME is a better base to what is wanted.
    But
    - SME would be quite bad for copy of small segments.
    - SME does not appear to get much love by Arm vendors others than Apple
    - SME blocks are expected to be implemented not in close proximity to
    the rest of the CPU core, which would make them problematic not just
    for copying small segment, but for medium-length segments (few KB)
    as well.
    May be, via Load Multiple Register? It was present in Arm's A32/T32,
    but didn't make it into ARM64. Or, may be, there are even better
    ways that I was not thinking about.

    And I fully agree that these would be useful features
    in general-purpose processors.

    My only point of contention is that the existence or lack of such
    instructions does not make any difference to whether or not you can
    write a good implementation of memcpy() or memmove() in portable
    standard C.

    You are moving a goalpost.

    No, my goalposts have been in the same place all the time. Some
    others have been kicking the ball at a completely different set of
    goalposts, but I have kept the same point all along.

    One does not need "good implementation" in a sense you have in
    mind.

    Maybe not - but /that/ would be moving the goalposts.

    All one needs is an implementation that pattern matching logic of
    compiler unmistakably recognizes as memove/memcpy. That is very
    easily done in standard C. For memmove, I had shown how to do it in
    one of the posts below. For memcpy its very obvious, so no need to
    show.

    But that would /not/ be an efficient implementation of memmove() in
    plain portable standard C.

    What do I mean by an "efficient" implementation in fully portable
    standard C? There are two possible ways to think about that. One is
    that the operations on the abstract machine are efficient. The other
    is that the code is likely to result in efficient code over a wide
    range of real-world compilers, options, and targets.
    No, there is no need for wide range of compilers or option.
    Standard library (well, may be, I should say "core of standard
    library", there is no such thing in the C Standard, but distinctions
    exists in many real world implementations, in particular, in gcc) is
    compiled with one compiler and one set of options. Or, at most, several selected sets of options that affect low level code generation, but do
    not affect high level optimizations.
    Range of targets is indeed desirable, but it does not have to be too
    wide.
    Besides, you forget that arguments were about theoretical possibility
    of writing efficient implementation of memmove() in Standard C, not
    about practicality of doing so.
    My example achieves that target easily, and even exceeds it, because
    it's obvious that required pattern matching is not just theoretically
    possible. Existing compilers are capable to handle much more complex
    cases. They likely can not handle this particular case, but only
    because nobody cared to add few dozens lines of code to compiler's
    logic.
    And I think it
    goes without saying that the implementation must not rely on any implementation-defined behaviour or anything beyond the minimal
    limits given in the C standards, and it must not introduce any new
    real or potential UB.

    Your "memmove()" implementation fails on several counts. It is
    inefficient in the abstract machine - it copies everything twice
    instead of once. It is inefficient in real-world implementations of
    all sorts and countless targets - being efficient for some compilers
    with some options on some targets (most of them hypothetical) does
    /not/ qualify as an efficient implementation. And quite clearly it
    risks causing failures from stack overflow in situations where the
    user would normally expect memmove() to function safely (on
    implementations other than those few that turn it into efficient
    object code).

    They would make it easier to write efficient
    implementations of these standard library functions for targets
    that had such instructions - but that would be
    implementation-specific code. And that is one of the reasons that
    C standard library implementations are tied to the specific
    compiler and target, and the writers of these libraries have
    "superpowers" and are not limited to standard C.



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Tue Oct 15 13:20:31 2024
    From Newsgroup: comp.arch

    On 15/10/2024 12:12, Michael S wrote:
    On Tue, 15 Oct 2024 10:53:30 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 14/10/2024 18:08, Michael S wrote:
    On Mon, 14 Oct 2024 17:19:40 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 14/10/2024 16:40, Terje Mathisen wrote:

    (I'm snipping for space - hopefully not too much.)


    REP MOVSB on x86 does the canonical memcpy() operation, originally
    by moving single bytes, and this was so slow that we also had REP
    MOVSW (moving 16-bit entities) and then REP MOVSD on the 386 and
    REP MOVSQ on 64-bit cpus.

    With a suitable chunk of logic, the basic MOVSB operation could in
    fact handle any kinds of alignments and sizes, while doing the
    actual transfer at maximum bus speeds, i.e. at least one cache
    line/cycle for things already in $L1.


    I agree on all of that.

    I am quite happy with the argument that suitable hardware can do
    these basic operations faster than a software loop or the x86 "rep"
    instructions.

    No, that's not true. And according to my understanding, that's not
    what Terje wrote.
    REP MOVSB _is_ almost ideal instruction for memcpy (modulo minor
    details - fixed registers for src, dest, len and Direction flag in
    PSW instead of being part of the opcode).

    My understanding of what Terje wrote is that REP MOVSB /could/ be an
    efficient solution if it were backed by a hardware block to run well
    (i.e., transferring as many bytes per cycle as memory bus bandwidth
    allows). But REP MOVSB is /not/ efficient - and rather than making
    it work faster, Intel introduced variants with wider fixed sizes.


    Above count of ~2000 byte REP MOVSB on few latest generations of Intel
    and AMD is very efficient.

    OK. That is news to me, and different from what I had thought.

    One can construct a case where software implementation is a little
    faster in one or another selected benchmark, but typically at cost
    of being slower in other situations.
    For smaller counts a story is different.

    Could REP MOVSB realistically be improved to be as efficient as the
    instructions in ARMv9, RISC-V, and Mitch'es "MM" instruction? I
    don't know. Intel and AMD have had many decades to do so, so I
    assume it's not an easy improvement.


    You somehow assume that REP MOVSB would have to be improved.

    That is certainly what I have been assuming. I haven't investigated it
    myself in any way, I've merely inferred it from other posts. So unless someone else provides more information, I'll take your word for it that
    at least for modern x86 devices and large copies, it's already about as efficient as it could be.

    That
    remains to be seen.
    It's quite likely that when (or 'if', in case of My 66000) those
    alternatives you mention hit silicon we will find out that REP MOVSB is already better as it is, at least for memcpy(). For memmove(), esp.
    for short memmove(), REP MOVSB is easier to beat, because it was not designed with memmove() in mind.

    REP MOVSW/D/Q were introduced because back then processors were
    small and stupid. When your processor is big and smart you don't
    need them any longer. REP MOVSB is sufficient.
    New Arm64 instruction that are hopefully coming next year are akin
    to REP MOVSB rather than to MOVSW/D/Q.
    Instructions for memmove, also defined by Arm and by Mitch, is the
    next logical step. IMHO, the main gain here is not measurable
    improvement in performance, but saving of code size when inlined.

    Now, is all that a good idea?

    That's a very important question.

    I am not 100% convinced.
    One can argue that streaming alignment hardware that is necessary
    for 1st-class implementation of these instructions is useful not
    only for memory copy.
    So, may be, it makes sense to expose this hardware in more generic
    ways.

    I believe that is the idea of "scalable vector" instructions as an
    alternative philosophy to wide explicit SIMD registers. My
    expectation is that SVE implementations will be more effort in the
    hardware than SIMD for any specific SIMD-friendly size point (i.e.,
    power-of-two widths). That usually corresponds to lower clock rates
    and/or higher latency and more coordination from extra pipeline
    stages.

    But once you have SVE support in place, then memcpy() and memset()
    are just examples of vector operations that you get almost for free
    when you have hardware for vector MACs and other operations.


    You don't seem to understand what is 'S' in SVE.
    Read more manuals. Read less marketing slides.
    Or try to write and profile code that utilizes SVE - that would improve
    your understanding more than anything else.


    It means "scalable". The idea is that the same binary code will use
    different stride sizes on different hardware - a bigger implementation
    of the core might have vector units handling wider strides than a
    smaller one. Am I missing something?

    Also, you don't seem to understand an issue at hand, which is exposing
    a hardware that aligns *stream* of N+1 aligned loads turning it into N unaligned loads.
    In absence of 'load multiple' instruction 128-bit SVE would help you
    here no more than 128-bit NEON. More so, 512-bit SVE wouldn't help
    enough, even ignoring absence of prospect of 512-bit SVE in mainstream
    ARM64 cores.
    May be, at ISA level, SME is a better base to what is wanted.
    But
    - SME would be quite bad for copy of small segments.

    I would expect a certain amount of overhead, which will be a cost for
    small copies.

    - SME does not appear to get much love by Arm vendors others than Apple

    If you say so. My main interest is in microcontrollers, and I don't
    track all the details of larger devices.

    - SME blocks are expected to be implemented not in close proximity to
    the rest of the CPU core, which would make them problematic not just
    for copying small segment, but for medium-length segments (few KB)
    as well.


    That sounds like a poor design choice to me, but again I don't know the details.

    May be, via Load Multiple Register? It was present in Arm's A32/T32,
    but didn't make it into ARM64. Or, may be, there are even better
    ways that I was not thinking about.

    And I fully agree that these would be useful features
    in general-purpose processors.

    My only point of contention is that the existence or lack of such
    instructions does not make any difference to whether or not you can
    write a good implementation of memcpy() or memmove() in portable
    standard C.

    You are moving a goalpost.

    No, my goalposts have been in the same place all the time. Some
    others have been kicking the ball at a completely different set of
    goalposts, but I have kept the same point all along.

    One does not need "good implementation" in a sense you have in
    mind.

    Maybe not - but /that/ would be moving the goalposts.

    All one needs is an implementation that pattern matching logic of
    compiler unmistakably recognizes as memove/memcpy. That is very
    easily done in standard C. For memmove, I had shown how to do it in
    one of the posts below. For memcpy its very obvious, so no need to
    show.

    But that would /not/ be an efficient implementation of memmove() in
    plain portable standard C.

    What do I mean by an "efficient" implementation in fully portable
    standard C? There are two possible ways to think about that. One is
    that the operations on the abstract machine are efficient. The other
    is that the code is likely to result in efficient code over a wide
    range of real-world compilers, options, and targets.

    No, there is no need for wide range of compilers or option.

    There /is/ a wide range of compilers and options. If one were to try to
    make an efficient portable standard C implementation of a function
    (whether or not it is a standard library function), then it needs to
    work on any of these compilers with any options as long as they are at
    least reasonably standards compliant, and it should be reasonably
    efficient on a large proportion of them.

    Standard library (well, may be, I should say "core of standard
    library", there is no such thing in the C Standard, but distinctions
    exists in many real world implementations, in particular, in gcc) is
    compiled with one compiler and one set of options. Or, at most, several selected sets of options that affect low level code generation, but do
    not affect high level optimizations.
    Range of targets is indeed desirable, but it does not have to be too
    wide.

    Of course. That is the /whole/ point. A C standard library is part of
    the implementation - it is tied to the compiler, options and target (as tightly or loosely as you want). When writing a "memmove()"
    implementation, there is no requirement for it to be portable or limited
    to standard C - there is no requirement for it to be in C at all. That
    is how we have functions like "memmove" at all, despite the fact that
    they cannot be implemented efficiently in portable standard C.



    Besides, you forget that arguments were about theoretical possibility
    of writing efficient implementation of memmove() in Standard C, not
    about practicality of doing so.

    I have not forgotten that at all.

    My example achieves that target easily, and even exceeds it, because
    it's obvious that required pattern matching is not just theoretically possible. Existing compilers are capable to handle much more complex
    cases. They likely can not handle this particular case, but only
    because nobody cared to add few dozens lines of code to compiler's
    logic.

    Just to be clear - your example was this :

    void *memmove( void *dest, const void *src, size_t count)
    {
    if (count > 0) {
    char tmp[count];
    memcpy(tmp, src, count);
    memcpy(dest, tmp, count);
    }
    return dest;
    }


    Some existing compilers may recognize that pattern, others do not. It
    is certainly true that it is /possible/ for compilers to recognize this pattern. It is equally certain that virtually all existing C compilers
    and option combinations do not recognize it. (Even ones that do, such
    as clang with -O2, generate a dozen instructions with a call to library memmove() in the middle). By no conceivable stretch of the imagination
    is your "solution" here a good, efficient, portable and standard C implementation of memmove(). It may, of course, be a perfectly good implementation for a /specific/ compiler and /specific/ target.



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Oct 15 14:55:30 2024
    From Newsgroup: comp.arch

    On Tue, 15 Oct 2024 13:20:31 +0200
    David Brown <david.brown@hesbynett.no> wrote:


    It means "scalable". The idea is that the same binary code will use different stride sizes on different hardware - a bigger
    implementation of the core might have vector units handling wider
    strides than a smaller one. Am I missing something?


    In practice, it means that at any given implementation you have fixed
    width.
    The spec does not say that width is equal to width of FP execution
    engine, but it appears to be a case in all implementations so far (1
    512-bit implementation from Fujitsu, 1 256-bit implementation from Arm
    Inc, quickly de-emphasized and many 128-bit implementations from several vendors).
    'S' means that width-agnostic implementation of common algorithms, esp.
    of Linear Algebra, is possible. Nobody promised that width-agnostic
    would be as efficient as width-aware. Especially so in algorithms
    that do little or no math. By chance, in case of memmove() we do want
    to be very efficient.

    Also, you don't seem to understand an issue at hand, which is
    exposing a hardware that aligns *stream* of N+1 aligned loads
    turning it into N unaligned loads.
    In absence of 'load multiple' instruction 128-bit SVE would help you
    here no more than 128-bit NEON. More so, 512-bit SVE wouldn't help
    enough, even ignoring absence of prospect of 512-bit SVE in
    mainstream ARM64 cores.
    May be, at ISA level, SME is a better base to what is wanted.
    But
    - SME would be quite bad for copy of small segments.

    I would expect a certain amount of overhead, which will be a cost for
    small copies.

    - SME does not appear to get much love by Arm vendors others than
    Apple

    If you say so. My main interest is in microcontrollers, and I don't
    track all the details of larger devices.

    - SME blocks are expected to be implemented not in close
    proximity to the rest of the CPU core, which would make them
    problematic not just for copying small segment, but for
    medium-length segments (few KB) as well.


    That sounds like a poor design choice to me, but again I don't know
    the details.


    That (i.e. implementation of SME as very powerful accelerator shared
    by several cores) is an excellent design choice for what SME is invented
    for - matrix multiplications and kernels that are similar to matrix multiplication.
    For that purpose it works very well on Apple chips, delivering lots of
    FLOPs to single thread/core. Programmers like it, both because
    single-threaded programming is easier than multi-threaded and because
    when fewer cores are tied driving FPUs more cores left available for
    something else.
    To me it also sounds as very suitable choice for long memcpy/memmove,
    i.e. for segments bigger than size of L1D$. But I am sure that it was
    not a major consideration for Apple designers.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Tue Oct 15 14:03:18 2024
    From Newsgroup: comp.arch

    On 15/10/2024 13:55, Michael S wrote:
    On Tue, 15 Oct 2024 13:20:31 +0200
    David Brown <david.brown@hesbynett.no> wrote:


    It means "scalable". The idea is that the same binary code will use
    different stride sizes on different hardware - a bigger
    implementation of the core might have vector units handling wider
    strides than a smaller one. Am I missing something?


    In practice, it means that at any given implementation you have fixed
    width.

    Yes. Or, more accurately, you have fixed maximum width. If the final
    step doesn't use the full width, that's okay.

    The spec does not say that width is equal to width of FP execution
    engine, but it appears to be a case in all implementations so far (1
    512-bit implementation from Fujitsu, 1 256-bit implementation from Arm
    Inc, quickly de-emphasized and many 128-bit implementations from several vendors).

    In general I'd expect the maximum width per step to match that of the appropriate execution engines, yes.

    'S' means that width-agnostic implementation of common algorithms, esp.
    of Linear Algebra, is possible. Nobody promised that width-agnostic
    would be as efficient as width-aware. Especially so in algorithms
    that do little or no math. By chance, in case of memmove() we do want
    to be very efficient.


    Perhaps. That is an issue of the quality of the SVE, rather than the principle of it. I don't know about the details of real-world SVE implementations, but I see no reason why the maximum stride size should
    be the same for all SVE instructions. An implementation that has a
    small cost-optimised floating point unit may have wider integer SVE
    strides or wider memory set/copy strides.

    Also, you don't seem to understand an issue at hand, which is
    exposing a hardware that aligns *stream* of N+1 aligned loads
    turning it into N unaligned loads.
    In absence of 'load multiple' instruction 128-bit SVE would help you
    here no more than 128-bit NEON. More so, 512-bit SVE wouldn't help
    enough, even ignoring absence of prospect of 512-bit SVE in
    mainstream ARM64 cores.
    May be, at ISA level, SME is a better base to what is wanted.
    But
    - SME would be quite bad for copy of small segments.

    I would expect a certain amount of overhead, which will be a cost for
    small copies.

    - SME does not appear to get much love by Arm vendors others than
    Apple

    If you say so. My main interest is in microcontrollers, and I don't
    track all the details of larger devices.

    - SME blocks are expected to be implemented not in close
    proximity to the rest of the CPU core, which would make them
    problematic not just for copying small segment, but for
    medium-length segments (few KB) as well.


    That sounds like a poor design choice to me, but again I don't know
    the details.


    That (i.e. implementation of SME as very powerful accelerator shared
    by several cores) is an excellent design choice for what SME is invented
    for - matrix multiplications and kernels that are similar to matrix multiplication.
    For that purpose it works very well on Apple chips, delivering lots of
    FLOPs to single thread/core. Programmers like it, both because single-threaded programming is easier than multi-threaded and because
    when fewer cores are tied driving FPUs more cores left available for something else.
    To me it also sounds as very suitable choice for long memcpy/memmove,
    i.e. for segments bigger than size of L1D$. But I am sure that it was
    not a major consideration for Apple designers.



    --- Synchronet 3.20a-Linux NewsLink 1.114