• Re: 80286 protected mode

    From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 9 16:28:19 2024
    From Newsgroup: comp.arch

    On Wed, 9 Oct 2024 8:24:34 +0000, David Brown wrote:

    On 08/10/2024 09:28, Anton Ertl wrote:.

    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never". Pretty much the
    only example people ever give of needing such comparisons is to
    implement memmove() efficiently - but you don't need to implement
    memmove(), because it is already in the standard library. (Standard
    library implementations don't need to be portable, and can rely on
    extensions or other compiler-specific features.)

    Somebody has to write memmove() and they want to use C to do it.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Oct 9 16:42:38 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 9 Oct 2024 8:24:34 +0000, David Brown wrote:

    On 08/10/2024 09:28, Anton Ertl wrote:.

    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never". Pretty much the
    only example people ever give of needing such comparisons is to
    implement memmove() efficiently - but you don't need to implement
    memmove(), because it is already in the standard library. (Standard
    library implementations don't need to be portable, and can rely on
    extensions or other compiler-specific features.)

    Somebody has to write memmove() and they want to use C to do it.

    In most every mainstream implementation, memmove() is written
    in assembler in order to inject the appropriate prefeches and
    follow the recommended instruction usage per the target architecture
    software optimization guide.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Oct 9 18:10:44 2024
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers,
    rather than having only a valid pointer or NULL. A compiler,
    for example, might want to store the fact that an error occurred
    while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can
    rely on what application programmers cannot, their implementation
    details. (Some do not, such as f2c).
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Wed Oct 9 22:20:42 2024
    From Newsgroup: comp.arch

    On 09/10/2024 18:28, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 8:24:34 +0000, David Brown wrote:

    On 08/10/2024 09:28, Anton Ertl wrote:.

    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never".  Pretty much the
    only example people ever give of needing such comparisons is to
    implement memmove() efficiently - but you don't need to implement
    memmove(), because it is already in the standard library.  (Standard
    library implementations don't need to be portable, and can rely on
    extensions or other compiler-specific features.)

    Somebody has to write memmove() and they want to use C to do it.

    They don't have to write it in standard, portable C. Standard libraries
    will, sometimes, use "magic" - they can be in assembly, or use compiler extensions, or target-specific features, or "-fsecret-flag-for-std-lib" compiler flags, or implementation-dependent features, or whatever they want.

    You will find that most implementations of memmove() are done by
    converting the pointers to a unsigned integer type and comparing those
    values. The type chosen may be implementation-dependent, or it may be "uintptr_t" (even if you are using C90 for your code, the library
    writers can use C99 for theirs).

    Such implementations will not be portable to all systems. They won't
    work on a target that has some kind of "fat" pointers or segmented
    pointers that can't be translated properly to integers.

    That's okay, of course. For targets that have such complications, that standard library function will be written in a different way.

    The avrlibc library used by gcc for the AVR has its memmove()
    implemented in assembly for speed, as does musl for some architectures.

    There are lots of parts of the standard C library that cannot be written completely in portable standard C. (How would you write a function that handles files? You need non-portable OS calls.) That's why these
    things are in the standard library in the first place.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Wed Oct 9 22:22:16 2024
    From Newsgroup: comp.arch

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers,
    rather than having only a valid pointer or NULL. A compiler,
    for example, might want to store the fact that an error occurred
    while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can
    rely on what application programmers cannot, their implementation
    details. (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they can
    implement an efficient memmove() even though a pure standard C
    programmer cannot (other than by simply calling the standard library
    memmove() function!).

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 9 21:37:30 2024
    From Newsgroup: comp.arch

    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers,
    rather than having only a valid pointer or NULL. A compiler,
    for example, might want to store the fact that an error occurred
    while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can
    rely on what application programmers cannot, their implementation
    details. (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they can implement an efficient memmove() even though a pure standard C
    programmer cannot (other than by simply calling the standard library memmove() function!).

    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 9 14:52:39 2024
    From Newsgroup: comp.arch

    On 10/9/2024 1:20 PM, David Brown wrote:
    On 09/10/2024 18:28, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 8:24:34 +0000, David Brown wrote:

    On 08/10/2024 09:28, Anton Ertl wrote:.

    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never".  Pretty much the
    only example people ever give of needing such comparisons is to
    implement memmove() efficiently - but you don't need to implement
    memmove(), because it is already in the standard library.  (Standard
    library implementations don't need to be portable, and can rely on
    extensions or other compiler-specific features.)

    Somebody has to write memmove() and they want to use C to do it.

    They don't have to write it in standard, portable C.  Standard libraries will, sometimes, use "magic" - they can be in assembly, or use compiler extensions, or target-specific features, or "-fsecret-flag-for-std-lib" compiler flags, or implementation-dependent features, or whatever they
    want.

    You will find that most implementations of memmove() are done by
    converting the pointers to a unsigned integer type and comparing those values.  The type chosen may be implementation-dependent, or it may be "uintptr_t" (even if you are using C90 for your code, the library
    writers can use C99 for theirs).

    Such implementations will not be portable to all systems.  They won't
    work on a target that has some kind of "fat" pointers or segmented
    pointers that can't be translated properly to integers.

    That's okay, of course.  For targets that have such complications, that standard library function will be written in a different way.

    The avrlibc library used by gcc for the AVR has its memmove()
    implemented in assembly for speed, as does musl for some architectures.

    There are lots of parts of the standard C library that cannot be written completely in portable standard C.  (How would you write a function that handles files?  You need non-portable OS calls.)  That's why these
    things are in the standard library in the first place.

    I agree with everything you say up until the last sentence. There are
    several languages, mostly older ones like Fortran and COBOL, where the
    file handling/I/O are defined portably within the language proper, not
    in a separate library. It just moves the non-portable stuff from the
    library writer (as in C) to the compiler writer (as in Fortran, COBOL, etc.)
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Oct 10 00:33:41 2024
    From Newsgroup: comp.arch

    On Wed, 9 Oct 2024 21:52:39 +0000, Stephen Fuld wrote:

    On 10/9/2024 1:20 PM, David Brown wrote:

    There are lots of parts of the standard C library that cannot be written
    completely in portable standard C.  (How would you write a function that
    handles files? 

    Do you mean things other than open(), close(), read(), write(), lseek()
    ??

    You need non-portable OS calls.)  That's why these
    things are in the standard library in the first place.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Thu Oct 10 08:24:32 2024
    From Newsgroup: comp.arch

    On 09/10/2024 23:52, Stephen Fuld wrote:
    On 10/9/2024 1:20 PM, David Brown wrote:

    There are lots of parts of the standard C library that cannot be
    written completely in portable standard C.  (How would you write a
    function that handles files?  You need non-portable OS calls.)  That's
    why these things are in the standard library in the first place.

    I agree with everything you say up until the last sentence.  There are several languages, mostly older ones like Fortran and COBOL, where the
    file handling/I/O are defined portably within the language proper, not
    in a separate library.  It just moves the non-portable stuff from the library writer (as in C) to the compiler writer (as in Fortran, COBOL,
    etc.)



    I meant that this is why these features have to be provided, rather than
    left for the user to implement themselves. They could also have been
    provided in the language itself (as was done in many other languages) -
    the point is that you cannot write the file access functions in pure
    standard C.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Thu Oct 10 08:30:37 2024
    From Newsgroup: comp.arch

    On 10/10/2024 02:33, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 21:52:39 +0000, Stephen Fuld wrote:

    On 10/9/2024 1:20 PM, David Brown wrote:

    There are lots of parts of the standard C library that cannot be written >>> completely in portable standard C.  (How would you write a function that >>> handles files?

    Do you mean things other than open(), close(), read(), write(), lseek()
    ??


    The C standard library provides functions like fopen(), fclose(),
    fwrite(), etc. It provides them because programs often need such functionality, and you cannot write them yourself in portable standard
    C. (As Stephen pointed out, C could have had them built into the
    language - for many good reasons, C did not go that route.)

    The functions you list here are the POSIX names - not the C standard
    library names. Those POSIX functions cannot be implemented in portable standard C either if you exclude making wrappers around the standard
    library functions.

    In both cases - implementing the standard library functions or
    implementing the POSIX functions - you need something beyond standard C,
    such as a way to call OS API's.

    You need non-portable OS calls.)  That's why these
    things are in the standard library in the first place.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Thu Oct 10 08:31:52 2024
    From Newsgroup: comp.arch

    On 09/10/2024 23:37, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers,
    rather than having only a valid pointer or NULL.  A compiler,
    for example, might want to store the fact that an error occurred
    while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can
    rely on what application programmers cannot, their implementation
    details.  (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they can
    implement an efficient memmove() even though a pure standard C
    programmer cannot (other than by simply calling the standard library
    memmove() function!).

    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    No, it is not. It has absolutely /nothing/ to do with the ISA.



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Oct 10 18:38:55 2024
    From Newsgroup: comp.arch

    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

    On 09/10/2024 23:37, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different objects? >>>>> For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers,
    rather than having only a valid pointer or NULL.  A compiler,
    for example, might want to store the fact that an error occurred
    while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can
    rely on what application programmers cannot, their implementation
    details.  (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they can
    implement an efficient memmove() even though a pure standard C
    programmer cannot (other than by simply calling the standard library
    memmove() function!).

    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    No, it is not. It has absolutely /nothing/ to do with the ISA.

    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Thu Oct 10 21:21:20 2024
    From Newsgroup: comp.arch

    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

    On 09/10/2024 23:37, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different objects? >>>>>> For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers,
    rather than having only a valid pointer or NULL.  A compiler,
    for example, might want to store the fact that an error occurred
    while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can
    rely on what application programmers cannot, their implementation
    details.  (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they can
    implement an efficient memmove() even though a pure standard C
    programmer cannot (other than by simply calling the standard library
    memmove() function!).

    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    No, it is not.  It has absolutely /nothing/ to do with the ISA.

    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you write
    an efficient memmove() in standard C. That's why I said there was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in assembly or
    using inline assembly, rather than in non-portable C (which is the
    common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up that
    is proportionally more costly for small transfers. Often that can be eliminated when the compiler optimises the functions inline - when the compiler knows the size of the move/copy, it can optimise directly.

    The use of wider register sizes can help to some extent, but not once
    you have reached the width of the internal buses or cache bandwidth.

    In general, there will be many aspects of a C compiler's code generator,
    its run-time support library, and C standard libraries that can work
    better if they are optimised for each new generation of processor.
    Sometimes you just need to re-compile the library with a newer compiler
    and appropriate flags, other times you need to modify the library source
    code. None of this is specific to memmove().

    But it is true that you get an easier and more future-proof memmove()
    and memcopy() if you have an ISA that supports scalable vector
    processing of some kind, such as ARM and RISC-V have, rather than
    explicitly sized SIMD registers.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Oct 10 20:00:29 2024
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> writes:
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:


    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you write
    an efficient memmove() in standard C. That's why I said there was no >connection between the two concepts.

    For some targets, it can be helpful to write memmove() in assembly or
    using inline assembly, rather than in non-portable C (which is the
    common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up that
    is proportionally more costly for small transfers. Often that can be >eliminated when the compiler optimises the functions inline - when the >compiler knows the size of the move/copy, it can optimise directly.

    The use of wider register sizes can help to some extent, but not once
    you have reached the width of the internal buses or cache bandwidth.

    In general, there will be many aspects of a C compiler's code generator,
    its run-time support library, and C standard libraries that can work
    better if they are optimised for each new generation of processor.
    Sometimes you just need to re-compile the library with a newer compiler
    and appropriate flags, other times you need to modify the library source >code. None of this is specific to memmove().

    But it is true that you get an easier and more future-proof memmove()
    and memcopy() if you have an ISA that supports scalable vector
    processing of some kind, such as ARM and RISC-V have, rather than
    explicitly sized SIMD registers.


    Note that ARMv8 (via FEAT_MOPS) does offer instructions that handle memcpy
    and memset.

    They're three-instruction sets; prolog/body/epilog. There are separate
    sets for forward vs. forward-or-backward copies.

    The prolog instruction preconditions the copy and copies
    an IMPDEF portion.

    The body instruction performs an IMPDEF Portion and

    the epilog instruction finalizes the copy.

    The three instructions are issued consecutively.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Oct 10 23:54:15 2024
    From Newsgroup: comp.arch

    On Thu, 10 Oct 2024 20:00:29 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    David Brown <david.brown@hesbynett.no> writes:
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:


    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you
    write an efficient memmove() in standard C. That's why I said there
    was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in assembly
    or using inline assembly, rather than in non-portable C (which is
    the common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up
    that is proportionally more costly for small transfers. Often that
    can be eliminated when the compiler optimises the functions inline -
    when the compiler knows the size of the move/copy, it can optimise >directly.

    The use of wider register sizes can help to some extent, but not
    once you have reached the width of the internal buses or cache
    bandwidth.

    In general, there will be many aspects of a C compiler's code
    generator, its run-time support library, and C standard libraries
    that can work better if they are optimised for each new generation
    of processor. Sometimes you just need to re-compile the library with
    a newer compiler and appropriate flags, other times you need to
    modify the library source code. None of this is specific to
    memmove().

    But it is true that you get an easier and more future-proof
    memmove() and memcopy() if you have an ISA that supports scalable
    vector processing of some kind, such as ARM and RISC-V have, rather
    than explicitly sized SIMD registers.


    Note that ARMv8 (via FEAT_MOPS) does offer instructions that handle
    memcpy and memset.

    They're three-instruction sets; prolog/body/epilog. There are
    separate sets for forward vs. forward-or-backward copies.

    The prolog instruction preconditions the copy and copies
    an IMPDEF portion.

    The body instruction performs an IMPDEF Portion and

    the epilog instruction finalizes the copy.

    The three instructions are issued consecutively.

    People that have more clue about Arm Inc schedule than myself
    expect Arm Cortex cores that implement these instructions to be
    announced next May and to appear in actual [expensive] phones in 2026.
    Which probably means 2027 at best for Neoverse cores.

    It's hard to make an educated guess about schedule of other Arm core
    designers.







    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Oct 10 21:03:33 2024
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 10 Oct 2024 20:00:29 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    David Brown <david.brown@hesbynett.no> writes:
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:


    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you
    write an efficient memmove() in standard C. That's why I said there
    was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in assembly
    or using inline assembly, rather than in non-portable C (which is
    the common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up
    that is proportionally more costly for small transfers. Often that
    can be eliminated when the compiler optimises the functions inline -
    when the compiler knows the size of the move/copy, it can optimise
    directly.

    The use of wider register sizes can help to some extent, but not
    once you have reached the width of the internal buses or cache
    bandwidth.

    In general, there will be many aspects of a C compiler's code
    generator, its run-time support library, and C standard libraries
    that can work better if they are optimised for each new generation
    of processor. Sometimes you just need to re-compile the library with
    a newer compiler and appropriate flags, other times you need to
    modify the library source code. None of this is specific to
    memmove().

    But it is true that you get an easier and more future-proof
    memmove() and memcopy() if you have an ISA that supports scalable
    vector processing of some kind, such as ARM and RISC-V have, rather
    than explicitly sized SIMD registers.


    Note that ARMv8 (via FEAT_MOPS) does offer instructions that handle
    memcpy and memset.

    They're three-instruction sets; prolog/body/epilog. There are
    separate sets for forward vs. forward-or-backward copies.

    The prolog instruction preconditions the copy and copies
    an IMPDEF portion.

    The body instruction performs an IMPDEF Portion and

    the epilog instruction finalizes the copy.

    The three instructions are issued consecutively.

    People that have more clue about Arm Inc schedule than myself
    expect Arm Cortex cores that implement these instructions to be
    announced next May and to appear in actual [expensive] phones in 2026.
    Which probably means 2027 at best for Neoverse cores.

    It's hard to make an educated guess about schedule of other Arm core >designers.

    In the mean time, they've have "DC ZVA" for the special case of
    memset(,0,) since ARMv8.0.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Brian G. Lucas@bagel99@gmail.com to comp.arch on Thu Oct 10 16:19:31 2024
    From Newsgroup: comp.arch

    On 10/10/24 2:21 PM, David Brown wrote:
    [ SNIP]

    The existence of a dedicated assembly instruction does not let you write an efficient memmove() in standard C.  That's why I said there was no connection
    between the two concepts.

    If the compiler generates the memmove instruction, then one doesn't
    have to write memmove() is C - it is never called/used.

    For some targets, it can be helpful to write memmove() in assembly or using inline assembly, rather than in non-portable C (which is the common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and memcpy() on large transfers, and the overhead in setting things up that is proportionally
    more costly for small transfers.  Often that can be eliminated when the compiler optimises the functions inline - when the compiler knows the size of
    the move/copy, it can optimise directly.

    The use of wider register sizes can help to some extent, but not once you have
    reached the width of the internal buses or cache bandwidth.

    In general, there will be many aspects of a C compiler's code generator, its run-time support library, and C standard libraries that can work better if they
    are optimised for each new generation of processor. Sometimes you just need to
    re-compile the library with a newer compiler and appropriate flags, other times
    you need to modify the library source code.  None of this is specific to memmove().

    But it is true that you get an easier and more future-proof memmove() and memcopy() if you have an ISA that supports scalable vector processing of some
    kind, such as ARM and RISC-V have, rather than explicitly sized SIMD registers.


    Not applicable.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Oct 10 21:30:38 2024
    From Newsgroup: comp.arch

    On Thu, 10 Oct 2024 19:21:20 +0000, David Brown wrote:

    On 10/10/2024 20:38, MitchAlsup1 wrote:
    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    No, it is not.  It has absolutely /nothing/ to do with the ISA.

    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you write
    an efficient memmove() in standard C.

    {
    memmove( p, q, size );
    }

    Where the compiler produces the MM instruction itself. Looks damn
    close to standard C to me !!
    OR
    for( int i = 0, i < size; i++ )
    p[i] = q[i];

    Which gets compiled to memcpy()--also looks to be standard C.
    OR

    p_struct = q_struct;

    gets compiled to::

    memmove( &p_struct, &q_struct, sizeof( q_struct ) );

    also looks to be std C.

    The thing is you are no longer writing memmove(), you are simply
    teaching the compiler to recognizes its _use_ cases directly. In
    addition, these will always be within spitting distance of as fast
    as one can perform those activities.

    That's why I said there was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in assembly or
    using inline assembly, rather than in non-portable C (which is the
    common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up that
    is proportionally more costly for small transfers.

    Given that we are talking about GBOoO machines here, the several
    AGEN units[1,2,3] have plenty of calculation BW to determine order
    without wasting cycles getting started.

    But given LBIO machine, the ability to process memory to memory moves
    at cache port width is always an advantage except for cases needing
    only 1 read or 1 write--if you build the HW with these in mind.

    Often that can be eliminated when the compiler optimises the functions inline - when the compiler knows the size of the move/copy, it can optimise directly.

    In HW they should always be optimized.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Oct 11 13:37:03 2024
    From Newsgroup: comp.arch

    On 10/10/2024 23:19, Brian G. Lucas wrote:
    On 10/10/24 2:21 PM, David Brown wrote:
    [ SNIP]

    The existence of a dedicated assembly instruction does not let you
    write an efficient memmove() in standard C.  That's why I said there
    was no connection between the two concepts.

    If the compiler generates the memmove instruction, then one doesn't
    have to write memmove() is C - it is never called/used.


    The common case is that a good compiler will generate inline code for
    some cases - typically known (at compile-time) small sizes - and call a generic library function when the size is not known or is over a certain
    size. Then there are some targets where it will always call the library
    code, and some where it will always generate inline code.

    Even if the compiler /can/ generate inline code, there can be
    circumstances when it will not do so - such as if you have not enabled optimisation, or are optimising for size, or using a weaker compiler, or calling the function indirectly.

    For some targets, it can be helpful to write memmove() in assembly or
    using inline assembly, rather than in non-portable C (which is the
    common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up
    that is proportionally more costly for small transfers.  Often that
    can be eliminated when the compiler optimises the functions inline -
    when the compiler knows the size of the move/copy, it can optimise
    directly.

    The use of wider register sizes can help to some extent, but not once
    you have reached the width of the internal buses or cache bandwidth.

    In general, there will be many aspects of a C compiler's code
    generator, its run-time support library, and C standard libraries that
    can work better if they are optimised for each new generation of
    processor. Sometimes you just need to re-compile the library with a
    newer compiler and appropriate flags, other times you need to modify
    the library source code.  None of this is specific to memmove().

    But it is true that you get an easier and more future-proof memmove()
    and memcopy() if you have an ISA that supports scalable vector
    processing of some kind, such as ARM and RISC-V have, rather than
    explicitly sized SIMD registers.


    Not applicable.


    I don't understand what you mean by that. /What/ is not applicable to
    /what/ ?


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Oct 11 14:10:13 2024
    From Newsgroup: comp.arch

    On 10/10/2024 23:30, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 19:21:20 +0000, David Brown wrote:

    On 10/10/2024 20:38, MitchAlsup1 wrote:
    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    No, it is not.  It has absolutely /nothing/ to do with the ISA.

    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you write
    an efficient memmove() in standard C.

          {
               memmove( p, q, size );
          }


    What is that circular reference supposed to do? The whole discussion
    has been about the /fact/ that you cannot implement the "memmove"
    function in a C standard library using fully portable standard C code.

    Do you think you can just write this :

    void * memmove(void * s1, const void * s2, size_t n)
    {
    return memmove(s1, s2, n);
    }

    in your library's source?


    You can implement "memcpy" in portable standard C, using a loop and
    array or pointer syntax (somewhat like your loop below, but with the
    correct type for the index). But you cannot do so for memmove() because
    you cannot identify the direction you need to run your loop in an
    efficient and fully portable manner.

    It does not matter what the target is - the target is totally irrelevant
    for /portable/ standard C code. If the target made a difference, it
    would not be portable!

    I can't understand why this is causing you difficulty.

    Perhaps you simply didn't understand what you wrote a few posts back,
    when you claimed that the reason people writing portable standard C code cannot write an efficient memmove() implementation is "a symptom of bad
    ISA design".


    Where the compiler produces the MM instruction itself. Looks damn
    close to standard C to me !!
    OR
          for( int i = 0, i < size; i++ )
               p[i] = q[i];

    Which gets compiled to memcpy()--also looks to be standard C.
    OR

          p_struct = q_struct;

    gets compiled to::

          memmove( &p_struct, &q_struct, sizeof( q_struct ) );

    also looks to be std C.


    Those are standard C, yes. And a good compiler will optimise such code.
    And if the target has some kind of scalable vector support or other dedicated instructions for moving or copying memory, it can do a better
    job of optimising the code.

    That has /nothing/ to do with the point under discussion.


    I think you are simply confused about what you are talking about here.
    Either you don't know what is meant by writing portable standard C, or
    you don't know what is meant by implementing a C standard library, or
    you haven't actually been reading the posts you replied to. You seem determined to make the point that /your/ ISA has useful and efficient instructions and features for memory copy functionality, while the x86
    ISA does not, and that means /your/ ISA is good design and the x86 ISA
    is bad design.

    Now, I will fully agree with you that the x86 is not a good design. The modern x86 processor devices are proof that you /can/ polish a turd.
    And I fully agree with you that instructions for arbitrary length vector instructions of various sorts (of which memory copying is the simplest operation) have many advantages over SIMD using fixed-size vector
    registers. (ARM and RISC-V also agree with you there.)

    But that is all irrelevant to the discussion.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Oct 11 15:13:17 2024
    From Newsgroup: comp.arch

    On Fri, 11 Oct 2024 13:37:03 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 10/10/2024 23:19, Brian G. Lucas wrote:

    Not applicable.


    I don't understand what you mean by that. /What/ is not applicable
    to /what/ ?


    Brian probably meant to say that that it is not applicable to his my66k
    LLVM back end.

    But I am pretty sure that what you suggest is applicable, but bad idea
    for memcpy/memmove routine that targets Arm+SVE.
    Dynamic dispatch based on concrete core features/identification, i.e.
    exactly the same mechanism that is done on "non-scalable"
    architectures, would provide better performance. And memcpy/memmove is certainly sufficiently important to justify an additional development
    effort.



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Oct 11 16:54:13 2024
    From Newsgroup: comp.arch

    On 11/10/2024 14:13, Michael S wrote:
    On Fri, 11 Oct 2024 13:37:03 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 10/10/2024 23:19, Brian G. Lucas wrote:

    Not applicable.


    I don't understand what you mean by that. /What/ is not applicable
    to /what/ ?


    Brian probably meant to say that that it is not applicable to his my66k
    LLVM back end.

    But I am pretty sure that what you suggest is applicable, but bad idea
    for memcpy/memmove routine that targets Arm+SVE.
    Dynamic dispatch based on concrete core features/identification, i.e.
    exactly the same mechanism that is done on "non-scalable"
    architectures, would provide better performance. And memcpy/memmove is certainly sufficiently important to justify an additional development
    effort.


    That explanation helps a little, but only a little. I wasn't suggesting anything - or if I was, it was several posts ago and the context has
    long since been snipped. Can you be more explicit about what you think
    I was suggesting, and why it might not be a good idea for targeting a
    "my66k" ISA? (That is not a processor I have heard of, so you'll have
    to give a brief summary of any particular features that are relevant here.)


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Fri Oct 11 08:15:29 2024
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

    On 10/9/2024 1:20 PM, David Brown wrote:

    There are lots of parts of the standard C library that cannot be
    written completely in portable standard C. (How would you write
    a function that handles files? You need non-portable OS calls.)
    That's why these things are in the standard library in the first
    place.

    I agree with everything you say up until the last sentence. There
    are several languages, mostly older ones like Fortran and COBOL,
    where the file handling/I/O are defined portably within the
    language proper, not in a separate library. It just moves the
    non-portable stuff from the library writer (as in C) to the
    compiler writer (as in Fortran, COBOL, etc.)

    What I think you mean is that I/O and file handling are defined as
    part of the language rather than being written in the language.
    Assuming that's true, what you're saying is not at odds with what
    David said. I/O and so forth cannot be written in unaugmented
    standard C without changing the language. Given the language as
    it is, these things must be put in the standard library, because
    they cannot be provided in the existing language.

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library. In
    particular, it makes for a very clean distinction between two
    kinds of implementation, what the C standard calls a freestanding implementation (which excludes most of the library) and a hosted
    implementation (which includes the whole library). This facility
    is what allows C to run easily on very small processors, because
    there is no overhead for non-essential language features. That is
    not to say such things couldn't be arranged for Fortran or COBOL,
    but it would be harder, because those languages are not designed
    to be separable.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Oct 11 18:55:29 2024
    From Newsgroup: comp.arch

    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void * s2, size_t n)
    {
    return memmove(s1, s2, n);
    }

    in your library's source?

    .global memmove
    memmove:
    MM R2,R1,R3
    RET

    sure !
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Sat Oct 12 00:02:32 2024
    From Newsgroup: comp.arch

    On 11/10/2024 20:55, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void  * s2, size_t n)
    {
        return memmove(s1, s2, n);
    }

    in your library's source?

          .global memmove
    memmove:
          MM     R2,R1,R3
          RET

    sure !

    You are either totally clueless, or you are trolling. And I know you
    are not clueless.

    This discussion has become pointless.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Oct 11 23:32:20 2024
    From Newsgroup: comp.arch

    On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

    On 11/10/2024 20:55, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void  * s2, size_t n)
    {
        return memmove(s1, s2, n);
    }

    in your library's source?

          .global memmove
    memmove:
          MM     R2,R1,R3
          RET

    sure !

    You are either totally clueless, or you are trolling. And I know you
    are not clueless.

    This discussion has become pointless.

    The point is that there are a few things that may be hard to do
    with {decode, pipeline, calculations, specifications...}; but
    because they are so universally needed; these, too, should
    "get into ISA".

    One good reason to put them in ISA is to preserve the programmers
    efforts over decades, so they don't have to re-write libc every-
    time a new set of instructions come out.

    Moving an arbitrary amount of memory from point a to point b
    happens to fall into that universal need. Setting an arbitrary
    amount of memory to a value also falls into that universal
    need.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Brett@ggtgp@yahoo.com to comp.arch on Sat Oct 12 05:06:05 2024
    From Newsgroup: comp.arch

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void * s2, size_t n)
    {
    return memmove(s1, s2, n);
    }

    in your library's source?

    .global memmove
    memmove:
    MM R2,R1,R3
    RET

    sure !


    Can R3 be a const, that causes issues for restartability, but branch
    prediction is easier and the code is shorter.

    Though I guess forwarding a const is probably a thing today to improve
    branch prediction, which is normally HORRIBLE for short branch counts.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Sat Oct 12 05:11:44 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:

    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:

    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different
    objects? For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers,
    rather than having only a valid pointer or NULL. A compiler,
    for example, might want to store the fact that an error occurred
    while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can
    rely on what application programmers cannot, their implementation
    details. (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they
    can implement an efficient memmove() even though a pure standard C
    programmer cannot (other than by simply calling the standard
    library memmove() function!).

    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    Throughout this long thread you keep missing the point. Having
    different instructions available doesn't change the definition
    of the C language. It is possible to write code in standard C
    (which means, C that does NOT depend on any internal details of
    any implementation) to copy bytes from one place to another with
    semantics matching those of memmove(), BUT that code is clunky.
    To get a decent implementation of memmove() semantics requires
    knowledge of some internal implementation details that are not
    part of standard C. Whether those details are part of the
    compiler or part of the runtime environment (the library) is
    irrelevant - they still aren't part of standard C. Adding new
    instructions to the ISA, no matter what those new instructions
    are, cannot change that.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Sat Oct 12 17:16:44 2024
    From Newsgroup: comp.arch

    On 12/10/2024 01:32, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

    On 11/10/2024 20:55, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void  * s2, size_t n)
    {
        return memmove(s1, s2, n);
    }

    in your library's source?

           .global memmove
    memmove:
           MM     R2,R1,R3
           RET

    sure !

    You are either totally clueless, or you are trolling.  And I know you
    are not clueless.

    This discussion has become pointless.

    The point is that there are a few things that may be hard to do
    with {decode, pipeline, calculations, specifications...}; but
    because they are so universally needed; these, too, should
    "get into ISA".

    One good reason to put them in ISA is to preserve the programmers
    efforts over decades, so they don't have to re-write libc every-
    time a new set of instructions come out.

    Moving an arbitrary amount of memory from point a to point b
    happens to fall into that universal need. Setting an arbitrary
    amount of memory to a value also falls into that universal
    need.

    Again, I have to ask - do you bother to read the posts you reply to?
    Are you interested in replying, and engaging in the discussion? Or are
    you just looking for a chance to promote your own architecture, no
    matter how tenuous the connection might be to other posts?

    Again, let me say that I agree with what you are saying - I agree that
    an ISA should have instructions that are efficient for what people
    actually want to do. I agree that it is a good thing to have
    instructions that let performance scale with advances in hardware
    ideally without needing changes in compiled binaries, and at least
    without needing changes in source code.

    I believe there is an interesting discussion to be had here, and I would
    enjoy hearing about comparisons of different ways things functions like memcpy() and memset() can be implemented in different architectures and optimised for different sizes, or how scalable vector instructions can
    work in comparison to fixed-size SIMD instructions.

    But at the moment, this potential is lost because you are posting total
    shite about implementing memmove() in standard C. It is disappointing
    that someone with your extensive knowledge and experience cannot see
    this. I am finding it all very frustrating.




    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Bernd Linsel@bl1-thispartdoesnotbelonghere@gmx.com to comp.arch on Sat Oct 12 19:26:30 2024
    From Newsgroup: comp.arch

    On 12.10.24 17:16, David Brown wrote:


    [snip rant]



    You are aware that this is c.arch, not c.lang.c?
    --
    Bernd Linsel
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Brian G. Lucas@bagel99@gmail.com to comp.arch on Sat Oct 12 12:36:43 2024
    From Newsgroup: comp.arch

    On 10/12/24 12:06 AM, Brett wrote:
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void * s2, size_t n)
    {
    return memmove(s1, s2, n);
    }

    in your library's source?

    .global memmove
    memmove:
    MM R2,R1,R3
    RET

    sure !


    Can R3 be a const, that causes issues for restartability, but branch prediction is easier and the code is shorter.

    Yes.
    #include <string.h>

    void memmoverr(char to[], char fm[], size_t cnt)
    {
    memmove(to, fm, cnt);
    }

    void memmoverd(char to[], char fm[])
    {
    memmove(to, fm, 0x100000000);
    }
    Yields:
    memmoverr: ; @memmoverr
    mm r1,r2,r3
    ret
    memmoverd: ; @memmoverd
    mm r1,r2,#4294967296
    ret

    Though I guess forwarding a const is probably a thing today to improve
    branch prediction, which is normally HORRIBLE for short branch counts.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Brett@ggtgp@yahoo.com to comp.arch on Sat Oct 12 18:17:18 2024
    From Newsgroup: comp.arch

    Brian G. Lucas <bagel99@gmail.com> wrote:
    On 10/12/24 12:06 AM, Brett wrote:
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void * s2, size_t n)
    {
    return memmove(s1, s2, n);
    }

    in your library's source?

    .global memmove
    memmove:
    MM R2,R1,R3
    RET

    sure !


    Can R3 be a const, that causes issues for restartability, but branch
    prediction is easier and the code is shorter.

    Yes.
    #include <string.h>

    void memmoverr(char to[], char fm[], size_t cnt)
    {
    memmove(to, fm, cnt);
    }

    void memmoverd(char to[], char fm[])
    {
    memmove(to, fm, 0x100000000);
    }
    Yields:
    memmoverr: ; @memmoverr
    mm r1,r2,r3
    ret
    memmoverd: ; @memmoverd
    mm r1,r2,#4294967296
    ret

    Excellent!

    Though I guess forwarding a const is probably a thing today to improve
    branch prediction, which is normally HORRIBLE for short branch counts.

    What is the default virtual loop count if the register count is not
    available?

    Worst case the source and dest are in cache, and the count is 150 cycles
    away in memory. So hundreds of chars could be copied until the value is
    loaded and that count value could be say 5. Lots of work and time
    discarded, so you play the odds, perhaps to the low side and over prefetch
    to cover being wrong.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Brett@ggtgp@yahoo.com to comp.arch on Sat Oct 12 18:33:17 2024
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> wrote:
    On 12/10/2024 01:32, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

    On 11/10/2024 20:55, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void  * s2, size_t n)
    {
        return memmove(s1, s2, n);
    }

    in your library's source?

           .global memmove
    memmove:
           MM     R2,R1,R3
           RET

    sure !

    You are either totally clueless, or you are trolling.  And I know you
    are not clueless.

    This discussion has become pointless.

    The point is that there are a few things that may be hard to do
    with {decode, pipeline, calculations, specifications...}; but
    because they are so universally needed; these, too, should
    "get into ISA".

    One good reason to put them in ISA is to preserve the programmers
    efforts over decades, so they don't have to re-write libc every-
    time a new set of instructions come out.

    Moving an arbitrary amount of memory from point a to point b
    happens to fall into that universal need. Setting an arbitrary
    amount of memory to a value also falls into that universal
    need.

    Again, I have to ask - do you bother to read the posts you reply to?
    Are you interested in replying, and engaging in the discussion? Or are
    you just looking for a chance to promote your own architecture, no
    matter how tenuous the connection might be to other posts?

    Again, let me say that I agree with what you are saying - I agree that
    an ISA should have instructions that are efficient for what people
    actually want to do. I agree that it is a good thing to have
    instructions that let performance scale with advances in hardware
    ideally without needing changes in compiled binaries, and at least
    without needing changes in source code.

    I believe there is an interesting discussion to be had here, and I would enjoy hearing about comparisons of different ways things functions like memcpy() and memset() can be implemented in different architectures and optimised for different sizes, or how scalable vector instructions can
    work in comparison to fixed-size SIMD instructions.

    But at the moment, this potential is lost because you are posting total shite about implementing memmove() in standard C. It is disappointing
    that someone with your extensive knowledge and experience cannot see
    this. I am finding it all very frustrating.

    There are only two decisions to make in memcpy, are the copies less than
    copy sized aligned, and do the pointers overlap in copy size.

    For hardware this simplifies down to perhaps two types of copies, easy and hard.

    If you make hard fast, and you will, then two versions is all you need, not
    the dozens of choices with 1k of code you need in C.

    Often you know which of the two you want at compile time from the pointer
    type.

    In short your complaints are wrong headed in not understanding what
    hardware memcpy can do.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Oct 12 18:32:48 2024
    From Newsgroup: comp.arch

    On Sat, 12 Oct 2024 5:06:05 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void * s2, size_t n)
    {
    return memmove(s1, s2, n);
    }

    in your library's source?

    .global memmove
    memmove:
    MM R2,R1,R3
    RET

    sure !


    Can R3 be a const, that causes issues for restartability, but branch prediction is easier and the code is shorter.

    The 3rd Operand can, indeed, be a constant.
    That causes no restartability problem when you have a place to
    store the current count==index, so that when control returns
    and you re-execute MM, it sees that x amount has already been
    done, and C-X is left.

    Though I guess forwarding a const is probably a thing today to improve
    branch prediction, which is normally HORRIBLE for short branch counts.

    That is what Predication is for.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Oct 12 18:37:35 2024
    From Newsgroup: comp.arch

    On Sat, 12 Oct 2024 18:17:18 +0000, Brett wrote:

    Brian G. Lucas <bagel99@gmail.com> wrote:
    On 10/12/24 12:06 AM, Brett wrote:
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void * s2, size_t n)
    {
    return memmove(s1, s2, n);
    }

    in your library's source?

    .global memmove
    memmove:
    MM R2,R1,R3
    RET

    sure !


    Can R3 be a const, that causes issues for restartability, but branch
    prediction is easier and the code is shorter.

    Yes.
    #include <string.h>

    void memmoverr(char to[], char fm[], size_t cnt)
    {
    memmove(to, fm, cnt);
    }

    void memmoverd(char to[], char fm[])
    {
    memmove(to, fm, 0x100000000);
    }
    Yields:
    memmoverr: ; @memmoverr
    mm r1,r2,r3
    ret
    memmoverd: ; @memmoverd
    mm r1,r2,#4294967296
    ret

    Excellent!

    Though I guess forwarding a const is probably a thing today to improve
    branch prediction, which is normally HORRIBLE for short branch counts.

    What is the default virtual loop count if the register count is not available?

    There is always a count available; it can come from a register or an
    immediate.

    Worst case the source and dest are in cache, and the count is 150 cycles
    away in memory. So hundreds of chars could be copied until the value is loaded and that count value could be say 5.

    The instruction cannot start until the count in known. You don't start
    an FMAC until all 3 operands are ready, either.

    Lots of work and time
    discarded, so you play the odds, perhaps to the low side and over
    prefetch to cover being wrong.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Brett@ggtgp@yahoo.com to comp.arch on Sun Oct 13 01:25:13 2024
    From Newsgroup: comp.arch

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sat, 12 Oct 2024 18:17:18 +0000, Brett wrote:

    Brian G. Lucas <bagel99@gmail.com> wrote:
    On 10/12/24 12:06 AM, Brett wrote:
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void * s2, size_t n)
    {
    return memmove(s1, s2, n);
    }

    in your library's source?

    .global memmove
    memmove:
    MM R2,R1,R3
    RET

    sure !


    Can R3 be a const, that causes issues for restartability, but branch
    prediction is easier and the code is shorter.

    Yes.
    #include <string.h>

    void memmoverr(char to[], char fm[], size_t cnt)
    {
    memmove(to, fm, cnt);
    }

    void memmoverd(char to[], char fm[])
    {
    memmove(to, fm, 0x100000000);
    }
    Yields:
    memmoverr: ; @memmoverr
    mm r1,r2,r3
    ret
    memmoverd: ; @memmoverd
    mm r1,r2,#4294967296
    ret

    Excellent!

    Though I guess forwarding a const is probably a thing today to improve >>>> branch prediction, which is normally HORRIBLE for short branch counts.

    What is the default virtual loop count if the register count is not
    available?

    There is always a count available; it can come from a register or an immediate.

    Worst case the source and dest are in cache, and the count is 150 cycles
    away in memory. So hundreds of chars could be copied until the value is
    loaded and that count value could be say 5.

    The instruction cannot start until the count in known. You don't start
    an FMAC until all 3 operands are ready, either.

    That simplifies a lot of issues, thanks!

    Lots of work and time
    discarded, so you play the odds, perhaps to the low side and over
    prefetch to cover being wrong.




    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Sat Oct 12 23:09:27 2024
    From Newsgroup: comp.arch

    On 10/12/24 2:37 PM, MitchAlsup1 wrote:
    On Sat, 12 Oct 2024 18:17:18 +0000, Brett wrote:
    [snip]
    Worst case the source and dest are in cache, and the count is
    150 cycles
    away in memory. So hundreds of chars could be copied until the
    value is
    loaded and that count value could be say 5.

    The instruction cannot start until the count in known. You don't
    start
    an FMAC until all 3 operands are ready, either.

    This is not _strictly_ true. Some ARM implementations start an
    FMADD before the addend is available when it is known that it
    will be available in time. This allows dependent accumulation
    with a latency equal to the ADD part.

    One might even be able to start the shift to align addend and
    product early as this value is easy to calculate for normal FP
    values.

    In many microarchitectures, an operation will be scheduled to
    execute when an L1 cache hit would be expected to make an operand
    available. I.e., the instruction "starts" before the operand is
    actually available.

    With branch prediction, a branch instruction is "started" before
    the condition has been evaluated. Your statement implies that
    My 66000 MM implementations will not do such prediction.

    In the case of a memory copy, performing rollback of
    misspeculation is potentially much easier than in the general case
    of a loop with store operations.

    Memory copy also facilitates deeper speculation. The source data
    can be preserved in memory more readily than arbitrary sequences
    of register contents. If both source and destination start points
    are known, destination reads can be translated into source reads
    within a speculation domain. (The source could also be prefetched
    before the destination is known.)

    It does seem that My 66000's MM does not completely eliminate the
    potential for faster special case software even if every
    implementation is perfect. Software might know that the tail part
    of a cache block that is not overwritten is dead data. This can
    avoid a read for ownership of the last destination block, software
    could do a cache block zero for the last block and then copy the
    data over that. This special case might apply for appending to a
    buffer.

    I do not know that adding a MM instruction variant to handle that
    special case would be worthwhile.

    I am skeptical that all implementations of MM would be perfect,
    i.e., perform at least as well as software more specifically
    controlling hardware if such control had been provided by the ISA.
    E.g., ISA support for byte-masks for stores might not only allow
    non-contiguous stores (such as updating more than one field in a
    structure while leaving other intermediately placed fields
    unchanged) but might have higher performance than a general MM if
    the source happened to be replicated in a register.

    "Hard cases make bad law" may be generalized to special cases make
    bad (general) interfaces. Clean interfaces that can be implemented
    almost optimally have advantages over complicated interfaces that
    can theoretically handle more cases optimally **if one uses the
    proper (highly specific) incantation!!!**
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Sun Oct 13 10:31:49 2024
    From Newsgroup: comp.arch

    On 2024-10-12 21:33, Brett wrote:
    David Brown <david.brown@hesbynett.no> wrote:
    On 12/10/2024 01:32, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

    On 11/10/2024 20:55, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void  * s2, size_t n)
    {
        return memmove(s1, s2, n);
    }

    in your library's source?

           .global memmove
    memmove:
           MM     R2,R1,R3
           RET

    sure !

    You are either totally clueless, or you are trolling.  And I know you >>>> are not clueless.

    This discussion has become pointless.

    The point is that there are a few things that may be hard to do
    with {decode, pipeline, calculations, specifications...}; but
    because they are so universally needed; these, too, should
    "get into ISA".

    One good reason to put them in ISA is to preserve the programmers
    efforts over decades, so they don't have to re-write libc every-
    time a new set of instructions come out.

    Moving an arbitrary amount of memory from point a to point b
    happens to fall into that universal need. Setting an arbitrary
    amount of memory to a value also falls into that universal
    need.

    Again, I have to ask - do you bother to read the posts you reply to?
    Are you interested in replying, and engaging in the discussion? Or are
    you just looking for a chance to promote your own architecture, no
    matter how tenuous the connection might be to other posts?

    Again, let me say that I agree with what you are saying - I agree that
    an ISA should have instructions that are efficient for what people
    actually want to do. I agree that it is a good thing to have
    instructions that let performance scale with advances in hardware
    ideally without needing changes in compiled binaries, and at least
    without needing changes in source code.

    I believe there is an interesting discussion to be had here, and I would
    enjoy hearing about comparisons of different ways things functions like
    memcpy() and memset() can be implemented in different architectures and
    optimised for different sizes, or how scalable vector instructions can
    work in comparison to fixed-size SIMD instructions.

    But at the moment, this potential is lost because you are posting total
    shite about implementing memmove() in standard C. It is disappointing
    that someone with your extensive knowledge and experience cannot see
    this. I am finding it all very frustrating.



    [ snip discussion of HW ]


    In short your complaints are wrong headed in not understanding what
    hardware memcpy can do.


    I think your reply proves David's complaint: you did not read, or did
    not understand, what David is frustrated about. The true fact that David
    is defending is that memmove() cannot be implemented "efficiently" in /standard/ C source code, on /any/ HW, because it would require
    comparing /C pointers/ that point to potentially different /C objects/,
    which is not defined behavior in standard C, whether compiled to machine
    code, or executed by an interpreter of C code, or executed by a human programmer performing what was called "desk testing" in the 1960s.

    Obviously memmove() can be implemented efficently in non-standard C
    where such pointers can be compared, or by sequences of ordinary ALU instructions, or by dedicated instructions such as Mitch's MM, and David
    is not disputing that. But Mitch seems not to understand or not to see
    the issue about standard C vs memmove().

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Oct 13 10:56:20 2024
    From Newsgroup: comp.arch

    On Sat, 12 Oct 2024 18:32:48 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    On Sat, 12 Oct 2024 5:06:05 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void * s2, size_t n)
    {
    return memmove(s1, s2, n);
    }

    in your library's source?

    .global memmove
    memmove:
    MM R2,R1,R3
    RET

    sure !


    Can R3 be a const, that causes issues for restartability, but branch prediction is easier and the code is shorter.

    The 3rd Operand can, indeed, be a constant.
    That causes no restartability problem when you have a place to
    store the current count==index, so that when control returns
    and you re-execute MM, it sees that x amount has already been
    done, and C-X is left.

    I don't understand this paragraph.
    Does constant as a 3rd operand cause restartability problem?
    Or does it not?
    If it does not, then how?
    Do you have a private field in thread state? Saved on stack by by
    interrupt uCode ?
    OS people would not like it. They prefer to have full control even when
    they don't use it 99.999% of the time.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Oct 13 12:00:48 2024
    From Newsgroup: comp.arch

    On Fri, 11 Oct 2024 16:54:13 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 11/10/2024 14:13, Michael S wrote:
    On Fri, 11 Oct 2024 13:37:03 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 10/10/2024 23:19, Brian G. Lucas wrote:

    Not applicable.


    I don't understand what you mean by that. /What/ is not applicable
    to /what/ ?


    Brian probably meant to say that that it is not applicable to his
    my66k LLVM back end.

    But I am pretty sure that what you suggest is applicable, but bad
    idea for memcpy/memmove routine that targets Arm+SVE.
    Dynamic dispatch based on concrete core features/identification,
    i.e. exactly the same mechanism that is done on "non-scalable" architectures, would provide better performance. And memcpy/memmove
    is certainly sufficiently important to justify an additional
    development effort.


    That explanation helps a little, but only a little. I wasn't
    suggesting anything - or if I was, it was several posts ago and the
    context has long since been snipped.

    You suggested that "scalable" vector extension are preferable for memcpy/memmove implementation over "non-scalable" SIMD.

    Can you be more explicit about
    what you think I was suggesting, and why it might not be a good idea
    for targeting a "my66k" ISA? (That is not a processor I have heard
    of, so you'll have to give a brief summary of any particular features
    that are relevant here.)


    The proper spelling appears to be My 66000.
    For starter, My 66000 has no SIMD. It does not even have dedicated FP
    register file. Both FP and Int share common 32x64bit register space.

    More importantly, it has dedicate instruction with exactly the same
    semantics as memmove(). Pretty much the same as ARM64. In both cases instruction is defined, but not yet implemented in production silicon.
    The difference is that in case of ARM64 we can be reasonably sure that eventually it will be implemented in production silicon. Which means
    that in at least several out of multitude of implementations it will
    suck.




    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Oct 13 12:26:22 2024
    From Newsgroup: comp.arch

    On Sun, 13 Oct 2024 10:31:49 +0300
    Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:
    On 2024-10-12 21:33, Brett wrote:
    David Brown <david.brown@hesbynett.no> wrote:
    On 12/10/2024 01:32, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

    On 11/10/2024 20:55, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void  * s2, size_t n)
    {
        return memmove(s1, s2, n);
    }

    in your library's source?

           .global memmove
    memmove:
           MM     R2,R1,R3
           RET

    sure !

    You are either totally clueless, or you are trolling.  And I
    know you are not clueless.

    This discussion has become pointless.

    The point is that there are a few things that may be hard to do
    with {decode, pipeline, calculations, specifications...}; but
    because they are so universally needed; these, too, should
    "get into ISA".

    One good reason to put them in ISA is to preserve the programmers
    efforts over decades, so they don't have to re-write libc every-
    time a new set of instructions come out.

    Moving an arbitrary amount of memory from point a to point b
    happens to fall into that universal need. Setting an arbitrary
    amount of memory to a value also falls into that universal
    need.

    Again, I have to ask - do you bother to read the posts you reply
    to? Are you interested in replying, and engaging in the
    discussion? Or are you just looking for a chance to promote your
    own architecture, no matter how tenuous the connection might be to
    other posts?

    Again, let me say that I agree with what you are saying - I agree
    that an ISA should have instructions that are efficient for what
    people actually want to do. I agree that it is a good thing to
    have instructions that let performance scale with advances in
    hardware ideally without needing changes in compiled binaries, and
    at least without needing changes in source code.

    I believe there is an interesting discussion to be had here, and I
    would enjoy hearing about comparisons of different ways things
    functions like memcpy() and memset() can be implemented in
    different architectures and optimised for different sizes, or how
    scalable vector instructions can work in comparison to fixed-size
    SIMD instructions.

    But at the moment, this potential is lost because you are posting
    total shite about implementing memmove() in standard C. It is
    disappointing that someone with your extensive knowledge and
    experience cannot see this. I am finding it all very frustrating.




    [ snip discussion of HW ]


    In short your complaints are wrong headed in not understanding what hardware memcpy can do.


    I think your reply proves David's complaint: you did not read, or did
    not understand, what David is frustrated about. The true fact that
    David is defending is that memmove() cannot be implemented
    "efficiently" in /standard/ C source code, on /any/ HW, because it
    would require comparing /C pointers/ that point to potentially
    different /C objects/, which is not defined behavior in standard C,
    whether compiled to machine code, or executed by an interpreter of C
    code, or executed by a human programmer performing what was called
    "desk testing" in the 1960s.

    Obviously memmove() can be implemented efficently in non-standard C
    where such pointers can be compared, or by sequences of ordinary ALU instructions, or by dedicated instructions such as Mitch's MM, and
    David is not disputing that. But Mitch seems not to understand or not
    to see the issue about standard C vs memmove().

    Sufficiently advanced compiler can recognize patterns and replace them
    with built-in sequences.
    In case of memmove() the most easily recognizable pattern in 100%
    standard C99 appears to be:
    void *memmove( void *dest, const void *src, size_t count)
    {
    if (count > 0) {
    char tmp[count];
    memcpy(tmp, src, count);
    memcpy(dest, tmp, count);
    }
    return dest;
    }
    I don't suggest that real implementation in Brian's compiler is like
    that. Much more likely his implementation uses non-standard C and looks approximately like:
    void *memmove(void *dest, const void *src, size_t count {
    return __builtin_memmove(dest, src, count);
    }
    However, implementing the first variant efficiently is well within
    abilities of good compiler.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Sun Oct 13 13:33:55 2024
    From Newsgroup: comp.arch

    On 2024-10-13 12:26, Michael S wrote:
    On Sun, 13 Oct 2024 10:31:49 +0300
    Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

    On 2024-10-12 21:33, Brett wrote:
    David Brown <david.brown@hesbynett.no> wrote:

    [ snip ]

    But at the moment, this potential is lost because you are posting
    total shite about implementing memmove() in standard C. It is
    disappointing that someone with your extensive knowledge and
    experience cannot see this. I am finding it all very frustrating.




    [ snip discussion of HW ]


    In short your complaints are wrong headed in not understanding what
    hardware memcpy can do.


    I think your reply proves David's complaint: you did not read, or did
    not understand, what David is frustrated about. The true fact that
    David is defending is that memmove() cannot be implemented
    "efficiently" in /standard/ C source code, on /any/ HW, because it
    would require comparing /C pointers/ that point to potentially
    different /C objects/, which is not defined behavior in standard C,
    whether compiled to machine code, or executed by an interpreter of C
    code, or executed by a human programmer performing what was called
    "desk testing" in the 1960s.

    Obviously memmove() can be implemented efficently in non-standard C
    where such pointers can be compared, or by sequences of ordinary ALU
    instructions, or by dedicated instructions such as Mitch's MM, and
    David is not disputing that. But Mitch seems not to understand or not
    to see the issue about standard C vs memmove().


    Sufficiently advanced compiler can recognize patterns and replace them
    with built-in sequences.


    Sure.


    In case of memmove() the most easily recognizable pattern in 100%
    standard C99 appears to be:

    void *memmove( void *dest, const void *src, size_t count)
    {
    if (count > 0) {
    char tmp[count];
    memcpy(tmp, src, count);
    memcpy(dest, tmp, count);
    }
    return dest;
    }


    Yes.


    I don't suggest that real implementation in Brian's compiler is like
    that. Much more likely his implementation uses non-standard C and looks approximately like:
    void *memmove(void *dest, const void *src, size_t count {
    return __builtin_memmove(dest, src, count);
    }

    However, implementing the first variant efficiently is well within
    abilities of good compiler.


    Yes, but it is not required by the C standard, so the fact remains that
    there is no standard way of implementing memmove() in a way that is "efficient" in the sense that it ensures that a copy to and from a
    temporary will /not/ happen.

    In practice, of course, memmove() is implemented in a non-portable way
    or by in-line code, as everybody understands.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Sun Oct 13 12:57:06 2024
    From Newsgroup: comp.arch

    On 12/10/2024 19:26, Bernd Linsel wrote:
    On 12.10.24 17:16, David Brown wrote:


    [snip rant]



    You are aware that this is c.arch, not c.lang.c?


    Absolutely, yes.

    But in a thread branch discussing C, details of C are relevant.

    I don't expect any random regular here to know "language lawyer" details
    of the C standards. I don't expect people here to care about them.
    People in comp.lang.c care about them - for people here, the main
    interest in C is for programs to run on the computer architectures that
    are the real interest.


    But if someone engages in a conversation about C, I /do/ expect them to understand some basics, and I /do/ expect them to read and think about
    what other posters write. The point under discussion was that you
    cannot implement an efficient "memmove()" function in fully portable
    standard C. That's a fact - it is a well-established fact. Another
    clear and inarguable fact is that particular ISAs or implementations are completely irrelevant to fully portable standard C - that is both the advantage and the disadvantage of writing code in portable standard C.

    All I am asking Mitch to do is to understand this, and to stop saying
    silly things (such as implementing memmove() by calling memmove(), or
    that the /reason/ you can't implement memmove() efficiently in portable standard C is weaknesses in the x86 ISA), so that we can clear up his misunderstandings and move on to the more interesting computer
    architecture discussions.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Sun Oct 13 13:58:14 2024
    From Newsgroup: comp.arch

    On 12/10/2024 20:33, Brett wrote:
    David Brown <david.brown@hesbynett.no> wrote:
    On 12/10/2024 01:32, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

    On 11/10/2024 20:55, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void  * s2, size_t n)
    {
        return memmove(s1, s2, n);
    }

    in your library's source?

           .global memmove
    memmove:
           MM     R2,R1,R3
           RET

    sure !

    You are either totally clueless, or you are trolling.  And I know you >>>> are not clueless.

    This discussion has become pointless.

    The point is that there are a few things that may be hard to do
    with {decode, pipeline, calculations, specifications...}; but
    because they are so universally needed; these, too, should
    "get into ISA".

    One good reason to put them in ISA is to preserve the programmers
    efforts over decades, so they don't have to re-write libc every-
    time a new set of instructions come out.

    Moving an arbitrary amount of memory from point a to point b
    happens to fall into that universal need. Setting an arbitrary
    amount of memory to a value also falls into that universal
    need.

    Again, I have to ask - do you bother to read the posts you reply to?
    Are you interested in replying, and engaging in the discussion? Or are
    you just looking for a chance to promote your own architecture, no
    matter how tenuous the connection might be to other posts?

    Again, let me say that I agree with what you are saying - I agree that
    an ISA should have instructions that are efficient for what people
    actually want to do. I agree that it is a good thing to have
    instructions that let performance scale with advances in hardware
    ideally without needing changes in compiled binaries, and at least
    without needing changes in source code.

    I believe there is an interesting discussion to be had here, and I would
    enjoy hearing about comparisons of different ways things functions like
    memcpy() and memset() can be implemented in different architectures and
    optimised for different sizes, or how scalable vector instructions can
    work in comparison to fixed-size SIMD instructions.

    But at the moment, this potential is lost because you are posting total
    shite about implementing memmove() in standard C. It is disappointing
    that someone with your extensive knowledge and experience cannot see
    this. I am finding it all very frustrating.

    There are only two decisions to make in memcpy, are the copies less than
    copy sized aligned, and do the pointers overlap in copy size.


    Are you confused about memcpy() and memmove()? If so, let's clear that
    one up from the start. For memcpy(), there are no overlap issues - the
    person using it promises that the source and destination areas do not
    overlap, and no one cares what might happen if they do. For memmove(),
    the areas /may/ overlap, and the copy is done as though the source were
    copied first to a temporary area, and then from the temporary area to
    the destination.

    For memcpy(), there can be several issues to consider for efficient implementations that can be skipped for a simple loop copying byte for
    byte. An efficient implementation will probably want to copy with
    larger sizes, such as using 32-bit, 64-bit, or bigger registers. For
    some targets, that is only possible for aligned data (and for some,
    unaligned accesses may be allowed but emulated by traps, making them
    massively slower than byte-by-byte accesses). The best choice of size
    will be implementation and target dependent, as will methods of
    determining alignment (if that is relevant). I'm guessing that by your somewhat muddled phrase "are the copies less than copy sized aligned",
    you meant something on those lines.

    For memmove(), you generally also need to decide if your copy loop
    should run upwards or downwards, and that must be done in an implementation-dependent manner. It is conceivable that for a target
    with more complex memory setups - perhaps allowing the same memory to be accessible in different ways via different segments - that this is not
    enough.

    For hardware this simplifies down to perhaps two types of copies, easy and hard.

    For most targets, yes.


    If you make hard fast, and you will, then two versions is all you need, not the dozens of choices with 1k of code you need in C.


    That makes little sense. What "1k of code" do you need in C?
    Implementations of memcpy() and memmove() are implementation and target-specific, not general portable standard C. There is no single C implementation of these functions.

    It is an obvious truism that if you have hardware instructions that can implement an efficient memcpy() and/or memmove() on a target, then the implementation-specific implementations of these functions on that
    target will be small, simple and efficient.

    Often you know which of the two you want at compile time from the pointer type.

    In short your complaints are wrong headed in not understanding what
    hardware memcpy can do.


    What complaints? I haven't made any complains about implementing these functions.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Sun Oct 13 14:10:20 2024
    From Newsgroup: comp.arch

    On 13/10/2024 11:00, Michael S wrote:
    On Fri, 11 Oct 2024 16:54:13 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 11/10/2024 14:13, Michael S wrote:
    On Fri, 11 Oct 2024 13:37:03 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 10/10/2024 23:19, Brian G. Lucas wrote:

    Not applicable.


    I don't understand what you mean by that. /What/ is not applicable
    to /what/ ?


    Brian probably meant to say that that it is not applicable to his
    my66k LLVM back end.

    But I am pretty sure that what you suggest is applicable, but bad
    idea for memcpy/memmove routine that targets Arm+SVE.
    Dynamic dispatch based on concrete core features/identification,
    i.e. exactly the same mechanism that is done on "non-scalable"
    architectures, would provide better performance. And memcpy/memmove
    is certainly sufficiently important to justify an additional
    development effort.


    That explanation helps a little, but only a little. I wasn't
    suggesting anything - or if I was, it was several posts ago and the
    context has long since been snipped.

    You suggested that "scalable" vector extension are preferable for memcpy/memmove implementation over "non-scalable" SIMD.

    I certainly suggested that they have some advantages, yes. I don't know nearly enough details about implementations and practical usage to know
    if scalable vector instructions are /always/ better than non-scalable
    SIMD with fixed-size registers, either from the viewpoint of their
    efficiency at runtime or their implementation in hardware.

    It seems to me that if the compiler knows the size of a memcpy/memmove,
    then the best results would probably be achieved by the compiler
    inlining the copy using fixed size registers of a suitable size. If it
    does not know the size, then I would expect (but I don't know for sure)
    that a hardware scalable vector instruction should be more efficient
    than using fixed-size registers. If that were not the case, then I
    wonder why scalable vector hardware has become popular recently in ISAs.

    If you - or someone else - knows enough to say more about this, then I'd
    be glad to learn about it.


    Can you be more explicit about
    what you think I was suggesting, and why it might not be a good idea
    for targeting a "my66k" ISA? (That is not a processor I have heard
    of, so you'll have to give a brief summary of any particular features
    that are relevant here.)


    The proper spelling appears to be My 66000.
    For starter, My 66000 has no SIMD. It does not even have dedicated FP register file. Both FP and Int share common 32x64bit register space.


    OK.

    More importantly, it has dedicate instruction with exactly the same
    semantics as memmove(). Pretty much the same as ARM64. In both cases instruction is defined, but not yet implemented in production silicon.
    The difference is that in case of ARM64 we can be reasonably sure that eventually it will be implemented in production silicon. Which means
    that in at least several out of multitude of implementations it will
    suck.


    So if I understand you correctly, your argument is that scalable vector instructions - at least for copying memory - is slow in hardware implementations, and thus it would be better to simply copy memory in a
    loop using larger fixed-size registers? I would find that surprising,
    but as I said, I don't know the details of implementations.

    (I do know that in the 68k family, the hardware division instruction was dropped for later devices after it was realised that a software division routine was faster than the hardware instruction. So such strange
    things have happened.)




    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Oct 13 15:45:37 2024
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> writes:
    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never". Pretty much the
    only example people ever give of needing such comparisons is to
    implement memmove() efficiently - but you don't need to implement
    memmove(), because it is already in the standard library.

    When you implements something like, say

    vsum(double *a, double *b, double *c, size_t n);

    where a, b, and c may point to arrays in different objects, or may
    point to overlapping parts of the same object, and the result vector c
    in the overlap case should be the same as in the no-overlap case
    (similar to memmove()), being able to compare pointers to possibly
    different objects also comes in handy.

    Another example is when the programmer uses the address as a key in,
    e.g., a binary search tree. And, as you write, casting to intptr_t is
    not guarenteed to work by the C standard, either.

    An example that probably compares pointers to the same object as far
    as the C standard is concerned, but feel like different objects to the programmer, is logic variables (in, e.g., a Prolog implementation).
    When you have two free variables, and you unify them, in the
    implementation one variable points to the other one. Now which should
    point to which? The younger variable should point to the older one,
    because it will die sooner. How do you know which variable is
    younger? You compare the addresses; the variables reside on a stack,
    so the younger one is closer to the top.

    If that stack is one object as far as the C standard is concerned,
    there is no problem with that solution. If the stack is implemented
    as several objects (to make it easier growable; I don't know if there
    is a Prolog implementation that does that), you first have to check in
    which piece it is (maybe with a binary search), and then possibly
    compare within the stack piece at hand.

    An interesting case is the Forth standard. It specifies "contiguous
    regions", which correspond to objects in C, but in Forth each address
    is a cell and can be added, subtracted, compared, etc. irrespective of
    where it came from. So Forth really has a flat-memory model. It has
    had that since its origins in the 1970s. Some of the 8086
    implementations had some extensions for dealing with more than 64KB,
    but they were never standardized and are now mostly forgotten.


    Forth does not require a flat memory model in the hardware, as far as I
    am aware, any more than C does. (I appreciate that your knowledge of
    Forth is /vastly/ greater than mine.) A Forth implementation could >interpret part of the address value as the segment or other memory block >identifier and part of it as an index into that block, just as a C >implementation can.

    I.e., what you are saying is that one can simulate a flat-memory model
    on a segmented memory model. Certainly. In the case of the 8086 (and
    even more so on the 286) the costs of that are so high that no
    widely-used Forth system went there.

    One can also simulate segmented memory (a natural fit for many
    programming languages) on flat memory. In this case the cost is much
    smaller, plus it gives the maximum flexibility about segment/object
    sizes and numbers. That is why flat memory has won.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Sun Oct 13 13:32:32 2024
    From Newsgroup: comp.arch

    On 10/13/24 3:56 AM, Michael S wrote:
    On Sat, 12 Oct 2024 18:32:48 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:
    [snip memory copy instruction]
    The 3rd Operand can, indeed, be a constant.
    That causes no restartability problem when you have a place to
    store the current count==index, so that when control returns
    and you re-execute MM, it sees that x amount has already been
    done, and C-X is left.

    I don't understand this paragraph.
    Does constant as a 3rd operand cause restartability problem?
    Or does it not?
    If it does not, then how?
    Do you have a private field in thread state? Saved on stack by by
    interrupt uCode ?

    The extra state is saved in the context save area (like
    for My 66000's extra state for the PREDicate instruction
    modifier).

    (Of course, restartability could also be provided by using
    an ordinary register for the in-progress count even for
    immediate counts. The instruction would effectively become a
    load immediate and memory copy. Implicit/extra state has
    some benefits.)

    OS people would not like it. They prefer to have full control even when
    they don't use it 99.999% of the time.

    On the other hand, isolating some state and functionality might
    facilitate less trust requirements? Some OS people might not like
    having the OS be less than fully trusted.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Oct 13 21:21:11 2024
    From Newsgroup: comp.arch

    David Brown wrote:
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

    On 09/10/2024 23:37, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different objects? >>>>>>> For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers,
    rather than having only a valid pointer or NULL.  A compiler,
    for example, might want to store the fact that an error occurred
    while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can
    rely on what application programmers cannot, their implementation
    details.  (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they can >>>>> implement an efficient memmove() even though a pure standard C
    programmer cannot (other than by simply calling the standard library >>>>> memmove() function!).

    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    No, it is not.  It has absolutely /nothing/ to do with the ISA.

    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you write
    an efficient memmove() in standard C.  That's why I said there was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in assembly or
    using inline assembly, rather than in non-portable C (which is the
    common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up that
    is proportionally more costly for small transfers.  Often that can be eliminated when the compiler optimises the functions inline - when the > compiler knows the size of the move/copy, it can optimise directly.
    What you are missing here David is the fact that Mitch's MM is a single instruction which does the entire memmove() operation, and has the
    inside knowledge about cache (residency at level x? width in
    bytes)/memory ranges/access rights/etc needed to do so in a very close
    to optimal manner, for both short and long transfers.
    I.e. totally removing the need for compiler tricks or wide register operations.
    Also apropos the compiler library issue:
    You start by teaching the compiler about the MM instruction, and to
    recognize common patterns (just as most compilers already do today), and then the memmove() calls will usually be inlined.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Brett@ggtgp@yahoo.com to comp.arch on Sun Oct 13 19:36:04 2024
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> wrote:
    On 12/10/2024 19:26, Bernd Linsel wrote:
    On 12.10.24 17:16, David Brown wrote:


    [snip rant]



    You are aware that this is c.arch, not c.lang.c?


    Absolutely, yes.

    But in a thread branch discussing C, details of C are relevant.

    I don't expect any random regular here to know "language lawyer" details
    of the C standards. I don't expect people here to care about them.
    People in comp.lang.c care about them - for people here, the main
    interest in C is for programs to run on the computer architectures that
    are the real interest.


    But if someone engages in a conversation about C, I /do/ expect them to understand some basics, and I /do/ expect them to read and think about
    what other posters write. The point under discussion was that you
    cannot implement an efficient "memmove()" function in fully portable standard C. That's a fact - it is a well-established fact. Another
    clear and inarguable fact is that particular ISAs or implementations are completely irrelevant to fully portable standard C - that is both the advantage and the disadvantage of writing code in portable standard C.

    All I am asking Mitch to do is to understand this, and to stop saying
    silly things (such as implementing memmove() by calling memmove(), or
    that the /reason/ you can't implement memmove() efficiently in portable standard C is weaknesses in the x86 ISA), so that we can clear up his misunderstandings and move on to the more interesting computer
    architecture discussions.

    MemMove in C is fundamentally two void pointers and a count of bytes to
    move.

    C does not care what the alignment of those two void pointers is.

    ALU’s are so cheap as to be free, a dedicated MM unit can have a shifter
    and mask with a buffer, and happily copy aligned chunks from the source and write aligned chunks to the dest, even though both are odd aligned in
    different ways, and overlapping the same buffer.

    Note that writes have byte enables, you can write 5 bytes in one go to
    cache, to finish off the end of a series of aligned writes.

    My 66000 only has one MM instruction because when you throw enough hardware
    at the problem, one instruction is all you need.

    And it also covers MemCopy, and yes there is a backwards copy version.

    I detailed the hardware to do this several years ago on Real World Tech.
    And such hardware has been available for many decades in DMA units.

    The 6502 based GameBoy had a MemMove DMA unit as it was many times faster copying bytes than the 6502 was, and doubled the overall performance of the GameBoy.

    One ring to rule them all.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Oct 13 19:43:34 2024
    From Newsgroup: comp.arch

    Brett <ggtgp@yahoo.com> writes:
    David Brown <david.brown@hesbynett.no> wrote:

    All I am asking Mitch to do is to understand this, and to stop saying
    silly things (such as implementing memmove() by calling memmove(), or
    that the /reason/ you can't implement memmove() efficiently in portable
    standard C is weaknesses in the x86 ISA), so that we can clear up his
    misunderstandings and move on to the more interesting computer
    architecture discussions.
    <snip>
    My 66000 only has one MM instruction because when you throw enough hardware >at the problem, one instruction is all you need.

    And it also covers MemCopy, and yes there is a backwards copy version.

    I detailed the hardware to do this several years ago on Real World Tech.

    Such hardware (memcpy/memmove/memfill) was available in 1965 on the Burroughs medium systems mainframes. In the 80s, support was added for hashing
    strings as well.

    It's not a new concept. In fact, there were some tricks that could
    be used with overlapping source and destination buffers that would
    replicate chunks of data).
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Oct 13 23:01:53 2024
    From Newsgroup: comp.arch

    On Sun, 13 Oct 2024 19:43:34 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Brett <ggtgp@yahoo.com> writes:
    David Brown <david.brown@hesbynett.no> wrote:

    All I am asking Mitch to do is to understand this, and to stop
    saying silly things (such as implementing memmove() by calling
    memmove(), or that the /reason/ you can't implement memmove()
    efficiently in portable standard C is weaknesses in the x86 ISA),
    so that we can clear up his misunderstandings and move on to the
    more interesting computer architecture discussions.
    <snip>
    My 66000 only has one MM instruction because when you throw enough
    hardware at the problem, one instruction is all you need.

    And it also covers MemCopy, and yes there is a backwards copy
    version.

    I detailed the hardware to do this several years ago on Real World
    Tech.

    Such hardware (memcpy/memmove/memfill) was available in 1965 on the
    Burroughs medium systems mainframes. In the 80s, support was added
    for hashing strings as well.

    It's not a new concept. In fact, there were some tricks that could
    be used with overlapping source and destination buffers that would
    replicate chunks of data).

    The difference is that today for strings of certain size, say from 200
    bytes to half of your L1D cache, if your precios HW copies less than 50
    bytes per clock then people would complain that it is slower than snail.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Brian G. Lucas@bagel99@gmail.com to comp.arch on Sun Oct 13 15:32:04 2024
    From Newsgroup: comp.arch

    On 10/13/24 4:26 AM, Michael S wrote:
    On Sun, 13 Oct 2024 10:31:49 +0300
    Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

    On 2024-10-12 21:33, Brett wrote:
    David Brown <david.brown@hesbynett.no> wrote:
    On 12/10/2024 01:32, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

    On 11/10/2024 20:55, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void  * s2, size_t n)
    {
        return memmove(s1, s2, n);
    }

    in your library's source?

           .global memmove
    memmove:
           MM     R2,R1,R3
           RET

    sure !

    You are either totally clueless, or you are trolling.  And I
    know you are not clueless.

    This discussion has become pointless.

    The point is that there are a few things that may be hard to do
    with {decode, pipeline, calculations, specifications...}; but
    because they are so universally needed; these, too, should
    "get into ISA".

    One good reason to put them in ISA is to preserve the programmers
    efforts over decades, so they don't have to re-write libc every-
    time a new set of instructions come out.

    Moving an arbitrary amount of memory from point a to point b
    happens to fall into that universal need. Setting an arbitrary
    amount of memory to a value also falls into that universal
    need.

    Again, I have to ask - do you bother to read the posts you reply
    to? Are you interested in replying, and engaging in the
    discussion? Or are you just looking for a chance to promote your
    own architecture, no matter how tenuous the connection might be to
    other posts?

    Again, let me say that I agree with what you are saying - I agree
    that an ISA should have instructions that are efficient for what
    people actually want to do. I agree that it is a good thing to
    have instructions that let performance scale with advances in
    hardware ideally without needing changes in compiled binaries, and
    at least without needing changes in source code.

    I believe there is an interesting discussion to be had here, and I
    would enjoy hearing about comparisons of different ways things
    functions like memcpy() and memset() can be implemented in
    different architectures and optimised for different sizes, or how
    scalable vector instructions can work in comparison to fixed-size
    SIMD instructions.

    But at the moment, this potential is lost because you are posting
    total shite about implementing memmove() in standard C. It is
    disappointing that someone with your extensive knowledge and
    experience cannot see this. I am finding it all very frustrating.




    [ snip discussion of HW ]


    In short your complaints are wrong headed in not understanding what
    hardware memcpy can do.


    I think your reply proves David's complaint: you did not read, or did
    not understand, what David is frustrated about. The true fact that
    David is defending is that memmove() cannot be implemented
    "efficiently" in /standard/ C source code, on /any/ HW, because it
    would require comparing /C pointers/ that point to potentially
    different /C objects/, which is not defined behavior in standard C,
    whether compiled to machine code, or executed by an interpreter of C
    code, or executed by a human programmer performing what was called
    "desk testing" in the 1960s.

    Obviously memmove() can be implemented efficently in non-standard C
    where such pointers can be compared, or by sequences of ordinary ALU
    instructions, or by dedicated instructions such as Mitch's MM, and
    David is not disputing that. But Mitch seems not to understand or not
    to see the issue about standard C vs memmove().


    Sufficiently advanced compiler can recognize patterns and replace them
    with built-in sequences.

    In case of memmove() the most easily recognizable pattern in 100%
    standard C99 appears to be:

    void *memmove( void *dest, const void *src, size_t count)
    {
    if (count > 0) {
    char tmp[count];
    memcpy(tmp, src, count);
    memcpy(dest, tmp, count);
    }
    return dest;
    }

    I don't suggest that real implementation in Brian's compiler is like
    that. Much more likely his implementation uses non-standard C and looks approximately like:
    void *memmove(void *dest, const void *src, size_t count {
    return __builtin_memmove(dest, src, count);
    }

    Well, something like that. Clang will generate LLVM IR which acts like
    a builtin_memmove that the backend can match and emit the MM instruction.

    However, implementing the first variant efficiently is well within
    abilities of good compiler.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Mon Oct 14 15:19:32 2024
    From Newsgroup: comp.arch

    On 13/10/2024 21:21, Terje Mathisen wrote:
    David Brown wrote:
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

    On 09/10/2024 23:37, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different
    objects?
    For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers, >>>>>>> rather than having only a valid pointer or NULL.  A compiler, >>>>>>> for example, might want to store the fact that an error occurred >>>>>>> while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can >>>>>>> rely on what application programmers cannot, their implementation >>>>>>> details.  (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they can >>>>>> implement an efficient memmove() even though a pure standard C
    programmer cannot (other than by simply calling the standard library >>>>>> memmove() function!).

    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    No, it is not.  It has absolutely /nothing/ to do with the ISA.

    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you
    write an efficient memmove() in standard C.  That's why I said there
    was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in assembly or
    using inline assembly, rather than in non-portable C (which is the
    common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up
    that is proportionally more costly for small transfers.  Often that
    can be eliminated when the compiler optimises the functions inline -
    when the compiler knows the size of the move/copy, it can optimise
    directly.

    What you are missing here David is the fact that Mitch's MM is a single instruction which does the entire memmove() operation, and has the
    inside knowledge about cache (residency at level x? width in
    bytes)/memory ranges/access rights/etc needed to do so in a very close
    to optimal manner, for both short and long transfers.

    I am not missing that at all. And I agree that an advanced hardware MM instruction could be a very efficient way to implement both memcpy and memmove. (For my own kind of work, I'd worry about such looping
    instructions causing an unbounded increased in interrupt latency, but
    that too is solvable given enough hardware effort.)

    And I agree that once you have an "MM" (or similar) instruction, you
    don't need to re-write the implementation for your memmove() and
    memcpy() library functions for every new generation of processors of a
    given target family.

    What I /don't/ agree with is the claim that you /do/ need to keep
    re-writing your implementations all the time. You will /sometimes/ get benefits from doing so, but it is not as simple as Mitch made out.


    I.e. totally removing the need for compiler tricks or wide register operations.

    Also apropos the compiler library issue:

    You start by teaching the compiler about the MM instruction, and to recognize common patterns (just as most compilers already do today), and then the memmove() calls will usually be inlined.


    The original compile library issue was that it is impossible to write an efficient memmove() implementation using pure portable standard C. That
    is independent of any ISA, any specialist instructions for memory moves,
    and any compiler optimisations. And it is independent of the fact that
    some good compilers can inline at least some calls to memcpy() and
    memmove() today, using whatever instructions are most efficient for the target.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Oct 14 16:40:26 2024
    From Newsgroup: comp.arch

    David Brown wrote:
    On 13/10/2024 21:21, Terje Mathisen wrote:
    David Brown wrote:
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

    On 09/10/2024 23:37, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different >>>>>>>>> objects?
    For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers,>>>>>>>> rather than having only a valid pointer or NULL.  A compiler,
    for example, might want to store the fact that an error occurred>>>>>>>> while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can>>>>>>>> rely on what application programmers cannot, their implementation
    details.  (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they can >>>>>>> implement an efficient memmove() even though a pure standard C
    programmer cannot (other than by simply calling the standard library >>>>>>> memmove() function!).

    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    No, it is not.  It has absolutely /nothing/ to do with the ISA. >>>>
    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you
    write an efficient memmove() in standard C.  That's why I said there >>> was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in assembly or
    using inline assembly, rather than in non-portable C (which is the
    common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up
    that is proportionally more costly for small transfers.  Often that
    can be eliminated when the compiler optimises the functions inline - >>> when the compiler knows the size of the move/copy, it can optimise
    directly.

    What you are missing here David is the fact that Mitch's MM is a
    single instruction which does the entire memmove() operation, and has >> the inside knowledge about cache (residency at level x? width in
    bytes)/memory ranges/access rights/etc needed to do so in a very close
    to optimal manner, for both short and long transfers.

    I am not missing that at all.  And I agree that an advanced hardware MM instruction could be a very efficient way to implement both memcpy and > memmove.  (For my own kind of work, I'd worry about such looping
    instructions causing an unbounded increased in interrupt latency, but
    that too is solvable given enough hardware effort.)

    And I agree that once you have an "MM" (or similar) instruction, you
    don't need to re-write the implementation for your memmove() and
    memcpy() library functions for every new generation of processors of a > given target family.

    What I /don't/ agree with is the claim that you /do/ need to keep
    re-writing your implementations all the time.  You will /sometimes/ get benefits from doing so, but it is not as simple as Mitch made out.


    I.e. totally removing the need for compiler tricks or wide register
    operations.

    Also apropos the compiler library issue:

    You start by teaching the compiler about the MM instruction, and to
    recognize common patterns (just as most compilers already do today),
    and then the memmove() calls will usually be inlined.


    The original compile library issue was that it is impossible to write an efficient memmove() implementation using pure portable standard C.  That
    is independent of any ISA, any specialist instructions for memory moves,
    and any compiler optimisations.  And it is independent of the fact that some good compilers can inline at least some calls to memcpy() and
    memmove() today, using whatever instructions are most efficient for the target.
    David, you and Mitch are among my most cherished writers here on c.arch, I really don't think any of us really disagree, it is just that we have
    been discussing two (mostly) orthogonal issues.
    a) memmove/memcpy are so important that people have been spending a lot
    of time & effort trying to make it faster, with the complication that in general it cannot be implemented in pure C (which disallows direct
    comparison of arbitrary pointers).
    b) Mitch have, like Andy ("Crazy") Glew many years before, realized that if a cpu architecture actually has an instruction designed to do this
    particular job, it behooves cpu architects to make sure that it is in
    fact so fast that it obviates any need for tricky coding to replace it. Ideally, it should be able to copy a single object, up to a cache line
    in size, in the same or less time needed to do so manually with a SIMD
    512-bit load followed by a 512-bit store (both ops masked to not touch anything it shouldn't)
    REP MOVSB on x86 does the canonical memcpy() operation, originally by
    moving single bytes, and this was so slow that we also had REP MOVSW
    (moving 16-bit entities) and then REP MOVSD on the 386 and REP MOVSQ on
    64-bit cpus.
    With a suitable chunk of logic, the basic MOVSB operation could in fact
    handle any kinds of alignments and sizes, while doing the actual
    transfer at maximum bus speeds, i.e. at least one cache line/cycle for
    things already in $L1.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Mon Oct 14 17:04:28 2024
    From Newsgroup: comp.arch

    On 13/10/2024 17:45, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never". Pretty much the
    only example people ever give of needing such comparisons is to
    implement memmove() efficiently - but you don't need to implement
    memmove(), because it is already in the standard library.

    When you implements something like, say

    vsum(double *a, double *b, double *c, size_t n);

    where a, b, and c may point to arrays in different objects, or may
    point to overlapping parts of the same object, and the result vector c
    in the overlap case should be the same as in the no-overlap case
    (similar to memmove()), being able to compare pointers to possibly
    different objects also comes in handy.


    OK, I can agree with that - /if/ you need such a function. I'd suggest
    that when you are writing code that might call such a function, you've a
    very good idea whether you want to do "vec_c = vec_a + vec_b;", or
    "vec_c += vec_a;" (that is, "b" and "c" are the same). In other words,
    the programmer calling vsum already knows if there are overlaps, and
    you'd get the best results if you had different functions for the
    separate cases.

    It is conceivable that you don't know if there is an overlap, especially
    if you are only dealing with parts of arrays rather than full arrays,
    but I think such cases will be rare.

    I do think it would be convenient if there were a fully standard way to compare independent pointers (other than just for equality). Rarely
    needing something does not mean /never/ needing it. Since a fully
    defined portable method might not be possible (or at least, not
    efficiently possible) for some weird targets, and it's a good thing that
    C supports weird targets, I think perhaps the ideal would be to have
    some feature that exists if and only if you can do sensible comparisons.
    This could be an additional <stdint.h> pointer type, or some pointer
    compare macros, or a pre-defined macro to say if you can simply use
    uintptr_t for the purpose (as you can on most modern C implementations).

    Another example is when the programmer uses the address as a key in,
    e.g., a binary search tree. And, as you write, casting to intptr_t is
    not guarenteed to work by the C standard, either.

    Casting to uintptr_t (why would one want a /signed/ address?) is all you
    need for most systems - and for any target where casting to uintptr_t
    will not be sufficient here, the type uintptr_t will not exist and you
    get a nice, safe hard compile-time error rather than silently UB code.
    For uses like this, you don't need to compare pointers - comparing the integers converted from the pointers is fine. (Imagine a system where converted addresses consist of a 16-bit segment number and a 16-bit
    offset, where the absolute address is the segment number times a scale
    factor, plus the offset. You can't easily compare two pointers for real address ordering by converting them to an integer type, but the result
    of casting to uintptr_t is still fine for your binary tree.)


    An example that probably compares pointers to the same object as far
    as the C standard is concerned, but feel like different objects to the programmer, is logic variables (in, e.g., a Prolog implementation).
    When you have two free variables, and you unify them, in the
    implementation one variable points to the other one. Now which should
    point to which? The younger variable should point to the older one,
    because it will die sooner. How do you know which variable is
    younger? You compare the addresses; the variables reside on a stack,
    so the younger one is closer to the top.

    If that stack is one object as far as the C standard is concerned,
    there is no problem with that solution. If the stack is implemented
    as several objects (to make it easier growable; I don't know if there
    is a Prolog implementation that does that), you first have to check in
    which piece it is (maybe with a binary search), and then possibly
    compare within the stack piece at hand.


    My only experience of Prolog was working through a short tutorial
    article when I was a teenager - I have no idea about implementations!

    But again I come back to the same conclusion - there are situations
    where being able to compare addresses can be useful, but it is very rare
    for most programmers to ever actually need to do so. And I think it is
    good that there is a widely portable way to achieve this, by casting to uintptr_t and comparing those integers. There are things that people
    want to do with C programming that can be done with
    implementation-specific code, but which cannot be done with fully
    portable standard code. While it is always nice if you /can/ use fully portable solutions (while still being clear and efficient), it's okay to
    have non-portable code when you need it.

    An interesting case is the Forth standard. It specifies "contiguous
    regions", which correspond to objects in C, but in Forth each address
    is a cell and can be added, subtracted, compared, etc. irrespective of
    where it came from. So Forth really has a flat-memory model. It has
    had that since its origins in the 1970s. Some of the 8086
    implementations had some extensions for dealing with more than 64KB,
    but they were never standardized and are now mostly forgotten.


    Forth does not require a flat memory model in the hardware, as far as I
    am aware, any more than C does. (I appreciate that your knowledge of
    Forth is /vastly/ greater than mine.) A Forth implementation could
    interpret part of the address value as the segment or other memory block
    identifier and part of it as an index into that block, just as a C
    implementation can.

    I.e., what you are saying is that one can simulate a flat-memory model
    on a segmented memory model.

    Yes.

    Certainly. In the case of the 8086 (and
    even more so on the 286) the costs of that are so high that no
    widely-used Forth system went there.


    OK.

    That's much the same as C on segmented targets.

    One can also simulate segmented memory (a natural fit for many
    programming languages) on flat memory. In this case the cost is much smaller, plus it gives the maximum flexibility about segment/object
    sizes and numbers. That is why flat memory has won.


    Sure, flat memory is nicer in many ways.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Mon Oct 14 17:19:40 2024
    From Newsgroup: comp.arch

    On 14/10/2024 16:40, Terje Mathisen wrote:
    David Brown wrote:
    On 13/10/2024 21:21, Terje Mathisen wrote:
    David Brown wrote:
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

    On 09/10/2024 23:37, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different >>>>>>>>>> objects?
    For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers, >>>>>>>>> rather than having only a valid pointer or NULL.  A compiler, >>>>>>>>> for example, might want to store the fact that an error occurred >>>>>>>>> while parsing a subexpression as a special pointer constant. >>>>>>>>>
    Compilers often have the unfair advantage, though, that they can >>>>>>>>> rely on what application programmers cannot, their implementation >>>>>>>>> details.  (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they >>>>>>>> can
    implement an efficient memmove() even though a pure standard C >>>>>>>> programmer cannot (other than by simply calling the standard
    library
    memmove() function!).

    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    No, it is not.  It has absolutely /nothing/ to do with the ISA. >>>>>
    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you
    write an efficient memmove() in standard C.  That's why I said
    there was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in assembly
    or using inline assembly, rather than in non-portable C (which is
    the common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up
    that is proportionally more costly for small transfers.  Often that >>>> can be eliminated when the compiler optimises the functions inline -
    when the compiler knows the size of the move/copy, it can optimise
    directly.

    What you are missing here David is the fact that Mitch's MM is a
    single instruction which does the entire memmove() operation, and has
    the inside knowledge about cache (residency at level x? width in
    bytes)/memory ranges/access rights/etc needed to do so in a very
    close to optimal manner, for both short and long transfers.

    I am not missing that at all.  And I agree that an advanced hardware
    MM instruction could be a very efficient way to implement both memcpy
    and memmove.  (For my own kind of work, I'd worry about such looping
    instructions causing an unbounded increased in interrupt latency, but
    that too is solvable given enough hardware effort.)

    And I agree that once you have an "MM" (or similar) instruction, you
    don't need to re-write the implementation for your memmove() and
    memcpy() library functions for every new generation of processors of a
    given target family.

    What I /don't/ agree with is the claim that you /do/ need to keep
    re-writing your implementations all the time.  You will /sometimes/
    get benefits from doing so, but it is not as simple as Mitch made out.


    I.e. totally removing the need for compiler tricks or wide register
    operations.

    Also apropos the compiler library issue:

    You start by teaching the compiler about the MM instruction, and to
    recognize common patterns (just as most compilers already do today),
    and then the memmove() calls will usually be inlined.


    The original compile library issue was that it is impossible to write
    an efficient memmove() implementation using pure portable standard C.
    That is independent of any ISA, any specialist instructions for memory
    moves, and any compiler optimisations.  And it is independent of the
    fact that some good compilers can inline at least some calls to
    memcpy() and memmove() today, using whatever instructions are most
    efficient for the target.

    David, you and Mitch are among my most cherished writers here on c.arch,
    I really don't think any of us really disagree, it is just that we have
    been discussing two (mostly) orthogonal issues.

    I agree. It's a "god dag mann, økseskaft" situation.

    I have a huge respect for Mitch, his knowledge and experience, and his willingness to share that freely with others. That's why I have found
    this very frustrating.


    a) memmove/memcpy are so important that people have been spending a lot
    of time & effort trying to make it faster, with the complication that in general it cannot be implemented in pure C (which disallows direct comparison of arbitrary pointers).


    Yes.

    (Unlike memmov(), memcpy() can be implemented in standard C as a simple byte-copy loop, without needing to compare pointers. But an
    implementation that copies in larger blocks than a byte requires implementation dependent behaviour to determine alignments, or it must
    rely on unaligned accesses being allowed by the implementation.)

    b) Mitch have, like Andy ("Crazy") Glew many years before, realized that
    if a cpu architecture actually has an instruction designed to do this particular job, it behooves cpu architects to make sure that it is in
    fact so fast that it obviates any need for tricky coding to replace it.


    Yes.

    Ideally, it should be able to copy a single object, up to a cache line
    in size, in the same or less time needed to do so manually with a SIMD 512-bit load followed by a 512-bit store (both ops masked to not touch anything it shouldn't)


    Yes.

    REP MOVSB on x86 does the canonical memcpy() operation, originally by
    moving single bytes, and this was so slow that we also had REP MOVSW
    (moving 16-bit entities) and then REP MOVSD on the 386 and REP MOVSQ on 64-bit cpus.

    With a suitable chunk of logic, the basic MOVSB operation could in fact handle any kinds of alignments and sizes, while doing the actual
    transfer at maximum bus speeds, i.e. at least one cache line/cycle for things already in $L1.


    I agree on all of that.

    I am quite happy with the argument that suitable hardware can do these
    basic operations faster than a software loop or the x86 "rep"
    instructions. And I fully agree that these would be useful features in general-purpose processors.

    My only point of contention is that the existence or lack of such
    instructions does not make any difference to whether or not you can
    write a good implementation of memcpy() or memmove() in portable
    standard C. They would make it easier to write efficient
    implementations of these standard library functions for targets that had
    such instructions - but that would be implementation-specific code. And
    that is one of the reasons that C standard library implementations are
    tied to the specific compiler and target, and the writers of these
    libraries have "superpowers" and are not limited to standard C.



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Oct 14 19:08:56 2024
    From Newsgroup: comp.arch

    On Mon, 14 Oct 2024 17:19:40 +0200
    David Brown <david.brown@hesbynett.no> wrote:
    On 14/10/2024 16:40, Terje Mathisen wrote:
    David Brown wrote:
    On 13/10/2024 21:21, Terje Mathisen wrote:
    David Brown wrote:
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

    On 09/10/2024 23:37, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to
    different objects?
    For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in
    pointers, rather than having only a valid pointer or
    NULL.  A compiler, for example, might want to store the >>>>>>>>> fact that an error occurred while parsing a subexpression
    as a special pointer constant.

    Compilers often have the unfair advantage, though, that
    they can rely on what application programmers cannot, their >>>>>>>>> implementation details.  (Some do not, such as f2c). >>>>>>>>
    Standard library authors have the same superpowers, so that
    they can
    implement an efficient memmove() even though a pure standard >>>>>>>> C programmer cannot (other than by simply calling the
    standard library
    memmove() function!).

    This is more a symptom of bad ISA design/evolution than of
    libc writers needing superpowers.

    No, it is not.  It has absolutely /nothing/ to do with the >>>>>> ISA.

    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let
    you write an efficient memmove() in standard C.  That's why I
    said there was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in
    assembly or using inline assembly, rather than in non-portable C
    (which is the common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things
    up that is proportionally more costly for small transfers.Â
    Often that can be eliminated when the compiler optimises the
    functions inline - when the compiler knows the size of the
    move/copy, it can optimise directly.

    What you are missing here David is the fact that Mitch's MM is a
    single instruction which does the entire memmove() operation, and
    has the inside knowledge about cache (residency at level x? width
    in bytes)/memory ranges/access rights/etc needed to do so in a
    very close to optimal manner, for both short and long transfers.

    I am not missing that at all.  And I agree that an advanced
    hardware MM instruction could be a very efficient way to implement
    both memcpy and memmove.  (For my own kind of work, I'd worry
    about such looping instructions causing an unbounded increased in
    interrupt latency, but that too is solvable given enough hardware
    effort.)

    And I agree that once you have an "MM" (or similar) instruction,
    you don't need to re-write the implementation for your memmove()
    and memcpy() library functions for every new generation of
    processors of a given target family.

    What I /don't/ agree with is the claim that you /do/ need to keep
    re-writing your implementations all the time.  You will
    /sometimes/ get benefits from doing so, but it is not as simple as
    Mitch made out.

    I.e. totally removing the need for compiler tricks or wide
    register operations.

    Also apropos the compiler library issue:

    You start by teaching the compiler about the MM instruction, and
    to recognize common patterns (just as most compilers already do
    today), and then the memmove() calls will usually be inlined.


    The original compile library issue was that it is impossible to
    write an efficient memmove() implementation using pure portable
    standard C. That is independent of any ISA, any specialist
    instructions for memory moves, and any compiler optimisations.
    And it is independent of the fact that some good compilers can
    inline at least some calls to memcpy() and memmove() today, using
    whatever instructions are most efficient for the target.

    David, you and Mitch are among my most cherished writers here on
    c.arch, I really don't think any of us really disagree, it is just
    that we have been discussing two (mostly) orthogonal issues.

    I agree. It's a "god dag mann, økseskaft" situation.

    I have a huge respect for Mitch, his knowledge and experience, and
    his willingness to share that freely with others. That's why I have
    found this very frustrating.


    a) memmove/memcpy are so important that people have been spending a
    lot of time & effort trying to make it faster, with the
    complication that in general it cannot be implemented in pure C
    (which disallows direct comparison of arbitrary pointers).


    Yes.

    (Unlike memmov(), memcpy() can be implemented in standard C as a
    simple byte-copy loop, without needing to compare pointers. But an implementation that copies in larger blocks than a byte requires implementation dependent behaviour to determine alignments, or it
    must rely on unaligned accesses being allowed by the implementation.)

    b) Mitch have, like Andy ("Crazy") Glew many years before, realized
    that if a cpu architecture actually has an instruction designed to
    do this particular job, it behooves cpu architects to make sure
    that it is in fact so fast that it obviates any need for tricky
    coding to replace it.

    Yes.

    Ideally, it should be able to copy a single object, up to a cache
    line in size, in the same or less time needed to do so manually
    with a SIMD 512-bit load followed by a 512-bit store (both ops
    masked to not touch anything it shouldn't)


    Yes.

    REP MOVSB on x86 does the canonical memcpy() operation, originally
    by moving single bytes, and this was so slow that we also had REP
    MOVSW (moving 16-bit entities) and then REP MOVSD on the 386 and
    REP MOVSQ on 64-bit cpus.

    With a suitable chunk of logic, the basic MOVSB operation could in
    fact handle any kinds of alignments and sizes, while doing the
    actual transfer at maximum bus speeds, i.e. at least one cache
    line/cycle for things already in $L1.


    I agree on all of that.

    I am quite happy with the argument that suitable hardware can do
    these basic operations faster than a software loop or the x86 "rep" instructions.
    No, that's not true. And according to my understanding, that's not what
    Terje wrote.
    REP MOVSB _is_ almost ideal instruction for memcpy (modulo minor
    details - fixed registers for src, dest, len and Direction flag in PSW
    instead of being part of the opcode).
    REP MOVSW/D/Q were introduced because back then processors were small
    and stupid. When your processor is big and smart you don't need them
    any longer. REP MOVSB is sufficient.
    New Arm64 instruction that are hopefully coming next year are akin to
    REP MOVSB rather than to MOVSW/D/Q.
    Instructions for memmove, also defined by Arm and by Mitch, is the next
    logical step. IMHO, the main gain here is not measurable improvement in performance, but saving of code size when inlined.
    Now, is all that a good idea? I am not 100% convinced.
    One can argue that streaming alignment hardware that is necessary for
    1st-class implementation of these instructions is useful not only for
    memory copy.
    So, may be, it makes sense to expose this hardware in more generic ways.
    May be, via Load Multiple Register? It was present in Arm's A32/T32,
    but didn't make it into ARM64. Or, may be, there are even better ways
    that I was not thinking about.
    And I fully agree that these would be useful features
    in general-purpose processors.

    My only point of contention is that the existence or lack of such instructions does not make any difference to whether or not you can
    write a good implementation of memcpy() or memmove() in portable
    standard C.
    You are moving a goalpost.
    One does not need "good implementation" in a sense you have in mind.
    All one needs is an implementation that pattern matching logic of
    compiler unmistakably recognizes as memove/memcpy. That is very easily
    done in standard C. For memmove, I had shown how to do it in one of the
    posts below. For memcpy its very obvious, so no need to show.
    They would make it easier to write efficient
    implementations of these standard library functions for targets that
    had such instructions - but that would be implementation-specific
    code. And that is one of the reasons that C standard library
    implementations are tied to the specific compiler and target, and the
    writers of these libraries have "superpowers" and are not limited to
    standard C.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Oct 14 19:02:51 2024
    From Newsgroup: comp.arch

    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard way to compare independent pointers (other than just for equality). Rarely
    needing something does not mean /never/ needing it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Oct 14 22:20:42 2024
    From Newsgroup: comp.arch

    On Mon, 14 Oct 2024 19:02:51 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard
    way to compare independent pointers (other than just for equality).
    Rarely needing something does not mean /never/ needing it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??

    That's their problem. The rest of the C world shouldn't suffer because
    of odd birds.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Oct 14 19:39:41 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard way to
    compare independent pointers (other than just for equality). Rarely
    needing something does not mean /never/ needing it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??

    Depends. On the Burroughs mainframe there could be eight
    active segments and the segment number was part of the pointer.

    Pointers were 32-bits (actually 8 BCD digits)

    S s OOOOOO

    Where 'S' was a sign digit (C or D), 's' was the
    segment number (0-7) and OOOOOO was the six digit
    offset within the segment (500kB/1000kD each).

    A particular task (process) could have up to
    one million "environments", each environment
    could have up to 100 "memory areas (up to 1000kD)
    of which the first eight were loaded into the
    processor base/limit registers. Index registers
    were 8 digits and were loaded with a pointer as
    described above. Operands could optionally select
    one of the index registers and the operand address
    was treated as an offset to the index register;
    there were 7 index registers.

    Access to memory areas 8-99 use string instructions
    where the pointer was 16 BCD digits:

    EEEEEEMM SsOOOOOO

    Where EEEEEE was the evironment number (0-999999);
    environments starting with D00000 were reserved for
    the MCP (Operating System). MM was the memory area
    number and the remaining eight digits described the
    data within the memory area. A subroutine call could
    call within a memory area or switch to a new environment.

    Memory area 1 was the code region for the segment,
    Memory area 0 held the stack and some global variables
    and was typically shared by all environments.
    Memory areas 2-7 were application dependent and could
    be configured to be shared between environments at
    link time.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Mon Oct 14 23:46:10 2024
    From Newsgroup: comp.arch

    On Tue, 8 Oct 2024 20:53:00 +0000, MitchAlsup1 wrote:

    The Algol family of block structure gave the illusion that flat was less necessary and it could all be done with lexical address-
    ing and block scoping rules.

    Then malloc() and mmap() came along.

    Algol-68 already had heap allocation and flex arrays. (The folks over in MULTICS land were working on mmap.)
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Oct 15 00:14:25 2024
    From Newsgroup: comp.arch

    On Mon, 14 Oct 2024 19:20:42 +0000, Michael S wrote:

    On Mon, 14 Oct 2024 19:02:51 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard
    way to compare independent pointers (other than just for equality).
    Rarely needing something does not mean /never/ needing it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??

    That's their problem. The rest of the C world shouldn't suffer because
    of odd birds.

    So, you are saying that 286 in its hey-day was/is odd ?!?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Oct 15 00:15:49 2024
    From Newsgroup: comp.arch

    On Mon, 14 Oct 2024 19:39:41 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard way to
    compare independent pointers (other than just for equality). Rarely
    needing something does not mean /never/ needing it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??

    Depends. On the Burroughs mainframe there could be eight
    active segments and the segment number was part of the pointer.

    Pointers were 32-bits (actually 8 BCD digits)

    Stick to the question asked. Registers were 16-binary digits,
    and segment registers enabled access to 24-bit address space.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Tue Oct 15 05:20:10 2024
    From Newsgroup: comp.arch

    On Tue, 8 Oct 2024 21:03:40 -0000 (UTC), John Levine wrote:

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

    If you look at the 8086 manuals, that's clearly what they had in mind.

    What I don't get is that the 286's segment stuff was so slow.

    It had to load the whole segment descriptor from RAM and possibly
    perform some additional setup.

    Right, and they appeared not to care or realize it was a performance
    problem.

    They didn’t expect anybody to make serious use of it.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Oct 15 10:41:41 2024
    From Newsgroup: comp.arch

    On Tue, 15 Oct 2024 00:14:25 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    On Mon, 14 Oct 2024 19:20:42 +0000, Michael S wrote:

    On Mon, 14 Oct 2024 19:02:51 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard
    way to compare independent pointers (other than just for
    equality). Rarely needing something does not mean /never/ needing
    it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??

    That's their problem. The rest of the C world shouldn't suffer
    because of odd birds.

    So, you are saying that 286 in its hey-day was/is odd ?!?

    In its heyday 80286 was used as MUCH faster 8088.
    286-as-286 was/is odd creature. I'd dare to say that it had no heyday.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Tue Oct 15 10:53:30 2024
    From Newsgroup: comp.arch

    On 14/10/2024 18:08, Michael S wrote:
    On Mon, 14 Oct 2024 17:19:40 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 14/10/2024 16:40, Terje Mathisen wrote:
    David Brown wrote:
    On 13/10/2024 21:21, Terje Mathisen wrote:
    David Brown wrote:
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

    On 09/10/2024 23:37, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to
    different objects?
    For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in
    pointers, rather than having only a valid pointer or
    NULL.  A compiler, for example, might want to store the >>>>>>>>>>> fact that an error occurred while parsing a subexpression >>>>>>>>>>> as a special pointer constant.

    Compilers often have the unfair advantage, though, that
    they can rely on what application programmers cannot, their >>>>>>>>>>> implementation details.  (Some do not, such as f2c). >>>>>>>>>>
    Standard library authors have the same superpowers, so that >>>>>>>>>> they can
    implement an efficient memmove() even though a pure standard >>>>>>>>>> C programmer cannot (other than by simply calling the
    standard library
    memmove() function!).

    This is more a symptom of bad ISA design/evolution than of
    libc writers needing superpowers.

    No, it is not.  It has absolutely /nothing/ to do with the >>>>>>>> ISA.

    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let
    you write an efficient memmove() in standard C.  That's why I
    said there was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in
    assembly or using inline assembly, rather than in non-portable C
    (which is the common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things
    up that is proportionally more costly for small transfers.Â
    Often that can be eliminated when the compiler optimises the
    functions inline - when the compiler knows the size of the
    move/copy, it can optimise directly.

    What you are missing here David is the fact that Mitch's MM is a
    single instruction which does the entire memmove() operation, and
    has the inside knowledge about cache (residency at level x? width
    in bytes)/memory ranges/access rights/etc needed to do so in a
    very close to optimal manner, for both short and long transfers.

    I am not missing that at all.  And I agree that an advanced
    hardware MM instruction could be a very efficient way to implement
    both memcpy and memmove.  (For my own kind of work, I'd worry
    about such looping instructions causing an unbounded increased in
    interrupt latency, but that too is solvable given enough hardware
    effort.)

    And I agree that once you have an "MM" (or similar) instruction,
    you don't need to re-write the implementation for your memmove()
    and memcpy() library functions for every new generation of
    processors of a given target family.

    What I /don't/ agree with is the claim that you /do/ need to keep
    re-writing your implementations all the time.  You will
    /sometimes/ get benefits from doing so, but it is not as simple as
    Mitch made out.

    I.e. totally removing the need for compiler tricks or wide
    register operations.

    Also apropos the compiler library issue:

    You start by teaching the compiler about the MM instruction, and
    to recognize common patterns (just as most compilers already do
    today), and then the memmove() calls will usually be inlined.


    The original compile library issue was that it is impossible to
    write an efficient memmove() implementation using pure portable
    standard C. That is independent of any ISA, any specialist
    instructions for memory moves, and any compiler optimisations.
    And it is independent of the fact that some good compilers can
    inline at least some calls to memcpy() and memmove() today, using
    whatever instructions are most efficient for the target.

    David, you and Mitch are among my most cherished writers here on
    c.arch, I really don't think any of us really disagree, it is just
    that we have been discussing two (mostly) orthogonal issues.

    I agree. It's a "god dag mann, økseskaft" situation.

    I have a huge respect for Mitch, his knowledge and experience, and
    his willingness to share that freely with others. That's why I have
    found this very frustrating.


    a) memmove/memcpy are so important that people have been spending a
    lot of time & effort trying to make it faster, with the
    complication that in general it cannot be implemented in pure C
    (which disallows direct comparison of arbitrary pointers).


    Yes.

    (Unlike memmov(), memcpy() can be implemented in standard C as a
    simple byte-copy loop, without needing to compare pointers. But an
    implementation that copies in larger blocks than a byte requires
    implementation dependent behaviour to determine alignments, or it
    must rely on unaligned accesses being allowed by the implementation.)

    b) Mitch have, like Andy ("Crazy") Glew many years before, realized
    that if a cpu architecture actually has an instruction designed to
    do this particular job, it behooves cpu architects to make sure
    that it is in fact so fast that it obviates any need for tricky
    coding to replace it.

    Yes.

    Ideally, it should be able to copy a single object, up to a cache
    line in size, in the same or less time needed to do so manually
    with a SIMD 512-bit load followed by a 512-bit store (both ops
    masked to not touch anything it shouldn't)


    Yes.

    REP MOVSB on x86 does the canonical memcpy() operation, originally
    by moving single bytes, and this was so slow that we also had REP
    MOVSW (moving 16-bit entities) and then REP MOVSD on the 386 and
    REP MOVSQ on 64-bit cpus.

    With a suitable chunk of logic, the basic MOVSB operation could in
    fact handle any kinds of alignments and sizes, while doing the
    actual transfer at maximum bus speeds, i.e. at least one cache
    line/cycle for things already in $L1.


    I agree on all of that.

    I am quite happy with the argument that suitable hardware can do
    these basic operations faster than a software loop or the x86 "rep"
    instructions.

    No, that's not true. And according to my understanding, that's not what
    Terje wrote.
    REP MOVSB _is_ almost ideal instruction for memcpy (modulo minor
    details - fixed registers for src, dest, len and Direction flag in PSW instead of being part of the opcode).

    My understanding of what Terje wrote is that REP MOVSB /could/ be an
    efficient solution if it were backed by a hardware block to run well
    (i.e., transferring as many bytes per cycle as memory bus bandwidth
    allows). But REP MOVSB is /not/ efficient - and rather than making it
    work faster, Intel introduced variants with wider fixed sizes.

    Could REP MOVSB realistically be improved to be as efficient as the instructions in ARMv9, RISC-V, and Mitch'es "MM" instruction? I don't
    know. Intel and AMD have had many decades to do so, so I assume it's
    not an easy improvement.

    REP MOVSW/D/Q were introduced because back then processors were small
    and stupid. When your processor is big and smart you don't need them
    any longer. REP MOVSB is sufficient.
    New Arm64 instruction that are hopefully coming next year are akin to
    REP MOVSB rather than to MOVSW/D/Q.
    Instructions for memmove, also defined by Arm and by Mitch, is the next logical step. IMHO, the main gain here is not measurable improvement in performance, but saving of code size when inlined.

    Now, is all that a good idea?

    That's a very important question.

    I am not 100% convinced.
    One can argue that streaming alignment hardware that is necessary for 1st-class implementation of these instructions is useful not only for
    memory copy.
    So, may be, it makes sense to expose this hardware in more generic ways.

    I believe that is the idea of "scalable vector" instructions as an
    alternative philosophy to wide explicit SIMD registers. My expectation
    is that SVE implementations will be more effort in the hardware than
    SIMD for any specific SIMD-friendly size point (i.e., power-of-two
    widths). That usually corresponds to lower clock rates and/or higher
    latency and more coordination from extra pipeline stages.

    But once you have SVE support in place, then memcpy() and memset() are
    just examples of vector operations that you get almost for free when you
    have hardware for vector MACs and other operations.

    May be, via Load Multiple Register? It was present in Arm's A32/T32,
    but didn't make it into ARM64. Or, may be, there are even better ways
    that I was not thinking about.

    And I fully agree that these would be useful features
    in general-purpose processors.

    My only point of contention is that the existence or lack of such
    instructions does not make any difference to whether or not you can
    write a good implementation of memcpy() or memmove() in portable
    standard C.

    You are moving a goalpost.

    No, my goalposts have been in the same place all the time. Some others
    have been kicking the ball at a completely different set of goalposts,
    but I have kept the same point all along.

    One does not need "good implementation" in a sense you have in mind.

    Maybe not - but /that/ would be moving the goalposts.

    All one needs is an implementation that pattern matching logic of
    compiler unmistakably recognizes as memove/memcpy. That is very easily
    done in standard C. For memmove, I had shown how to do it in one of the
    posts below. For memcpy its very obvious, so no need to show.


    But that would /not/ be an efficient implementation of memmove() in
    plain portable standard C.

    What do I mean by an "efficient" implementation in fully portable
    standard C? There are two possible ways to think about that. One is
    that the operations on the abstract machine are efficient. The other is
    that the code is likely to result in efficient code over a wide range of real-world compilers, options, and targets. And I think it goes without saying that the implementation must not rely on any
    implementation-defined behaviour or anything beyond the minimal limits
    given in the C standards, and it must not introduce any new real or
    potential UB.

    Your "memmove()" implementation fails on several counts. It is
    inefficient in the abstract machine - it copies everything twice instead
    of once. It is inefficient in real-world implementations of all sorts
    and countless targets - being efficient for some compilers with some
    options on some targets (most of them hypothetical) does /not/ qualify
    as an efficient implementation. And quite clearly it risks causing
    failures from stack overflow in situations where the user would normally expect memmove() to function safely (on implementations other than those
    few that turn it into efficient object code).

    They would make it easier to write efficient
    implementations of these standard library functions for targets that
    had such instructions - but that would be implementation-specific
    code. And that is one of the reasons that C standard library
    implementations are tied to the specific compiler and target, and the
    writers of these libraries have "superpowers" and are not limited to
    standard C.



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Oct 15 11:59:27 2024
    From Newsgroup: comp.arch

    On Tue, 8 Oct 2024 21:03:40 -0000 (UTC)
    John Levine <johnl@taugh.com> wrote:

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    If you look at the 8086 manuals, that's clearly what they had in
    mind.

    What I don't get is that the 286's segment stuff was so slow.

    It had to load the whole segment descriptor from RAM and possibly
    perform some additional setup.

    Right, and they appeared not to care or realize it was a performance
    problem.

    They didn't even do obvious things like see if you're reloading the
    same value into the segment register and skip the rest of the setup.
    Sure, you could put checks in your code and skip the segment load but
    that would make your code a lot bigger and uglier.


    The question is how slowness of 80286 segments compares to
    contemporaries that used segment-based protected memory.
    Wikipedia lists following machines as examples of segmentation:
    - Burroughs B5000 and following Burroughs Large Systems
    - GE 645 -> Honeywell 6080
    - Prime 400 and successors
    - IBM System/38
    They also mention S/370, but to me segmentation in S/370 looks very
    different and probably not intended for fine-grained protection.

    Of those Burroughs B5900 looks to me as the most comparable to 80286.






    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Tue Oct 15 12:38:40 2024
    From Newsgroup: comp.arch

    On 14/10/2024 21:02, MitchAlsup1 wrote:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard way to
    compare independent pointers (other than just for equality).  Rarely
    needing something does not mean /never/ needing it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??


    void * p = ...
    void * q = ...

    uintptr_t pu = (uintptr_t) p;
    uintptr_t qu = (uintptr_t) q;

    if (pu > qu) {
    ...
    } else if (pu < qu) {
    ...
    } else {
    ...
    }


    If your comparison needs to actually match up with the real virtual
    addresses, then this will not work. But does that actually matter?

    Think about using this comparison for memmove().

    Consider where these pointers come from. Maybe they are pointers to statically allocated data. Then you would expect the segment to be the
    same in each case, and the uintptr_t comparison will be fine for
    memmove(). Maybe they come from malloc() and are in different segments.
    Then the comparison here might not give the same result as a full
    virtual address comparison - but that does not matter. If the pointers
    came from different mallocs, they could not overlap and memmove() can
    run either direction.

    The same applies to other uses, such as indexing in a binary search tree
    or a hash map - the comparison above will be correct when it matters.



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Oct 15 14:22:46 2024
    From Newsgroup: comp.arch

    On Tue, 15 Oct 2024 12:38:40 +0200
    David Brown <david.brown@hesbynett.no> wrote:
    On 14/10/2024 21:02, MitchAlsup1 wrote:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard
    way to compare independent pointers (other than just for
    equality).  Rarely needing something does not mean /never/ needing
    it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??


    void * p = ...
    void * q = ...

    uintptr_t pu = (uintptr_t) p;
    uintptr_t qu = (uintptr_t) q;

    if (pu > qu) {
    ...
    } else if (pu < qu) {
    ...
    } else {
    ...
    }


    If your comparison needs to actually match up with the real virtual addresses, then this will not work. But does that actually matter?

    Think about using this comparison for memmove().

    Consider where these pointers come from. Maybe they are pointers to statically allocated data. Then you would expect the segment to be
    the same in each case, and the uintptr_t comparison will be fine for memmove(). Maybe they come from malloc() and are in different
    segments. Then the comparison here might not give the same result as
    a full virtual address comparison - but that does not matter. If the pointers came from different mallocs, they could not overlap and
    memmove() can run either direction.

    The same applies to other uses, such as indexing in a binary search
    tree or a hash map - the comparison above will be correct when it
    matters.



    It's all fine for as long as there are no objects bigger than 64KB.
    But with 16MB of virtual memory and with several* MB of physical memory
    one does want objects that are bigger than 64KB!
    ---
    * https://theretroweb.com/motherboards/s/compaq-deskpro-286e-p-n-001226
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Tue Oct 15 14:09:58 2024
    From Newsgroup: comp.arch

    On 15/10/2024 13:22, Michael S wrote:
    On Tue, 15 Oct 2024 12:38:40 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 14/10/2024 21:02, MitchAlsup1 wrote:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard
    way to compare independent pointers (other than just for
    equality).  Rarely needing something does not mean /never/ needing
    it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??


    void * p = ...
    void * q = ...

    uintptr_t pu = (uintptr_t) p;
    uintptr_t qu = (uintptr_t) q;

    if (pu > qu) {
    ...
    } else if (pu < qu) {
    ...
    } else {
    ...
    }


    If your comparison needs to actually match up with the real virtual
    addresses, then this will not work. But does that actually matter?

    Think about using this comparison for memmove().

    Consider where these pointers come from. Maybe they are pointers to
    statically allocated data. Then you would expect the segment to be
    the same in each case, and the uintptr_t comparison will be fine for
    memmove(). Maybe they come from malloc() and are in different
    segments. Then the comparison here might not give the same result as
    a full virtual address comparison - but that does not matter. If the
    pointers came from different mallocs, they could not overlap and
    memmove() can run either direction.

    The same applies to other uses, such as indexing in a binary search
    tree or a hash map - the comparison above will be correct when it
    matters.




    It's all fine for as long as there are no objects bigger than 64KB.
    But with 16MB of virtual memory and with several* MB of physical memory
    one does want objects that are bigger than 64KB!


    I don't know how such objects would be allocated and addressed in such a system. (I didn't do much DOS/Win16 programming, and on the few
    occasions when I needed structures bigger than 64KB in total, they were structured in multiple levels.)

    But I would expect that in almost any practical system where you can use
    "p++" to step through big arrays, you can also convert the pointer to a uintptr_t and compare as shown above.

    The exceptions would be systems where pointers hold more than just
    addresses, such as access control information or bounds that mean they
    are larger than the largest integer type on the target.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Brett@ggtgp@yahoo.com to comp.arch on Tue Oct 15 19:46:23 2024
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> wrote:
    On 15/10/2024 13:22, Michael S wrote:
    On Tue, 15 Oct 2024 12:38:40 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 14/10/2024 21:02, MitchAlsup1 wrote:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard
    way to compare independent pointers (other than just for
    equality).  Rarely needing something does not mean /never/ needing
    it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??


    void * p = ...
    void * q = ...

    uintptr_t pu = (uintptr_t) p;
    uintptr_t qu = (uintptr_t) q;

    if (pu > qu) {
    ...
    } else if (pu < qu) {
    ...
    } else {
    ...
    }


    If your comparison needs to actually match up with the real virtual
    addresses, then this will not work. But does that actually matter?

    Think about using this comparison for memmove().

    Consider where these pointers come from. Maybe they are pointers to
    statically allocated data. Then you would expect the segment to be
    the same in each case, and the uintptr_t comparison will be fine for
    memmove(). Maybe they come from malloc() and are in different
    segments. Then the comparison here might not give the same result as
    a full virtual address comparison - but that does not matter. If the
    pointers came from different mallocs, they could not overlap and
    memmove() can run either direction.

    The same applies to other uses, such as indexing in a binary search
    tree or a hash map - the comparison above will be correct when it
    matters.




    It's all fine for as long as there are no objects bigger than 64KB.
    But with 16MB of virtual memory and with several* MB of physical memory
    one does want objects that are bigger than 64KB!


    I don't know how such objects would be allocated and addressed in such a system. (I didn't do much DOS/Win16 programming, and on the few
    occasions when I needed structures bigger than 64KB in total, they were structured in multiple levels.)

    But I would expect that in almost any practical system where you can use "p++" to step through big arrays, you can also convert the pointer to a uintptr_t and compare as shown above.

    The exceptions would be systems where pointers hold more than just addresses, such as access control information or bounds that mean they
    are larger than the largest integer type on the target.

    EGA graphics had more than 64k, smart software would group one or more scan lines into segments for bit mapping the array. A bit mapper works a scan
    line at a time so segment changes were not that expensive. This was
    profoundly faster than using pixel pokes and the other default methods of changing bits.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Oct 15 17:26:29 2024
    From Newsgroup: comp.arch

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".

    In an ideal world, it would be better if we could define `malloc` and
    `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Oct 15 21:55:44 2024
    From Newsgroup: comp.arch

    On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".

    One of the key ways C got into the minds of programmers was that
    one could write stuff like printf() in C and NOT needd to have it
    entirely built-into the language.

    In an ideal world, it would be better if we could define `malloc` and `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be std. K&R C--what dropped if from the std ??


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Oct 15 22:05:56 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing
    functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".

    One of the key ways C got into the minds of programmers was that
    one could write stuff like printf() in C and NOT needd to have it
    entirely built-into the language.

    In an ideal world, it would be better if we could define `malloc` and
    `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be std. K&R C--what dropped if from the std ??

    It still is part of the ISO C standard.

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/malloc.html

    POSIX adds some extensions (marked 'CX').


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 16 00:24:07 2024
    From Newsgroup: comp.arch

    On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing
    functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".

    One of the key ways C got into the minds of programmers was that
    one could write stuff like printf() in C and NOT needd to have it
    entirely built-into the language.

    In an ideal world, it would be better if we could define `malloc` and
    `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be std. K&R C--what dropped if from the std ??

    It still is part of the ISO C standard.

    The paragraaph with 3 >'s indicates malloc() cannot be written
    in std. C. It used to be written in std. K&R C. I am not asking
    if it is still in the std libraries, I am asking what happened
    to make it impossible to write malloc() in std. C ?!?

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/malloc.html

    POSIX adds some extensions (marked 'CX').
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Wed Oct 16 09:21:59 2024
    From Newsgroup: comp.arch

    On 15/10/2024 23:26, Stefan Monnier wrote:
    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".

    In an ideal world, it would be better if we could define `malloc` and `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.


    I don't see an advantage in being able to implement them in standard C.
    I /do/ see an advantage in being able to do so well in non-standard, implementation-specific C.

    The reason why you might want your own special memmove, or your own
    special malloc, is that you are doing niche and specialised software.
    For example, you might be making real-time software and require specific
    time constraints on these functions. In such cases, you are not
    interested in writing fully portable software - it will already contain
    many implementation-specific features or use compiler extensions.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Wed Oct 16 09:38:20 2024
    From Newsgroup: comp.arch

    On 15/10/2024 23:55, MitchAlsup1 wrote:
    On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing
    functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".

    One of the key ways C got into the minds of programmers was that
    one could write stuff like printf() in C and NOT needd to have it
    entirely built-into the language.

    It's a very good philosophy in programming language design that the core language should only contain what it has to contain - if a desired
    feature can be put in a library and be equally efficient and convenient
    to use, then it should be in the standard library, not the core
    language. It is much easier to develop, implement, enhance, adapt, and otherwise change things in libraries than the core language.

    And it is also fine, IMHO, that some things in the standard library need non-standard C - the standard library is part of the implementation.


    In an ideal world, it would be better if we could define `malloc` and
    `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be std. K&R C--what dropped if from the std ??

    The function has always been available in C since the language was standardised, and AFAIK it was in K&R C. But no one (in authority) ever claimed it could be implemented purely in standard C. What do you think
    has changed?


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Wed Oct 16 11:18:19 2024
    From Newsgroup: comp.arch

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.
    It goes a bit further: for a general purpose language, any existing
    functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".
    In an ideal world, it would be better if we could define `malloc` and
    `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.
    I don't see an advantage in being able to implement them in standard C.

    It means you can likely also implement a related yet different API
    without having your code "demoted" to non-standard.
    E.g. say if your application wants to use a region/pool/zone-based
    memory management.

    The fact that malloc can't be implemented in standard C is evidence
    that standard C may not be general-purpose enough to accommodate an
    application that wants to use a custom-designed allocator.

    I don't disagree with you, from a practical perspective:

    - in practice, C serves us well for Emacs's GC, even though that can't
    be written in standard C.
    - it's not like there are lots of other languages out there that offer
    you portability together with the ability to define your own `malloc`.

    But it's still a weakness, just a fairly minor one.

    The reason why you might want your own special memmove, or your own special malloc, is that you are doing niche and specialised software.

    Region/pool/zone-based memory management is common enough that I would
    not call it "niche", FWIW, and it's also used in applications that do want portability (GCC and Apache come to mind).
    Can't think of a practical reason to implement my own `memove`, OTOH.


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Oct 16 15:38:47 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing
    functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".

    One of the key ways C got into the minds of programmers was that
    one could write stuff like printf() in C and NOT needd to have it >>>entirely built-into the language.

    In an ideal world, it would be better if we could define `malloc` and
    `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be std. K&R C--what dropped if from the std ??

    It still is part of the ISO C standard.

    The paragraaph with 3 >'s indicates malloc() cannot be written
    in std. C. It used to be written in std. K&R C.

    K&R may have been 'de facto' standard C, but not 'de jure'.

    Unix V6 malloc used the 'brk' system call to allocate space
    for the heap. Later versions used 'sbrk'.

    Those are both kernel system calls.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Wed Oct 16 19:57:03 2024
    From Newsgroup: comp.arch

    (Please do not snip or omit attributions. There are Usenet standards
    for a reason.)

    On 16/10/2024 17:18, Stefan Monnier wrote:
    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.
    It goes a bit further: for a general purpose language, any existing
    functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".
    In an ideal world, it would be better if we could define `malloc` and
    `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.
    I don't see an advantage in being able to implement them in standard C.

    It means you can likely also implement a related yet different API
    without having your code "demoted" to non-standard.

    That makes no sense to me. We are talking about implementing standard
    library functions. If you want to implement other functions, go ahead.

    Or do you mean that it is only possible to implement related functions
    (such as memory pools) if you also can implement malloc in fully
    portable standard C? That would make a little more sense if it were
    true, but it is not. First, you can implement such functions in implementation-specific code, so you are not hindered from writing the
    code you want. Secondly, standard C provides functions such as malloc()
    and aligned_alloc() that give you the parts you need - the fact that you
    need something outside of standard C to implement malloc() does not
    imply that you need those same features to implement your additional functions.

    E.g. say if your application wants to use a region/pool/zone-based
    memory management.

    The fact that malloc can't be implemented in standard C is evidence
    that standard C may not be general-purpose enough to accommodate an application that wants to use a custom-designed allocator.


    No, it is not - see above.

    And remember how C was designed and how it was intended to be used. The
    aim was to be able to write portable code that could be reused on many systems, and /also/ implementation, OS and target specific code for
    maximum efficiency, systems programming, and other non-portable work. A typical C program combines these - some parts can be fully portable,
    other parts are partially portable (such as to any POSIX system, or
    targets with 32-bit int and 8-bit char), and some parts may be very compiler-specific or target specific.

    That's not an indication of failure of C for general-purpose
    programming. (But I would certainly not suggest that C is the best
    choice of language for many "general" programming tasks.)

    I don't disagree with you, from a practical perspective:

    - in practice, C serves us well for Emacs's GC, even though that can't
    be written in standard C.
    - it's not like there are lots of other languages out there that offer
    you portability together with the ability to define your own `malloc`.

    But it's still a weakness, just a fairly minor one.

    The reason why you might want your own special memmove, or your own special >> malloc, is that you are doing niche and specialised software.

    Region/pool/zone-based memory management is common enough that I would
    not call it "niche", FWIW, and it's also used in applications that do want portability (GCC and Apache come to mind).
    Can't think of a practical reason to implement my own `memove`, OTOH.


    Stefan

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Oct 16 20:00:27 2024
    From Newsgroup: comp.arch

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    The paragraaph with 3 >'s indicates malloc() cannot be written
    in std. C. It used to be written in std. K&R C. I am not asking
    if it is still in the std libraries, I am asking what happened
    to make it impossible to write malloc() in std. C ?!?

    You need to reserve memory by some way from the operating system,
    which is, by necessity, outside of the scope of C (via brk(),
    GETMAIN, mmap() or whatever).

    But more problematic is the implementation of free() without
    knowing how to compare pointers.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 16 22:18:49 2024
    From Newsgroup: comp.arch

    On Wed, 16 Oct 2024 20:00:27 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    The paragraaph with 3 >'s indicates malloc() cannot be written
    in std. C. It used to be written in std. K&R C. I am not asking
    if it is still in the std libraries, I am asking what happened
    to make it impossible to write malloc() in std. C ?!?

    You need to reserve memory by some way from the operating system,
    which is, by necessity, outside of the scope of C (via brk(),
    GETMAIN, mmap() or whatever).

    Agreed, but once you HAVE a way of getting memory (by whatever name)
    you can write malloc in std. C.

    But more problematic is the implementation of free() without
    knowing how to compare pointers.

    Never wrote a program that actually needs free--I have re-written
    programs that used free to avoid using free, though.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From George Neuner@gneuner2@comcast.net to comp.arch on Wed Oct 16 23:06:24 2024
    From Newsgroup: comp.arch

    On Wed, 16 Oct 2024 15:38:47 GMT, scott@slp53.sl.home (Scott Lurndal)
    wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing
    functionality that cannot be written using the language is a sign of >>>>> a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".

    One of the key ways C got into the minds of programmers was that
    one could write stuff like printf() in C and NOT needd to have it >>>>entirely built-into the language.

    In an ideal world, it would be better if we could define `malloc` and >>>>> `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be std. K&R C--what dropped if from the std ??

    It still is part of the ISO C standard.

    The paragraaph with 3 >'s indicates malloc() cannot be written
    in std. C. It used to be written in std. K&R C.

    K&R may have been 'de facto' standard C, but not 'de jure'.

    Unix V6 malloc used the 'brk' system call to allocate space
    for the heap. Later versions used 'sbrk'.

    Those are both kernel system calls.

    Yes, but malloc() subdivides an already provided space. Because that
    space can be treated as a single array of char, and comparing pointers
    to elements of the same array is legal, the only thing I can see that
    prevents writing malloc() in standard C would be the need to somhow
    define the array from the /language's/ POV (not the compiler's) prior
    to using it.

    Which circles back to why something like

    char (*heap)[ULONG_MAX] = ... ;

    would/does not satisfy the language's requirement. All the compilers
    I have ever seen would have been happy with it, but none of them ever
    needed something like it anyway. Conversion to <an integer type> also
    would always work, but also was never needed.

    I am not a language lawyer - I don't even pretend to understand the
    arguments against allowing general pointer comparison.


    Aside: I have worked on architectures (DSPs) having disjoint memory
    spaces, spaces with differing bit widths, and even spaces where [sans
    MMU] the same physical address had multiple logical addresses whose
    use depended on the type of access.

    I have written allocators and even a GC for such architectures. Never
    had a problem convincing C compilers to compare pointers - the only
    issue I ever faced was whether the result actually was meaningful to
    the program.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From George Neuner@gneuner2@comcast.net to comp.arch on Wed Oct 16 23:32:41 2024
    From Newsgroup: comp.arch

    On Wed, 16 Oct 2024 09:38:20 +0200, David Brown
    <david.brown@hesbynett.no> wrote:


    It's a very good philosophy in programming language design that the core >language should only contain what it has to contain - if a desired
    feature can be put in a library and be equally efficient and convenient
    to use, then it should be in the standard library, not the core
    language. It is much easier to develop, implement, enhance, adapt, and >otherwise change things in libraries than the core language.

    And it is also fine, IMHO, that some things in the standard library need >non-standard C - the standard library is part of the implementation.

    But it is a problem if the library has to be written using a different compiler. [For this purpose I would consider specifying different
    compiler flags to be using a different compiler.]

    Why? Because once these things are discovered, many programmers will
    see their advantages and lack the discipline to avoid using them for
    more general application work.


    In an ideal world, it would be better if we could define `malloc` and
    `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be std. K&R C--what dropped if from the std ??

    The function has always been available in C since the language was >standardised, and AFAIK it was in K&R C. But no one (in authority) ever >claimed it could be implemented purely in standard C. What do you think
    has changed?

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Thu Oct 17 00:40:34 2024
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    The paragraaph with 3 >'s indicates malloc() cannot be written
    in std. C. It used to be written in std. K&R C. I am not asking
    if it is still in the std libraries, I am asking what happened
    to make it impossible to write malloc() in std. C ?!?

    You need to reserve memory by some way from the operating system,
    which is, by necessity, outside of the scope of C (via brk(),
    GETMAIN, mmap() or whatever).

    Right. And that is why malloc(), or some essential internal component
    of malloc(), has to be platform specific, and thus malloc() must be
    supplied by the implementation (which means both the compiler and the
    standard library).

    But more problematic is the implementation of free() without knowing
    how to compare pointers.

    Once there is a way to get additional memory from whatever underlying environment is there, malloc() and free() can be implemented (and I
    believe most often are implemented) without needing to compare
    pointers. Note: pointers can be tested for equality without having
    to compare them relationally, and testing pointers for equality is
    well-defined between any two pointers (which may need to be converted
    to 'void *' to avoid a type mismatch).
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Thu Oct 17 01:18:04 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:

    On Wed, 16 Oct 2024 20:00:27 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    The paragraaph with 3 >'s indicates malloc() cannot be written in
    standard C. It used to be written in standard K&R C. I am not
    asking if it is still in the std libraries, I am asking what
    happened to make it impossible to write malloc() in standard C ?!?

    You need to reserve memory by some way from the operating system,
    which is, by necessity, outside of the scope of C (via brk(),
    GETMAIN, mmap() or whatever).

    Agreed, but once you HAVE a way of getting memory (by whatever name)
    you can write malloc in standard C.

    The point is that getting more memory is inherently platform
    specific, which is why malloc() must be defined by each particular implementation, and so was put in the standard library.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Thu Oct 17 02:48:49 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:

    On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any
    existing functionality that cannot be written using the language
    is a sign of a weakness because it shows that despite being
    "general purpose" it fails to cover this specific "purpose".

    One of the key ways C got into the minds of programmers was that
    one could write stuff like printf() in C and NOT need to have it
    entirely built-into the language.

    In an ideal world, it would be better if we could define `malloc`
    and `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be standard K&R C--what dropped if from the
    standard??

    It still is part of the ISO C standard.

    The paragraph with 3 >'s indicates malloc() cannot be written in
    standard C. It used to be written in standard K&R C.

    No, it didn't. In the original book (my copy is from the third
    printing of the first edition, copyright 1978), on page 175 there
    is a function 'alloc()' that shows how to write a memory allocator.
    The code in alloc() calls 'morecore()', described as follows:

    The function morecore obtains storage from the operating system.
    The details of how this is done of course vary from system to
    system. In UNIX, the system entry sbrk() returns a pointer to n
    more bytes of storage. [...]

    An implementation of morecore() is shown on the next page, and
    it indeed uses sbrk() to get more memory. That makes it UNIX
    specific, not portable standard C. Both alloc() and morecore()
    are part of chapter 8, "The UNIX System Interface".

    Note also that chapter 7, titled "Input and Output" and describing
    the standard library, mentions in section 7.9, "Some Miscellaneous
    Functions", the function calloc() as part of the standard library.
    (There is no mention of malloc().) The point of having a standard
    library is that the functions it contains depend on details of the
    underlying OS and thus cannot be written in platform-agnostic code.
    Being platform portable is the defining property of "standard C".

    (Amusing aside: the entire standard library seems to be covered by
    just #include <stdio.h>.)

    I am not
    asking if it is still in the standard libraries, I am asking what
    happened to make it impossible to write malloc() in standard C ?!?

    Functions such as sbrk() are not part of the C language. Whether
    it's called calloc() or malloc(), memory allocation has always
    needed access to some facilities not provided by the C language
    itself. The function malloc() is not any more writable in standard
    K&R C than it is in standard ISO C (except of course malloc() can
    be implemented by using calloc() internally, but that depends on
    calloc() being part of the standard library).
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Thu Oct 17 03:16:13 2024
    From Newsgroup: comp.arch

    George Neuner <gneuner2@comcast.net> writes:

    On Wed, 16 Oct 2024 15:38:47 GMT, scott@slp53.sl.home (Scott Lurndal)
    wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    [...]

    malloc() used to be standard K&R C--what dropped it from the
    standard ??

    It still is part of the ISO C standard.

    The paragraaph with 3 >'s indicates malloc() cannot be written in
    standard C. It used to be written in standard K&R C.

    K&R may have been 'de facto' standard C, but not 'de jure'.

    Unix V6 malloc used the 'brk' system call to allocate space
    for the heap. Later versions used 'sbrk'.

    Those are both kernel system calls.

    Yes, but malloc() subdivides an already provided space.

    Not necessarily.

    Because that space can be treated as a single array of char,

    Not always.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Thu Oct 17 03:17:33 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:

    On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing
    functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".

    That is a foolish statement.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Thu Oct 17 16:16:42 2024
    From Newsgroup: comp.arch

    On 17/10/2024 05:06, George Neuner wrote:
    On Wed, 16 Oct 2024 15:38:47 GMT, scott@slp53.sl.home (Scott Lurndal)
    wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing >>>>>> functionality that cannot be written using the language is a sign of >>>>>> a weakness because it shows that despite being "general purpose" it >>>>>> fails to cover this specific "purpose".

    One of the key ways C got into the minds of programmers was that
    one could write stuff like printf() in C and NOT needd to have it
    entirely built-into the language.

    In an ideal world, it would be better if we could define `malloc` and >>>>>> `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be std. K&R C--what dropped if from the std ??

    It still is part of the ISO C standard.

    The paragraaph with 3 >'s indicates malloc() cannot be written
    in std. C. It used to be written in std. K&R C.

    K&R may have been 'de facto' standard C, but not 'de jure'.

    Unix V6 malloc used the 'brk' system call to allocate space
    for the heap. Later versions used 'sbrk'.

    Those are both kernel system calls.

    Yes, but malloc() subdivides an already provided space. Because that
    space can be treated as a single array of char, and comparing pointers
    to elements of the same array is legal, the only thing I can see that prevents writing malloc() in standard C would be the need to somhow
    define the array from the /language's/ POV (not the compiler's) prior
    to using it.


    It is common for malloc() implementations to ask the OS for large chunks
    of memory, then subdivide it and pass it out to the application. When
    the chunk(s) it has run out, it will ask for more from the OS. You
    could reasonably argue that each chunk it gets may be considered a
    single unsigned char array, but that is certainly not true for
    additional chunks.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Thu Oct 17 16:25:01 2024
    From Newsgroup: comp.arch

    On 17/10/2024 05:32, George Neuner wrote:
    On Wed, 16 Oct 2024 09:38:20 +0200, David Brown
    <david.brown@hesbynett.no> wrote:


    It's a very good philosophy in programming language design that the core
    language should only contain what it has to contain - if a desired
    feature can be put in a library and be equally efficient and convenient
    to use, then it should be in the standard library, not the core
    language. It is much easier to develop, implement, enhance, adapt, and
    otherwise change things in libraries than the core language.

    And it is also fine, IMHO, that some things in the standard library need
    non-standard C - the standard library is part of the implementation.

    But it is a problem if the library has to be written using a different compiler. [For this purpose I would consider specifying different
    compiler flags to be using a different compiler.]

    Specifying different flags would technically give you a different /implementation/, but it would not normally be considered a different /compiler/. I see no problem at all if libraries (standard library or otherwise) are compiled with different flags. I can absolutely
    guarantee that the flags I use for compiling my application code are not
    the same as those used for compiling the static libraries that came with
    my toolchains. Using different /compilers/ could be a significant inconvenience, and might mean you lose additional features (such as
    link-time optimisation), but as long as the ABI is consistent then they
    should work fine.


    Why? Because once these things are discovered, many programmers will
    see their advantages and lack the discipline to avoid using them for
    more general application work.


    Really? Have you ever looked at the source code for a library such as
    glibc or newlib? Most developers would look at that and quickly shy
    away from all the macros, additional compiler-specific attributes,
    conditional compilation, and the rest of it. Very, very few would look
    into the details to see if there are any "tricks" or "secret" compiler extensions they can copy. And with very few exceptions, all the compiler-specific features will already be documented and available to programmers enthusiastic enough to RTFM.


    In an ideal world, it would be better if we could define `malloc` and
    `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be std. K&R C--what dropped if from the std ??

    The function has always been available in C since the language was
    standardised, and AFAIK it was in K&R C. But no one (in authority) ever
    claimed it could be implemented purely in standard C. What do you think
    has changed?


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Fri Oct 18 06:00:54 2024
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:

    On Mon, 14 Oct 2024 17:19:40 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    [...]

    My only point of contention is that the existence or lack of such
    instructions does not make any difference to whether or not you can
    write a good implementation of memcpy() or memmove() in portable
    standard C.

    You are moving a goalpost.

    No, he isn't.

    One does not need "good implementation" in a sense you have in mind.
    All one needs is an implementation that pattern matching logic of
    compiler unmistakably recognizes as memove/memcpy. That is very easily
    done in standard C. For memmove, I had shown how to do it in one of the
    posts below. For memcpy its very obvious, so no need to show.

    You have misunderstood the meaning of "standard C", which means
    code that does not rely on any implementation-specific behavior.
    "All one needs is an implementation that ..." already invalidates
    the requirement that the code not rely on implementation-specific
    behavior.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Oct 18 14:06:17 2024
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 14 Oct 2024 19:39:41 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard
    way to compare independent pointers (other than just for
    equality). Rarely needing something does not mean /never/ needing
    it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??

    Depends. On the Burroughs mainframe there could be eight
    active segments and the segment number was part of the pointer.

    Pointers were 32-bits (actually 8 BCD digits)

    S s OOOOOO

    Where 'S' was a sign digit (C or D), 's' was the
    segment number (0-7) and OOOOOO was the six digit
    offset within the segment (500kB/1000kD each).

    A particular task (process) could have up to
    one million "environments", each environment
    could have up to 100 "memory areas (up to 1000kD)
    of which the first eight were loaded into the
    processor base/limit registers. Index registers
    were 8 digits and were loaded with a pointer as
    described above. Operands could optionally select
    one of the index registers and the operand address
    was treated as an offset to the index register;
    there were 7 index registers.

    Access to memory areas 8-99 use string instructions
    where the pointer was 16 BCD digits:

    EEEEEEMM SsOOOOOO

    Where EEEEEE was the evironment number (0-999999);
    environments starting with D00000 were reserved for
    the MCP (Operating System). MM was the memory area
    number and the remaining eight digits described the
    data within the memory area. A subroutine call could
    call within a memory area or switch to a new environment.

    Memory area 1 was the code region for the segment,
    Memory area 0 held the stack and some global variables
    and was typically shared by all environments.
    Memory areas 2-7 were application dependent and could
    be configured to be shared between environments at
    link time.

    What was the size of phiscal address space ?
    I would suppose, more than 1,000,000 words?

    It varied based on the generation. In the
    1960s, a half megabyte (10^6 digits)
    was the limit.

    In the 1970s, the architecture supported
    10^8 digits, the largest B4800 systems
    were shipped with 2 million digits (1MB).
    In 1979, the B4900 was introduced supporting
    up to 10MB (20 MD), later increased to
    20MB/40MD.

    In the 1980s, the largest systems (V500)
    supported up to 10^9 digits. It
    was that generation of machine where the
    environment scheme was introduced.

    Binaries compiled in 1966 ran on all
    generations without recompilation.

    There was room in the segmentation structures
    for up to 10^18 digit physical addresses
    (where the segments were aligned on 10^3
    digit boundaries).

    Unisys discontinued that line of systems in 1992.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Oct 18 17:34:16 2024
    From Newsgroup: comp.arch

    On Fri, 18 Oct 2024 14:06:17 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 14 Oct 2024 19:39:41 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard
    way to compare independent pointers (other than just for
    equality). Rarely needing something does not mean /never/
    needing it.

    OK, take a segmented memory model with 16-bit pointers and a
    24-bit virtual address space. How do you actually compare to
    segmented pointers ??

    Depends. On the Burroughs mainframe there could be eight
    active segments and the segment number was part of the pointer.

    Pointers were 32-bits (actually 8 BCD digits)

    S s OOOOOO

    Where 'S' was a sign digit (C or D), 's' was the
    segment number (0-7) and OOOOOO was the six digit
    offset within the segment (500kB/1000kD each).

    A particular task (process) could have up to
    one million "environments", each environment
    could have up to 100 "memory areas (up to 1000kD)
    of which the first eight were loaded into the
    processor base/limit registers. Index registers
    were 8 digits and were loaded with a pointer as
    described above. Operands could optionally select
    one of the index registers and the operand address
    was treated as an offset to the index register;
    there were 7 index registers.

    Access to memory areas 8-99 use string instructions
    where the pointer was 16 BCD digits:

    EEEEEEMM SsOOOOOO

    Where EEEEEE was the evironment number (0-999999);
    environments starting with D00000 were reserved for
    the MCP (Operating System). MM was the memory area
    number and the remaining eight digits described the
    data within the memory area. A subroutine call could
    call within a memory area or switch to a new environment.

    Memory area 1 was the code region for the segment,
    Memory area 0 held the stack and some global variables
    and was typically shared by all environments.
    Memory areas 2-7 were application dependent and could
    be configured to be shared between environments at
    link time.

    What was the size of phiscal address space ?
    I would suppose, more than 1,000,000 words?

    It varied based on the generation. In the
    1960s, a half megabyte (10^6 digits)
    was the limit.

    In the 1970s, the architecture supported
    10^8 digits, the largest B4800 systems
    were shipped with 2 million digits (1MB).
    In 1979, the B4900 was introduced supporting
    up to 10MB (20 MD), later increased to
    20MB/40MD.

    In the 1980s, the largest systems (V500)
    supported up to 10^9 digits. It
    was that generation of machine where the
    environment scheme was introduced.

    Binaries compiled in 1966 ran on all
    generations without recompilation.

    There was room in the segmentation structures
    for up to 10^18 digit physical addresses
    (where the segments were aligned on 10^3
    digit boundaries).

    So, can it be said that ar least some of B6500-compatible models
    suffered from the same problem as 80286 - the segment of maximal size
    didn't cover all linear (or physical) address space?
    Or their index register width was increased to accomodate 1e9 digits in
    the single segment?


    Unisys discontinued that line of systems in 1992.

    I thought it lasted longer. My impresion was that there were still
    hardware implemntation (alongside with emulation on Xeons) sold up
    until 15 years ago.





    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Oct 18 16:19:08 2024
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Fri, 18 Oct 2024 14:06:17 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 14 Oct 2024 19:39:41 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard
    way to compare independent pointers (other than just for
    equality). Rarely needing something does not mean /never/
    needing it.

    OK, take a segmented memory model with 16-bit pointers and a
    24-bit virtual address space. How do you actually compare to
    segmented pointers ??

    Depends. On the Burroughs mainframe there could be eight
    active segments and the segment number was part of the pointer.

    Pointers were 32-bits (actually 8 BCD digits)

    S s OOOOOO

    Where 'S' was a sign digit (C or D), 's' was the
    segment number (0-7) and OOOOOO was the six digit
    offset within the segment (500kB/1000kD each).

    A particular task (process) could have up to
    one million "environments", each environment
    could have up to 100 "memory areas (up to 1000kD)
    of which the first eight were loaded into the
    processor base/limit registers. Index registers
    were 8 digits and were loaded with a pointer as
    described above. Operands could optionally select
    one of the index registers and the operand address
    was treated as an offset to the index register;
    there were 7 index registers.

    Access to memory areas 8-99 use string instructions
    where the pointer was 16 BCD digits:

    EEEEEEMM SsOOOOOO

    Where EEEEEE was the evironment number (0-999999);
    environments starting with D00000 were reserved for
    the MCP (Operating System). MM was the memory area
    number and the remaining eight digits described the
    data within the memory area. A subroutine call could
    call within a memory area or switch to a new environment.

    Memory area 1 was the code region for the segment,
    Memory area 0 held the stack and some global variables
    and was typically shared by all environments.
    Memory areas 2-7 were application dependent and could
    be configured to be shared between environments at
    link time.

    What was the size of phiscal address space ?
    I would suppose, more than 1,000,000 words?

    It varied based on the generation. In the
    1960s, a half megabyte (10^6 digits)
    was the limit.

    In the 1970s, the architecture supported
    10^8 digits, the largest B4800 systems
    were shipped with 2 million digits (1MB).
    In 1979, the B4900 was introduced supporting
    up to 10MB (20 MD), later increased to
    20MB/40MD.

    In the 1980s, the largest systems (V500)
    supported up to 10^9 digits. It
    was that generation of machine where the
    environment scheme was introduced.

    Binaries compiled in 1966 ran on all
    generations without recompilation.

    There was room in the segmentation structures
    for up to 10^18 digit physical addresses
    (where the segments were aligned on 10^3
    digit boundaries).

    So, can it be said that ar least some of B6500-compatible models

    No. The systems I described above are from the medium
    systems family (B2000/B3000/B4000). The B5000/B6000/B7000
    (large) family systems were a completely different stack based
    architecture with a 48-bit word size. The Small systems (B1000)
    supported task-specific dynamic microcode loading (different
    microcode for a cobol app vs. a fortran app).

    Medium systems evolved from the Electrodata Datatron and 220 (1954) through
    the Burroughs B300 to the Burroughs B3500 by 1965. The B5000
    was also developed at the old Electrodata plant in Pasadena
    (where I worked in the 80s) - eventually large systems moved
    out - the more capable large systems (B7XXX) were designed in Tredyffrin
    Pa, the less capable large systems (B5XXX) were designed in Mission Viejo, Ca.

    suffered from the same problem as 80286 - the segment of maximal size
    didn't cover all linear (or physical) address space?
    Or their index register width was increased to accomodate 1e9 digits in
    the single segment?


    Unisys discontinued that line of systems in 1992.

    I thought it lasted longer. My impresion was that there were still
    hardware implemntation (alongside with emulation on Xeons) sold up
    until 15 years ago.

    Large systems still exist today in emulation[*], as do the
    former Univac (Sperry 2200) systems. The last medium system
    (V380) was retired by the City of Santa Ana in 2010 (almost two
    decades after Unisys cancelled the product line) and was moved
    to the Living Computer Museum.

    City of Santa Ana replaced the single 1980 vintage V380 with
    29 windows servers.

    After the merger of Burroughs and Sperry in '86 there were six
    different mainframe architectures - by 1990, all but
    two (2200 and large systems) had been terminated.

    [*] Clearpath Libra https://www.unisys.com/client-education/clearpath-forward-libra-servers/
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Vir Campestris@vir.campestris@invalid.invalid to comp.arch on Fri Oct 18 17:38:55 2024
    From Newsgroup: comp.arch

    On 16/10/2024 08:21, David Brown wrote:

    I don't see an advantage in being able to implement them in standard C.
    I /do/ see an advantage in being able to do so well in non-standard, implementation-specific C.

    The reason why you might want your own special memmove, or your own
    special malloc, is that you are doing niche and specialised software.
    For example, you might be making real-time software and require specific time constraints on these functions.  In such cases, you are not
    interested in writing fully portable software - it will already contain
    many implementation-specific features or use compiler extensions.

    I have a vague feeling that once upon a time I wrote a malloc for an
    embedded system. Having only one process it had access to the entire
    memory range, and didn't need to talk to the OS. Entirely C is quite
    feasible there.

    But memmove? On an 80286 it will be using rep movsw, rather than a
    software loop, to copy the memory contents to the new location.

    _That_ does require assembler, or compiler extensions, not standard C.

    Andy
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Oct 18 21:45:37 2024
    From Newsgroup: comp.arch

    On 18/10/2024 18:38, Vir Campestris wrote:
    On 16/10/2024 08:21, David Brown wrote:

    I don't see an advantage in being able to implement them in standard
    C. I /do/ see an advantage in being able to do so well in
    non-standard, implementation-specific C.

    The reason why you might want your own special memmove, or your own
    special malloc, is that you are doing niche and specialised software.
    For example, you might be making real-time software and require
    specific time constraints on these functions.  In such cases, you are
    not interested in writing fully portable software - it will already
    contain many implementation-specific features or use compiler extensions.

    I have a vague feeling that once upon a time I wrote a malloc for an embedded system. Having only one process it had access to the entire
    memory range, and didn't need to talk to the OS. Entirely C is quite feasible there.


    Sure - but you are not writing portable standard C. You are relying on implementation details, or writing code that is only suitable for a
    particular implementation (or set of implementations). It is normal to
    write this kind of thing in C, but it is non-portable C. (Or at least,
    not fully portable C.)

    But memmove? On an 80286 it will be using rep movsw, rather than a
    software loop, to copy the memory contents to the new location.

    _That_ does require assembler, or compiler extensions, not standard C.


    It would normally be written in C, and the compiler will generate the
    "rep" assembly. The bit you can't write in fully portable standard C is
    the comparison of the pointers so you know which direction to do the
    copying.



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sat Oct 19 19:46:41 2024
    From Newsgroup: comp.arch

    On Fri, 18 Oct 2024 16:19:08 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Fri, 18 Oct 2024 14:06:17 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 14 Oct 2024 19:39:41 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully
    standard way to compare independent pointers (other than
    just for equality). Rarely needing something does not mean
    /never/ needing it.

    OK, take a segmented memory model with 16-bit pointers and a
    24-bit virtual address space. How do you actually compare to
    segmented pointers ??

    Depends. On the Burroughs mainframe there could be eight
    active segments and the segment number was part of the pointer.

    Pointers were 32-bits (actually 8 BCD digits)

    S s OOOOOO

    Where 'S' was a sign digit (C or D), 's' was the
    segment number (0-7) and OOOOOO was the six digit
    offset within the segment (500kB/1000kD each).

    A particular task (process) could have up to
    one million "environments", each environment
    could have up to 100 "memory areas (up to 1000kD)
    of which the first eight were loaded into the
    processor base/limit registers. Index registers
    were 8 digits and were loaded with a pointer as
    described above. Operands could optionally select
    one of the index registers and the operand address
    was treated as an offset to the index register;
    there were 7 index registers.

    Access to memory areas 8-99 use string instructions
    where the pointer was 16 BCD digits:

    EEEEEEMM SsOOOOOO

    Where EEEEEE was the evironment number (0-999999);
    environments starting with D00000 were reserved for
    the MCP (Operating System). MM was the memory area
    number and the remaining eight digits described the
    data within the memory area. A subroutine call could
    call within a memory area or switch to a new environment.

    Memory area 1 was the code region for the segment,
    Memory area 0 held the stack and some global variables
    and was typically shared by all environments.
    Memory areas 2-7 were application dependent and could
    be configured to be shared between environments at
    link time.

    What was the size of phiscal address space ?
    I would suppose, more than 1,000,000 words?

    It varied based on the generation. In the
    1960s, a half megabyte (10^6 digits)
    was the limit.

    In the 1970s, the architecture supported
    10^8 digits, the largest B4800 systems
    were shipped with 2 million digits (1MB).
    In 1979, the B4900 was introduced supporting
    up to 10MB (20 MD), later increased to
    20MB/40MD.

    In the 1980s, the largest systems (V500)
    supported up to 10^9 digits. It
    was that generation of machine where the
    environment scheme was introduced.

    Binaries compiled in 1966 ran on all
    generations without recompilation.

    There was room in the segmentation structures
    for up to 10^18 digit physical addresses
    (where the segments were aligned on 10^3
    digit boundaries).

    So, can it be said that ar least some of B6500-compatible models

    No. The systems I described above are from the medium
    systems family (B2000/B3000/B4000).

    I didn't realize that you were not talking about Large Systems.
    I didn't even know that Medium Systems used segmented memory.
    Sorry.

    The B5000/B6000/B7000
    (large) family systems were a completely different stack based
    architecture with a 48-bit word size. The Small systems (B1000)
    supported task-specific dynamic microcode loading (different
    microcode for a cobol app vs. a fortran app).

    Medium systems evolved from the Electrodata Datatron and 220 (1954)
    through the Burroughs B300 to the Burroughs B3500 by 1965. The B5000
    was also developed at the old Electrodata plant in Pasadena
    (where I worked in the 80s) - eventually large systems moved
    out - the more capable large systems (B7XXX) were designed in
    Tredyffrin Pa, the less capable large systems (B5XXX) were designed
    in Mission Viejo, Ca.

    suffered from the same problem as 80286 - the segment of maximal size >didn't cover all linear (or physical) address space?
    Or their index register width was increased to accomodate 1e9 digits
    in the single segment?


    Unisys discontinued that line of systems in 1992.

    I thought it lasted longer. My impresion was that there were still
    hardware implemntation (alongside with emulation on Xeons) sold up
    until 15 years ago.

    Large systems still exist today in emulation[*], as do the
    former Univac (Sperry 2200) systems. The last medium system
    (V380) was retired by the City of Santa Ana in 2010 (almost two
    decades after Unisys cancelled the product line) and was moved
    to the Living Computer Museum.

    City of Santa Ana replaced the single 1980 vintage V380 with
    29 windows servers.

    After the merger of Burroughs and Sperry in '86 there were six
    different mainframe architectures - by 1990, all but
    two (2200 and large systems) had been terminated.

    [*] Clearpath Libra https://www.unisys.com/client-education/clearpath-forward-libra-servers/


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Vir Campestris@vir.campestris@invalid.invalid to comp.arch on Sun Oct 20 21:51:30 2024
    From Newsgroup: comp.arch

    On 18/10/2024 20:45, David Brown wrote:
    On 18/10/2024 18:38, Vir Campestris wrote:
    On 16/10/2024 08:21, David Brown wrote:

    I don't see an advantage in being able to implement them in standard
    C. I /do/ see an advantage in being able to do so well in non-
    standard, implementation-specific C.

    The reason why you might want your own special memmove, or your own
    special malloc, is that you are doing niche and specialised software.
    For example, you might be making real-time software and require
    specific time constraints on these functions.  In such cases, you are
    not interested in writing fully portable software - it will already
    contain many implementation-specific features or use compiler
    extensions.

    I have a vague feeling that once upon a time I wrote a malloc for an
    embedded system. Having only one process it had access to the entire
    memory range, and didn't need to talk to the OS. Entirely C is quite
    feasible there.


    Sure - but you are not writing portable standard C.  You are relying on implementation details, or writing code that is only suitable for a particular implementation (or set of implementations).  It is normal to write this kind of thing in C, but it is non-portable C.  (Or at least,
    not fully portable C.)

    Ah, I see your point. Because some implementations will require
    communication with the OS there cannot be a truly portable malloc.

    But memmove? On an 80286 it will be using rep movsw, rather than a
    software loop, to copy the memory contents to the new location.

    _That_ does require assembler, or compiler extensions, not standard C.


    It would normally be written in C, and the compiler will generate the
    "rep" assembly.  The bit you can't write in fully portable standard C is the comparison of the pointers so you know which direction to do the copying.

    It's a long time since I had to mistrust a compiler so much that I was
    pulling the assembler apart. It sounds as though they have got smarter
    in the meantime.

    I just checked BTW, and you are correct.

    Andy
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Mon Oct 21 08:58:05 2024
    From Newsgroup: comp.arch

    On 20/10/2024 22:51, Vir Campestris wrote:
    On 18/10/2024 20:45, David Brown wrote:
    On 18/10/2024 18:38, Vir Campestris wrote:
    On 16/10/2024 08:21, David Brown wrote:

    I don't see an advantage in being able to implement them in standard
    C. I /do/ see an advantage in being able to do so well in non-
    standard, implementation-specific C.

    The reason why you might want your own special memmove, or your own
    special malloc, is that you are doing niche and specialised
    software. For example, you might be making real-time software and
    require specific time constraints on these functions.  In such
    cases, you are not interested in writing fully portable software -
    it will already contain many implementation-specific features or use
    compiler extensions.

    I have a vague feeling that once upon a time I wrote a malloc for an
    embedded system. Having only one process it had access to the entire
    memory range, and didn't need to talk to the OS. Entirely C is quite
    feasible there.


    Sure - but you are not writing portable standard C.  You are relying
    on implementation details, or writing code that is only suitable for a
    particular implementation (or set of implementations).  It is normal
    to write this kind of thing in C, but it is non-portable C.  (Or at
    least, not fully portable C.)

    Ah, I see your point. Because some implementations will require communication with the OS there cannot be a truly portable malloc.

    Yes.

    I think /every/ implementation will require communication with the OS,
    if there is an OS - otherwise it will need support from other parts of
    the toolchain (such as symbols created in a linker script to define the
    heap area - that's the typical implementation in small embedded systems).

    The nearest you could get to a portable implementation would be using a
    local unsigned char array as the heap, but I don't believe that would be
    fully correct according to the effective type rules (or the "strict
    aliasing" or type-based aliasing rules, if you prefer those terms). It
    would also not be good enough for the needs of many programs.

    Of course, a fair amount of the code for malloc/free can written in
    fully portable C - and almost all of it can be written in a somewhat
    vaguely defined "widely portable C" where you can mask pointer bits to
    handle alignment, and other such conveniences.


    But memmove? On an 80286 it will be using rep movsw, rather than a
    software loop, to copy the memory contents to the new location.

    _That_ does require assembler, or compiler extensions, not standard C.


    It would normally be written in C, and the compiler will generate the
    "rep" assembly.  The bit you can't write in fully portable standard C
    is the comparison of the pointers so you know which direction to do
    the copying.

    It's a long time since I had to mistrust a compiler so much that I was pulling the assembler apart. It sounds as though they have got smarter
    in the meantime.

    I just checked BTW, and you are correct.


    Looking at the generated assembly is usually not a matter of mistrusting
    the compiler. One of the reasons I do so is to check that the compiler
    can generate efficient object code from my source code, in cases where I
    need maximal efficiency. I'd rather not write assembly unless I really
    have to!




    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Oct 21 09:21:42 2024
    From Newsgroup: comp.arch

    David Brown wrote:
    On 20/10/2024 22:51, Vir Campestris wrote:
    On 18/10/2024 20:45, David Brown wrote:
    On 18/10/2024 18:38, Vir Campestris wrote:
    On 16/10/2024 08:21, David Brown wrote:

    I don't see an advantage in being able to implement them in
    standard C. I /do/ see an advantage in being able to do so well in >>>>> non- standard, implementation-specific C.

    The reason why you might want your own special memmove, or your own >>>>> special malloc, is that you are doing niche and specialised
    software. For example, you might be making real-time software and
    require specific time constraints on these functions.  In such
    cases, you are not interested in writing fully portable software - >>>>> it will already contain many implementation-specific features or
    use compiler extensions.

    I have a vague feeling that once upon a time I wrote a malloc for an
    embedded system. Having only one process it had access to the entire
    memory range, and didn't need to talk to the OS. Entirely C is quite
    feasible there.


    Sure - but you are not writing portable standard C.  You are relying >>> on implementation details, or writing code that is only suitable for >>> a particular implementation (or set of implementations).  It is
    normal to write this kind of thing in C, but it is non-portable C.Â
    (Or at least, not fully portable C.)

    Ah, I see your point. Because some implementations will require
    communication with the OS there cannot be a truly portable malloc.

    Yes.

    I think /every/ implementation will require communication with the OS, > if there is an OS - otherwise it will need support from other parts of > the toolchain (such as symbols created in a linker script to define the
    heap area - that's the typical implementation in small embedded systems).

    The nearest you could get to a portable implementation would be using a local unsigned char array as the heap, but I don't believe that would be fully correct according to the effective type rules (or the "strict aliasing" or type-based aliasing rules, if you prefer those terms).  It would also not be good enough for the needs of many programs.

    Of course, a fair amount of the code for malloc/free can written in
    fully portable C - and almost all of it can be written in a somewhat
    vaguely defined "widely portable C" where you can mask pointer bits to > handle alignment, and other such conveniences.


    But memmove? On an 80286 it will be using rep movsw, rather than a
    software loop, to copy the memory contents to the new location.

    _That_ does require assembler, or compiler extensions, not standard C. >>>>

    It would normally be written in C, and the compiler will generate the
    "rep" assembly.  The bit you can't write in fully portable standard
    C is the comparison of the pointers so you know which direction to do
    the copying.

    It's a long time since I had to mistrust a compiler so much that I was
    pulling the assembler apart. It sounds as though they have got smarter
    in the meantime.

    I just checked BTW, and you are correct.


    Looking at the generated assembly is usually not a matter of mistrusting
    the compiler.  One of the reasons I do so is to check that the compiler
    can generate efficient object code from my source code, in cases where I need maximal efficiency.  I'd rather not write assembly unless I really have to!
    For near-light-speed code I used to write it first in C, optimize that,
    then I would translate it into (inline) asm and re-optimize based on
    having the full cpu architecture available, before in the final stage I
    would use the asm experience to tweak the C just enough to let the
    compiler generate machine code quite close (90+%) to my best asm, while
    still being portable to any cpu with more or less the same capabilities.
    One example: When I won an international competition to write the
    fastest Pentomino solver (capable of finding all 2339/1010/368/2
    solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
    portable C version.
    My asm submission was twice as fast as anyone else, while the C version
    was still fast enough that a couple of years later I got a prize in the
    mail: Someone in France had submitted my C code, with my name & address, to a similar competition there and it was still faster than anyone else. :-)
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Oct 21 14:04:42 2024
    From Newsgroup: comp.arch

    I don't see an advantage in being able to implement them in standard C.
    It means you can likely also implement a related yet different API
    without having your code "demoted" to non-standard.
    That makes no sense to me. We are talking about implementing standard library functions. If you want to implement other functions, go ahead.

    No, I'm talking about a very general principle that applies to
    languages, libraries, etc...

    For example, in Emacs I always try [and don't always succeed] to make
    sure that the default behavior for a given functionality can be
    implemented using the official API entry points of the underlying
    library, because it makes it more likely that whoever wants to replace
    that behavior with something else will be able to do it without having
    to break abstraction barriers.


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Mon Oct 21 23:17:10 2024
    From Newsgroup: comp.arch

    On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

    Because some implementations will require
    communication with the OS there cannot be a truly portable malloc.

    There can if you have a portable OS API. The only serious candidate for
    that is POSIX.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Oct 21 23:52:59 2024
    From Newsgroup: comp.arch

    On Mon, 21 Oct 2024 23:17:10 +0000, Lawrence D'Oliveiro wrote:

    On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

    Because some implementations will require
    communication with the OS there cannot be a truly portable malloc.

    There can if you have a portable OS API. The only serious candidate for
    that is POSIX.

    POSIX is an environment not an OS.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Tue Oct 22 01:09:49 2024
    From Newsgroup: comp.arch

    On Mon, 21 Oct 2024 23:52:59 +0000, MitchAlsup1 wrote:

    On Mon, 21 Oct 2024 23:17:10 +0000, Lawrence D'Oliveiro wrote:

    On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

    Because some implementations will require communication with the OS
    there cannot be a truly portable malloc.

    There can if you have a portable OS API. The only serious candidate for
    that is POSIX.

    POSIX is an environment not an OS.

    Guess what the “OS” part of “POSIX” stands for.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Mon Oct 21 18:32:27 2024
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [C vs assembly]

    For near-light-speed code I used to write it first in C, optimize
    that, then I would translate it into (inline) asm and re-optimize
    based on having the full cpu architecture available, before in the
    final stage I would use the asm experience to tweak the C just
    enough to let the compiler generate machine code quite close
    (90+%) to my best asm, while still being portable to any cpu with
    more or less the same capabilities.

    One example: When I won an international competition to write the
    fastest Pentomino solver (capable of finding all 2339/1010/368/2
    solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
    portable C version.

    My asm submission was twice as fast as anyone else, while the C
    version was still fast enough that a couple of years later I got a
    prize in the mail: Someone in France had submitted my C code,
    with my name & address, to a similar competition there and it was
    still faster than anyone else. :-)

    I hope you will consider writing a book, "Writing Fast Code" (or
    something along those lines). The core of the book could be, oh,
    let's say between 8 and 12 case studies, starting with a problem
    statement and tracing through the process that you followed, or
    would follow, with stops along the way showing the code at each
    of the different stages.

    If you do write such I book I guarantee I will want to buy one.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Oct 22 08:27:12 2024
    From Newsgroup: comp.arch

    Tim Rentsch wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [C vs assembly]

    For near-light-speed code I used to write it first in C, optimize
    that, then I would translate it into (inline) asm and re-optimize
    based on having the full cpu architecture available, before in the
    final stage I would use the asm experience to tweak the C just
    enough to let the compiler generate machine code quite close
    (90+%) to my best asm, while still being portable to any cpu with
    more or less the same capabilities.

    One example: When I won an international competition to write the
    fastest Pentomino solver (capable of finding all 2339/1010/368/2
    solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
    portable C version.

    My asm submission was twice as fast as anyone else, while the C
    version was still fast enough that a couple of years later I got a
    prize in the mail: Someone in France had submitted my C code,
    with my name & address, to a similar competition there and it was
    still faster than anyone else. :-)

    I hope you will consider writing a book, "Writing Fast Code" (or
    something along those lines). The core of the book could be, oh,
    let's say between 8 and 12 case studies, starting with a problem
    statement and tracing through the process that you followed, or
    would follow, with stops along the way showing the code at each
    of the different stages.

    If you do write such I book I guarantee I will want to buy one.

    Thank you Tim!

    Probably not a book but I would consider writing a series of blog posts similar to that, now that I am about to retire: My wife and I will both
    go on "permanent vacation" starting a week before Christmas. :-)

    I already know that this will give me more time to work on digital
    mapping projects (ref my https://mapant.no/ Norwegian topo map generated
    from ~50 TB of LiDAR), but if there's an interest in optimization I
    might do that as well.

    BTW, I am also open to doing some consulting work, if the problems are interesting enough. :-)

    Regards,
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From George Neuner@gneuner2@comcast.net to comp.arch on Tue Oct 22 17:26:06 2024
    From Newsgroup: comp.arch

    On Tue, 22 Oct 2024 01:09:49 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote:

    On Mon, 21 Oct 2024 23:52:59 +0000, MitchAlsup1 wrote:

    On Mon, 21 Oct 2024 23:17:10 +0000, Lawrence D'Oliveiro wrote:

    On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

    Because some implementations will require communication with the OS
    there cannot be a truly portable malloc.

    There can if you have a portable OS API. The only serious candidate for
    that is POSIX.

    POSIX is an environment not an OS.

    Guess what the “OS” part of “POSIX” stands for.

    It's still an just environment - POSIX defines only an interface, not
    an implementation.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Wed Oct 23 07:25:42 2024
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [C vs assembly]

    For near-light-speed code I used to write it first in C, optimize
    that, then I would translate it into (inline) asm and re-optimize
    based on having the full cpu architecture available, before in the
    final stage I would use the asm experience to tweak the C just
    enough to let the compiler generate machine code quite close
    (90+%) to my best asm, while still being portable to any cpu with
    more or less the same capabilities.

    One example: When I won an international competition to write the
    fastest Pentomino solver (capable of finding all 2339/1010/368/2
    solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
    portable C version.

    My asm submission was twice as fast as anyone else, while the C
    version was still fast enough that a couple of years later I got a
    prize in the mail: Someone in France had submitted my C code,
    with my name & address, to a similar competition there and it was
    still faster than anyone else. :-)

    I hope you will consider writing a book, "Writing Fast Code" (or
    something along those lines). The core of the book could be, oh,
    let's say between 8 and 12 case studies, starting with a problem
    statement and tracing through the process that you followed, or
    would follow, with stops along the way showing the code at each
    of the different stages.

    If you do write such a book I guarantee I will want to buy one.

    Thank you Tim!

    I know from past experience you are good at this. I would love
    to hear what you have to say.

    Probably not a book but I would consider writing a series of blog
    posts similar to that, now that I am about to retire:

    You could try writing one blog post a month on the subject. By
    this time next year you will have plenty of material and be well
    on your way to putting a book together. (First drafts are always
    the hardest part...)

    My wife and I will both go on "permanent vacation" starting a week
    before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual
    vacation and self-chosen "work". In any case I hope you both enjoy
    the time.

    P.S. Is the email address in your message a good way to reach you?
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 23 18:11:57 2024
    From Newsgroup: comp.arch

    On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    My wife and I will both go on "permanent vacation" starting a week
    before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual
    vacation and self-chosen "work". In any case I hope you both enjoy
    the time.

    Just remember, retirement does not mean you "stop working"
    it means you "stop working for HIM".
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Oct 23 18:27:06 2024
    From Newsgroup: comp.arch

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    My wife and I will both go on "permanent vacation" starting a week
    before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual
    vacation and self-chosen "work". In any case I hope you both enjoy
    the time.

    Just remember, retirement does not mean you "stop working"
    it means you "stop working for HIM".

    And start working for "HER". (Honeydew list).
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Oct 23 21:11:59 2024
    From Newsgroup: comp.arch

    MitchAlsup1 wrote:
    On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    My wife and I will both go on "permanent vacation" starting a week
    before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual
    vacation and self-chosen "work".  In any case I hope you both enjoy
    the time.

    Just remember, retirement does not mean you "stop working"
    it means you "stop working for HIM".
    Exactly!
    I have unlimited amounts of potential/available mapping work, and I do
    want to get back to NTP Hackers.
    We recently started (officially) on the 754-2029 revision.
    I'm still connected to Mill Computing as well.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Oct 23 21:12:57 2024
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    My wife and I will both go on "permanent vacation" starting a week
    before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual
    vacation and self-chosen "work". In any case I hope you both enjoy
    the time.

    Just remember, retirement does not mean you "stop working"
    it means you "stop working for HIM".

    And start working for "HER". (Honeydew list).

    My wife do have a small list of things that we (i.e. I) could do when we retire...

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Oct 23 21:09:47 2024
    From Newsgroup: comp.arch

    Tim Rentsch wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [C vs assembly]

    For near-light-speed code I used to write it first in C, optimize
    that, then I would translate it into (inline) asm and re-optimize
    based on having the full cpu architecture available, before in the
    final stage I would use the asm experience to tweak the C just
    enough to let the compiler generate machine code quite close
    (90+%) to my best asm, while still being portable to any cpu with
    more or less the same capabilities.

    One example: When I won an international competition to write the
    fastest Pentomino solver (capable of finding all 2339/1010/368/2
    solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
    portable C version.

    My asm submission was twice as fast as anyone else, while the C
    version was still fast enough that a couple of years later I got a
    prize in the mail: Someone in France had submitted my C code,
    with my name & address, to a similar competition there and it was
    still faster than anyone else. :-)

    I hope you will consider writing a book, "Writing Fast Code" (or
    something along those lines). The core of the book could be, oh,
    let's say between 8 and 12 case studies, starting with a problem
    statement and tracing through the process that you followed, or
    would follow, with stops along the way showing the code at each
    of the different stages.

    If you do write such a book I guarantee I will want to buy one.

    Thank you Tim!

    I know from past experience you are good at this. I would love
    to hear what you have to say.

    Probably not a book but I would consider writing a series of blog
    posts similar to that, now that I am about to retire:

    You could try writing one blog post a month on the subject. By
    this time next year you will have plenty of material and be well
    on your way to putting a book together. (First drafts are always
    the hardest part...)

    I'm sure you're right!

    My wife and I will both go on "permanent vacation" starting a week
    before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual
    vacation and self-chosen "work". In any case I hope you both enjoy
    the time.

    P.S. Is the email address in your message a good way to reach you?

    Yes, that is my personal domain, so it won't be affected by my retirement.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 23 21:01:01 2024
    From Newsgroup: comp.arch

    On Wed, 23 Oct 2024 19:11:59 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    My wife and I will both go on "permanent vacation" starting a week
    before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual
    vacation and self-chosen "work".  In any case I hope you both enjoy
    the time.

    Just remember, retirement does not mean you "stop working"
    it means you "stop working for HIM".

    Exactly!

    I have unlimited amounts of potential/available mapping work, and I do
    want to get back to NTP Hackers.

    We recently started (officially) on the 754-2029 revision.

    Are you going to put in something equivalent to quires ??

    I'm still connected to Mill Computing as well.

    Terje
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Oct 24 07:39:52 2024
    From Newsgroup: comp.arch

    MitchAlsup1 wrote:
    On Wed, 23 Oct 2024 19:11:59 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    My wife and I will both go on "permanent vacation" starting a week
    before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual
    vacation and self-chosen "work".  In any case I hope you both enjoy >>>> the time.

    Just remember, retirement does not mean you "stop working"
    it means you "stop working for HIM".

    Exactly!

    I have unlimited amounts of potential/available mapping work, and I do>> want to get back to NTP Hackers.

    We recently started (officially) on the 754-2029 revision.

    Are you going to put in something equivalent to quires ??
    I don't know that usage, I thought quires was a typesetting/printing
    measure?
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 24 06:55:20 2024
    From Newsgroup: comp.arch

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Probably not a book but I would consider writing a series of blog
    posts similar to that, now that I am about to retire:

    You could try writing one blog post a month on the subject. By
    this time next year you will have plenty of material and be well
    on your way to putting a book together. (First drafts are always
    the hardest part...)

    One thing I have thought of is a wiki of optimization techniques that
    contains descriptions of the techniques and case studies, but I have
    not yet implemented this idea.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Brown@david.brown@hesbynett.no to comp.arch on Thu Oct 24 10:00:16 2024
    From Newsgroup: comp.arch

    On 24/10/2024 08:55, Anton Ertl wrote:
    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Probably not a book but I would consider writing a series of blog
    posts similar to that, now that I am about to retire:

    You could try writing one blog post a month on the subject. By
    this time next year you will have plenty of material and be well
    on your way to putting a book together. (First drafts are always
    the hardest part...)

    One thing I have thought of is a wiki of optimization techniques that contains descriptions of the techniques and case studies, but I have
    not yet implemented this idea.


    Would it make sense to start something under Wikibooks on Wikipedia? I
    have no experience with it myself, but it looks to me like a way to have
    a collaborative collection of related knowledge. It could provide the structure and framework, saving you (plural) from having to set up a
    wiki, blog, or whatever.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 24 16:34:45 2024
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> writes:
    On 24/10/2024 08:55, Anton Ertl wrote:
    One thing I have thought of is a wiki of optimization techniques that
    contains descriptions of the techniques and case studies, but I have
    not yet implemented this idea.


    Would it make sense to start something under Wikibooks on Wikipedia?

    Yes, I was thinking about that. In the bookshelf on computer
    programming <https://en.wikibooks.org/wiki/Shelf:Computer_programming>
    there are two "Books nearing completion" that have "Opti" in the
    title:

    https://en.wikibooks.org/wiki/Optimizing_Code_for_Speed https://en.wikibooks.org/wiki/Optimizing_C%2B%2B

    Looking at the contents of the former, it's rather short and
    high-level, and I don't think it's intended for the kind of project we
    have in mind.

    The latter is more in the direction I have in mind, but the limitation
    to C++ is, well, limiting.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Oct 24 18:32:22 2024
    From Newsgroup: comp.arch

    On Thu, 24 Oct 2024 5:39:52 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Wed, 23 Oct 2024 19:11:59 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    My wife and I will both go on "permanent vacation" starting a week >>>>>> before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual
    vacation and self-chosen "work".  In any case I hope you both enjoy >>>>> the time.

    Just remember, retirement does not mean you "stop working"
    it means you "stop working for HIM".

    Exactly!

    I have unlimited amounts of potential/available mapping work, and I do
    want to get back to NTP Hackers.

    We recently started (officially) on the 754-2029 revision.

    Are you going to put in something equivalent to quires ??

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    I don't know that usage, I thought quires was a typesetting/printing
    measure?

    Terje

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Vir Campestris@vir.campestris@invalid.invalid to comp.arch on Sun Oct 27 20:42:09 2024
    From Newsgroup: comp.arch

    On 22/10/2024 00:17, Lawrence D'Oliveiro wrote:
    On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

    Because some implementations will require
    communication with the OS there cannot be a truly portable malloc.

    There can if you have a portable OS API. The only serious candidate for
    that is POSIX.

    One of the other groups I'm following just for the hell of it is
    comp.os.cpm/ I'm pretty sure you don't get POSIX in your 64kb (max).

    "cannot be a _truly_ portable" is what I meant. Portable to most machine
    is easy - just write for Windows. POSIX will give you a larger subset -
    but still a subset.

    Andy
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Vir Campestris@vir.campestris@invalid.invalid to comp.arch on Sun Oct 27 20:45:09 2024
    From Newsgroup: comp.arch

    On 23/10/2024 20:12, Terje Mathisen wrote:

    My wife do have a small list of things that we (i.e. I) could do when we retire...

    Since I retired the garden is looking much better, I've started to win
    the odd trophy sailing, most of the house has been redecorated...

    But best of all - I've lost 5kG and been able to stop worrying about my weight!

    Andy
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Sun Oct 27 21:04:49 2024
    From Newsgroup: comp.arch

    On Sun, 27 Oct 2024 20:42:09 +0000, Vir Campestris wrote:

    I'm pretty sure you don't get POSIX in your 64kb (max).

    <https://news.ycombinator.com/item?id=34981059>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From David Schultz@david.schultz@earthlink.net to comp.arch on Sun Oct 27 17:55:52 2024
    From Newsgroup: comp.arch

    On 10/27/24 3:42 PM, Vir Campestris wrote:
    On 22/10/2024 00:17, Lawrence D'Oliveiro wrote:
    On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

    Because some implementations will require
    communication with the OS there cannot be a truly portable malloc.

    There can if you have a portable OS API. The only serious candidate for
    that is POSIX.

    One of the other groups I'm following just for the hell of it is comp.os.cpm/ I'm pretty sure you don't get POSIX in your 64kb (max).

    Ignores the 16 bit versions of CP/M: 8086, 68000, Z8000.
    --
    http://davesrocketworks.com
    David Schultz
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Oct 28 11:39:57 2024
    From Newsgroup: comp.arch

    MitchAlsup1 wrote:
    On Thu, 24 Oct 2024 5:39:52 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Wed, 23 Oct 2024 19:11:59 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    My wife and I will both go on "permanent vacation" starting a week >>>>>>> before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual >>>>>> vacation and self-chosen "work".  In any case I hope you both >>>>>> enjoy
    the time.

    Just remember, retirement does not mean you "stop working"
    it means you "stop working for HIM".

    Exactly!

    I have unlimited amounts of potential/available mapping work, and I do >>>> want to get back to NTP Hackers.

    We recently started (officially) on the 754-2029 revision.

    Are you going to put in something equivalent to quires ??

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.
    OK, I have seen and used "Super-accumulator" as the term for those, I
    have thought about implementing one in carry-save redundant form, but
    that might be more redundancy than really needed?
    Having a carry bit for every byte should still make it possible to
    handle several additions/cycle, right?
    I'm assuming the real cost is in the alignment network needed to route incoming addends into the right slice.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Oct 28 16:30:46 2024
    From Newsgroup: comp.arch

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Oct 28 10:12:08 2024
    From Newsgroup: comp.arch

    On 10/28/2024 9:30 AM, Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

    Another newer alternative. This came up on my news feed. I haven't
    looked at the details at all, so I can't comment on it.

    https://arxiv.org/abs/2410.03692
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Oct 28 18:14:20 2024
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 10/28/2024 9:30 AM, Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented
    https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option
    https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

    Another newer alternative. This came up on my news feed. I haven't
    looked at the details at all, so I can't comment on it.

    https://arxiv.org/abs/2410.03692

    That is about another number representation for AI, trying to squeeze
    more AI performance out of few bits.

    Personally, I like the approach of doing analog calculation for
    the low-accuracy dot products that they do, followed by an A/D
    converter. There is a company doing that, but I forget its name.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Oct 28 15:24:18 2024
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

    These would be very large registers. You'd need some way to store and load
    the these for register spills, fills and task switch, as well as move
    and manage them.

    Karlsruhe above has a link to http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

    which describes their large accumulators as residing in memory, which
    avoids the spill/fill/switch issue but with an obvious performance hit:

    "A floating-point accumulator occupies a 168-byte storage area that is
    aligned on a 256-byte boundary. An accumulator consists of a four-byte
    status area on the left, followed by a 164-byte numeric area."

    The operands are specified by virtual address of their in-memory accumulator.

    Of course, once you have 168-byte registers people are going to
    think of new uses for them.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Oct 29 06:33:50 2024
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented
    https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option
    https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

    These would be very large registers. You'd need some way to store and load the these for register spills, fills and task switch, as well as move
    and manage them.

    Karlsruhe above has a link to http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

    which describes their large accumulators as residing in memory, which
    avoids the spill/fill/switch issue but with an obvious performance hit:

    "A floating-point accumulator occupies a 168-byte storage area that is aligned on a 256-byte boundary. An accumulator consists of a four-byte
    status area on the left, followed by a 164-byte numeric area."

    The operands are specified by virtual address of their in-memory accumulator.

    Makes sense, given the time this was implemented. This was also a
    mid-range machine, not a number cruncher. I do not find the
    number of cycles that the instructions took.

    But this was also for hex floating point. A similar scheme for IEEE
    double would need a bit more than 2048 bits, so five AVX-512 registers.


    Of course, once you have 168-byte registers people are going to
    think of new uses for them.

    SIMD from hell? Pretend that a CPU is a graphics card? :-)
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Oct 29 08:07:50 2024
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented
    https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option
    https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

    These would be very large registers. You'd need some way to store and load >> the these for register spills, fills and task switch, as well as move
    and manage them.

    Karlsruhe above has a link to
    http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

    which describes their large accumulators as residing in memory, which
    avoids the spill/fill/switch issue but with an obvious performance hit:

    "A floating-point accumulator occupies a 168-byte storage area that is
    aligned on a 256-byte boundary. An accumulator consists of a four-byte
    status area on the left, followed by a 164-byte numeric area."

    The operands are specified by virtual address of their in-memory accumulator.

    Makes sense, given the time this was implemented. This was also a
    mid-range machine, not a number cruncher. I do not find the
    number of cycles that the instructions took.

    At the time, memory was just a few clock cycles away from the CPU, so
    not really that problematic. Today, such a super-accumulator would stay
    in $L1 most of the time, or at least the central, in-use cache line of
    it, would do so.


    But this was also for hex floating point. A similar scheme for IEEE
    double would need a bit more than 2048 bits, so five AVX-512 registers.

    With 1312 bits of storage, their fp inputs (hex fp?) must have had a
    smaller exponent range than ieee double.

    If I was implementing this I would probably want some redundant storage
    to limit carry propagation, so maybe 48 bits per 64-bit chunk, in which
    case I would need about 2800 bits or 6 of those 512-bit SIMD regs.

    SIMD from hell? Pretend that a CPU is a graphics card? :-)

    Writing this as a throughput task could make it fit better within a GPU?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Oct 29 14:19:13 2024
    From Newsgroup: comp.arch

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    IIUC you can already implement such a thing with standard IEEE
    operations, based on the "standard" Knuth approach of computing the
    exact result of `a + b` as the sum of x + y where x is the "normal" sum
    of a + b (and hence y holds the remaining bits lost to rounding).

    I wonder how often this is used in practice.

    Intuitively it should be possible to make it reasonably efficient, where
    you first compute the "naive" sum but also keep the N remaining numbers representing the bits lost to each of the N roundings. I.e. you take in
    a vector "as" of N numbers and return a pair of the "naive" sum plus
    a vector of N rounding errors.

    Σ as => (round(Σ As), rs)
    such that round(Σ As) = the naive IEEE sum of as
    and Σ as = round(Σ As) + Σ rs

    You can then recursively compute "Σ rs" in the same way. At each step of
    the recursion you can compute round(Σ |rs|) to estimate an upper bound
    on the remaining error and thus stop when that error is smaller than
    1 ULP or somesuch.

    AFAICT, if your sum is well-conditioned you should need at most 2 steps
    of the recursion, and I suspect you can predict when the next estimated
    error will be too small before you start the last recursion, so the last recursion might skip the generation of the last "rs" vector.


    Stefan
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Tue Oct 29 14:29:28 2024
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.
    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented
    https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option
    https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility
    These would be very large registers. You'd need some way to store and load >> the these for register spills, fills and task switch, as well as move
    and manage them.

    Karlsruhe above has a link to
    http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

    which describes their large accumulators as residing in memory, which
    avoids the spill/fill/switch issue but with an obvious performance hit:

    "A floating-point accumulator occupies a 168-byte storage area that is
    aligned on a 256-byte boundary. An accumulator consists of a four-byte
    status area on the left, followed by a 164-byte numeric area."

    The operands are specified by virtual address of their in-memory accumulator.

    Makes sense, given the time this was implemented. This was also a
    mid-range machine, not a number cruncher. I do not find the
    number of cycles that the instructions took.

    But this was also for hex floating point. A similar scheme for IEEE
    double would need a bit more than 2048 bits, so five AVX-512 registers.

    Right, something like 2048+52+3 = 2103 bits for data, plus some status bits. For x64 they could overlay it onto AVX-512 register file in groups of 5
    and use existing SIMD instructions for management.
    That would allow them to pack 3 accumulators into registers z0..z14.

    For RISC-V they have the large vector registers, 32 * 256-bits each I think,
    so again 3 accumulators.

    So its a plausible proposition.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Oct 29 19:57:25 2024
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented
    https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option
    https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility >>>
    These would be very large registers. You'd need some way to store and load >>> the these for register spills, fills and task switch, as well as move
    and manage them.

    Karlsruhe above has a link to
    http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

    which describes their large accumulators as residing in memory, which
    avoids the spill/fill/switch issue but with an obvious performance hit:

    "A floating-point accumulator occupies a 168-byte storage area that is
    aligned on a 256-byte boundary. An accumulator consists of a four-byte
    status area on the left, followed by a 164-byte numeric area."

    The operands are specified by virtual address of their in-memory accumulator.

    Makes sense, given the time this was implemented. This was also a
    mid-range machine, not a number cruncher. I do not find the
    number of cycles that the instructions took.

    At the time, memory was just a few clock cycles away from the CPU, so
    not really that problematic. Today, such a super-accumulator would stay
    in $L1 most of the time, or at least the central, in-use cache line of
    it, would do so.


    But this was also for hex floating point. A similar scheme for IEEE
    double would need a bit more than 2048 bits, so five AVX-512 registers.

    With 1312 bits of storage, their fp inputs (hex fp?) must have had a
    smaller exponent range than ieee double.

    IBM format had one sign bit, seven exponent bits and six or fourteen hexadecimal digits for single and double precision, respectively.
    (Insert fear and loathing for hex float here).
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Oct 29 20:21:11 2024
    From Newsgroup: comp.arch

    On Tue, 29 Oct 2024 19:57:25 +0000, Thomas Koenig wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented
    https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option
    https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility >>>>
    These would be very large registers. You'd need some way to store and
    load
    the these for register spills, fills and task switch, as well as move
    and manage them.

    Karlsruhe above has a link to
    http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

    which describes their large accumulators as residing in memory, which
    avoids the spill/fill/switch issue but with an obvious performance hit: >>>>
    "A floating-point accumulator occupies a 168-byte storage area that is >>>> aligned on a 256-byte boundary. An accumulator consists of a four-byte >>>> status area on the left, followed by a 164-byte numeric area."

    The operands are specified by virtual address of their in-memory
    accumulator.

    Makes sense, given the time this was implemented. This was also a
    mid-range machine, not a number cruncher. I do not find the
    number of cycles that the instructions took.

    At the time, memory was just a few clock cycles away from the CPU, so
    not really that problematic. Today, such a super-accumulator would stay
    in $L1 most of the time, or at least the central, in-use cache line of
    it, would do so.


    But this was also for hex floating point. A similar scheme for IEEE
    double would need a bit more than 2048 bits, so five AVX-512 registers.

    With 1312 bits of storage, their fp inputs (hex fp?) must have had a
    smaller exponent range than ieee double.

    Terje--IEEE is all capitals.

    IBM format had one sign bit, seven exponent bits and six or fourteen hexadecimal digits for single and double precision, respectively.

    The span of an IEEE double "quire" would be the exponent-2 + fraction.
    a) The most significant non-infinity has an exponent of +1023
    b) The least significant non-underflow has an exponent of -1023
    Leaving a span of 2046 bits plus 52 denormalized bits or 2098-bits
    or 262 bytes.

    One note: When left in memory, one indexes the accumulator with
    the (exponent>>6) and fetches 2 doublewords. A carry out requires
    accessing the 3rd doubleword (possibly transitively).

    (Insert fear and loathing for hex float here).

    Heck, watching Kahan's notes on FP problems leaves one in fear of
    binary floating point representations.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Oct 29 20:30:12 2024
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented
    https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option
    https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility >>>>
    These would be very large registers. You'd need some way to store and load >>>> the these for register spills, fills and task switch, as well as move
    and manage them.

    Karlsruhe above has a link to
    http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

    which describes their large accumulators as residing in memory, which
    avoids the spill/fill/switch issue but with an obvious performance hit: >>>>
    "A floating-point accumulator occupies a 168-byte storage area that is >>>> aligned on a 256-byte boundary. An accumulator consists of a four-byte >>>> status area on the left, followed by a 164-byte numeric area."

    The operands are specified by virtual address of their in-memory accumulator.

    Makes sense, given the time this was implemented. This was also a
    mid-range machine, not a number cruncher. I do not find the
    number of cycles that the instructions took.

    At the time, memory was just a few clock cycles away from the CPU, so
    not really that problematic. Today, such a super-accumulator would stay
    in $L1 most of the time, or at least the central, in-use cache line of
    it, would do so.


    But this was also for hex floating point. A similar scheme for IEEE
    double would need a bit more than 2048 bits, so five AVX-512 registers.

    With 1312 bits of storage, their fp inputs (hex fp?) must have had a
    smaller exponent range than ieee double.

    IBM format had one sign bit, seven exponent bits and six or fourteen >hexadecimal digits for single and double precision, respectively.
    (Insert fear and loathing for hex float here).

    Burroughs Medium systems had four exponent sign bits, eight exponent bits,
    four mantissa sign bits, and up to 400 mantissa bits. BCD, so that's an exponent range of -99 to +99 and a 1 to 100 digit mantissa.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Oct 29 21:27:29 2024
    From Newsgroup: comp.arch

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    (Insert fear and loathing for hex float here).

    Heck, watching Kahan's notes on FP problems leaves one in fear of
    binary floating point representations.

    True, but... hex float is so much worse.

    "Hacker's delight" has some choice words there, and the
    author worked for IBM :-)
    --- Synchronet 3.20a-Linux NewsLink 1.114