• Concertina IV Has Arrived

    From quadi@quadibloc@ca.invalid to comp.arch on Tue May 19 20:14:37 2026
    From Newsgroup: comp.arch

    It had to happen?
    I was not sure if it ever could happen.
    There was Concertina II - an attempt at a practical ISA, unlike the
    original Concertina, which was merely illustrative.
    But it had a block structure, which was highly criticized. And I had to
    admit it was overly complicated. And so I went on and used the Concertina
    III designation again - for a CISC-like instruction set with variable-
    length instructions. The price was, though, switching to register banks of
    16 registers instead of 32.
    The IBM 360 had banks of 16 registers, and more modern CISC designs, like
    the 680x0 and the x86 have banks of only eight. Only RISC designs can
    offer banks of 32 registers.
    And yet it seemed so tantalizingly close - that it might just be possible, using what I've learned about squeezing an ISA into the available opcode space, to go back to banks of 32 registers.
    I found it to be possible - at a price.
    It could be done, but I wouldn't have much space left for 16-bit short instructions.
    Even if I had lots of space for 16-bit short instructions, though, they
    would still, just by being 16 bits long, where the banks of registers have
    32 registers in them, be badly compromised.
    And so I decided to offer only a very limited set of 16-bit short instructions, and to chiefly provide... 24-bit short instructions.
    I didn't want to depart from the example of the 680x0 and the System/360
    to allow instructions to start on odd bytes, but it seemed like I had no choice if I wanted to offer a reasonably complete set of short
    instructions at all.
    Concertina IV is described at:
    http://www.quadibloc.com/arch/cw01int.htm

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Wed May 20 00:03:27 2026
    From Newsgroup: comp.arch

    I've made my first change to Concertina IV. I'm not happy with the way
    things were before the change or the way they are now, so I may change it again.

    The 16-bit short instructions only have 12 free bits available. That's not much to work with when there are 32 registers in each register bank.

    Initially, I settled on four bits of opcode, along with the basic register specification scheme used for the 15-bit paired short instructions in Concertina II.

    But choosing single and double precision floating-point as the only two
    types supported didn't rest easily with me. Single precision isn't really precise enough to be useful, or so I've heard.

    The alternative of supporting 48-bit intermediate precision and double precision, while it appeals to me personally... is clearly untenable.
    Medium is a nonstandard data type, and so it would not be widely used.

    So instead I decided to only support double precision, and use the extra
    bits to allow additional ways to specify registers.

    The result, of course, is messy.

    So I'm considering going back to the earlier format, but instead of
    supporting two floating-point data types, to support one integer type and
    one floating type. But which integer type? 32-bit integer, or 64-bit long?

    I could get more bits by going to _paired_ instructions. But I have some
    free space between 32-bit instructions so that I could just add those
    while keeping 16-bit short instructions.

    And this also led me to thinking about something else.

    I align different integer types on the right, even while aligning
    different floating-point types on the left like everyone else. So integer operations must sign-extend if they're on values shorter than 64 bits.

    Propagating a bit takes time.

    So should I design the ALU so that the sign extension takes place after
    the rest of the instruction, and allow another 32-bit (or shorter) integer instruction to use results when they're ready, before sign extension? Is
    that just normal efficiency, or wasteful complexity?

    In any case, I think I've come up with something that is a reasonable compromise I can live with after all.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Wed May 20 00:34:59 2026
    From Newsgroup: comp.arch

    On Wed, 20 May 2026 00:03:27 +0000, quadi wrote:

    In any case, I think I've come up with something that is a reasonable compromise I can live with after all.

    And what was that compromise?

    When it comes to floating-point types, there was only one that I valued
    above all the rest, so I couldn't decide what second one to use.

    With integer types, on the other hand, there were two types that I
    couldn't decide between.

    So go with it!

    Support two integer types - even with room for logical as well as
    arithmetic operations - but with a more limited specification of source
    and destination registers... and one floating-point type.

    Done!

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed May 20 01:35:01 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    I've made my first change to Concertina IV. I'm not happy with the way things were before the change or the way they are now, so I may change it again.

    The 16-bit short instructions only have 12 free bits available. That's not much to work with when there are 32 registers in each register bank.

    Initially, I settled on four bits of opcode, along with the basic register specification scheme used for the 15-bit paired short instructions in Concertina II.

    But choosing single and double precision floating-point as the only two types supported didn't rest easily with me. Single precision isn't really precise enough to be useful, or so I've heard.

    Everything you have heard is both true and false::

    There are many applications where DP is de rigueur {galactic simulations} smaller precision simply will not do. Many of these would like to go
    FP128 but performance is not there yet.
    There is a growing demand for FP16 and FP8 data types for memory-size
    and BW reasons.
    There is a growing background need for FP128, too.

    The alternative of supporting 48-bit intermediate precision and double precision, while it appeals to me personally... is clearly untenable.
    Medium is a nonstandard data type, and so it would not be widely used.

    So instead I decided to only support double precision, and use the extra bits to allow additional ways to specify registers.

    My 66000 started out that way and the compiler showed that this choice sucks.

    The result, of course, is messy.

    No it becomes unacceptable when FP32 takes 3 instructions while FP64
    takes but 1.

    So I'm considering going back to the earlier format, but instead of supporting two floating-point data types, to support one integer type and one floating type. But which integer type? 32-bit integer, or 64-bit long?

    You will find you have no <marketable> choice; you need to support::

    Integer{S8, S16, S32, S64, U8, U16, U32, U64}
    Float {FP8, FP16, FP32, FP64 and some way to get FP128}

    I could get more bits by going to _paired_ instructions. But I have some free space between 32-bit instructions so that I could just add those
    while keeping 16-bit short instructions.

    And this also led me to thinking about something else.

    I align different integer types on the right, even while aligning
    different floating-point types on the left like everyone else. So integer operations must sign-extend if they're on values shorter than 64 bits.

    Go LE all the way. LE won get over BE thinking.

    As far as integers go: all calculations produce proper integer values
    in the 64-bit destination register.
    S8 has range [-128..127]
    u8 has range [0..255]
    ...

    Propagating a bit takes time.

    A solved HW gate-level problem.

    So should I design the ALU so that the sign extension takes place after
    the rest of the instruction, and allow another 32-bit (or shorter) integer instruction to use results when they're ready, before sign extension? Is that just normal efficiency, or wasteful complexity?

    All the sign and zero stuff goes "in the CARRY chain".

    In any case, I think I've come up with something that is a reasonable compromise I can live with after all.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Wed May 20 02:09:08 2026
    From Newsgroup: comp.arch

    On Wed, 20 May 2026 01:35:01 +0000, MitchAlsup wrote:
    quadi <quadibloc@ca.invalid> posted:

    I align different integer types on the right, even while aligning
    different floating-point types on the left like everyone else. So
    integer operations must sign-extend if they're on values shorter than
    64 bits.

    Go LE all the way. LE won get over BE thinking.

    a) I didn't think this really had anything to do with little-endian versus big-endian.

    b) Yes, little-endian is more popular, but that's just because the PDP-11, 8080, and 6502 happened to choose it. Little-endian doesn't work as well
    *if* you also want to put packed decimal values in registers.

    As far as integers go: all calculations produce proper integer values in
    the 64-bit destination register.
    S8 has range [-128..127]
    u8 has range [0..255]
    ...

    If you have 64 bit registers, then if you want to avoid a gap between the
    sign in a 32-bit number and the sign of a 64-bit number by placing the 32-
    bit number on the most significant side, a 32-bit 1 is equal to a 64-bit 8,589,934,592.

    Propagating a bit takes time.

    A solved HW gate-level problem.

    That's good news, then I don't have a problem. I figured the solution
    would be to use slightly slower gates with larger current output.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Wed May 20 02:26:42 2026
    From Newsgroup: comp.arch

    On Wed, 20 May 2026 02:09:08 +0000, quadi wrote:

    On Wed, 20 May 2026 01:35:01 +0000, MitchAlsup wrote:
    quadi <quadibloc@ca.invalid> posted:

    I align different integer types on the right, even while aligning
    different floating-point types on the left like everyone else. So
    integer operations must sign-extend if they're on values shorter than
    64 bits.

    Go LE all the way. LE won get over BE thinking.

    a) I didn't think this really had anything to do with little-endian
    versus big-endian.

    b) Yes, little-endian is more popular, but that's just because the
    PDP-11,
    8080, and 6502 happened to choose it. Little-endian doesn't work as well
    *if* you also want to put packed decimal values in registers.

    As far as integers go: all calculations produce proper integer values
    in the 64-bit destination register.
    S8 has range [-128..127]
    u8 has range [0..255]
    ...

    If you have 64 bit registers, then if you want to avoid a gap between
    the sign in a 32-bit number and the sign of a 64-bit number by placing
    the 32-
    bit number on the most significant side, a 32-bit 1 is equal to a 64-bit 8,589,934,592.

    While the majority of computers nowadays are little-endian, back in the
    old days only a very few computers treated fixed-point numbers as
    fractions in the range [-1,1) instead of as integers.

    Those that did that either wasted a bit in double-word integers, or
    required one to do a right shift by one bit after doing a multiplication
    if you wanted the result of the multiplication to correspond with treating
    the numbers as integers instead.

    This, not little-endian versus big-endian, was what I was talking about
    not doing.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Wed May 20 07:21:04 2026
    From Newsgroup: comp.arch

    On Wed, 20 May 2026 01:35:01 +0000, MitchAlsup wrote:

    Everything you have heard is both true and false::

    There are many applications where DP is de rigueur {galactic
    simulations} smaller precision simply will not do. Many of these would
    like to go FP128 but performance is not there yet.
    There is a growing demand for FP16 and FP8 data types for memory-size
    and BW reasons.
    There is a growing background need for FP128, too.

    I'm aware of all of this.

    You will find you have no <marketable> choice; you need to support::

    Integer{S8, S16, S32, S64, U8, U16, U32, U64}
    Float {FP8, FP16, FP32, FP64 and some way to get FP128}

    I *do* intend to support them all. However, U8, U16, U32, and U64 don't
    get special instructions; the compiler will just have to remember the
    meaning of the condition codes for signed numbers when doing comparisons
    on unsigned numbers.

    Actually, though, that does mean I have to modify the conditional branch instructions. One will actually want to test for combinations of less,
    equal, and greater when overflow is present, and I've assumed that some combinations can be excluded!

    So in commenting on a different part of my design entirely, you've pointed
    out an important flaw I will have to correct.

    It's just that the pigeonhole principle prevents me, quite effectively,
    from supporting them all *in 16-bit short instructions with only 12 bits available*. I don't care what marketing says; I believe engineering when
    they say they can't do the impossible.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed May 20 05:38:07 2026
    From Newsgroup: comp.arch

    quadi <quadibloc@ca.invalid> writes:
    b) Yes, little-endian is more popular, but that's just because the PDP-11, >8080, and 6502 happened to choose it.

    Thinking about it:

    * The last descendent of the PDP-11 was canceled long before the most
    prominent big-endien architecture (SPARC) was canceled, and long
    before Power switched its Linux support to little-endian, so the
    PDP-11 had little, if any, influence on the outcome.

    * 8080: Yes, because AMD64 inherited its byte order from it. But if
    we go to the origin here, it's not the 8080 and not the 8008, but
    the Datapoint 2200, which is remarkable, because it was designed as
    a terminal for mainframes, and S/360 is big-endian.
    <https://en.wikipedia.org/wiki/Datapoint_2200#Technical_description>
    says:

    |The fact that most laptops and cloud computers today store numbers
    |in little-endian format is carried forward from the original
    |Datapoint 2200. Because the original Datapoint 2200 had a serial
    |processor, it needed to start with the lowest bit of the lowest byte
    |in order to handle carries. Microprocessors descended from the
    |Datapoint 2200 (the 8008, Z80, and the x86 chips used in most
    |laptops and cloud computers today) kept the little-endian format
    |used by that original Datapoint 2200.

    * 6502: Yes, because ARM A64 inherited its byte order from it. The
    6502 is remarkable because it is a child of the 6800, which is
    big-endian. So the choice of little-endian byte order was
    deliberate.

    RISC-V inherits its original byte order from the descendents of 8080
    and 6502. The ISA manual comments on this:

    |We originally chose little-endian byte ordering for the RISC-V memory
    |system because little-endian systems are currently dominant
    |commercially (all x86 systems; iOS, Android, and Windows for ARM). A
    |minor point is that we have also found little-endian memory systems to
    |be more natural for hardware designers. However, certain application
    |areas, such as IP networking, operate on big-endian data structures,
    |and certain legacy code bases have been built assuming big-endian
    |processors, so we have defined big-endian and bi-endian variants of
    |RISC-V.
    [...]
    |We further make the instruction parcels themselves little-endian to
    |decouple the instruction encoding from the memory system endianness |altogether.

    I expect that big-endian RISC-V's will be as common as big-endian
    Alphas and big-endian ARMs (all Alphas and ARMs after a certain point
    in time support a big-endian mode), i.e., not at all.

    Little-endian doesn't work as well
    *if* you also want to put packed decimal values in registers.

    It certainly does. I know it because we had a group exercise in
    assembly language on 80286s that dealt with BCD numbers, and we split
    the project into submodules, one for each student. In integration
    testing we found that we had forgotten to specify the byte order in
    our interface descriptions. Two in our group, two students (including
    me) had chosen little-endian and IIRC two had chosen big-endian. I
    did not find that doing the BCD stuff in little-endian byte order did
    not work well.

    With the BCD support of instruction sets typically requiring piecing
    together the complete operation of suboperations of less than full
    length (e.g., bytes on the 6502 and the 80(2)86), little-endian is
    actually easier. When you add two BCD numbers that are longer than a
    byte, you don't have to first go to the end of the number and then go
    backwards from there. This is especially relevant if you do not want
    to completely unroll the loop that handles these bytes.

    Note that the 6502 includes BCD support with its decimal mode, and the designers of the 6502 obviously did not agree with the claim you made
    above.

    When the 8080 added BCD support in form of the DAA instruction (the
    8086 added DAS), the byte order decision had already been made with
    the Datapoint 2200, but if they really thought that decimal operation
    is a good reason for big-endian byte order, they could have done what
    the 6502 had done and switched the byte order around from its
    ancestors.

    On the other hand, given that the 6502 and 8080 BCD support worked on
    bytes, the programmers were free to choose any byte order they prefer,
    as our student project proved. Maybe some (how many?) of the
    programmers who wrote BCD code for the 6502 and for the 8080 and its descendants actually chose a big-endian format. Things get more
    interesting if the granularity of BCD support is bigger than a byte,
    e.g., on the HPPA or IIRC S/360.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed May 20 08:10:10 2026
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    * 8080: Yes, because AMD64 inherited its byte order from it. But if
    we go to the origin here, it's not the 8080 and not the 8008, but
    the Datapoint 2200, which is remarkable, because it was designed as
    a terminal for mainframes, and S/360 is big-endian.
    <https://en.wikipedia.org/wiki/Datapoint_2200#Technical_description>
    says:

    |The fact that most laptops and cloud computers today store numbers
    |in little-endian format is carried forward from the original
    |Datapoint 2200. Because the original Datapoint 2200 had a serial
    |processor, it needed to start with the lowest bit of the lowest byte
    |in order to handle carries. Microprocessors descended from the
    |Datapoint 2200 (the 8008, Z80, and the x86 chips used in most
    |laptops and cloud computers today) kept the little-endian format
    |used by that original Datapoint 2200.

    For the Datapoint 2200, there was a solid technical reason:
    It used shift register memory which supplied one bit at a time,
    so the adder *had* to be little-endian.

    See https://www.righto.com/2014/12/inside-intel-1405-die-photos-of-shift.html --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Bernd Linsel@bl1-thispartdoesnotbelonghere@gmx.com to comp.arch on Wed May 20 10:42:44 2026
    From Newsgroup: comp.arch

    On 5/20/26 04:09, quadi wrote:

    b) Yes, little-endian is more popular, but that's just because the
    PDP-11,
    8080, and 6502 happened to choose it. Little-endian doesn't work as well *if* you also want to put packed decimal values in registers.

    For packed decimals that are processed in memory, little endian is
    superior to big endian, because you don't have to look for the LSB when performing an addition, you can proceed bytewise on ascending addresses.

    As a consequence should packed decimals in registers also be little
    endian, conceding the fact that the classic byte-wise representation is
    skewed (but when displaying words, the reading order is natural).
    --
    Bernd Linsel
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed May 20 08:36:05 2026
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    <https://en.wikipedia.org/wiki/Datapoint_2200#Technical_description>
    says:

    |[...] Because the original Datapoint 2200 had a serial
    |processor, it needed to start with the lowest bit of the lowest byte
    |in order to handle carries.
    [...]
    For the Datapoint 2200, there was a solid technical reason:
    It used shift register memory which supplied one bit at a time,
    so the adder *had* to be little-endian.

    Looks plausible at first, but when I think about it some more, both
    claims are wrong.

    Yes, you start with the least significant bit, but given that the
    architecture is not bit-addressed, this is irrelevant.

    The architecture is byte-addressed, and the ALU only works on a single
    byte, so the ALU does not work any better for little-endian than for big-endian.

    For the 6502 dealing with carries in addressing, both in the relative addressing of conditional branches, and in the indexed addressing
    modes with 16-bit base addresses, little-endian made the
    implementation a little simpler. The Datapoint 2200 does not have
    indexed addressing modes, so relative branches may have been the issue
    (if the DataPoint 2200 has them).

    Did I miss any other reason why little-endian byte order is easier to
    implement on these processors than big-endian?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed May 20 10:37:39 2026
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    <https://en.wikipedia.org/wiki/Datapoint_2200#Technical_description>
    says:

    |[...] Because the original Datapoint 2200 had a serial
    |processor, it needed to start with the lowest bit of the lowest byte
    |in order to handle carries.
    [...]
    For the Datapoint 2200, there was a solid technical reason:
    It used shift register memory which supplied one bit at a time,
    so the adder *had* to be little-endian.

    Looks plausible at first, but when I think about it some more, both
    claims are wrong.

    Unfortunately, you are mistaken.

    Yes, you start with the least significant bit, but given that the architecture is not bit-addressed, this is irrelevant.

    JMP with a two-byte address was little-endian on the Datapoint 2200,
    and so had to be on the Intel 8808, which had to be binary compatible
    with the TTL CPU of the 2200.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed May 20 15:03:22 2026
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    quadi <quadibloc@ca.invalid> writes:
    b) Yes, little-endian is more popular, but that's just because the PDP-11, >>8080, and 6502 happened to choose it.

    Thinking about it:



    With the BCD support of instruction sets typically requiring piecing
    together the complete operation of suboperations of less than full
    length (e.g., bytes on the 6502 and the 80(2)86), little-endian is
    actually easier. When you add two BCD numbers that are longer than a
    byte, you don't have to first go to the end of the number and then go >backwards from there. This is especially relevant if you do not want
    to completely unroll the loop that handles these bytes.

    The B3500 had a clever algorithm for adding BCD numbers. The
    addend and augend could each be from 1 to 100 digits in length.
    The algorithm would start adding from the lowest (most significant
    digit in the longested operand) address of each operand adding
    each digit in turn.

    "The processor uses an adder that accumulates two fields
    from the most significant to the least significant digit
    positions. Reverse addition, as incorporated in the
    B2500 and B3500 systems has the advantage of detecting
    an overflow condition prior to altering the receiving field"

    The algorithm used a 9's counter to track the leading
    digits.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Wed May 20 15:28:16 2026
    From Newsgroup: comp.arch

    On Wed, 20 May 2026 05:38:07 +0000, Anton Ertl wrote:

    * The last descendent of the PDP-11 was canceled long before the most
    prominent big-endien architecture (SPARC) was canceled, and long
    before Power switched its Linux support to little-endian, so the
    PDP-11 had little, if any, influence on the outcome.

    The reason I blame the PDP-11 for everything is that it was a hugely influential machine. It was widely used in academic settings, and it was
    also the machine for which UNIX was first widely distributed.

    When you add two BCD numbers that are longer than a
    byte, you don't have to first go to the end of the number and then go backwards from there. This is especially relevant if you do not want to completely unroll the loop that handles these bytes.

    This is the reason little-endian was popular for small processors. It is
    no longer relevant if a processor has a 64-bit data bus. And, of course,
    it applies equally to binary and BCD.

    The reason I claim that BCD support strongly favors big-endian byte order
    is this:

    Character strings are, of course, in "big endian" order; that is,
    normally, a character string is written in memory with successive
    characters at increasing addresses - and, at least in languages that are written from left to right, numerals appear in texts with the most
    significant digit first.

    So if one has a hardware instruction to convert from BCD to the string representation of numbers, such as UNPK or EDIT, then those two representations should have the same endian-ness.

    And if one wants to use the same ALU for binary and BCD arithmetic, then
    those have to have the same endianness.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed May 20 15:32:41 2026
    From Newsgroup: comp.arch

    Bernd Linsel <bl1-thispartdoesnotbelonghere@gmx.com> writes:
    On 5/20/26 04:09, quadi wrote:

    b) Yes, little-endian is more popular, but that's just because the
    PDP-11,
    8080, and 6502 happened to choose it. Little-endian doesn't work as well *if* you also want to put packed decimal values in registers.

    For packed decimals that are processed in memory, little endian is
    superior to big endian, because you don't have to look for the LSB when >performing an addition, you can proceed bytewise on ascending addresses.

    Burroughs figured that problem out a half century ago, and were
    able to add two big-endian BCD numbers memory-to-memory handling
    overflow (by counting leading 9s). Overflow was detected before
    the receiving field was modified (without intermediate or internal
    storage) by counting leading 9s.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed May 20 11:04:44 2026
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    <https://en.wikipedia.org/wiki/Datapoint_2200#Technical_description>
    says:

    |[...] Because the original Datapoint 2200 had a serial
    |processor, it needed to start with the lowest bit of the lowest byte >>>> |in order to handle carries.
    [...]
    For the Datapoint 2200, there was a solid technical reason:
    It used shift register memory which supplied one bit at a time,
    so the adder *had* to be little-endian.

    Looks plausible at first, but when I think about it some more, both
    claims are wrong.

    Unfortunately, you are mistaken.

    A claim without any supporting argument.

    Yes, you start with the least significant bit, but given that the
    architecture is not bit-addressed, this is irrelevant.

    JMP with a two-byte address was little-endian on the Datapoint 2200,

    Yes, but is the bit-serial memory the reason for that? No, the ALU is
    not involved, and they could just have decided to represent the
    address in big-endian byte order, and the 16 bits into the PC (or
    next-PC) register.

    The conditional jump instructions of the Datapoint 2200 also have
    absolute target addresses and don't involve the ALU.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed May 20 15:42:03 2026
    From Newsgroup: comp.arch

    quadi <quadibloc@ca.invalid> writes:
    On Wed, 20 May 2026 05:38:07 +0000, Anton Ertl wrote:

    * The last descendent of the PDP-11 was canceled long before the most
    prominent big-endien architecture (SPARC) was canceled, and long
    before Power switched its Linux support to little-endian, so the
    PDP-11 had little, if any, influence on the outcome.

    The reason I blame the PDP-11 for everything is that it was a hugely >influential machine. It was widely used in academic settings, and it was >also the machine for which UNIX was first widely distributed.

    But its byte order was not influential into this century. Unix and
    its applications are portable, including between byte orders (or at
    least they were, when there were still enough machines of either byte
    order around that one could test that). And somehow the PDP-11 and
    its offspring did not capture the workstation market and the server
    market that involved from that, and which constituted the Unix
    markets.

    Instead, the big-endian 68000 and its offspring dominated that market
    for a while, and was replaced with RISCs later, which had the same
    byte order as the earlier machines from the same company (i.e.,
    little-endian for DEC and big-endian for the others). And when the
    market for workstations and server on RISCs shrunk down to almost
    nothing, not only did these big-endian machine vanish, but the
    offspring of the PDP-11 as well (and actually before some of the
    big-endian RISCs). What remains of this world is AIX on Power, and I
    have no idea how many installations there still are.

    Linux on Power was switched to little-endian with the introduction of OpenPower, not because of the PDP-11 descendants, but because of the
    Datapoint 2200 descendants. And the Datapoint 2200 (announced in June
    1970) was probably not influence by the PDP-11 (announced in January
    1970).

    When you add two BCD numbers that are longer than a
    byte, you don't have to first go to the end of the number and then go
    backwards from there. This is especially relevant if you do not want to
    completely unroll the loop that handles these bytes.

    This is the reason little-endian was popular for small processors. It is
    no longer relevant if a processor has a 64-bit data bus. And, of course,
    it applies equally to binary and BCD.

    If the numbers fit in one granule, yes, that benefit does not matter.
    But 64 bits are not enough for all binary numbers and probably not for
    all BCD numbers, either: the decimal FP people were not satisfied with
    the 15-digit mantissa that are easily possible with their
    representations in 64 bits; they did not even define a decimal64
    format last I checked. So will 16-digit BCD numbers be satisfactory?

    The reason I claim that BCD support strongly favors big-endian byte order
    is this:

    Character strings are, of course, in "big endian" order; that is,
    normally, a character string is written in memory with successive
    characters at increasing addresses - and, at least in languages that are >written from left to right, numerals appear in texts with the most >significant digit first.

    So if one has a hardware instruction to convert from BCD to the string >representation of numbers, such as UNPK or EDIT, then those two >representations should have the same endian-ness.

    Reality check: Modern architectures tend to have byte-swap and shuffle instructions. They tend not to have BCD-to-ASCII instructions, but
    these can be implemented easily enough with the help of shuffle and
    bitwise instructions. And given that you need to use shuffle anyway,
    the byte-swapping does not cost extra.

    For doing it for more than one granule, you have to pay the big-endian
    cost on that conversion (for storing into the string, the loading of
    the BCD number would still be in little-endian order), but at least
    not for the arithmetic operations.

    And if one wants to use the same ALU for binary and BCD arithmetic, then >those have to have the same endianness.

    Sure, but that's not a reason to use big-endian byte order, see above.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed May 20 17:01:56 2026
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    <https://en.wikipedia.org/wiki/Datapoint_2200#Technical_description> >>>>> says:

    |[...] Because the original Datapoint 2200 had a serial
    |processor, it needed to start with the lowest bit of the lowest byte >>>>> |in order to handle carries.
    [...]
    For the Datapoint 2200, there was a solid technical reason:
    It used shift register memory which supplied one bit at a time,
    so the adder *had* to be little-endian.

    Looks plausible at first, but when I think about it some more, both
    claims are wrong.

    Unfortunately, you are mistaken.

    A claim without any supporting argument.

    Then maybe some more explanation is needed. It is sometimes difficult
    to think back to the limitations those designers faced.

    The 2200 did not have byte-addressable memory; memory contents only
    could be used when they bubbled up through the shift registers.
    Otherwise, the CPU had to wait. (It was a silicon version of the
    mercury delay lines of the UNIVAC I).

    So, how do you add or subtract values in memory? From low to high
    value, saving carries. You then have a choice of either loading
    them in sequence, in a single go, or to load the high value,
    wait for half a microsecond and then load the low value.

    Would you build such a machine in big-endian or little-endian?

    (And yes, it seems negative branches could take ~ 500 cycles, as
    well.)
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Wed May 20 17:25:21 2026
    From Newsgroup: comp.arch

    On Wed, 20 May 2026 15:42:03 +0000, Anton Ertl wrote:
    quadi <quadibloc@ca.invalid> writes:

    Character strings are, of course, in "big endian" order; that is,
    normally, a character string is written in memory with successive >>characters at increasing addresses - and, at least in languages that are >>written from left to right, numerals appear in texts with the most >>significant digit first.

    So if one has a hardware instruction to convert from BCD to the string >>representation of numbers, such as UNPK or EDIT, then those two >>representations should have the same endian-ness.

    Reality check: Modern architectures tend to have byte-swap and shuffle instructions. They tend not to have BCD-to-ASCII instructions, but
    these can be implemented easily enough with the help of shuffle and
    bitwise instructions. And given that you need to use shuffle anyway,
    the byte-swapping does not cost extra.

    An additional instruction is an additional instruction! But I think you
    simply mean that the hardware is present. I'm not saying that BCD can't be implemented in a little-endian architecture; I'm saying it's much easier
    to understand and define when BCD and character strings and binary all go
    the same way - and the byte order of character strings is fixed.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Wed May 20 17:47:59 2026
    From Newsgroup: comp.arch

    According to Bernd Linsel <bl1-thispartdoesnotbelonghere@gmx.com>:
    For packed decimals that are processed in memory, little endian is
    superior to big endian, because you don't have to look for the LSB when >performing an addition, you can proceed bytewise on ascending addresses.

    It depends what you're doing. If you're doing arithmetic, you need to start at the low end. If you're packing or unpacking or editing for display, you need to start at the high end. My understanding is that back in the day when performance
    mattered, the applications that used BCD arithmetic typically did one arithmetic
    operation on each value, so the pack/edit mattered more.

    Having looked into this in some detail, both when IBM used bigendian order on S/360 and DEC used little-endian on the PDP-11, neither documented the reasons for the byte order choice at all. Not even a litle bit.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Wed May 20 18:07:14 2026
    From Newsgroup: comp.arch

    According to Scott Lurndal <slp53@pacbell.net>:
    The B3500 had a clever algorithm for adding BCD numbers. The
    addend and augend could each be from 1 to 100 digits in length.
    The algorithm would start adding from the lowest (most significant
    digit in the longested operand) address of each operand adding
    each digit in turn.

    "The processor uses an adder that accumulates two fields
    from the most significant to the least significant digit
    positions. Reverse addition, as incorporated in the
    B2500 and B3500 systems has the advantage of detecting
    an overflow condition prior to altering the receiving field"

    The algorithm used a 9's counter to track the leading
    digits.

    How did it handle carries? Let's say you're adding

    099999999999999999999999999999999999999999999999999
    000000000000000000000000000000000000000000000000001

    If it starts at the high digit, it won't know until it gets to the end
    that it has to propagate carries all the way back to the beginning.

    S/360 had operand lengths in the instructions so even though it
    addresed the high byte, it could do one add and get the address
    of the low byte. On S/370 and later machines with virtual memory
    it was more complicated since it had to check and be sure that all
    of the pages where the operands resided were available.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed May 20 17:30:49 2026
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    The 2200 did not have byte-addressable memory; memory contents only
    could be used when they bubbled up through the shift registers.
    Otherwise, the CPU had to wait. (It was a silicon version of the
    mercury delay lines of the UNIVAC I).

    So, how do you add or subtract values in memory? From low to high
    value, saving carries. You then have a choice of either loading
    them in sequence, in a single go, or to load the high value,
    wait for half a microsecond and then load the low value.

    The Datapoint 2200 has only instructions for adding or subtracting the
    bits of a byte. For adding two 16-bit values X and Y, you load the
    LSB of X and the LSB of Y, add them, store the result, load the MSB of
    X and MSB of Y, adc them, and store the result.

    Given that you have only HL for memory access, and several registers,
    if the LSBs and MSBs are adjacent, you probably first want to load the
    LSB and MSB of X (and in that case, there is no preferred order), and
    add the LSB of Y, move A to some other register, then move the MSB of
    X into A, and adc the MSB of Y, then store the LSB and MSB of the
    result (again, no preferred order). And note that for any new address
    you access, you have to change at least L between the memory accesses,
    and maybe also H.

    Even with that kind of drum-like memory, how will little-endian
    provide a benefit? At best in the memory accesses to Y, but only if
    the other stuff that is going on between these two memory accesses
    does not advance the memory chip across the MSB (if the MSB is
    actually in the same memory chip as the LSB).

    And in any case, this is pure software convention. There is nothing
    in the architecture that tells programmers how to arrange the two
    bytes of a 16-bit data number. They could also do an array for the
    LSBs and an array for the MSBs (structure-of-array style), and then
    one would not need so many registers for intermediate storage. Load
    LSB of X, (update L), add LSB of Y, (update L), store LSB of the
    result, then (update L and maybe H), load MSB of X, (update L), adc
    MSB of Y, (update L) store the MSB of the result.

    The only thing in the architecture that actually specifies
    little-endian byte order is in the control-flow instructions where the
    byte order of the target address is little-endian. But bit-serial
    memory is not the reason for that, implementing these instructions
    with a big-endian target address would have been just as fast and just
    as hard.

    Would you build such a machine in big-endian or little-endian?

    It's not about what I would do, but about what is little-endian about
    the Datapoint 2200, and if there were technical reasons for that. I
    don't see any.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed May 20 18:13:22 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    On Wed, 20 May 2026 01:35:01 +0000, MitchAlsup wrote:
    quadi <quadibloc@ca.invalid> posted:

    I align different integer types on the right, even while aligning
    different floating-point types on the left like everyone else. So
    integer operations must sign-extend if they're on values shorter than
    64 bits.

    Go LE all the way. LE won get over BE thinking.

    a) I didn't think this really had anything to do with little-endian versus big-endian.

    b) Yes, little-endian is more popular, but that's just because the PDP-11, 8080, and 6502 happened to choose it. Little-endian doesn't work as well *if* you also want to put packed decimal values in registers.

    BEs advantage is only when packed decimal is both not a power of 2 in
    size, and residing in memory. Once in a register those advantages vanish.
    One could make a LE in MEM PD solution work with modern resource counts,
    too.

    As far as integers go: all calculations produce proper integer values in the 64-bit destination register.
    S8 has range [-128..127]
    u8 has range [0..255]
    ...

    If you have 64 bit registers, then if you want to avoid a gap between the sign in a 32-bit number and the sign of a 64-bit number by placing the 32- bit number on the most significant side, a 32-bit 1 is equal to a 64-bit 8,589,934,592.

    Propagating a bit takes time.

    A solved HW gate-level problem.

    That's good news, then I don't have a problem. I figured the solution
    would be to use slightly slower gates with larger current output.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed May 20 19:03:01 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    According to Scott Lurndal <slp53@pacbell.net>:
    The B3500 had a clever algorithm for adding BCD numbers. The
    addend and augend could each be from 1 to 100 digits in length.
    The algorithm would start adding from the lowest (most significant
    digit in the longested operand) address of each operand adding
    each digit in turn.

    "The processor uses an adder that accumulates two fields
    from the most significant to the least significant digit
    positions. Reverse addition, as incorporated in the
    B2500 and B3500 systems has the advantage of detecting
    an overflow condition prior to altering the receiving field"

    The algorithm used a 9's counter to track the leading
    digits.

    How did it handle carries? Let's say you're adding

    099999999999999999999999999999999999999999999999999 000000000000000000000000000000000000000000000000001

    A value that overflows the size of the receiving field
    cannot be represented, so the overflow toggle is set and
    the instruction terminates _without modifying the
    receiving field_.

    The size of the receiving field is the larger of the
    two source fields. So

    ADD 0508 000000 100000 200000

    would add the 5 digit value at address 0 to the
    8 digit value at address 100000 and store the
    result at address 200000.


    If it starts at the high digit, it won't know until it gets to the end
    that it has to propagate carries all the way back to the beginning.

    Actually, that's the clever part. They count 9s.

    Example 1: 10 digit receiving field, 10 digit addend, 1 digit augend:

    Memory contents before:

    000000: 9999999999
    000010: 1

    ADD 1001 000000 000010 000020

    The result of the instruction is that the overflow toggle
    will be set and the destination field will remain unmodified.

    The algorithm implicitly fills leading zeros into
    the shorter operand.

    The first digit of the addend operand is read. '9' in
    this case. The first digit of the augend is added (in this
    case, implicitly zero) and the result is 9. A special
    register (the 9's counter) is incremented and the algorithm
    proceeds to the next digit. Wash, rinse and repeat until
    reaching the last digit, where the sum of 9 + 1 will overflow
    a single digit, so the instruction terminates with overflow.

    If in the case you showed above, there was a zero in the
    first digit of both operands, there is no posibility of
    overflow and the algorithm will simply process each
    digit of the addend+augend sequentially from higher
    magnitude to lower magnitude. It delays writing each
    digit of the sum (other than the last) until it knows
    the following digit doesn't overflow. If it does
    overflow, it increments the delayed value before
    writing. To the extent that there multiple sequential
    9s in the sum, when the next digit would overflow, the
    processor uses the 9's counter and the saved digit to
    store the correct digits to the receiving field.

    There's a flow chart in 1025475_B2500_B3500_RefMan_Oct69.pdf
    which is available on bitsavers.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed May 20 21:33:05 2026
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    quadi <quadibloc@ca.invalid> writes:
    On Wed, 20 May 2026 05:38:07 +0000, Anton Ertl wrote:

    * The last descendent of the PDP-11 was canceled long before the most
    prominent big-endien architecture (SPARC) was canceled, and long
    before Power switched its Linux support to little-endian, so the
    PDP-11 had little, if any, influence on the outcome.

    The reason I blame the PDP-11 for everything is that it was a hugely
    influential machine. It was widely used in academic settings, and it was
    also the machine for which UNIX was first widely distributed.

    But its byte order was not influential into this century. Unix and
    its applications are portable, including between byte orders (or at
    least they were, when there were still enough machines of either byte
    order around that one could test that). And somehow the PDP-11 and
    its offspring did not capture the workstation market and the server
    market that involved from that, and which constituted the Unix
    markets.

    Instead, the big-endian 68000 and its offspring dominated that market
    for a while, and was replaced with RISCs later, which had the same
    byte order as the earlier machines from the same company (i.e.,
    little-endian for DEC and big-endian for the others). And when the
    market for workstations and server on RISCs shrunk down to almost
    nothing, not only did these big-endian machine vanish, but the
    offspring of the PDP-11 as well (and actually before some of the
    big-endian RISCs). What remains of this world is AIX on Power, and I
    have no idea how many installations there still are.

    Linux on Power was switched to little-endian with the introduction of OpenPower, not because of the PDP-11 descendants, but because of the Datapoint 2200 descendants. And the Datapoint 2200 (announced in June
    1970) was probably not influence by the PDP-11 (announced in January
    1970).

    When you add two BCD numbers that are longer than a
    byte, you don't have to first go to the end of the number and then go
    backwards from there. This is especially relevant if you do not want to >>> completely unroll the loop that handles these bytes.

    This is the reason little-endian was popular for small processors. It is
    no longer relevant if a processor has a 64-bit data bus. And, of course,
    it applies equally to binary and BCD.

    If the numbers fit in one granule, yes, that benefit does not matter.
    But 64 bits are not enough for all binary numbers and probably not for
    all BCD numbers, either: the decimal FP people were not satisfied with
    the 15-digit mantissa that are easily possible with their
    representations in 64 bits; they did not even define a decimal64
    format last I checked. So will 16-digit BCD numbers be satisfactory?

    ieee754 does define decimal64, decimal128 and even decimal32, but the
    first two has pretty much all the actual usage, probably (?) decimal128
    as the majority, at least for all accumulators.


    The reason I claim that BCD support strongly favors big-endian byte order
    is this:

    Character strings are, of course, in "big endian" order; that is,
    normally, a character string is written in memory with successive
    characters at increasing addresses - and, at least in languages that are
    written from left to right, numerals appear in texts with the most
    significant digit first.

    So if one has a hardware instruction to convert from BCD to the string
    representation of numbers, such as UNPK or EDIT, then those two
    representations should have the same endian-ness.

    Reality check: Modern architectures tend to have byte-swap and shuffle instructions. They tend not to have BCD-to-ASCII instructions, but
    these can be implemented easily enough with the help of shuffle and
    bitwise instructions. And given that you need to use shuffle anyway,
    the byte-swapping does not cost extra.

    BCD-to-ASCII, with the input in an AVX 32-byte register, so up to 64
    digits, would start with an exchange of the high and low 16-byte halves,
    then a permute of each half to reverse the order. The final single-cycle operation is the only overhead of the little vs high-endian inputs.

    Next we duplicate the input by unpacking the high and low 16 bytes into
    each byte value into 16 16-bit shorts, with the leading byte 0, then (in parallel) you copy and mask the low nybble while shifting all shorts up
    by 4 bits, then use the same all-15 mask to save the high nybbles.
    OR these two back together, and do the same for the other half of the
    original input. About 15-20 cycles in total with well under 10% being
    the byte order swap.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed May 20 21:45:57 2026
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    John Levine <johnl@taugh.com> writes:
    According to Scott Lurndal <slp53@pacbell.net>:
    The B3500 had a clever algorithm for adding BCD numbers. The
    addend and augend could each be from 1 to 100 digits in length.
    The algorithm would start adding from the lowest (most significant
    digit in the longested operand) address of each operand adding
    each digit in turn.

    "The processor uses an adder that accumulates two fields
    from the most significant to the least significant digit
    positions. Reverse addition, as incorporated in the
    B2500 and B3500 systems has the advantage of detecting
    an overflow condition prior to altering the receiving field"

    The algorithm used a 9's counter to track the leading
    digits.

    How did it handle carries? Let's say you're adding

    099999999999999999999999999999999999999999999999999
    000000000000000000000000000000000000000000000000001

    A value that overflows the size of the receiving field
    cannot be represented, so the overflow toggle is set and
    the instruction terminates _without modifying the
    receiving field_.

    The size of the receiving field is the larger of the
    two source fields. So

    ADD 0508 000000 100000 200000

    would add the 5 digit value at address 0 to the
    8 digit value at address 100000 and store the
    result at address 200000.


    If it starts at the high digit, it won't know until it gets to the end
    that it has to propagate carries all the way back to the beginning.

    Actually, that's the clever part. They count 9s.

    Example 1: 10 digit receiving field, 10 digit addend, 1 digit augend:

    Memory contents before:

    000000: 9999999999
    000010: 1

    ADD 1001 000000 000010 000020

    The example he showed had an 11 digit receive field so it would not
    overflow, but the two inputs would cause a full carry propagate all the
    way to the top digit.


    The result of the instruction is that the overflow toggle
    will be set and the destination field will remain unmodified.

    The algorithm implicitly fills leading zeros into
    the shorter operand.

    The first digit of the addend operand is read. '9' in
    this case. The first digit of the augend is added (in this
    case, implicitly zero) and the result is 9. A special
    register (the 9's counter) is incremented and the algorithm
    proceeds to the next digit. Wash, rinse and repeat until
    reaching the last digit, where the sum of 9 + 1 will overflow
    a single digit, so the instruction terminates with overflow.

    If in the case you showed above, there was a zero in the
    first digit of both operands, there is no posibility of

    That's what he showed afair?

    overflow and the algorithm will simply process each
    digit of the addend+augend sequentially from higher
    magnitude to lower magnitude. It delays writing each
    digit of the sum (other than the last) until it knows
    the following digit doesn't overflow. If it does
    overflow, it increments the delayed value before
    writing. To the extent that there multiple sequential
    9s in the sum, when the next digit would overflow, the
    processor uses the 9's counter and the saved digit to
    store the correct digits to the receiving field.

    There's a flow chart in 1025475_B2500_B3500_RefMan_Oct69.pdf
    which is available on bitsavers.


    So it did process them top-down, but delayed writing the anything to the output field until it was known that it would not overflow, and the same happened for every subsequent partial sum of 9.

    Yeah, that works but it probably caused some output hickups when a long
    chain of potential carries finally resolved. :-)

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Wed May 20 22:50:04 2026
    From Newsgroup: comp.arch

    On Wed, 20 May 2026 18:07:14 +0000, John Levine wrote:

    On S/370 and later machines with virtual memory it was more complicated
    since it had to check and be sure that all of the pages where the
    operands resided were available.

    Yes, since while the System/360 gave you an error if you tried to use unaligned operands in memory, this restriction was abolished with the System/370. Only an unaligned operand can possibly cross a page boundary, since pages have a power-of-two size greater than the size of any data
    type.

    But this means that even on the System/370, it's a rare event that an instruction will refer to an unaligned operand. So that there is some
    extra overhead for unaligned values might well have been considered acceptable.

    John Savard

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 00:06:54 2026
    From Newsgroup: comp.arch

    On Wed, 20 May 2026 07:21:04 +0000, quadi wrote:

    So in commenting on a different part of my design entirely, you've
    pointed out an important flaw I will have to correct.

    It's possible that I panicked needlessly, and the conditional branches I support, being the conventional set, are indeed sufficient for unsigned
    values as well; for them, they would have alternate names in assembler,
    but no additional types of branch perhaps are needed.

    I will have to review this point, however, to be sure.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Thu May 21 00:37:39 2026
    From Newsgroup: comp.arch

    It appears that quadi <quadibloc@ca.invalid> said:
    On Wed, 20 May 2026 18:07:14 +0000, John Levine wrote:

    On S/370 and later machines with virtual memory it was more complicated
    since it had to check and be sure that all of the pages where the
    operands resided were available.

    Yes, since while the System/360 gave you an error if you tried to use >unaligned operands in memory, this restriction was abolished with the >System/370. Only an unaligned operand can possibly cross a page boundary, >since pages have a power-of-two size greater than the size of any data
    type.

    While that is true for the RX and RS instructions that do loads and
    stores and arithmetic operations, it is not at all true for the SS
    instructions common in commercial code.

    Yhey have two storage operands with the length specified in the second
    byte of the instruction. Even on S/360 there is no alignment
    requirement for any of the operands. In most cases it can tell the
    sizes of the operands at the time the instruction is decoded, e.g.,
    decimal add (AP) has two four-bit length codes that say how long each
    operand is and move characters (MVC) has a single 8-bit length code
    that applies to both operands.

    But sometimes it is not that simple. Translate and test (TRT) has
    a string operand with a length, and a second 256 byte table operand.
    It fetches the bytes from the string one at a time, looks them up
    in the table, and stops as soon as the looked up value is non-zero,
    putting the address of the source byte and the lookup values in
    R1 and R2. Only the bytes actually fetched have to be resident.

    The Edit instruction (ED) takes a packed decimal operand and
    a pattern, with the length specifying the length of the pattern.
    It goes through the pattern a byte at a time with some pattern
    bytes ("digit selector") taking the next digit from the input
    operand and others just copied literally. The length of the
    input operand depends on the contents of the pattern.

    To make this work S/370 and its successors first do a trial
    execution of the instruction without storing anything to see
    if it causes a page fault. If not, it then redoes the
    instruction for real, storing the result. I suspect that
    if they had known how soon S/370 would add paging to the 360
    architecture, they might have designed these instructions
    differently.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 02:18:27 2026
    From Newsgroup: comp.arch

    On Thu, 21 May 2026 00:37:39 +0000, John Levine wrote:

    To make this work S/370 and its successors first do a trial execution of
    the instruction without storing anything to see if it causes a page
    fault. If not, it then redoes the instruction for real, storing the
    result. I suspect that if they had known how soon S/370 would add
    paging to the 360 architecture, they might have designed these
    instructions differently.

    When I first read that, I thought that you meant they would have designed
    it differently when they designed the 370, but, of course, the
    instructions already existed. After I realized my mistake, of course, I
    also knew that back in 1964 or before, there was really no way that they
    could possibly have known that.

    John Savard

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 02:33:51 2026
    From Newsgroup: comp.arch

    On Wed, 20 May 2026 01:35:01 +0000, MitchAlsup wrote:
    quadi <quadibloc@ca.invalid> posted:

    So instead I decided to only support double precision, and use the
    extra bits to allow additional ways to specify registers.

    My 66000 started out that way and the compiler showed that this choice
    sucks.

    The good news is that this only concerns the 16-bit short instructions. A compiler can choose to ignore them if it can't handle them.

    Currently, the 16-bit instructions provide the following:

    All the basic operate instructions for two integer types; they can only operate on the first eight integer registers.

    The basic floating operate instructions for one floating-point type; the register specification is the one used with Concertina II's paired 15-bit operate instructions; choose one of four banks of eight registers, and
    both operands must be in that bank.

    The idea is that it can be used for efficient pipelined code where four sequences of instructions which are independent are interleaved.

    Everything else is straightforwards; the 24-bit short instructions and all
    the 32-bit and longer instructions that operate on registers allow the use
    of all 32 registers in a bank.

    Of course, though, the other restrictions are still present - seven
    choices for an index register, seven choices for a base register (for each
    of three displacement sizes, 20, 16, and 12 bits).

    I think I have indeed achieved the goal which, when I started out, I
    thought might prove to be an "impossible dream" - combining what a CISC instruction set offers with what a RISC instruction set offers, and yet
    doing so without making the instructions longer than they usually are in
    those instruction types.

    Except for register-to-register operate instructions being 24 bits instead
    of 16 bits, this has been achieved - but for a very limited subset of the possible register-to-register operate instructions, chosen by me as the
    ones I think are the most useful and popular - and I realize the choice is subjective and hence potentially controversial - the 16-bit instruction
    length is retained!

    I think it's an ISA that, in this respect, has achieved more than anyone
    could have expected!

    Now, of course, whether or not this is an achievement that anyone cares
    about, that anyone wants, that anyone is interested in... well, I don't
    know.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 06:12:11 2026
    From Newsgroup: comp.arch

    I had a tiny bit of unused opcode space within the 32-bit operate instructions.

    As well, there were a couple of lengths of instructions longer than 32
    bits which were allocated more opcode space than they actually needed.

    That let me move those two lengths of instructions, plus one other length
    of instructions longer than 32 bits which kept is entire, though small, allocation of opcode space, into that unused space.

    And that let me increase the opcode space allocated to 16-bit short instructions from 1/16th of the opcode space to 3/32nds of the opcode
    space.

    Which allowed me to give them a much simpler and plainer format, of which
    it finally could be argued - without the claim being utterly laughable -
    that they offer just about what 16-bit short instructions do in a CISC architecture.

    So now the 16-bit short instructions have all 96 basic operate opcodes, so they can perform all the basic operations on all the basic integer and floating-point types.

    They are in all cases now limited to just the first eight registers. So
    this is inferior to the System/360, which has sixteen, but it matches the 680x0 which had eight.

    Finally, I have achieved my dream, insane and useless though it may be!

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu May 21 06:29:29 2026
    From Newsgroup: comp.arch

    quadi <quadibloc@ca.invalid> writes:
    When I first read that, I thought that you meant they would have designed
    it differently when they designed the 370, but, of course, the
    instructions already existed. After I realized my mistake, of course, I
    also knew that back in 1964 or before, there was really no way that they >could possibly have known that.

    The Atlas existed in 1962 and did have paging. So it was possible.
    Is it excusable that the S/360 designers did not consider this
    development at the time? Probably, although according to <https://en.wikipedia.org/wiki/Atlas_(computer)> "it was a 1959
    description of Muse [the 1959 name for Atlas] that gave CDC ideas that significantly accelerated the development of the 6600 and allowed it
    to be delivered earlier than originally estimated".

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 07:03:47 2026
    From Newsgroup: comp.arch

    On Thu, 21 May 2026 00:06:54 +0000, quadi wrote:

    I will have to review this point, however, to be sure.

    Although I have not yet completed that review, it has become apparent
    that, since I want the compare instruction to produce a correct result for signed numbers even if one is comparing, say, a positive number and a
    negative number which are both over half of the maximum possible magnitude
    for their format... it will be necessary to have a special compare
    instruction for unsigned integers.

    Since there is opcode space for that readily available, though, there is
    no difficulty in adding that.

    John Savard

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu May 21 10:29:12 2026
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton Ertl wrote:
    But 64 bits are not enough for all binary numbers and probably not for
    all BCD numbers, either: the decimal FP people were not satisfied with
    the 15-digit mantissa that are easily possible with their
    representations in 64 bits; they did not even define a decimal64
    format last I checked. So will 16-digit BCD numbers be satisfactory?

    ieee754 does define decimal64, decimal128 and even decimal32, but the
    first two has pretty much all the actual usage, probably (?) decimal128
    as the majority, at least for all accumulators.

    I should check half-known things before I make claims in a posting.

    Anyway, looking at <https://en.wikipedia.org/wiki/Decimal64_floating-point_format>, I see
    that Decimal64 even has 16 digits of mantissa. So 15 digits is not
    enough. (And, as an aside, they complicated things by not specifying
    a 54-bit mantissa, but combining the exponent with the upper bits of
    the mantissa).

    To the point: these 16 digits are not enough, as the lack of
    popularity of decimal64 (even relative to decimal128) shows, so 64-bit
    BCD numbers are not enough in all cases, either.

    Reality check: Modern architectures tend to have byte-swap and shuffle
    instructions. They tend not to have BCD-to-ASCII instructions, but
    these can be implemented easily enough with the help of shuffle and
    bitwise instructions. And given that you need to use shuffle anyway,
    the byte-swapping does not cost extra.

    BCD-to-ASCII, with the input in an AVX 32-byte register, so up to 64
    digits, would start with an exchange of the high and low 16-byte halves, >then a permute of each half to reverse the order. The final single-cycle >operation is the only overhead of the little vs high-endian inputs.

    Next we duplicate the input by unpacking the high and low 16 bytes into
    each byte value into 16 16-bit shorts, with the leading byte 0, then (in >parallel) you copy and mask the low nybble while shifting all shorts up
    by 4 bits, then use the same all-15 mask to save the high nybbles.
    OR these two back together, and do the same for the other half of the >original input. About 15-20 cycles in total with well under 10% being
    the byte order swap.

    My thinking was along the lines of using VPERMB to do the
    byte-swapping, the duplicating, and the unpacking in one step. E.g.,
    if you have a 64-bit BCD number 1234567890123456 as the following
    sequence of bytes

    56 34 12 90 78 56 34 12

    Then you have the index vector

    7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0

    and VPERMB xmm1, xmm2, xmm3

    (where the BCD number is in xmm3 and the index vector is in xmm2) will
    put the following in xmm1:

    12 12 34 34 56 56 78 78 90 90 12 12 34 34 56 56

    So no extra instruction for the byte swapping.

    The problem is that I now would like a masked parallel byte shift to
    shift the even-indexed bytes right by 4 bits, but I don't find
    parallel byte shifts. I guess the answer is to let the VPERMB arrange
    the result as follows

    1234 1234 5678 5678 9012 9012 3456 3456
    ^^^^ ^^^^ ^^^^ ^^^^

    then use a masked VPSRLW for shifting the marked 16-bit pieces to the
    right by 4 bits, resulting in

    0123 1234 0567 5678 0901 9012 0345 3456

    Now use VPSHUFB or VPERMB to rearrange the bytes in the intended order:

    01 12 23 34 45 56 67 78 89 90 01 12 23 34 45 56

    Now mask away the top 4 bits of each byte with VPAND and turn it into
    ASCII by VPORing every byte with 0x30.

    And the whole thing can be done with BCD numbers of up to 64 digits
    per pass.

    The absence of VPSRLB caused an additional instruction, but that's
    also necessary for dealing with big-endian BCD numbers. So storing
    the BCD numbers in little-endian format costs no additional
    instruction.

    VPERMB is not in AVX2, so if you want to limit yourself to that,
    little-endian needs an extra instruction indeed.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu May 21 11:52:51 2026
    From Newsgroup: comp.arch

    quadi <quadibloc@ca.invalid> writes:
    On Wed, 20 May 2026 15:42:03 +0000, Anton Ertl wrote:
    quadi <quadibloc@ca.invalid> writes:
    Reality check: Modern architectures tend to have byte-swap and shuffle
    instructions. They tend not to have BCD-to-ASCII instructions, but
    these can be implemented easily enough with the help of shuffle and
    bitwise instructions. And given that you need to use shuffle anyway,
    the byte-swapping does not cost extra.

    An additional instruction is an additional instruction!

    There is no additional instruction. VPERMB does the byte swapping and
    byte duplication at the same time, see <2026May21.122912@mips.complang.tuwien.ac.at>.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu May 21 12:04:40 2026
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    BCD-to-ASCII, with the input in an AVX 32-byte register, so up to 64 >>digits, would start with an exchange of the high and low 16-byte halves, >>then a permute of each half to reverse the order. The final single-cycle >>operation is the only overhead of the little vs high-endian inputs.

    Next we duplicate the input by unpacking the high and low 16 bytes into >>each byte value into 16 16-bit shorts, with the leading byte 0, then (in >>parallel) you copy and mask the low nybble while shifting all shorts up
    by 4 bits, then use the same all-15 mask to save the high nybbles.
    OR these two back together, and do the same for the other half of the >>original input. About 15-20 cycles in total with well under 10% being
    the byte order swap.

    My thinking was along the lines of using VPERMB to do the
    byte-swapping, the duplicating, and the unpacking in one step. E.g.,
    if you have a 64-bit BCD number 1234567890123456 as the following
    sequence of bytes

    56 34 12 90 78 56 34 12

    Then you have the index vector

    7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0

    and VPERMB xmm1, xmm2, xmm3

    (where the BCD number is in xmm3 and the index vector is in xmm2) will
    put the following in xmm1:

    12 12 34 34 56 56 78 78 90 90 12 12 34 34 56 56

    So no extra instruction for the byte swapping.

    The problem is that I now would like a masked parallel byte shift to
    shift the even-indexed bytes right by 4 bits, but I don't find
    parallel byte shifts. I guess the answer is to let the VPERMB arrange
    the result as follows

    1234 1234 5678 5678 9012 9012 3456 3456
    ^^^^ ^^^^ ^^^^ ^^^^

    then use a masked VPSRLW for shifting the marked 16-bit pieces to the
    right by 4 bits, resulting in

    0123 1234 0567 5678 0901 9012 0345 3456

    Now use VPSHUFB or VPERMB to rearrange the bytes in the intended order:

    01 12 23 34 45 56 67 78 89 90 01 12 23 34 45 56

    I have a better approach:

    First do the shifting with, e.g. VPSRLW, with the result in a new
    register. So you now have

    56 34 12 90 78 56 34 12 #original data
    0563 0129 0785 0341 #shifted version

    Now you use VPERMT2B to rearrange the bytes from both registers into a
    third one, doing the byte-swapping while you are at it, resulting in:

    41 12 03 34 85 56 07 78 29 90 01 12 63 34 05 56

    The remainder uses VPAND and VPOR, as described earlier.

    If you have BCD numbers with more than 64, but at most 128 digits, the
    first step would only have to be performed once. You would then use
    two VPERMI2B instructions with different index inputs to produce the
    64 least significant and the 64 most significant digits, and the VPAND
    and VPOR would also have to be duplicated.

    So 4 central instructions for a BCD number with up to 64 digits, and 7
    for up to 128 digits. In addition, you need the VPERMT2B index, the
    VPSRLW shift amounts and the other operand for VPAND and VPOR in
    registers, but if you are converting a lot of BCD numbers, you may
    already have them in registers when you convert the next BCD number.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 13:13:23 2026
    From Newsgroup: comp.arch

    On Thu, 21 May 2026 06:12:11 +0000, quadi wrote:

    Finally, I have achieved my dream, insane and useless though it may be!

    Someone once suggested that, if a genie grants you three wishes, you
    should use one of them to wish for more wishes.

    Well, I have taken the opportunity to squeeze one more little thing into
    the instruction set that Concertina III had, but this time I could not
    squeeze quite as many of them in... 16-bit prefixes for instructions,
    which allow the instruction set to be extended.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 13:22:55 2026
    From Newsgroup: comp.arch

    On Thu, 21 May 2026 13:13:23 +0000, quadi wrote:

    Well, I have taken the opportunity to squeeze one more little thing into
    the instruction set that Concertina III had, but this time I could not squeeze quite as many of them in... 16-bit prefixes for instructions,
    which allow the instruction set to be extended.

    I've taken the opportunity now, before things go on, to modify this
    addition in one important way: I've precluded the possibility that the complexity of instruction length encoding might grow without bounds by specifying the length scheme now for any prefixed instructions that might
    be added.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 13:42:09 2026
    From Newsgroup: comp.arch

    On Thu, 21 May 2026 07:03:47 +0000, quadi wrote:

    it will be necessary to have a special
    compare instruction for unsigned integers.

    I have now back-propagated this needful change to Concertina II. The description of Concertina III hadn't gotten to the point where this would
    be placed.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu May 21 14:36:13 2026
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Scott Lurndal wrote:

    <snip>


    overflow and the algorithm will simply process each
    digit of the addend+augend sequentially from higher
    magnitude to lower magnitude. It delays writing each
    digit of the sum (other than the last) until it knows
    the following digit doesn't overflow. If it does
    overflow, it increments the delayed value before
    writing. To the extent that there multiple sequential
    9s in the sum, when the next digit would overflow, the
    processor uses the 9's counter and the saved digit to
    store the correct digits to the receiving field.

    There's a flow chart in 1025475_B2500_B3500_RefMan_Oct69.pdf
    which is available on bitsavers.


    So it did process them top-down, but delayed writing the anything to the >output field until it was known that it would not overflow, and the same >happened for every subsequent partial sum of 9.

    Yeah, that works but it probably caused some output hickups when a long >chain of potential carries finally resolved. :-)

    The maximum size of an operand was 100 digits.

    To add to the potential for a long hickup, each of the operands
    could be indirect, which in turn could point to indirect
    operands ad infinitum. A processor timer was started with
    each instruction, and if it expired before the instruction
    finished, the processor would raise a fault and the application
    would be terminated.

    There was also search table and linked list instructions, which had
    variable timing depending on the number of entries in the
    list or table (the instruction timer would handle infinite
    loops in the list).

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Thu May 21 15:41:17 2026
    From Newsgroup: comp.arch

    According to quadi <quadibloc@ca.invalid>:
    On Thu, 21 May 2026 00:37:39 +0000, John Levine wrote:
    result. I suspect that if they had known how soon S/370 would add
    paging to the 360 architecture, they might have designed these
    instructions differently.

    When I first read that, I thought that you meant they would have designed
    it differently when they designed the 370, but, of course, the
    instructions already existed. After I realized my mistake, of course, I
    also knew that back in 1964 or before, there was really no way that they >could possibly have known that.

    According to Pugh et al., IBM Research was quite aware of Atlas and
    was doing its own work on one-level store and time sharing. They were
    also close to CTSS at MIT Project MAC. Atlas' performance was terrible
    (later solved partly by better paging schemes but mostly by larger
    real memory) and I get the impression that there was an internal
    institutional bias that only batch was real computing and time sharing
    was somewhere between a niche and a fad.

    The MIT people were deeply disappointed when S/360 had no memory
    mapping at all, which led to Multics switching from IBM to GE for its
    new computer. IBM then came out with the 360/67 which had quite decent
    virtual memory but it was too late. It didn't help that its intended
    main operating system was TSS which was overambitious and didn't work.
    Lucky for them CP/67 escaped from the lab to become VM/370.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu May 21 18:26:32 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    On Wed, 20 May 2026 01:35:01 +0000, MitchAlsup wrote:
    quadi <quadibloc@ca.invalid> posted:

    So instead I decided to only support double precision, and use the
    extra bits to allow additional ways to specify registers.

    My 66000 started out that way and the compiler showed that this choice sucks.

    The good news is that this only concerns the 16-bit short instructions. A compiler can choose to ignore them if it can't handle them.

    Currently, the 16-bit instructions provide the following:

    All the basic operate instructions for two integer types; they can only operate on the first eight integer registers.

    I suspect you (and compiler) will end up not liking the restriction.

    The basic floating operate instructions for one floating-point type; the register specification is the one used with Concertina II's paired 15-bit operate instructions; choose one of four banks of eight registers, and
    both operands must be in that bank.

    I suspect you (and compiler) will end up not liking the restriction.

    The idea is that it can be used for efficient pipelined code where four sequences of instructions which are independent are interleaved.

    I suspect you (and compiler) will end up not finding that much parallelism.

    Everything else is straightforwards; the 24-bit short instructions and all the 32-bit and longer instructions that operate on registers allow the use of all 32 registers in a bank.

    Of course, though, the other restrictions are still present - seven
    choices for an index register, seven choices for a base register (for each of three displacement sizes, 20, 16, and 12 bits).

    I think I have indeed achieved the goal which, when I started out, I
    thought might prove to be an "impossible dream" - combining what a CISC instruction set offers with what a RISC instruction set offers, and yet doing so without making the instructions longer than they usually are in those instruction types.

    Except for register-to-register operate instructions being 24 bits instead of 16 bits, this has been achieved - but for a very limited subset of the possible register-to-register operate instructions, chosen by me as the
    ones I think are the most useful and popular - and I realize the choice is subjective and hence potentially controversial - the 16-bit instruction length is retained!

    I think it's an ISA that, in this respect, has achieved more than anyone could have expected!

    Now, of course, whether or not this is an achievement that anyone cares about, that anyone wants, that anyone is interested in... well, I don't know.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu May 21 18:32:48 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    On Thu, 21 May 2026 00:06:54 +0000, quadi wrote:

    I will have to review this point, however, to be sure.

    Although I have not yet completed that review, it has become apparent
    that, since I want the compare instruction to produce a correct result for signed numbers even if one is comparing, say, a positive number and a negative number which are both over half of the maximum possible magnitude for their format... it will be necessary to have a special compare instruction for unsigned integers.

    Or a wider condition register !

    Since there is opcode space for that readily available, though, there is
    no difficulty in adding that.

    John Savard

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 22:14:51 2026
    From Newsgroup: comp.arch

    On Thu, 21 May 2026 07:03:47 +0000, quadi wrote:

    Although I have not yet completed that review, it has become apparent
    that, since I want the compare instruction to produce a correct result
    for signed numbers even if one is comparing, say, a positive number and
    a negative number which are both over half of the maximum possible
    magnitude for their format... it will be necessary to have a special
    compare instruction for unsigned integers.

    I have now given the matter thought, and I found that it would indeed be necessary to add an extra bit to all the conditional jump, branch, or set
    flag instructions to indicate the test was being applied to the condition
    code settings left after an integer arithmetic instruction on integers
    deemed to be unsigned.

    Amazingly enough, however, it turned out that in each case there was no difficulty in finding the additional opcode space that was needed.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 22:21:53 2026
    From Newsgroup: comp.arch

    On Thu, 21 May 2026 18:32:48 +0000, MitchAlsup wrote:
    quadi <quadibloc@ca.invalid> posted:

    it will be necessary to have a special
    compare instruction for unsigned integers.

    Or a wider condition register !

    A wider condition register isn't enough by itself.

    I have now realized that I will have to add a bit to the conditional
    branch instructions. Amazingly, though, that bit was readily available
    without much trouble.

    In the case of conditional branches after integer arithmetic, a wider condition register might be needed, although it seems that carry,
    overflow, negative, and zero will suffice.

    The compare instruction in my ISA _does not_ return the same condition
    codes as the subtract instruction. So if I compare bytes, the compare instruction will correctly indicate that -100 is less than 100. The fact
    that if you subtracted -100 from 100 as byte values, you wouldn't get 200, since that doesn't fit into a signed byte, but the negative value -44 is neither here nor there.

    Because of this special handling of the MSB, I do need a different compare instruction - not just the modified branch instructions for unsigned
    values - to yield correct behavior.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 23:44:34 2026
    From Newsgroup: comp.arch

    On Thu, 21 May 2026 18:26:32 +0000, MitchAlsup wrote:
    quadi <quadibloc@ca.invalid> posted:

    Currently, the 16-bit instructions provide the following:

    All the basic operate instructions for two integer types; they can only
    operate on the first eight integer registers.

    I suspect you (and compiler) will end up not liking the restriction.

    I don't like the restriction, but since there's not much opcode space available, there's not much I can do.

    The basic floating operate instructions for one floating-point type;
    the register specification is the one used with Concertina II's paired
    15-bit operate instructions; choose one of four banks of eight
    registers, and both operands must be in that bank.

    I suspect you (and compiler) will end up not liking the restriction.

    The compiler will, indeed, probably have difficulty dealing with a kind of restriction that no one else has ever put in an ISA.

    But this is moot now. I've found some additional opcode space for 16-bit
    short instructions. Not much, just enough to increase the available opcode space by a factor of 1.5.

    So now all the operations are restricted to only the first eight registers
    - but 16-bit short instructions now support all the basic data types.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 23:46:05 2026
    From Newsgroup: comp.arch

    On Thu, 21 May 2026 22:14:51 +0000, quadi wrote:

    Amazingly enough, however, it turned out that in each case there was no difficulty in finding the additional opcode space that was needed.

    I even managed to find enough opcode space to increase the size of the displacement field from 8 bits to 9 bits in all the branch instructions,
    so that having 24-bit short instructions doesn't shorten their range.

    John Savard

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Fri May 22 02:20:14 2026
    From Newsgroup: comp.arch

    On Thu, 21 May 2026 23:46:05 +0000, quadi wrote:

    I even managed to find enough opcode space to increase the size of the displacement field from 8 bits to 9 bits in all the branch instructions,
    so that having 24-bit short instructions doesn't shorten their range.

    However, there were a number of serious mistakes on the page, which I have
    now corrected.

    John Savard

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri May 22 07:22:05 2026
    From Newsgroup: comp.arch

    quadi <quadibloc@ca.invalid> writes:
    Although I have not yet completed that review, it has become apparent
    that, since I want the compare instruction to produce a correct result for >signed numbers even if one is comparing, say, a positive number and a >negative number which are both over half of the maximum possible magnitude >for their format... it will be necessary to have a special compare >instruction for unsigned integers.

    The fact that IA-32/AMD64 and ARM A64 do not have a special compare
    instruction for unsigned integers (and manage to do with NCVZ) shows
    that this is unnecessary. What you do for your "if (-100<100)" case
    is encode it (on AMD64) as

    cmpb %r8, %r9 #note that AT&T syntax has the arguments reversed
    jnl target
    ... code to execute if r9<r8 ...
    target:

    And JNL (jump if not less) tests for N=V (the Intel manual writes SF=OF).

    See <https://www.felixcloutier.com/x86/jcc>

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri May 22 07:35:36 2026
    From Newsgroup: comp.arch

    quadi <quadibloc@ca.invalid> writes:
    The compare instruction in my ISA _does not_ return the same condition
    codes as the subtract instruction. So if I compare bytes, the compare >instruction will correctly indicate that -100 is less than 100. The fact >that if you subtracted -100 from 100 as byte values, you wouldn't get 200, >since that doesn't fit into a signed byte, but the negative value -44 is >neither here nor there.

    8086, IA-32, AMD64, and AFAIK ARM A64 produce the same condition codes
    for compare and subtract instructions. That the subtract instruction
    writes back the result does not influence the condition codes. The
    fact that you see an overflow/underflow if you byte-subtract/compare
    100 with -100 and want to interpret the result as a signed byte is
    reflected in the overflow flag for both subtract and compare, and the conditional jumps for signed <, <=, >, >= take the overflow flag into
    account (as well as the sign flag, and, in some cases, the zero flag).

    Because of this special handling of the MSB, I do need a different compare >instruction - not just the modified branch instructions for unsigned
    values - to yield correct behavior.

    You only need that if your flags are insufficiently expressive (i.e.,
    less powerful than NCZV).

    An interesting case is PowerPC (and Power). It stores < = > flags
    (for comparsons, for other instructions it's <0, =0, and >0) and a
    sticky overflow flag in one of the CRs (for many instructions, CR0,
    for comparison instructions, the CR can be selected). It has an
    overflow flag and a carry flag elsewhere, so it could use the
    subtraction instruction together with these flags for both signed and
    unsigned conditional branches, but instead it has unsigned and signed comparisons, and the conditional branches are only conditional on
    flags in a CR register.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Fri May 22 15:48:18 2026
    From Newsgroup: comp.arch

    On Fri, 22 May 2026 07:35:36 +0000, Anton Ertl wrote:

    You only need that if your flags are insufficiently expressive (i.e.,
    less powerful than NCZV).

    While the System/360 had only two condition code bits, I do plan to have
    full VZNC bits. However, unlike the System/360, I do not have a complete
    set of sixteen conditional branch instructions. I just have twelve: eight instructions for testing between negative, zero, and positive nonzero in
    any combination, and instructions for separately testing for carry and overflow.

    However, if I have enough opcode space to add a U bit to all the
    conditional branch instructions, then I also have enough opcode space to
    fix that instead, so I likely will rework this part of the ISA into
    something more conventional.

    I want a compare instruction which, for integers, isn't fooled by
    overflows - and overflows happen at a different point in the two's
    complement number circle for signed and unsigned; for unsigned, basically carry takes the role of overflow. And I don't want to have to do two instructions for the conditional branch afterwards to handle that. So I
    _may_ still need a separate compare unsigned, even though the rest of your points are well taken.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Fri May 22 21:22:48 2026
    From Newsgroup: comp.arch

    On Fri, 22 May 2026 15:48:18 +0000, quadi wrote:

    However, if I have enough opcode space to add a U bit to all the
    conditional branch instructions, then I also have enough opcode space to
    fix that instead, so I likely will rework this part of the ISA into
    something more conventional.

    I have made the first set of changes, using five-bit condition code fields
    to nicely and fully handle both the signed and unsigned cases; I checked
    what the Motorola 68000 did, and found that it only provided a complete
    set of tests for signed values, but only two tests for unsigned ones.
    (Worse yet, it used separate condition codes for floating-point numbers,
    which makes sense, given that they were originally in a coprocessor, but
    that means an extra set of instructions is needed.)

    So, while it used a four-bit condition code field, I needed a five-bit one.

    I did notice it didn't just always fail the signed tests if overflow was present; instead, in that case it switched plus and minus. Given that, and treating carry the same way for unsigned tests, you likely are right that
    an unsigned compare is not needed. Oh, wait; my assumed behavior that everything should just fail if there's an overflow... is reasonable for floating-point numbers.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat May 23 08:36:49 2026
    From Newsgroup: comp.arch

    quadi <quadibloc@ca.invalid> writes:
    On Fri, 22 May 2026 07:35:36 +0000, Anton Ertl wrote:

    You only need that if your flags are insufficiently expressive (i.e.,
    less powerful than NCZV).

    While the System/360 had only two condition code bits, I do plan to have >full VZNC bits. However, unlike the System/360,

    The S/360 is a mess as far as dealing with conditions is concerned.
    Or is there a great underlying principle involved, and I fail to see
    it? I doubt it, for the following reasons: 1) I have not come across
    any description that eplained the underlying principe, and in fact I
    have come across few descriptions at all. 2) In the 62 years that
    S/360 has been available, it has not found any successors in its
    particular approach to conditions.

    So my recommendation is that you look at other architectures for
    inspiration. 8086, 88000, MIPS/Alpha/RISC-V (including the
    differences between them), and IA-64 all have quite different
    approaches that are worthy of study. And if you want to look for
    something unproven, look at <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.

    I do not have a complete
    set of sixteen conditional branch instructions. I just have twelve: eight >instructions for testing between negative, zero, and positive nonzero in
    any combination, and instructions for separately testing for carry and >overflow.
    ...
    I want a compare instruction which, for integers, isn't fooled by
    overflows - and overflows happen at a different point in the two's >complement number circle for signed and unsigned; for unsigned, basically >carry takes the role of overflow. And I don't want to have to do two >instructions for the conditional branch afterwards to handle that.

    What's this thing about "two instructions for the conditional branch afterwards"? On the 8086, if you want to branch on signed <, you use
    JL, and if you want to branch on unsigned <, you use JB; each of them
    is one instruction (and the 8086 has IIRC signed and unsigned <= > >=,
    too).

    If you mean the opcode space, then yes, you may use less opcode space
    if you have a signed and unsigned comparison, and fewer conditional
    branches (depending on how much proportion of your opcode space the
    respective instructions take). You can also save opcode space by
    leaving away the <= and > conditions (reverse the operands of < and
    =). One question in such a design is if there are cases where you
    want to have the unsigned and signed conditions for the same operands,
    but it's probably rare enough that it is not a big disadvantage that
    you need to use both comparison instructions for those cases (at least
    I have never seen a complaint about this aspect of PowerPC).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat May 23 09:28:45 2026
    From Newsgroup: comp.arch

    quadi <quadibloc@ca.invalid> writes:
    On Fri, 22 May 2026 15:48:18 +0000, quadi wrote:

    However, if I have enough opcode space to add a U bit to all the
    conditional branch instructions, then I also have enough opcode space to
    fix that instead, so I likely will rework this part of the ISA into
    something more conventional.

    I have made the first set of changes, using five-bit condition code fields >to nicely and fully handle both the signed and unsigned cases; I checked >what the Motorola 68000 did, and found that it only provided a complete
    set of tests for signed values, but only two tests for unsigned ones.

    I see four tests for unsigned conditions on the 68000 <https://en.wikibooks.org/wiki/68000_Assembly/Conditional_Tests>:

    HI >
    LS <=
    CC >=
    CS <

    For the signed ones there is

    GT >
    LE <=
    GE >=
    LT <

    my assumed behavior that
    everything should just fail if there's an overflow... is reasonable for >floating-point numbers.

    The usual setup is that FP operations silently overflow to +INF and
    underflow to -INF. They do set sticky flags (called "exceptions" in
    the IEEE FP standard) on various conditions, including on overflows,
    but also on rounding errors ("inexact").

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 23 16:19:35 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    On Fri, 22 May 2026 15:48:18 +0000, quadi wrote:

    However, if I have enough opcode space to add a U bit to all the conditional branch instructions, then I also have enough opcode space to fix that instead, so I likely will rework this part of the ISA into something more conventional.

    I have made the first set of changes, using five-bit condition code fields to nicely and fully handle both the signed and unsigned cases; I checked what the Motorola 68000 did, and found that it only provided a complete
    set of tests for signed values, but only two tests for unsigned ones.
    (Worse yet, it used separate condition codes for floating-point numbers, which makes sense, given that they were originally in a coprocessor, but that means an extra set of instructions is needed.)

    So, while it used a four-bit condition code field, I needed a five-bit one.

    x86 uses COZAP but this includes P=parity, which it is unlikely you do.
    Thus, 4 bits are sufficient to define 16-states, of which you only need 10-states signless{EQ, NEQ}, signed{>=, >, <, <=}, unsigned{>=, >, <, <=}.

    I did notice it didn't just always fail the signed tests if overflow was present; instead, in that case it switched plus and minus. Given that, and treating carry the same way for unsigned tests, you likely are right that
    an unsigned compare is not needed. Oh, wait; my assumed behavior that everything should just fail if there's an overflow... is reasonable for floating-point numbers.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat May 23 16:38:57 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    quadi <quadibloc@ca.invalid> posted:

    On Fri, 22 May 2026 15:48:18 +0000, quadi wrote:

    However, if I have enough opcode space to add a U bit to all the
    conditional branch instructions, then I also have enough opcode space to >> > fix that instead, so I likely will rework this part of the ISA into
    something more conventional.

    I have made the first set of changes, using five-bit condition code fields >> to nicely and fully handle both the signed and unsigned cases; I checked
    what the Motorola 68000 did, and found that it only provided a complete
    set of tests for signed values, but only two tests for unsigned ones.
    (Worse yet, it used separate condition codes for floating-point numbers,
    which makes sense, given that they were originally in a coprocessor, but
    that means an extra set of instructions is needed.)

    So, while it used a four-bit condition code field, I needed a five-bit one.

    x86 uses COZAP but this includes P=parity, which it is unlikely you do.
    Thus, 4 bits are sufficient to define 16-states, of which you only need >10-states signless{EQ, NEQ}, signed{>=, >, <, <=}, unsigned{>=, >, <, <=}.

    ARM includes the Q flag (saturation).

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat May 23 16:46:40 2026
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    quadi <quadibloc@ca.invalid> writes:
    On Fri, 22 May 2026 07:35:36 +0000, Anton Ertl wrote:

    You only need that if your flags are insufficiently expressive (i.e.,
    less powerful than NCZV).

    While the System/360 had only two condition code bits, I do plan to have >>full VZNC bits. However, unlike the System/360,

    The S/360 is a mess as far as dealing with conditions is concerned.
    Or is there a great underlying principle involved, and I fail to see
    it? I doubt it, for the following reasons: 1) I have not come across
    any description that eplained the underlying principe, and in fact I
    have come across few descriptions at all. 2) In the 62 years that
    S/360 has been available, it has not found any successors in its
    particular approach to conditions.

    The B3500 had three bits: Overflow, COM Low and COM High. The
    V-Series added COM null, used by the search linked list (SLT)
    instruction when the search key wasn't found.

    Condition Flags
    --------- -----------------------
    EQUAL COML=1, COMH=1
    Less Than COML=1, COMH=0
    Greater Than COML=0, COMH=1
    NULL COML=0, COMH=0

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sat May 23 17:01:10 2026
    From Newsgroup: comp.arch

    On Sat, 23 May 2026 09:28:45 +0000, Anton Ertl wrote:

    quadi <quadibloc@ca.invalid> writes:
    On Fri, 22 May 2026 15:48:18 +0000, quadi wrote:

    I have made the first set of changes, using five-bit condition code
    fields to nicely and fully handle both the signed and unsigned cases; I >>checked what the Motorola 68000 did, and found that it only provided a >>complete set of tests for signed values, but only two tests for unsigned >>ones.

    I see four tests for unsigned conditions on the 68000 <https://en.wikibooks.org/wiki/68000_Assembly/Conditional_Tests>:

    HI >
    LS <=
    CC >=
    CS <

    For the signed ones there is

    GT >
    LE <=
    GE >=
    LT <

    What I was going by was Table 3-19 on page 3-19 of the M68000 Family Programmer's Reference Manual on the Internet Archive from Bitsavers; it
    gives the available condition code tests on the architecture as:

    0000 True
    0001 False
    0010 High not C and not Z
    0011 Low or Same C or Z
    0100 Carry Clear
    0101 Carry Set
    0110 Not Equal not Z
    0111 Equal Z
    1000 Overflow Clear not V
    1001 Overflow Set V
    1010 Plus not N
    1011 Minus N
    1100 Greater or Equal (N and V) or (not N and not V)
    1101 Less Than (N and not V) or (not N and V)
    1110 Greater Than (N and V and not Z) or (not N and not V and not Z)
    1111 Less or Equal Z or (N and not V) or (not N and V)

    I took Low or Same as unsigned, and Plus, Minus, Greater or Equal, Less
    Than, Greater Than, and Less or Equal as signed.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat May 23 14:15:46 2026
    From Newsgroup: comp.arch

    On 2026-05-23 5:28 a.m., Anton Ertl wrote:
    quadi <quadibloc@ca.invalid> writes:
    On Fri, 22 May 2026 15:48:18 +0000, quadi wrote:

    However, if I have enough opcode space to add a U bit to all the
    conditional branch instructions, then I also have enough opcode space to >>> fix that instead, so I likely will rework this part of the ISA into
    something more conventional.

    I have made the first set of changes, using five-bit condition code fields >> to nicely and fully handle both the signed and unsigned cases; I checked
    what the Motorola 68000 did, and found that it only provided a complete
    set of tests for signed values, but only two tests for unsigned ones.

    I see four tests for unsigned conditions on the 68000 <https://en.wikibooks.org/wiki/68000_Assembly/Conditional_Tests>:

    HI >
    LS <=
    CC >=
    CS <

    CS may also be called LO
    CC may also be called HS

    For the signed ones there is

    GT >
    LE <=
    GE >=
    LT <

    my assumed behavior that
    everything should just fail if there's an overflow... is reasonable for
    floating-point numbers.

    The usual setup is that FP operations silently overflow to +INF and
    underflow to -INF. They do set sticky flags (called "exceptions" in

    Methinks overflow could be to +/- INF and underflow to zero or a denormal.

    the IEEE FP standard) on various conditions, including on overflows,
    but also on rounding errors ("inexact").

    - anton

    If one has CVNZ it is enough for both signed and unsigned integer
    conditional testing using only four bits.

    The CVNZ could be repurposed for float comparisons. V = INF. C=inexact
    for instance.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat May 23 18:37:39 2026
    From Newsgroup: comp.arch

    quadi <quadibloc@ca.invalid> writes:
    On Sat, 23 May 2026 09:28:45 +0000, Anton Ertl wrote:
    I see four tests for unsigned conditions on the 68000
    <https://en.wikibooks.org/wiki/68000_Assembly/Conditional_Tests>:

    HI >
    LS <=
    CC >=
    CS <

    For the signed ones there is

    GT >
    LE <=
    GE >=
    LT <

    What I was going by was Table 3-19 on page 3-19 of the M68000 Family >Programmer's Reference Manual on the Internet Archive from Bitsavers; it >gives the available condition code tests on the architecture as:

    0000 True
    0001 False
    0010 High not C and not Z
    0011 Low or Same C or Z
    0100 Carry Clear
    0101 Carry Set
    0110 Not Equal not Z
    0111 Equal Z
    1000 Overflow Clear not V
    1001 Overflow Set V
    1010 Plus not N
    1011 Minus N
    1100 Greater or Equal (N and V) or (not N and not V)
    1101 Less Than (N and not V) or (not N and V)
    1110 Greater Than (N and V and not Z) or (not N and not V and not Z) >1111 Less or Equal Z or (N and not V) or (not N and V)

    I took Low or Same as unsigned, and Plus, Minus, Greater or Equal, Less >Than, Greater Than, and Less or Equal as signed.

    Carry Clear (CC) is unsigned >=
    Carry Set (CS) is unsigned <

    after a CMP or SUB instruction.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sat May 23 19:33:46 2026
    From Newsgroup: comp.arch

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    The S/360 is a mess as far as dealing with conditions is concerned.
    Or is there a great underlying principle involved, and I fail to see
    it? I doubt it, for the following reasons: 1) I have not come across
    any description that eplained the underlying principe, and in fact I
    have come across few descriptions at all. 2) In the 62 years that
    S/360 has been available, it has not found any successors in its
    particular approach to conditions.

    I suspect the encoded condition bits in S/360 are a reflection of
    the expensive memory era in which it was created. If they had
    decoded condition codes, they'd have had to find more bits in
    the PSW to store them, and it was already quite full.

    I agree that nobody else did that, and in retrospect it was an overoptimization.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 23 20:01:07 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2026-05-23 5:28 a.m., Anton Ertl wrote:
    quadi <quadibloc@ca.invalid> writes:
    On Fri, 22 May 2026 15:48:18 +0000, quadi wrote:

    However, if I have enough opcode space to add a U bit to all the
    conditional branch instructions, then I also have enough opcode space to >>> fix that instead, so I likely will rework this part of the ISA into
    something more conventional.

    I have made the first set of changes, using five-bit condition code fields >> to nicely and fully handle both the signed and unsigned cases; I checked >> what the Motorola 68000 did, and found that it only provided a complete
    set of tests for signed values, but only two tests for unsigned ones.

    I see four tests for unsigned conditions on the 68000 <https://en.wikibooks.org/wiki/68000_Assembly/Conditional_Tests>:

    HI >
    LS <=
    CC >=
    CS <

    CS may also be called LO
    CC may also be called HS

    For the signed ones there is

    GT >
    LE <=
    GE >=
    LT <

    my assumed behavior that
    everything should just fail if there's an overflow... is reasonable for
    floating-point numbers.

    The usual setup is that FP operations silently overflow to +INF and underflow to -INF. They do set sticky flags (called "exceptions" in

    Methinks overflow could be to +/- INF and underflow to zero or a denormal.

    IEEE defines OVERFLOW as finite becomes signed infinite.
    IEEE defines UNDERFLOW as finite becomes signed sub-finite*.
    Sub-finite ={deNormal or zero}

    the IEEE FP standard) on various conditions, including on overflows,
    but also on rounding errors ("inexact").

    - anton

    If one has CVNZ it is enough for both signed and unsigned integer conditional testing using only four bits.

    The CVNZ could be repurposed for float comparisons. V = INF. C=inexact
    for instance.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 23 20:03:34 2026
    From Newsgroup: comp.arch


    John Levine <johnl@taugh.com> posted:

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    The S/360 is a mess as far as dealing with conditions is concerned.
    Or is there a great underlying principle involved, and I fail to see
    it? I doubt it, for the following reasons: 1) I have not come across
    any description that eplained the underlying principe, and in fact I
    have come across few descriptions at all. 2) In the 62 years that
    S/360 has been available, it has not found any successors in its
    particular approach to conditions.

    I suspect the encoded condition bits in S/360 are a reflection of
    the expensive memory era in which it was created. If they had
    decoded condition codes, they'd have had to find more bits in
    the PSW to store them, and it was already quite full.

    S/360 would have been better off as defining PSW as a PSQW (128-bits)
    which would have alleviated several problems associated with running
    out of PSW space.

    I agree that nobody else did that, and in retrospect it was an overoptimization.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sat May 23 20:09:54 2026
    From Newsgroup: comp.arch

    According to MitchAlsup <user5857@newsgrouper.org.invalid>:
    I suspect the encoded condition bits in S/360 are a reflection of
    the expensive memory era in which it was created. If they had
    decoded condition codes, they'd have had to find more bits in
    the PSW to store them, and it was already quite full.

    S/360 would have been better off as defining PSW as a PSQW (128-bits)
    which would have alleviated several problems associated with running
    out of PSW space.

    They'd also have been better off making the addresses 32 bits and not
    putting junk in the high byte, which caused endless pain later, but
    they were really really worried about making low end models with 8K
    bytes usable.

    Remember that the major reason for B+D addressing was that it let them
    have 16 bit address fields in instructions while keeping 24 bit flat addressing.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 23 22:15:30 2026
    From Newsgroup: comp.arch


    John Levine <johnl@taugh.com> posted:

    According to MitchAlsup <user5857@newsgrouper.org.invalid>:
    I suspect the encoded condition bits in S/360 are a reflection of
    the expensive memory era in which it was created. If they had
    decoded condition codes, they'd have had to find more bits in
    the PSW to store them, and it was already quite full.

    S/360 would have been better off as defining PSW as a PSQW (128-bits)
    which would have alleviated several problems associated with running
    out of PSW space.

    They'd also have been better off making the addresses 32 bits and not
    putting junk in the high byte, which caused endless pain later, but
    they were really really worried about making low end models with 8K
    bytes usable.

    Remember that the major reason for B+D addressing was that it let them
    have 16 bit address fields in instructions while keeping 24 bit flat addressing.

    B+X+D addressing only got 12-bits
    B+D addressing was for RS and SS instructions

    I think they thought they were saving on complexity and HW logic, but
    I think the whole RS and SS could have used a "more regular format pattern"; and they (IBM) would have been better off long term.

    But that was "Oh so long ago."
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sun May 24 01:43:29 2026
    From Newsgroup: comp.arch

    According to MitchAlsup <user5857@newsgrouper.org.invalid>:
    Remember that the major reason for B+D addressing was that it let them
    have 16 bit address fields in instructions while keeping 24 bit flat
    addressing.

    B+X+D addressing only got 12-bits
    B+D addressing was for RS and SS instructions

    four bits of B, 12 bits of D, 16 bit addresses
    you're right that RX used another four bits.

    I think they thought they were saving on complexity and HW logic, but

    We don't have to guess. "Architecture of the IBM System/360" by Amdahl, Blaauw, and Brooks in the IBM Systems Journal in April 1964 described a lot
    of the reasoning, and they wrote a whole book about it.

    They had to make a lot of other design decisions like 6 vs 8 bit
    bytes, ones- vs twos-complement, length fields vs word marks for
    variable length data, stack vs registers, floating point format (they
    blew that one).

    They said that the combination of a full length base register and a
    short displacement "gives consequent gains in instruction density. The base-register approach was adopted, and then augmented, for some
    instructions, with a second level of indexing."

    In retrospect, B+X+D was probably a mistake since I believe that
    double indexing is rarely used, and easy to do with an extra register
    add. On the other hand, it's not obvious what a better use of the X
    field would have been. I suppose they could have made instructions
    three operand, e.g.

    A Rx,Ry,B(D)

    would add the memory operand to Ry and put it in Rx but it was a long
    time until compilers could make good use of that.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun May 24 03:10:27 2026
    From Newsgroup: comp.arch


    John Levine <johnl@taugh.com> posted:

    According to MitchAlsup <user5857@newsgrouper.org.invalid>:
    Remember that the major reason for B+D addressing was that it let them
    have 16 bit address fields in instructions while keeping 24 bit flat
    addressing.

    B+X+D addressing only got 12-bits
    B+D addressing was for RS and SS instructions

    four bits of B, 12 bits of D, 16 bit addresses
    you're right that RX used another four bits.

    I think they thought they were saving on complexity and HW logic, but

    We don't have to guess. "Architecture of the IBM System/360" by Amdahl, Blaauw, and Brooks in the IBM Systems Journal in April 1964 described a lot of the reasoning, and they wrote a whole book about it.

    They had to make a lot of other design decisions like 6 vs 8 bit
    bytes, ones- vs twos-complement, length fields vs word marks for
    variable length data, stack vs registers, floating point format (they
    blew that one).

    They said that the combination of a full length base register and a
    short displacement "gives consequent gains in instruction density. The base-register approach was adopted, and then augmented, for some instructions, with a second level of indexing."

    In retrospect, B+X+D was probably a mistake since I believe that
    double indexing is rarely used, and easy to do with an extra register
    add.

    That is the view of MIPS and RISC_V
    That is not the view of x86 or ARM or My 66000 or Mc 88K

    On the other hand, it's not obvious what a better use of the X
    field would have been. I suppose they could have made instructions
    three operand, e.g.

    A Rx,Ry,B(D)

    would add the memory operand to Ry and put it in Rx but it was a long
    time until compilers could make good use of that.

    Agreed about time it took compiler to be taught how to use it.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 13:30:42 2026
    From Newsgroup: comp.arch

    On Sat, 23 May 2026 20:09:54 +0000, John Levine wrote:

    Remember that the major reason for B+D addressing was that it let them
    have 16 bit address fields in instructions while keeping 24 bit flat addressing.

    12 bits, of course. And they felt that 12 bits were enough because memory
    was such an issue back then.

    In hindsight, of course having a two-bit condition code was a "mistake".
    But C hadn't been invented yet, so nobody knew there would be any real use
    for unsigned integers.

    And the PSW really was full - when IBM went to System/370, they had to repurpose a bit in the PSW that was already assigned to an existing
    feature, ASCII mode. Since nobody ever used it, however, using it instead
    for the System/370's "Extended Control Mode", wherein the PSW *did* get doubled in length was possible.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 13:32:41 2026
    From Newsgroup: comp.arch

    On Sat, 23 May 2026 20:03:34 +0000, MitchAlsup wrote:

    S/360 would have been better off as defining PSW as a PSQW (128-bits)
    which would have alleviated several problems associated with running out
    of PSW space.

    It's not as if these problems were impossible to fix.

    Remember the System/370, and its Extended Control Mode? All they lost was
    the ability to switch the computer into an ASCII mode nobody ever used.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 13:49:26 2026
    From Newsgroup: comp.arch

    On Sun, 24 May 2026 13:32:41 +0000, quadi wrote:
    On Sat, 23 May 2026 20:03:34 +0000, MitchAlsup wrote:

    S/360 would have been better off as defining PSW as a PSQW (128-bits)
    which would have alleviated several problems associated with running
    out of PSW space.

    Remember the System/370, and its Extended Control Mode? All they lost
    was the ability to switch the computer into an ASCII mode nobody ever
    used.

    Come to think of this, though, that fact doesn't make you wrong. They
    would have been better off defining it as 128 bits long in the first
    place, since one thing they _couldn't_ do with Extended Control Mode was change the condition codes from two bits to full NZVC, since user programs
    had to remain compatible.

    Of course, though, people must have been able to get C compilers working
    on z/Architecture, despite inefficiencies, or it wouldn't be possible to install Linux on those machines.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 13:57:12 2026
    From Newsgroup: comp.arch

    On Sun, 24 May 2026 01:43:29 +0000, John Levine wrote:

    In retrospect, B+X+D was probably a mistake since I believe that double indexing is rarely used, and easy to do with an extra register add. On
    the other hand, it's not obvious what a better use of the X field would
    have been. I suppose they could have made instructions three operand,
    e.g.

    A Rx,Ry,B(D)

    would add the memory operand to Ry and put it in Rx but it was a long
    time until compilers could make good use of that.

    Since there were three-address machines back in the days before general registers, I am surprised to hear that they didn't know how to write
    compilers that made use of such a field.

    But the "better use of the X field" is obvious - make the displacement
    field 16 bits instead of 12 bits. Except, of course, that this would have killed the SS format of instructions.

    But I don't agree that B+X+D is a bad thing. An extra register add is an
    extra instruction. And it's not rarely used; it's used every time an array
    is accessed, and arrays are often accessed in inner loops!

    Of course, there are counterarguments. B+X+D, when used, involves an extra
    add inside the instruction. Doesn't that take time too? Wouldn't it be
    better to add just once at the beginning of the loop?

    The thing is, though, there's also *register pressure* to think about.
    Plus, the extra add inside the instruction just means a three-input add,
    and once one recalls how *multipliers* are designed, one realizes that
    this extra add, though it may still take time, does not take _much_ time.

    John Savard

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 14:14:41 2026
    From Newsgroup: comp.arch

    On Sun, 24 May 2026 13:49:26 +0000, quadi wrote:

    Of course, though, people must have been able to get C compilers working
    on z/Architecture, despite inefficiencies, or it wouldn't be possible to install Linux on those machines.

    I did a search, and found that z/Architecture added add-with-carry, subtract-with-borrow, and LLGF and LLGH which appear to be UL and ULH in
    my architectures.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun May 24 09:32:07 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    I suspect the encoded condition bits in S/360 are a reflection of
    the expensive memory era in which it was created. If they had
    decoded condition codes, they'd have had to find more bits in
    the PSW to store them, and it was already quite full.

    Some additional possible reasons:

    Most[1] architecture before the S/360 use ones-complement or
    sign/magnitude representation for integers, and trap on overflow [2],
    and I guess they have separate comparison instructions (I don't know
    that much about these machines, so I may be wrong here). So there was
    no need for flags indicating signed overflow (V), or unsigned overflow
    (C). Having only two flag bits was good enough to represent the three
    possible outcomes of a comparison.

    [1] Zuse chose twos-complement in the early 1940s. I don't know if he
    stuck with that in his later machines.

    [2] Reading the IBM 704 manual, it just says for some instructions
    that "Ac overflow is possible", but does not describe the
    behaviour. For division, the IBM 704 has "divide or halt" and
    "divide or proceed", so I guess that trapping in the modern sense
    was not yet on the table.

    S/360 also supports the trap-on-overflow behaviour for signed
    arithmetics, but one can turn the trapping off. Arithmetic
    instructions set the flags in different ways depending on whether they
    are signed or unsigned. So S/360 has a separate add-signed (A) and add-unsigned (AL) instruction; thanks to 2s-complement arithmetics,
    when they don't trap, they produce the same result in the target
    register, but different behaviour in the flags and in trapping.

    I expect that this all costs in control logic, so more constrained
    processors like the 6502 and the 8080 then later went with NCZV. The
    PDP-11 too AFAIK, but that may be due to the features of the bit-slice
    ALUs available when the PDP-11 was designed. These machines also did
    not have as many encoding bits to waste on separate signed and
    unsigned integer arithemetics, thanks to their very narrow memory
    bandwidth.

    The architectures before the S/360 do not provide support multi-word
    integer arithmetic (unless you count the digit-serial and
    character-serial machines), and S/360 does not, either. It takes
    until 1990 for IBM to add ALCR (add with carry-in) to the
    architecture. For architectures with smaller word sizes like the
    PDP-11, the 6502 and 8080, the need for multi-word integer arithmetic
    was much greater.

    Interestingly, the IBM 704 has the ACL instruction, an unsigned
    addition with carry-in, like the ESA/390's ALCR.

    Bottom line: When the S/360 was designed, the design of 2s-complement
    machines was in its infancy (if we ignore Zuse, and the S/360
    designers may have ignored him), so it was not known how to design the
    flags for them.

    One other aspect that may have played a role is that various S/360 implementations included compatibility modes for earlier IBM models,
    and the may have designed the flags with that in mind. However, given
    the vast differences between a 36-bit machine with sign/magnitude representation (IBM 704) and the S/360, implementing different flags
    for the different architectures was probably just a minor
    complication. Moreover, different S/360 models offer compatibility
    for different older architectures, where the flags probably were
    different.

    Concerning the question about why IBM chose big-endian for the S/360.
    I see <https://en.wikipedia.org/wiki/IBM_704#Registers> that already
    the IBM 704 used big-endian bit-numbering. As long as you only have
    one width at which to talk to registers or memory, that's as good as little-endian. It only becomes an issue if you talk to registers at
    different widths (e.g., 32-bit and 64-bit Power(PC)), and likewise,
    for memory it only becomes an issue when you talk to memory at
    different widths; i.e., not word-addressed machines, but
    byte-addressed machines.

    For FP the machines have different widths from early on, but they tend
    not to access the halves of a double-precision number, so the
    difference between big- and little-endian rarely makes a difference
    there.

    Actually, one does see some effects of big-endian bit numbering in the
    IBM 704, because the Accumulator has additional bits, and they are
    called P and Q (with little-endian bit ordering starting with bit 0,
    they would just be called 35 and 36). Also the 15-bit index registers
    run from bit 3 (MSB) to bit 17 (LSB), not from 0 to 17.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sun May 24 15:12:48 2026
    From Newsgroup: comp.arch

    According to quadi <quadibloc@ca.invalid>:
    On Sun, 24 May 2026 01:43:29 +0000, John Levine wrote:

    In retrospect, B+X+D was probably a mistake since I believe that double
    indexing is rarely used, and easy to do with an extra register add. On
    the other hand, it's not obvious what a better use of the X field would
    have been. I suppose they could have made instructions three operand,
    e.g.

    A Rx,Ry,B(D)

    would add the memory operand to Ry and put it in Rx but it was a long
    time until compilers could make good use of that.

    Since there were three-address machines back in the days before general >registers, I am surprised to hear that they didn't know how to write >compilers that made use of such a field.

    Optimizing compilers largely meant Fortran, which came from the single address 70x series. Human programmers did all sorts of clever tricks but it took a while to get compilers to do it. It probably needed graph coloring register allocation which wasn't invented until 1980.

    But the "better use of the X field" is obvious - make the displacement
    field 16 bits instead of 12 bits. Except, of course, that this would have >killed the SS format of instructions.

    Or worse had some instructions with 12 bit displacement and some with 16
    which would have been a programming nightmare.

    But I don't agree that B+X+D is a bad thing. An extra register add is an >extra instruction. And it's not rarely used; it's used every time an array >is accessed, and arrays are often accessed in inner loops!

    A decent optimizing compiler will do strength reduction so there's a register pointing at the array and stepping through it. You're right about register pressure but with 16 registers it shouldn't be hard to find one for an inner loop value.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sun May 24 15:24:22 2026
    From Newsgroup: comp.arch

    According to quadi <quadibloc@ca.invalid>:
    On Sat, 23 May 2026 20:09:54 +0000, John Levine wrote:

    Remember that the major reason for B+D addressing was that it let them
    have 16 bit address fields in instructions while keeping 24 bit flat
    addressing.

    12 bits, of course. And they felt that 12 bits were enough because memory >was such an issue back then.

    It was also to force all addresses to be base relative to make code relocatable.

    You should read the 1964 paper. It's not very long. Here's a copy:

    https://www.ece.ucdavis.edu/~vojin/CLASSES/EEC272/S2005/Papers/IBM360-Amdahl_april64.pdf


    In hindsight, of course having a two-bit condition code was a "mistake".
    But C hadn't been invented yet, so nobody knew there would be any real use >for unsigned integers.

    Sure they did. S/360 had separate unsigned versions of add and subtract instructions. The results were the same but the condition codes were
    different and the unsigned versions couldn't overflow. There were also arithmetic and logical shifts.

    And the PSW really was full - when IBM went to System/370, they had to >repurpose a bit in the PSW that was already assigned to an existing
    feature, ASCII mode. Since nobody ever used it, however, using it instead >for the System/370's "Extended Control Mode", wherein the PSW *did* get >doubled in length was possible.

    Yup.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 16:39:25 2026
    From Newsgroup: comp.arch

    On Sun, 24 May 2026 15:24:22 +0000, John Levine wrote:

    Sure they did. S/360 had separate unsigned versions of add and subtract instructions. The results were the same but the condition codes were different and the unsigned versions couldn't overflow.

    Ah, I didn't remember that!

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 16:44:26 2026
    From Newsgroup: comp.arch

    On Sun, 24 May 2026 15:12:48 +0000, John Levine wrote:
    According to quadi <quadibloc@ca.invalid>:

    But the "better use of the X field" is obvious - make the displacement >>field 16 bits instead of 12 bits. Except, of course, that this would
    have killed the SS format of instructions.

    Or worse had some instructions with 12 bit displacement and some with 16 which would have been a programming nightmare.

    Of course, the z/Architecture does have instructions with 20 bit
    displacements as well as 12 bits. But unlike the case where only the SS instructions have a 12-bit displacement, it has a complete set of
    instructions in each size.

    And my Concertina II and IV also have 12, 16, and 20 bit displacements -
    but it uses a different set of registers as the base registers for each,
    and also has a complete set of instructions for each, thus avoiding the nightmare.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sun May 24 17:06:52 2026
    From Newsgroup: comp.arch

    According to MitchAlsup <user5857@newsgrouper.org.invalid>:
    In retrospect, B+X+D was probably a mistake since I believe that
    double indexing is rarely used, and easy to do with an extra register
    add.

    That is the view of MIPS and RISC_V
    That is not the view of x86 or ARM or My 66000 or Mc 88K

    I suppose, but I don't think any of them reserved four instruction bits
    for an index register that's rarely used. On x86 it's one bit in the r/m
    field and arguably not even that since it's part of a three bit field
    that's overloaded as a register number, or in 32 bit mode one address
    form out of 8 that takes an extra byte for the base and index registers.

    Vax also had double indexing, but it was an extra prefix byte in the
    address field that said add register N scaled by the operand size to
    whaver other address followed, so there it was one addrsss mode out of
    16.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 17:16:35 2026
    From Newsgroup: comp.arch

    On Sun, 24 May 2026 09:32:07 +0000, Anton Ertl wrote:

    Most[1] architecture before the S/360 use ones-complement or
    sign/magnitude representation for integers, and trap on overflow [2],

    It makes sense to trap on a floating-point overflow, but trapping on an integer overflow is usually a terrible idea.

    Before the System/360, it's definitely true that one's complement and sign- magnitude representations of integers were valid options for designers.
    I'm not sure of their relative frequency.

    I do know of a claim made by one maker of a 24-bit computer in its
    advertising literature, and I suspect it did represent the situation then.

    Sign-magnitude was what the IBM 704 and its descendants used. As a result,
    it was the... aspirational... integer representation.

    One's complement was very popular back then - simpler to implement than sign-magnitude, but almost equivalent, in some sense. Thus, one's
    complement was the preferred representation in the PDP-4, which also had a limited two's complement capability.

    And two's complement was the simplest to implement, and thus chosen where
    cost savings were paramount. So the PDP-5 used two's complement.

    And then the IBM 360 came along, and woke everyone up to the fact that
    there was no real reason to use anything but two's complement.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 17:30:40 2026
    From Newsgroup: comp.arch

    On Sun, 24 May 2026 09:32:07 +0000, Anton Ertl wrote:

    Concerning the question about why IBM chose big-endian for the S/360

    ...I'm not really aware that they had a choice.

    Some machines before the IBM System/360 did use little-endian ordering for multiple words, to simplify handling the carries when adding pairs of
    words.

    Until the PDP-11 came along, though, _nobody_ thought of putting the characters inside a word starting at the least-significant end, so that
    the ordering of bytes would be consistent with the ordering of words.

    Until the PDP-11 came along, therefore, little-endian wasn't a "thing";
    while the most significant part of a two-word integer might be placed
    second, so you could fetch the parts in forwards order and start adding
    right away, but that wasn't part of a philosophy.

    The System/360 _only_ did BCD arithmetic with the SS instructions, it
    didn't put BCD in registers. So it wasn't forced to use big-endian by my consistency argument; binary values could still have been little-endian if they had preferred. But the different machines in the System/260 family
    had different bus widths.

    So they couldn't just be little-endian at the 16-bit level; they would
    have had to have been consistent. I suppose they could have thought of it first even if they didn't have the PDP-11 to copy from. But because almost
    all their machines were microcoded, they were in a position to do things
    like working backwards from the end of a number to do arithmetic to avoid having a severe cost penalty for big-endian.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun May 24 17:32:10 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    On Sun, 24 May 2026 09:32:07 +0000, Anton Ertl wrote:

    Most[1] architecture before the S/360 use ones-complement or
    sign/magnitude representation for integers, and trap on overflow [2],

    It makes sense to trap on a floating-point overflow, but trapping on an integer overflow is usually a terrible idea.

    So, detecting something went wrong and you should inform the programmer
    is a bad idea ???

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 21:39:42 2026
    From Newsgroup: comp.arch

    On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:
    quadi <quadibloc@ca.invalid> posted:

    It makes sense to trap on a floating-point overflow, but trapping on an
    integer overflow is usually a terrible idea.

    So, detecting something went wrong and you should inform the programmer
    is a bad idea ???

    No, so being able to turn the trap for integer overflow on should
    definitely be allowed. But that shouldn't be the default behavior.
    Otherwise, programs like random number generators wouldn't work.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun May 24 22:07:18 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:
    quadi <quadibloc@ca.invalid> posted:

    It makes sense to trap on a floating-point overflow, but trapping on an
    integer overflow is usually a terrible idea.

    So, detecting something went wrong and you should inform the programmer
    is a bad idea ???

    No, so being able to turn the trap for integer overflow on should
    definitely be allowed. But that shouldn't be the default behavior. Otherwise, programs like random number generators wouldn't work.

    They work just fine using unSigned integers.


    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun May 24 15:22:46 2026
    From Newsgroup: comp.arch

    On 5/24/2026 3:07 PM, MitchAlsup wrote:

    quadi <quadibloc@ca.invalid> posted:

    On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:
    quadi <quadibloc@ca.invalid> posted:

    It makes sense to trap on a floating-point overflow, but trapping on an >>>> integer overflow is usually a terrible idea.

    So, detecting something went wrong and you should inform the programmer
    is a bad idea ???

    No, so being able to turn the trap for integer overflow on should
    definitely be allowed. But that shouldn't be the default behavior.
    Otherwise, programs like random number generators wouldn't work.

    They work just fine using unSigned integers.

    Ditto!


    [...]

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Mon May 25 01:04:36 2026
    From Newsgroup: comp.arch

    On Wed, 20 May 2026 01:35:01 +0000, MitchAlsup wrote:

    You will find you have no <marketable> choice; you need to support::

    Integer{S8, S16, S32, S64, U8, U16, U32, U64}
    Float {FP8, FP16, FP32, FP64 and some way to get FP128}

    After realizing that I did need a second instruction for unsigned
    _division_ I then learned, to my shock, that division was not one, but
    two, instructions, at least in my architecture, for integers.

    And there didn't seem to be enough opcode space left for Divide Extensibly Unsigned.

    I was able to re-adjust the 32-bit operate instructions so that the two
    places where only 96 opcodes were provided for the basic operate
    instructions could now provide 128 opcodes.

    The 16-bit and 24-bit short instructions could not be so modified. But
    there were a few unused opcodes; so Divide Extensibly Unsigned could still
    fit in, just out of place.

    But that meant that this one operation would be missing from the minimum- length immediate instructions, and would still be treated as out of the
    basic instruction set, getting immediate instructions that were 16 bits longer, for them.

    The Pigeonhole Principle has finally bit me!

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Mon May 25 01:29:37 2026
    From Newsgroup: comp.arch

    On Mon, 25 May 2026 01:04:36 +0000, quadi wrote:

    The 16-bit and 24-bit short instructions could not be so modified. But
    there were a few unused opcodes; so Divide Extensibly Unsigned could
    still fit in, just out of place.

    But that meant that this one operation would be missing from the
    minimum- length immediate instructions, and would still be treated as
    out of the basic instruction set, getting immediate instructions that
    were 16 bits longer, for them.

    I have found a way around even that problem. There is no use for a "swap immediate" instruction, so I'll put Divide Extensibly Unsigned in its
    spot, so it will be in the columns for its types, and put the swap instruction, another exotic one, in the out-of-place spots left over.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Mon May 25 10:23:00 2026
    From Newsgroup: comp.arch

    On 24/05/2026 23:39, quadi wrote:
    On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:
    quadi <quadibloc@ca.invalid> posted:

    It makes sense to trap on a floating-point overflow, but trapping on an
    integer overflow is usually a terrible idea.

    So, detecting something went wrong and you should inform the programmer
    is a bad idea ???

    No, so being able to turn the trap for integer overflow on should
    definitely be allowed. But that shouldn't be the default behavior.
    Otherwise, programs like random number generators wouldn't work.

    John Savard

    That does not make sense. Code such as random number generators should
    be written so that they are correct in the language they are written in.
    If that is C, signed integer overflow is UB while unsigned integers
    have wrapping behaviour - thus if your code depends on wrapping, and it
    is written in C, it needs to use unsigned types or compiler-specific extensions, flags, etc. (Or C23 ckd_add and other checked arithmetic functions.)

    If it is written in Zig, you need to use the specific modulo arithmetic functions even for unsigned arithmetic. If it is written in Java,
    signed integer arithmetic is fine.

    It all depends on the language and/or any options the language and tools
    might support - and code should be written to work correctly according
    to the language rules.

    The hardware, of course, cannot always enable trapping on overflow if it
    is going to efficiently support a range of programming languages. But
    as an optional feature it can be helpful for catching a few bugs in
    code, so it can be a good idea (both for signed and unsigned overflow).

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon May 25 14:28:21 2026
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> writes:
    On 24/05/2026 23:39, quadi wrote:
    On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:
    quadi <quadibloc@ca.invalid> posted:

    It makes sense to trap on a floating-point overflow, but trapping on an >>>> integer overflow is usually a terrible idea.

    Most programming environments I have had contact with don't trap on floating-point overflow.

    So, detecting something went wrong and you should inform the programmer
    is a bad idea ???

    The question is if an integer overflow means that something went
    wrong. Despite their eagerness to "optimize" based on the assumption
    that signed integer overflow does not happen, the GCC developers have
    avoided making -ftrap the default, even on platforms like MIPS and
    Alpha where the implementation of -ftrapv just means to use different instructions (e.g., add instead of addu on MIPS, and addv instead of
    add on Alpha).

    The hardware, of course, cannot always enable trapping on overflow if it
    is going to efficiently support a range of programming languages. But
    as an optional feature it can be helpful for catching a few bugs in
    code, so it can be a good idea (both for signed and unsigned overflow).

    This supposedly helpful feature has been neglected by C compiler
    developers, and you see in the progression from MIPS (1986) to Alpha
    (1992) and then RISC-V (2011) that the hardware architects have
    accepted that:

    MIPS: add traps on signed overflow, you need to write addu if you
    don't want that.

    Alpha: add ignores signed overflow, you need to write addv if you want
    the trapping.

    RISC-V: add ignores signed overflow, there is no add that traps on
    signed overflow (and detecting signed overflow is pretty
    involved if both operands are unknown to the compiler).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Mon May 25 17:18:18 2026
    From Newsgroup: comp.arch

    On 25/05/2026 16:28, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 24/05/2026 23:39, quadi wrote:
    On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:
    quadi <quadibloc@ca.invalid> posted:

    It makes sense to trap on a floating-point overflow, but trapping on an >>>>> integer overflow is usually a terrible idea.

    Most programming environments I have had contact with don't trap on floating-point overflow.

    So, detecting something went wrong and you should inform the programmer >>>> is a bad idea ???

    The question is if an integer overflow means that something went
    wrong.

    At the source code level, that is often the case - but not always. I
    think it is quite clear that if you do something the language does not
    allow, the code is wrong, but it might give the correct results for some
    tools nonetheless. And overflow will often mean something went wrong
    even when the language (or compiler options) specifically allow it. At
    the object code level, things may be different again. (For an obvious example, if you are using a double-width integer type then the source
    code may have no overflow but the implementation might use two "add-with-carry" instructions where overflow is a natural part of the implementation.)

    Despite their eagerness to "optimize" based on the assumption
    that signed integer overflow does not happen, the GCC developers have
    avoided making -ftrap the default, even on platforms like MIPS and
    Alpha where the implementation of -ftrapv just means to use different instructions (e.g., add instead of addu on MIPS, and addv instead of
    add on Alpha).

    An awkward thing about using trap on overflow is determining how
    precisely it is defined. Supposing you have the expression "a + b - a".
    Perhaps "a + b" overflows. I would hope than when using debug-related compiler flags such as "-fsanitize=signed-integer-overflow", a compiler
    would check for overflow on "a + b", and report it at runtime.
    (Unfortunately, gcc does not do that unless the partial expression is
    assigned to a variable.) But in "normal" usage, I'd expect the
    expression to be simplified, resulting in just "b" and no overflow.

    If "trap on overflow" has precise semantics in the code, then this
    disables a range of useful optimisations and re-arrangements. If it is
    just "use trapping arithmetic instructions", then it will miss many
    possible cases of actual overflow in the code, which we might want to
    catch. And "trap on overflow" might either trigger when there is no
    overflow in the original code, or hinder optimisations. (Consider the expression "x / 2 + y / 2" - the compiler could implement that as a
    combined "(x + y) / 2", but that might introduce overflow.)

    It is not easy to see how a tool can avoid false positives and false
    negatives and also conveniently optimise and re-arrange code.


    The hardware, of course, cannot always enable trapping on overflow if it
    is going to efficiently support a range of programming languages. But
    as an optional feature it can be helpful for catching a few bugs in
    code, so it can be a good idea (both for signed and unsigned overflow).

    This supposedly helpful feature has been neglected by C compiler
    developers, and you see in the progression from MIPS (1986) to Alpha
    (1992) and then RISC-V (2011) that the hardware architects have
    accepted that:

    MIPS: add traps on signed overflow, you need to write addu if you
    don't want that.

    Alpha: add ignores signed overflow, you need to write addv if you want
    the trapping.

    RISC-V: add ignores signed overflow, there is no add that traps on
    signed overflow (and detecting signed overflow is pretty
    involved if both operands are unknown to the compiler).

    - anton

    Compilers have not always been good at taking advantage of all the
    features provided by hardware - nor have languages been good at exposing
    the possibilities in the language so that programmers can take advantage
    of them.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 16:45:07 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    On Wed, 20 May 2026 01:35:01 +0000, MitchAlsup wrote:

    You will find you have no <marketable> choice; you need to support::

    Integer{S8, S16, S32, S64, U8, U16, U32, U64}
    Float {FP8, FP16, FP32, FP64 and some way to get FP128}

    After realizing that I did need a second instruction for unsigned
    _division_ I then learned, to my shock, that division was not one, but
    two, instructions, at least in my architecture, for integers.

    And there didn't seem to be enough opcode space left for Divide Extensibly Unsigned.

    My 66000 has an instruction bit that denotes the signedness of integer calculations {Signed, unSigned}. This bit is available as another OpCode
    bit for non-integer calculation instructions.

    I was able to re-adjust the 32-bit operate instructions so that the two places where only 96 opcodes were provided for the basic operate instructions could now provide 128 opcodes.

    The 16-bit and 24-bit short instructions could not be so modified. But
    there were a few unused opcodes; so Divide Extensibly Unsigned could still fit in, just out of place.

    But that meant that this one operation would be missing from the minimum- length immediate instructions, and would still be treated as out of the basic instruction set, getting immediate instructions that were 16 bits longer, for them.

    The Pigeonhole Principle has finally bit me!

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 16:49:59 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    On 24/05/2026 23:39, quadi wrote:
    On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:
    -----------------
    This supposedly helpful feature has been neglected by C compiler
    developers, and you see in the progression from MIPS (1986) to Alpha
    (1992) and then RISC-V (2011) that the hardware architects have
    accepted that:

    MIPS: add traps on signed overflow, you need to write addu if you
    don't want that.

    Alpha: add ignores signed overflow, you need to write addv if you want
    the trapping.

    RISC-V: add ignores signed overflow, there is no add that traps on
    signed overflow (and detecting signed overflow is pretty
    involved if both operands are unknown to the compiler).

    The worst of all possible semantic encodings


    - anton
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon May 25 16:43:07 2026
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> writes:
    On 25/05/2026 16:28, Anton Ertl wrote:
    Despite their eagerness to "optimize" based on the assumption
    that signed integer overflow does not happen, the GCC developers have
    avoided making -ftrap the default, even on platforms like MIPS and
    Alpha where the implementation of -ftrapv just means to use different
    instructions (e.g., add instead of addu on MIPS, and addv instead of
    add on Alpha).

    An awkward thing about using trap on overflow is determining how
    precisely it is defined. Supposing you have the expression "a + b - a".
    Perhaps "a + b" overflows. I would hope than when using debug-related
    compiler flags such as "-fsanitize=signed-integer-overflow", a compiler >would check for overflow on "a + b", and report it at runtime. >(Unfortunately, gcc does not do that unless the partial expression is >assigned to a variable.) But in "normal" usage, I'd expect the
    expression to be simplified, resulting in just "b" and no overflow.

    OTOH, cases like a+b+c where the result is in range, while an
    intermediate result is out of range are one of the reasons why I
    prefer -fwrapv over -ftrapv. As for your preference of nasal demons,
    given enough information, the compiler might "optimize" "a+b-a" into,
    e.g., 0.

    Anyway, the definition of -ftrapv is not very precise; for gcc-12.2:

    |'-ftrapv'
    | This option generates traps for signed overflow on addition,
    | subtraction, multiplication operations.

    As for what gcc-12.2 does for your example on AMD64:

    long foo(long a, long b)
    {
    return a+b-a;
    }

    is compiled with gcc -O3 -ftrapv to:

    0: 48 89 f0 mov %rsi,%rax
    3: c3 ret

    If "trap on overflow" has precise semantics in the code, then this
    disables a range of useful optimisations and re-arrangements. If it is
    just "use trapping arithmetic instructions", then it will miss many
    possible cases of actual overflow in the code, which we might want to
    catch.

    Which would you prefer by default?

    The gcc developers apparently took the latter approach, even when you
    ask for -ftrapv explicitly. So what, IYO, speaks against doing that
    by default on machines like MIPS and Alpha.

    And "trap on overflow" might either trigger when there is no
    overflow in the original code, or hinder optimisations. (Consider the >expression "x / 2 + y / 2" - the compiler could implement that as a
    combined "(x + y) / 2", but that might introduce overflow.)

    x/2+y/2 produces a different result from (x+y)/2 when both x and y are
    odd integers.

    gcc-12.2 compiles

    long bar(long x, long y)
    {
    return x/2+y/2;
    }


    on AMD64 to:

    gcc -O3 -ftrapv gcc -O3
    mov %rdi,%rax mov %rdi,%rax
    sub $0x8,%rsp mov %rsi,%rdx
    shr $0x3f,%rax shr $0x3f,%rax
    add %rax,%rdi shr $0x3f,%rdx
    mov %rsi,%rax add %rdi,%rax
    shr $0x3f,%rax add %rsi,%rdx
    sar %rdi sar %rax
    add %rax,%rsi sar %rdx
    sar %rsi add %rdx,%rax
    call __addvdi3@PLT ret
    add $0x8,%rsp
    ret

    so the -ftrapv introduces an additional mov and a call; I would have
    expected that the + would be compiled to an ADD instruction followed
    by a JO instruction.

    Trying the same on a MIPS64 machine with gcc-8.3 (which apparently
    produces ILP32 code) produces a call to __addvsi3 instead of the
    expected add instruction:

    gcc -O3 -ftrapv gcc -O3
    lui gp,0x0 srl v0,a0,0x1f
    addiu gp,gp,0 srl v1,a1,0x1f
    addu gp,gp,t9 addu v0,v0,a0
    srl v1,a0,0x1f addu a1,v1,a1
    lw t9,__addvsi3(gp) sra v0,v0,0x1
    srl v0,a1,0x1f sra a1,a1,0x1
    addiu sp,sp,-32 jr ra
    addu a0,v1,a0 addu v0,v0,a1
    addu a1,v0,a1
    sra a0,a0,0x1
    sw ra,28(sp)
    sw gp,16(sp)
    jalr t9
    sra a1,a1,0x1
    lw ra,28(sp)
    jr ra
    addiu sp,sp,32

    The call costs a lot of overhead.

    It is not easy to see how a tool can avoid false positives and false >negatives and also conveniently optimise and re-arrange code.

    It can't. But it does not try to avoid false negatives even when
    explicitly asked for trapping on overflow.

    If some overflow trapping when it can be done without additional
    instructions would be preferable over no overflow, gcc would compile
    signed adds that survive after optimization into add on MIPS rather
    than addu, by default. Given that it does not, the GCC developers
    probably found out that it is not preferable. I guess they would get
    too many customer complaints, including for "relevant" code, i.e.,
    code where the usual "it's UB, so your code is broken" excuse does not
    work.

    The fact that they don't even try to make -ftrapv produce efficient
    code indicates that there is no "relevant" interest in efficient
    -ftrapv. It would be interesting to know who came up with the idea of
    adding -ftrapv, and why they are still keeping it.

    Compilers have not always been good at taking advantage of all the
    features provided by hardware

    GCC is pretty good at implementing -fwrapv. For the two examples
    above, "gcc -O3 -fwrapv" produces the same code on AMD64 and MIPS as
    "gcc -O3".

    nor have languages been good at exposing
    the possibilities in the language so that programmers can take advantage
    of them.

    Yes. But I leave that for another day.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 19:20:01 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    On 25/05/2026 16:28, Anton Ertl wrote:
    Despite their eagerness to "optimize" based on the assumption
    that signed integer overflow does not happen, the GCC developers have
    avoided making -ftrap the default, even on platforms like MIPS and
    Alpha where the implementation of -ftrapv just means to use different
    instructions (e.g., add instead of addu on MIPS, and addv instead of
    add on Alpha).

    An awkward thing about using trap on overflow is determining how
    precisely it is defined. Supposing you have the expression "a + b - a".
    Perhaps "a + b" overflows. I would hope than when using debug-related
    compiler flags such as "-fsanitize=signed-integer-overflow", a compiler >would check for overflow on "a + b", and report it at runtime. >(Unfortunately, gcc does not do that unless the partial expression is >assigned to a variable.) But in "normal" usage, I'd expect the
    expression to be simplified, resulting in just "b" and no overflow.

    OTOH, cases like a+b+c where the result is in range, while an
    intermediate result is out of range are one of the reasons why I
    prefer -fwrapv over -ftrapv. As for your preference of nasal demons,
    given enough information, the compiler might "optimize" "a+b-a" into,
    e.g., 0.

    a/0/b/


    Anyway, the definition of -ftrapv is not very precise; for gcc-12.2:

    |'-ftrapv'
    | This option generates traps for signed overflow on addition,
    | subtraction, multiplication operations.

    As for what gcc-12.2 does for your example on AMD64:

    long foo(long a, long b)
    {
    return a+b-a;
    }

    is compiled with gcc -O3 -ftrapv to:

    0: 48 89 f0 mov %rsi,%rax
    3: c3 ret

    If "trap on overflow" has precise semantics in the code, then this >disables a range of useful optimisations and re-arrangements. If it is >just "use trapping arithmetic instructions", then it will miss many >possible cases of actual overflow in the code, which we might want to >catch.

    Which would you prefer by default?

    What you do want is compiled code that can trap on overflow and avoid
    trapping on overflow without code substitution or being re-compiled.
    This way production code can avoid trapping but if the debugger is
    turned on, you can trap.

    The gcc developers apparently took the latter approach, even when you
    ask for -ftrapv explicitly. So what, IYO, speaks against doing that
    by default on machines like MIPS and Alpha.

    Both architectures got this one wrong--IMO--and so does RISC-V.

    And "trap on overflow" might either trigger when there is no
    overflow in the original code, or hinder optimisations. (Consider the >expression "x / 2 + y / 2" - the compiler could implement that as a >combined "(x + y) / 2", but that might introduce overflow.)

    x/2+y/2 produces a different result from (x+y)/2 when both x and y are
    odd integers.

    gcc-12.2 compiles

    long bar(long x, long y)
    {
    return x/2+y/2;
    }


    on AMD64 to:

    gcc -O3 -ftrapv gcc -O3
    mov %rdi,%rax mov %rdi,%rax
    sub $0x8,%rsp mov %rsi,%rdx
    shr $0x3f,%rax shr $0x3f,%rax
    add %rax,%rdi shr $0x3f,%rdx
    mov %rsi,%rax add %rdi,%rax
    shr $0x3f,%rax add %rsi,%rdx
    sar %rdi sar %rax
    add %rax,%rsi sar %rdx
    sar %rsi add %rdx,%rax
    call __addvdi3@PLT ret
    add $0x8,%rsp
    ret

    so the -ftrapv introduces an additional mov and a call; I would have
    expected that the + would be compiled to an ADD instruction followed
    by a JO instruction.

    Trying the same on a MIPS64 machine with gcc-8.3 (which apparently
    produces ILP32 code) produces a call to __addvsi3 instead of the
    expected add instruction:

    gcc -O3 -ftrapv gcc -O3
    lui gp,0x0 srl v0,a0,0x1f
    addiu gp,gp,0 srl v1,a1,0x1f
    addu gp,gp,t9 addu v0,v0,a0
    srl v1,a0,0x1f addu a1,v1,a1
    lw t9,__addvsi3(gp) sra v0,v0,0x1
    srl v0,a1,0x1f sra a1,a1,0x1
    addiu sp,sp,-32 jr ra
    addu a0,v1,a0 addu v0,v0,a1
    addu a1,v0,a1
    sra a0,a0,0x1
    sw ra,28(sp)
    sw gp,16(sp)
    jalr t9
    sra a1,a1,0x1
    lw ra,28(sp)
    jr ra
    addiu sp,sp,32

    The call costs a lot of overhead.

    Architectures without overflow traps are notorious for excess instruction
    count when overflow detection is desired or mandated.

    It is not easy to see how a tool can avoid false positives and false >negatives and also conveniently optimise and re-arrange code.

    It can't. But it does not try to avoid false negatives even when
    explicitly asked for trapping on overflow.

    Granted, Optimization can do a lot of strange code emission and movement
    when one does not care about precise overflow semantics. But, as a whole,
    we are a society where we want high HP automobiles more than we want safe automobiles ('we' not including *.gov's).

    If some overflow trapping when it can be done without additional
    instructions would be preferable over no overflow, gcc would compile
    signed adds that survive after optimization into add on MIPS rather
    than addu, by default. Given that it does not, the GCC developers
    probably found out that it is not preferable. I guess they would get
    too many customer complaints, including for "relevant" code, i.e.,
    code where the usual "it's UB, so your code is broken" excuse does not
    work.

    It is much harder than that. For example: does a signed shift left
    overflow when significant bits are shifted out ?? What if the sub-
    sequent instruction shifts the result back and the pair are acting
    as a bit-field extract ?? My 66000 has bit field extracts for exactly
    this reason. Floating-point has a lot of these cases, too.

    The fact that they don't even try to make -ftrapv produce efficient
    code indicates that there is no "relevant" interest in efficient
    -ftrapv. It would be interesting to know who came up with the idea of
    adding -ftrapv, and why they are still keeping it.

    Compilers have not always been good at taking advantage of all the >features provided by hardware

    GCC is pretty good at implementing -fwrapv. For the two examples
    above, "gcc -O3 -fwrapv" produces the same code on AMD64 and MIPS as
    "gcc -O3".

    nor have languages been good at exposing
    the possibilities in the language so that programmers can take advantage >of them.

    Yes. But I leave that for another day.

    A whole new kettle of fish...

    - anton
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Mon May 25 20:26:24 2026
    From Newsgroup: comp.arch

    On Mon, 25 May 2026 10:23:00 +0200, David Brown wrote:

    The hardware, of course, cannot always enable trapping on overflow if it
    is going to efficiently support a range of programming languages.

    Yes. And I am used to FORTRAN, which did not trap on integer overflows.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Mon May 25 20:32:15 2026
    From Newsgroup: comp.arch

    On Mon, 25 May 2026 19:20:01 +0000, MitchAlsup wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    David Brown <david.brown@hesbynett.no> writes:
    On 25/05/2026 16:28, Anton Ertl wrote:

    Despite their eagerness to "optimize" based on the assumption
    that signed integer overflow does not happen, the GCC developers
    have avoided making -ftrap the default, even on platforms like MIPS
    and Alpha where the implementation of -ftrapv just means to use
    different instructions (e.g., add instead of addu on MIPS, and addv
    instead of add on Alpha).

    Both architectures got this one wrong--IMO--and so does RISC-V.

    You may not have been replying to what Anton Ertl wrote above, since there
    was a lot in between that I snipped. But it does mention two architectures that took an approach to trapping on integer overflow... that I also tend
    to disagree with.

    What I'm used to is the System/360. While it made the mistake of having
    two condition code bits instead of NZVC, the idea of having "trap on
    overflow" controlled by a bit in the PSW is... what I assumed to be normal
    and correct.

    I could be wrong, as I haven't examined that approach critically and given full consideration to the alternatives.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon May 25 20:32:15 2026
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> schrieb:
    On 24/05/2026 23:39, quadi wrote:
    On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:
    quadi <quadibloc@ca.invalid> posted:

    It makes sense to trap on a floating-point overflow, but trapping on an >>>> integer overflow is usually a terrible idea.

    So, detecting something went wrong and you should inform the programmer
    is a bad idea ???

    No, so being able to turn the trap for integer overflow on should
    definitely be allowed. But that shouldn't be the default behavior.
    Otherwise, programs like random number generators wouldn't work.

    John Savard

    That does not make sense. Code such as random number generators should
    be written so that they are correct in the language they are written in.

    In principle, yes.

    In practice, people often used whatever "worked" on their systems.
    Implementors have a certain right because they control what their
    compiler does or does not do. But users did so, as well, with
    Numerical Recipes a(n in)famous example.

    And yes, this bites people. You can see this at https://gcc.gnu.org/gcc-13/porting_to.html :

    # GCC 13 includes new optimizations which may change behavior
    # on integer overflow. Traditional code, like linear congruential
    # pseudo-random number generators in old programs and relying on
    # a specific, non-standard behavior may now generate unexpected
    # results. The option -fsanitize=undefined can be used to detect
    # such code at runtime.

    # It is recommended to use the intrinsic subroutine RANDOM_NUMBER for
    # random number generators or, if the old behavior is desired, to use
    # the -fwrapv option. Note that this option can impact performance.


    If that is C, signed integer overflow is UB while unsigned integers
    have wrapping behaviour - thus if your code depends on wrapping, and it
    is written in C, it needs to use unsigned types or compiler-specific extensions, flags, etc. (Or C23 ckd_add and other checked arithmetic functions.)

    If it is written in Zig, you need to use the specific modulo arithmetic functions even for unsigned arithmetic. If it is written in Java,
    signed integer arithmetic is fine.

    It all depends on the language and/or any options the language and tools might support - and code should be written to work correctly according
    to the language rules.

    Fortran has no standard way of implementing this unless you
    restrict yourself to sizes which do not overflow a signed integer.
    Implementing LCGRNGs was one reason why I pushed for unsigned
    arithmetic (modulo 2**n) in Fortran. The attempt failed (not
    taken up by WG5 after being endorsed by J3), but I implemented it
    for gfortran anyway.

    The hardware, of course, cannot always enable trapping on overflow if it
    is going to efficiently support a range of programming languages. But
    as an optional feature it can be helpful for catching a few bugs in
    code, so it can be a good idea (both for signed and unsigned overflow).

    Sanitizers are also fairly good now, but of course cost performance.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Mon May 25 20:34:41 2026
    From Newsgroup: comp.arch

    On Mon, 25 May 2026 16:49:59 +0000, MitchAlsup wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    RISC-V: add ignores signed overflow, there is no add that traps on
    signed overflow (and detecting signed overflow is pretty
    involved if both operands are unknown to the compiler).

    The worst of all possible semantic encodings

    Although I thought that making trapping on fixed-point overflow the
    default is a bad idea, I agree that making it impossible to do so, or even test for fixed-point overflow, is a much worse idea.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Mon May 25 20:45:20 2026
    From Newsgroup: comp.arch

    On Mon, 25 May 2026 16:45:07 +0000, MitchAlsup wrote:

    My 66000 has an instruction bit that denotes the signedness of integer calculations {Signed, unSigned}. This bit is available as another OpCode
    bit for non-integer calculation instructions.

    That's nice. It's not an option I can consider, as having lots of
    orthogonal modifiers on instructions would tend to increase their length.
    A major goal of the Concertina II, III, and IV architectures is for instructions not to be longer than similar instructions on the Motorola
    68020 or the IBM System/360 if at all possible.

    Basically, the selling point is... "Your programs only get 10% bigger, if that, and yet you have 32 registers, so they run faster!".

    Or they _would_, if the design didn't have so many extra transistors for supporting both IBM-format and Intel-format Decimal Floating Point, old-
    style IBM floats, simple floating (You too can work with numbers that go around the world 2 1/2 times!), packed decimal, mixed-radix arithmetic...

    But, hey, supporting these things in hardware is faster than doing them in software!

    And are people even going to _read_ the part of the manual that
    explains... as is noted in the description of the original Concertina architecture...

    This chip has 8-way simultaneous multi-threading, but only for programs
    which do not make use of extensions to the register set.

    Only two programs per core may use the extended register banks with 128 elements.

    Only one program per core may use the vector registers for long vector instructions. The 256-bit short vector registers, on the other hand, like
    the integer and floating-point registers, are available to all
    simultaneous threads.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon May 25 20:32:35 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    What you do want is compiled code that can trap on overflow and avoid >trapping on overflow without code substitution or being re-compiled.
    This way production code can avoid trapping but if the debugger is
    turned on, you can trap.

    Why do you consider that desirable?

    long bar(long x, long y)
    {
    return x/2+y/2;
    }
    ...
    Trying the same on a MIPS64 machine with gcc-8.3 (which apparently
    produces ILP32 code) produces a call to __addvsi3 instead of the
    expected add instruction:

    gcc -O3 -ftrapv gcc -O3
    lui gp,0x0 srl v0,a0,0x1f
    addiu gp,gp,0 srl v1,a1,0x1f
    addu gp,gp,t9 addu v0,v0,a0
    srl v1,a0,0x1f addu a1,v1,a1
    lw t9,__addvsi3(gp) sra v0,v0,0x1
    srl v0,a1,0x1f sra a1,a1,0x1
    addiu sp,sp,-32 jr ra
    addu a0,v1,a0 addu v0,v0,a1
    addu a1,v0,a1
    sra a0,a0,0x1
    sw ra,28(sp)
    sw gp,16(sp)
    jalr t9
    sra a1,a1,0x1
    lw ra,28(sp)
    jr ra
    addiu sp,sp,32

    The call costs a lot of overhead.

    Architectures without overflow traps are notorious for excess instruction >count when overflow detection is desired or mandated.

    MIPS' add traps on overflow. gcc could have emitted almost the same
    code for gcc -O3 -trapv as for gcc -O3, except that the last
    instruction would be an add, not an addu. But apparently nobody gives
    a damn about the efficiency of -trapv, possibly rightly so.

    If some overflow trapping when it can be done without additional
    instructions would be preferable over no overflow, gcc would compile
    signed adds that survive after optimization into add on MIPS rather
    than addu, by default. Given that it does not, the GCC developers
    probably found out that it is not preferable. I guess they would get
    too many customer complaints, including for "relevant" code, i.e.,
    code where the usual "it's UB, so your code is broken" excuse does not
    work.

    It is much harder than that. For example: does a signed shift left
    overflow when significant bits are shifted out ??

    -ftrapv specifies trapping on overflow only for additions,
    subtractions, and multiplications.
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon May 25 16:34:50 2026
    From Newsgroup: comp.arch

    On 5/25/2026 9:28 AM, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 24/05/2026 23:39, quadi wrote:
    On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:
    quadi <quadibloc@ca.invalid> posted:

    It makes sense to trap on a floating-point overflow, but trapping on an >>>>> integer overflow is usually a terrible idea.

    Most programming environments I have had contact with don't trap on floating-point overflow.


    Many just go Inf...

    Division by zero is usually handled by going NaN.

    Contrast with integer division by zero which does usually trap.


    So, detecting something went wrong and you should inform the programmer >>>> is a bad idea ???

    The question is if an integer overflow means that something went
    wrong. Despite their eagerness to "optimize" based on the assumption
    that signed integer overflow does not happen, the GCC developers have
    avoided making -ftrap the default, even on platforms like MIPS and
    Alpha where the implementation of -ftrapv just means to use different instructions (e.g., add instead of addu on MIPS, and addv instead of
    add on Alpha).


    Integer overflow happens far too often for trapping to be a good solution.


    We almost need a separate "integer that should not overflow" type, with
    more explicit "do something special if it does" semantics.


    Though, more likely to be useful would be a "detect if an overflow had happened" mechanism.

    errno_t ovfstate;
    __int_no_overflow x, y, z;
    ...
    __start_errsense(&ovfstate);
    z=x+y;
    __end_errsense(&ovfstate);
    if(ovfstate&ERRSENSE_FLAG_OVERFLOW)
    ...

    Which would be awkward, but probably more useful than, say, raising a
    signal and/or terminating the program.


    The hardware, of course, cannot always enable trapping on overflow if it
    is going to efficiently support a range of programming languages. But
    as an optional feature it can be helpful for catching a few bugs in
    code, so it can be a good idea (both for signed and unsigned overflow).

    This supposedly helpful feature has been neglected by C compiler
    developers, and you see in the progression from MIPS (1986) to Alpha
    (1992) and then RISC-V (2011) that the hardware architects have
    accepted that:

    MIPS: add traps on signed overflow, you need to write addu if you
    don't want that.

    Alpha: add ignores signed overflow, you need to write addv if you want
    the trapping.

    RISC-V: add ignores signed overflow, there is no add that traps on
    signed overflow (and detecting signed overflow is pretty
    involved if both operands are unknown to the compiler).


    In practice, given:
    We have instructions like ADDW, etc, whose behavior is explicitly to sign-extend the results of 32-bit ADD;
    Behavior in practice is often to meticulously follow wrap-on-overflow semantics;
    Exceptions to wrap-on-overflow usually exist as edge cases;
    Various programs exist that will actively break if wrap-on-overflow is
    not the observed behavior in C land;
    ...

    The expectation that 'int' can or meaningfully do something other than
    wrap on overflow is more of a fantasy.


    Or like some other some other "portability boogeymen":
    Non two's complement integer arithmetic;
    Big endian machines;
    Machines that don't allow unaligned loads and stores;
    Types with sizes other than the "usually accepted" set;
    ...


    The argument has often been, "but, 64-bit machines might not provide
    native 32-bit arithmetic".

    But, often in 64-bit machines, a pattern emerges:
    Most ops are full 64-bit;
    A subset of instructions have variants that produce sign and/or zero
    extended results;
    The instructions which produce these results, typically being, the ones
    needed to preserve the usual wrap-on-overflow semantics in those places
    where something could happen that would produce a deviation from the
    expected semantics.

    The ones that have zero-extension usually treating signed integers as zero-extended.

    The reverse has also been done; treating unsigned as sign-extended, as
    in the standard RISC-V ABI, but IMO this is stupid. Even in the absence
    of a native zero-extension op (as in plain RV64G), the mess that results
    from sign-extending unsigned is worse than the cost of explicit zero extension.

    Best case here being to keep values using "native extension":
    'int' : Always sign extended;
    'unsigned int': Always zero extended.
    Then 32-bit types are a strict subset of the 64-bit range, and
    up-promotion becomes free. Not sure why some people don't see this as
    obvious though. Well, and people keep making the choice of adding
    garbage edge cases to RISC-V that would have been entirely unnecessary
    if people weren't being stupid about the ABI rules.

    But yeah...


    But, all this would not be expected to happen unless one accepts that it
    is already generally accepted that wrap-on-overflow for 'int' and
    similar is the only really practical or viable solution here.





    Otherwise, recently:
    In my case I decided to live with a "breaking change" in XG3 and to
    change some things that may matter later. Then ended up tweaking some
    other things on my annoyance list (since I was already breaking existing binaries, better to cluster breakage to a singular event if doing it).

    ADD, ADDS.L, and ADDU.L have all been changed from Imm10u/n to Imm10s.
    The Imm10u cases are now Imm10s;
    The Imm10n sub-case is now dropped/reserved.
    May be reused later.
    This reclaims 3 out of the 20 Imm10 spots.
    Was mostly a case of it being harder to justify the encoding space.
    Old behavior will need to remain for XG1 and XG2.
    In this case, XG3 will explicitly deviate from XG1 and XG2 here.
    Does mean that XG3 now had less ADD/SUB Imm range than XG2, but...
    Only goes from 97.1% hit rate to 95.9%,
    no significant effect on overall code density.
    Could use the RV Imm12 ops (ADDI / ADDIW), but:
    Hit rate for the RV ops here is negligible;
    Much of these also happen to miss on one or both registers.

    The MULS.L and MULU.L ops were also switched to Imm10s.
    This means all of the Imm10 ALU ops are now unified on Imm10s.

    Relocated TST and TSTN from the F0-8 block (with the XMOV instructions)
    to the F0-9 block (with the other CMPxx 3R ops).

    A few very rarely used instructions were demoted from 32-bit to 64-bit encodings.


    Have experimentally added some 32-bit:
    Bcc Rm, Imm6s, (PC, Disp6s)
    instructions, where:
    Imm6s: Hits ~ 80% of these cases;
    Disp6s: Hits ~ 60% of these cases;
    Imm5s + Disp7s would hit slightly better, but,
    would have needed more new decoder logic...
    Resulting in it hitting about half over the:
    Bcc Rm, Imm17s, (PC, Disp10s)
    Cases, for an overall code-density improvement of ~ 0.5%, ...
    Dominant use-case: Final compare-and-branch in a short "for()" loop.
    Secondary use-case: Short non-predicated "if()" branches.
    But, is out-weighed by said predicated "if()" branches.
    Would likely see more use here if not using predication.
    If it would have hit for 100% of these, would have saved ~ 1%.

    This is debatable.

    This reused the encoding spots previously used for the Load-Disp5us ops,
    which still exist for XG1 and XG2 (decoder special-case handling), but
    were N/A in XG3 (they would be in effect entirely redundant with the
    Disp10s forms in XG3; but had non-redundant edge-cases in XG1 and XG2).

    Like with the Imm17s+Disp10s ops, these will still depend on the IMMB extension, as they still need the same basic mechanism.

    Was a fairly low-priority feature, in any case.


    Seemingly running low on obvious optimization paths.


    - anton

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 22:49:58 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    On Mon, 25 May 2026 10:23:00 +0200, David Brown wrote:

    The hardware, of course, cannot always enable trapping on overflow if it
    is going to efficiently support a range of programming languages.

    Yes. And I am used to FORTRAN, which did not trap on integer overflows.

    WATfor and WATfive trapped on integer overflows.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 22:51:42 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    On Mon, 25 May 2026 19:20:01 +0000, MitchAlsup wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    David Brown <david.brown@hesbynett.no> writes:
    On 25/05/2026 16:28, Anton Ertl wrote:

    Despite their eagerness to "optimize" based on the assumption
    that signed integer overflow does not happen, the GCC developers
    have avoided making -ftrap the default, even on platforms like MIPS
    and Alpha where the implementation of -ftrapv just means to use
    different instructions (e.g., add instead of addu on MIPS, and addv
    instead of add on Alpha).

    Both architectures got this one wrong--IMO--and so does RISC-V.

    You may not have been replying to what Anton Ertl wrote above, since there was a lot in between that I snipped. But it does mention two architectures that took an approach to trapping on integer overflow... that I also tend
    to disagree with.

    What I'm used to is the System/360. While it made the mistake of having
    two condition code bits instead of NZVC, the idea of having "trap on overflow" controlled by a bit in the PSW is... what I assumed to be normal and correct.

    And what My 66000 does....

    I purport that ANY Industrial quality ISA should provide a means to
    trap on integer overflow.

    I could be wrong, as I haven't examined that approach critically and given full consideration to the alternatives.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 22:59:10 2026
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    David Brown <david.brown@hesbynett.no> schrieb:
    On 24/05/2026 23:39, quadi wrote:
    On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:
    quadi <quadibloc@ca.invalid> posted:

    It makes sense to trap on a floating-point overflow, but trapping on an >>>> integer overflow is usually a terrible idea.

    So, detecting something went wrong and you should inform the programmer >>> is a bad idea ???

    No, so being able to turn the trap for integer overflow on should
    definitely be allowed. But that shouldn't be the default behavior.
    Otherwise, programs like random number generators wouldn't work.

    John Savard

    That does not make sense. Code such as random number generators should
    be written so that they are correct in the language they are written in.

    In principle, yes.

    Principle is better in theory than in practice.

    In practice, people often used whatever "worked" on their systems.

    Face it, the poor slug writing the code may not have the faintest
    grasp at the system qualities we are discussing, and does not care
    to learn as long as he can slug through the writing and his pro-
    gram not blow up catastrophically while it is under his purview.

    That defines a lot of what is wrong with SW programming today.

    Implementors have a certain right because they control what their
    compiler does or does not do.

    You would be surprised at how little influence implementors have
    on compilers and other software.

    But users did so, as well, with
    Numerical Recipes a(n in)famous example.

    And yes, this bites people. You can see this at https://gcc.gnu.org/gcc-13/porting_to.html :

    # GCC 13 includes new optimizations which may change behavior
    # on integer overflow. Traditional code, like linear congruential
    # pseudo-random number generators in old programs and relying on
    # a specific, non-standard behavior may now generate unexpected
    # results. The option -fsanitize=undefined can be used to detect
    # such code at runtime.

    My VAX favorite was:

    for( int i = 1; i; i+=i )

    Traps instead of exiting the loop normally.

    # It is recommended to use the intrinsic subroutine RANDOM_NUMBER for
    # random number generators or, if the old behavior is desired, to use
    # the -fwrapv option. Note that this option can impact performance.


    If that is C, signed integer overflow is UB while unsigned integers
    have wrapping behaviour - thus if your code depends on wrapping, and it
    is written in C, it needs to use unsigned types or compiler-specific extensions, flags, etc. (Or C23 ckd_add and other checked arithmetic functions.)

    If it is written in Zig, you need to use the specific modulo arithmetic functions even for unsigned arithmetic. If it is written in Java,
    signed integer arithmetic is fine.

    It all depends on the language and/or any options the language and tools might support - and code should be written to work correctly according
    to the language rules.

    Fortran has no standard way of implementing this unless you
    restrict yourself to sizes which do not overflow a signed integer.

    Old FORTRAN had no unSigned integer type and no way to avoid overflows.

    Implementing LCGRNGs was one reason why I pushed for unsigned
    arithmetic (modulo 2**n) in Fortran. The attempt failed (not
    taken up by WG5 after being endorsed by J3), but I implemented it
    for gfortran anyway.

    The hardware, of course, cannot always enable trapping on overflow if it is going to efficiently support a range of programming languages. But
    as an optional feature it can be helpful for catching a few bugs in
    code, so it can be a good idea (both for signed and unsigned overflow).

    Sanitizers are also fairly good now, but of course cost performance.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 23:00:32 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    What you do want is compiled code that can trap on overflow and avoid >trapping on overflow without code substitution or being re-compiled.
    This way production code can avoid trapping but if the debugger is
    turned on, you can trap.

    Why do you consider that desirable?

    So you can debug production/released code to find subtle errors.

    long bar(long x, long y)
    {
    return x/2+y/2;
    }
    ...
    Trying the same on a MIPS64 machine with gcc-8.3 (which apparently
    produces ILP32 code) produces a call to __addvsi3 instead of the
    expected add instruction:

    gcc -O3 -ftrapv gcc -O3
    lui gp,0x0 srl v0,a0,0x1f
    addiu gp,gp,0 srl v1,a1,0x1f
    addu gp,gp,t9 addu v0,v0,a0
    srl v1,a0,0x1f addu a1,v1,a1
    lw t9,__addvsi3(gp) sra v0,v0,0x1
    srl v0,a1,0x1f sra a1,a1,0x1
    addiu sp,sp,-32 jr ra
    addu a0,v1,a0 addu v0,v0,a1
    addu a1,v0,a1
    sra a0,a0,0x1
    sw ra,28(sp)
    sw gp,16(sp)
    jalr t9
    sra a1,a1,0x1
    lw ra,28(sp)
    jr ra
    addiu sp,sp,32

    The call costs a lot of overhead.

    Architectures without overflow traps are notorious for excess instruction >count when overflow detection is desired or mandated.

    MIPS' add traps on overflow. gcc could have emitted almost the same
    code for gcc -O3 -trapv as for gcc -O3, except that the last
    instruction would be an add, not an addu. But apparently nobody gives
    a damn about the efficiency of -trapv, possibly rightly so.

    If some overflow trapping when it can be done without additional
    instructions would be preferable over no overflow, gcc would compile
    signed adds that survive after optimization into add on MIPS rather
    than addu, by default. Given that it does not, the GCC developers
    probably found out that it is not preferable. I guess they would get
    too many customer complaints, including for "relevant" code, i.e.,
    code where the usual "it's UB, so your code is broken" excuse does not
    work.

    It is much harder than that. For example: does a signed shift left
    overflow when significant bits are shifted out ??

    -ftrapv specifies trapping on overflow only for additions,
    subtractions, and multiplications.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 23:03:03 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    On Mon, 25 May 2026 16:45:07 +0000, MitchAlsup wrote:

    My 66000 has an instruction bit that denotes the signedness of integer calculations {Signed, unSigned}. This bit is available as another OpCode bit for non-integer calculation instructions.

    That's nice. It's not an option I can consider, as having lots of
    orthogonal modifiers on instructions would tend to increase their length.

    And harm instruction Entropy.

    A major goal of the Concertina II, III, and IV architectures is for instructions not to be longer than similar instructions on the Motorola 68020 or the IBM System/360 if at all possible.

    Basically, the selling point is... "Your programs only get 10% bigger, if that, and yet you have 32 registers, so they run faster!".

    Mine are getting 30% smaller and needing fewer instructions at the same
    time

    Or they _would_, if the design didn't have so many extra transistors for supporting both IBM-format and Intel-format Decimal Floating Point, old- style IBM floats, simple floating (You too can work with numbers that go around the world 2 1/2 times!), packed decimal, mixed-radix arithmetic...

    But, hey, supporting these things in hardware is faster than doing them in software!

    And are people even going to _read_ the part of the manual that
    explains... as is noted in the description of the original Concertina architecture...

    This chip has 8-way simultaneous multi-threading, but only for programs which do not make use of extensions to the register set.

    Another One Bites the Dust.....

    Only two programs per core may use the extended register banks with 128 elements.

    Only one program per core may use the vector registers for long vector instructions. The 256-bit short vector registers, on the other hand, like the integer and floating-point registers, are available to all
    simultaneous threads.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 23:05:06 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 5/25/2026 9:28 AM, Anton Ertl wrote:
    --------------
    Integer overflow happens far too often for trapping to be a good solution.

    Even on 64-bit variables/machines ??
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon May 25 20:02:52 2026
    From Newsgroup: comp.arch

    On 5/25/2026 3:34 PM, quadi wrote:
    On Mon, 25 May 2026 16:49:59 +0000, MitchAlsup wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    RISC-V: add ignores signed overflow, there is no add that traps on
    signed overflow (and detecting signed overflow is pretty
    involved if both operands are unknown to the compiler).

    The worst of all possible semantic encodings

    Although I thought that making trapping on fixed-point overflow the
    default is a bad idea, I agree that making it impossible to do so, or even test for fixed-point overflow, is a much worse idea.


    Possibly true.

    The lack of things like ADD-with-Carry or ADD-with-Overflow are
    annoyance points on RISC-V.


    Though, it is less obvious what a useful behavior is at the language level:
    "signal()" ? ...
    Something like try/catch (mostly N/A to C)?
    Something similar to FENV_ACCESS?
    ...


    Well, and that if trapping were applied globally:
    Overhead due to trap detection/handling code causing excessive bloat;
    Overflows traps from any code that naively assumes wrap-on-overflow
    semantics;
    ...

    In some codebases, it is already enough of a pain to hunt and fix all
    the out-of-bounds and uninitialized variables mess.
    Signed integer overflows would likely "turn it up to 11";
    Then, how does one fix it? Ask that people start adding a bunch of casts
    to make it work?...

    One might say:
    Add "if()" cases to deal with the overflows, but, ... this only makes
    sense for cases where the overflows are not the expected behavior.

    Then again, could maybe classify code, say:
    1, signed, value doesn't (or shouldn't) go out-of-range;
    2, unsigned, value doesn't (or shouldn't) go out-of-range;
    3, signed, value is expected to be modulo;
    4, unsigned, value is expected to be modulo.

    "nasal demons" types assume 1 and 4 as dominant.
    Or, 1 as exclusive vs 3.

    For compilers, we often need to assume 3 and 4.
    Because, failure to uphold 3 results in misbehaving programs.
    And, if 3 were uncommon, RISC-V's "ADDW"/etc would be pure stupidity.
    Instead:
    Something like plain ADD plus ADDWU would have made sense.
    But, they dropped ADDWU instead (also stupid IMO).

    While, granted, a lot of 1 code likely exists, 3 code tends to generate
    the vast majority of overflows; and if there is any reasonable
    expectation for 'int' to overflow, and it is not desired for int to
    overflow.

    We mostly ignore 2 vs 4, because standard specifies 4 making 2 to be
    purely a programming error, in which case "2" becomes "should have used
    a bigger signed type instead".


    Then again, could maybe make sense to add a semantic distinction, say:
    "int" (plain):
    Maybe a case could be made that overflow be assumed unexpected.
    "signed int":
    Maybe make separate from plain case, explicitly modulo;
    So, could be made distinct;
    Explicitly like the "unsigned" case in being modulo.
    "unsigned int":
    Remains the same, no real controversy here.

    Or, say:
    char, short, int, long, long long:
    For code, assume that overflow may be unexpected / undesirable;
    signed char, signed int, signed long, signed long long:
    Assume signed modulo;
    Compiler should, ideally, always produce wrap-on-overflow semantics.
    unsigned ...:
    Unsigned modulo.

    For a compiler, then:
    -ftrapv:
    May ideally trap on lack of "signed";
    Explicit "signed", continues to wrap.
    -fwrapv:
    Both default and signed will wrap.
    Neither:
    Dunno, probably better for compiler to assume "-fwrapv" semantics;
    Maybe assume UB opts are safe if no "signed".


    Well, and for the programmer POV:
    If assuming maximum portability:
    Only unsigned overflow wrapping is "safe".
    If assuming "any reasonable system":
    Both will wrap in most cases;
    Absent "-fwrapv", UB opts may occur in certain obscure edge cases.
    Though usually in the form of "early" vs "late" type promotion;
    In most cases, where it does occur, early promotion is benign.
    Vs whatever "nasal demons" people may assert.
    What else, that it late propmotes?
    (as "-fwrapv" semantics would dictate...)


    Like, say:
    int x;
    long z;
    ...
    z = 42 - x;
    //Oh no! UB opt has turned this into a 64-bit RSUB instruction!

    Yeah...


    Granted, ATM, for BGBCC, wouldn't make much difference at present. Could
    maybe make sense to add a distinction either to strengthen semantic
    analysis, or if I decided to change away from my existing "assume wrap
    on overflow semantics as sole option" policy. Or maybe adding an
    "-fno-wrapv" option, with "wrapv" remaining default but allowing an
    option to opt-out, sort of like how there is an "-fptropts" option to
    "opt into" strict-aliasing / TBAA semantics, vs the default semantics of "assume every explicit store may alias" semantics. Though, may still
    assume that loads may be cached and reordered, unless "volatile" is
    used, which explicitly disallows caching and reordering loads, though at present is a little "shotgun" and will basically disable caching
    throughout the whole basic block; which works as a detractor to the
    "casually use volatile as a way to dispel TBAA" interpretation (works on
    GCC, and is less adverse for performance than the "use memcpy" option on
    some other compilers, ...).


    Or, say:
    Bare pointer cast and deref:
    GCC: averse (falls afoul of default semantics);
    MSVC: benign;
    BGBCC: benign.
    Volatile pointer cast and deref:
    GCC: benign (doesn't use TBAA on volatile pointers);
    MSVC: benign;
    BGBCC: detrimental, disables caching and ld/st reordering;
    Using memcpy:
    GCC: benign;
    MSVC:
    Old (15+ years):
    Averse (actually calls memcpy, significant impact);
    Some intermediate versions would do an inline for "REP MOVSB".
    Also kinda crap, but less bad vs calling "memcpy()".
    Mostly only matters if still targeting WinXP or similar.
    Newer: Mild detriment in some cases.
    Inline loads/stores
    may fail to optimize to plain register moves for locals.
    BGBCC;
    Mostly similar to newer MSVC here;
    Works, just less efficient than plain "cast and deref".

    ...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon May 25 15:27:29 2026
    From Newsgroup: comp.arch

    An awkward thing about using trap on overflow is determining how
    precisely it is defined.

    Indeed, this is a nasty part of language design.

    [ IMO, the only sane choice (beside wrapping and explicit `ckd_add`) is
    to treat overflow not as a exception (in the sense of `try..catch`
    thingies, not in the CPU hardware sense of the word) but as an
    execution error comparable to memory exhaustion. ]

    Luckily, for `comp.arch` the same problem doesn't plague ISAs because
    it's accepted that a CPU should stick religiously to the literal
    semantics of the machine code, no matter how far it is from what
    really happens inside the machine.


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue May 26 05:39:02 2026
    From Newsgroup: comp.arch

    quadi <quadibloc@ca.invalid> schrieb:
    On Mon, 25 May 2026 10:23:00 +0200, David Brown wrote:

    The hardware, of course, cannot always enable trapping on overflow if it
    is going to efficiently support a range of programming languages.

    Yes. And I am used to FORTRAN, which did not trap on integer overflows.

    Incorrect.

    Integer overflow is illegal in Fortran, so what the compiler then
    does is not determined (see my post on random number generators).

    Example:

    $ cat overfl.f90
    program main
    integer :: a, b
    a = 12345678
    b = 2345678
    print *,a*b
    end program main
    $ gfortran -fsanitize=undefined overfl.f90
    $ ./a.out
    overfl.f90:5:13: runtime error: signed integer overflow: 12345678 * 2345678 cannot be represented in type 'integer(kind=4)'
    -1979197244
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Tue May 26 08:18:17 2026
    From Newsgroup: comp.arch

    On 25/05/2026 18:43, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 25/05/2026 16:28, Anton Ertl wrote:
    Despite their eagerness to "optimize" based on the assumption
    that signed integer overflow does not happen, the GCC developers have
    avoided making -ftrap the default, even on platforms like MIPS and
    Alpha where the implementation of -ftrapv just means to use different
    instructions (e.g., add instead of addu on MIPS, and addv instead of
    add on Alpha).

    An awkward thing about using trap on overflow is determining how
    precisely it is defined. Supposing you have the expression "a + b - a".
    Perhaps "a + b" overflows. I would hope than when using debug-related
    compiler flags such as "-fsanitize=signed-integer-overflow", a compiler
    would check for overflow on "a + b", and report it at runtime.
    (Unfortunately, gcc does not do that unless the partial expression is
    assigned to a variable.) But in "normal" usage, I'd expect the
    expression to be simplified, resulting in just "b" and no overflow.

    OTOH, cases like a+b+c where the result is in range, while an
    intermediate result is out of range are one of the reasons why I
    prefer -fwrapv over -ftrapv. As for your preference of nasal demons,
    given enough information, the compiler might "optimize" "a+b-a" into,
    e.g., 0.

    Anyway, the definition of -ftrapv is not very precise; for gcc-12.2:

    |'-ftrapv'
    | This option generates traps for signed overflow on addition,
    | subtraction, multiplication operations.


    My understanding is that the GCC developers would rather deprecate
    -ftrapv entirely, and encourage the use of -fsanitize instead as a way
    to detect run-time errors. I don't know the details of the internals,
    but I believe the GCC developers see the sanitize options as more
    accurate and more likely to be further developed in the future.

    As for what gcc-12.2 does for your example on AMD64:

    long foo(long a, long b)
    {
    return a+b-a;
    }

    is compiled with gcc -O3 -ftrapv to:

    0: 48 89 f0 mov %rsi,%rax
    3: c3 ret

    If "trap on overflow" has precise semantics in the code, then this
    disables a range of useful optimisations and re-arrangements. If it is
    just "use trapping arithmetic instructions", then it will miss many
    possible cases of actual overflow in the code, which we might want to
    catch.

    Which would you prefer by default?

    I don't know for sure. A "by default" choice has to be suitable for a
    wide variety of users and a wide variety of cases, and preferably err on
    the side of caution. For my own personal use, I'm happy with UB
    overflow and would have preferred that as the default even for unsigned arithmetic (but of course with a way to specify wrapping when I need
    it). But that's for /my/ use - I don't think that should necessarily be
    the default for others. Let those who are willing to spend the time and effort learning the details and the care needed use compiler flags to
    get the highest efficiency from their code, and let the defaults help
    others catch their bugs. However, the logical endpoint of that is that
    C should only be used by those that have a detailed understanding of the language and need it for peak efficiency, while other programmers should
    work with other languages that have more error handling.



    The gcc developers apparently took the latter approach, even when you
    ask for -ftrapv explicitly. So what, IYO, speaks against doing that
    by default on machines like MIPS and Alpha.

    And "trap on overflow" might either trigger when there is no
    overflow in the original code, or hinder optimisations. (Consider the
    expression "x / 2 + y / 2" - the compiler could implement that as a
    combined "(x + y) / 2", but that might introduce overflow.)

    x/2+y/2 produces a different result from (x+y)/2 when both x and y are
    odd integers.


    True. Can we pretend that is not the case, and still see my point? The
    point is that the compiler can, during re-arrangements, introduce new overflows as long as it knows the final results are correct (since the compiler knows the details of how instructions are actually implemented).

    gcc-12.2 compiles

    long bar(long x, long y)
    {
    return x/2+y/2;
    }


    on AMD64 to:

    gcc -O3 -ftrapv gcc -O3
    mov %rdi,%rax mov %rdi,%rax
    sub $0x8,%rsp mov %rsi,%rdx
    shr $0x3f,%rax shr $0x3f,%rax
    add %rax,%rdi shr $0x3f,%rdx
    mov %rsi,%rax add %rdi,%rax
    shr $0x3f,%rax add %rsi,%rdx
    sar %rdi sar %rax
    add %rax,%rsi sar %rdx
    sar %rsi add %rdx,%rax
    call __addvdi3@PLT ret
    add $0x8,%rsp
    ret

    so the -ftrapv introduces an additional mov and a call; I would have
    expected that the + would be compiled to an ADD instruction followed
    by a JO instruction.

    Trying the same on a MIPS64 machine with gcc-8.3 (which apparently
    produces ILP32 code) produces a call to __addvsi3 instead of the
    expected add instruction:

    gcc -O3 -ftrapv gcc -O3
    lui gp,0x0 srl v0,a0,0x1f
    addiu gp,gp,0 srl v1,a1,0x1f
    addu gp,gp,t9 addu v0,v0,a0
    srl v1,a0,0x1f addu a1,v1,a1
    lw t9,__addvsi3(gp) sra v0,v0,0x1
    srl v0,a1,0x1f sra a1,a1,0x1
    addiu sp,sp,-32 jr ra
    addu a0,v1,a0 addu v0,v0,a1
    addu a1,v0,a1
    sra a0,a0,0x1
    sw ra,28(sp)
    sw gp,16(sp)
    jalr t9
    sra a1,a1,0x1
    lw ra,28(sp)
    jr ra
    addiu sp,sp,32

    The call costs a lot of overhead.

    Agreed. I don't know why GCC uses a function call here. In my quick
    godbolt testing, clang uses the "add, jump-on-overflow" sequence.

    Using

    -fsanitize=signed-integer-overflow -fsanitize-trap

    gives an add followed by a jump-on-overflow sequence.


    It is not easy to see how a tool can avoid false positives and false
    negatives and also conveniently optimise and re-arrange code.

    It can't. But it does not try to avoid false negatives even when
    explicitly asked for trapping on overflow.

    If some overflow trapping when it can be done without additional
    instructions would be preferable over no overflow, gcc would compile
    signed adds that survive after optimization into add on MIPS rather
    than addu, by default. Given that it does not, the GCC developers
    probably found out that it is not preferable. I guess they would get
    too many customer complaints, including for "relevant" code, i.e.,
    code where the usual "it's UB, so your code is broken" excuse does not
    work.

    If "-ftrapv" is to have any use at all, then overflow is no longer UB -
    it has to be defined to trap. But I have to conclude that in GCC,
    -ftrapv is too vaguely defined and too inconsistently and inefficiently implemented to be of any use. This matches my understanding that the "-fsanitize=signed-integer-overflow -fsanitize-trap" flags are preferred
    by the GCC developers.


    The fact that they don't even try to make -ftrapv produce efficient
    code indicates that there is no "relevant" interest in efficient
    -ftrapv. It would be interesting to know who came up with the idea of
    adding -ftrapv, and why they are still keeping it.

    Compilers have not always been good at taking advantage of all the
    features provided by hardware

    GCC is pretty good at implementing -fwrapv. For the two examples
    above, "gcc -O3 -fwrapv" produces the same code on AMD64 and MIPS as
    "gcc -O3".

    That is my experience too (though I expect your experience here vastly outweighs mine).


    nor have languages been good at exposing
    the possibilities in the language so that programmers can take advantage
    of them.

    Yes. But I leave that for another day.


    Good idea :-)

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Tue May 26 08:27:28 2026
    From Newsgroup: comp.arch

    On 26/05/2026 01:00, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    What you do want is compiled code that can trap on overflow and avoid
    trapping on overflow without code substitution or being re-compiled.
    This way production code can avoid trapping but if the debugger is
    turned on, you can trap.

    Why do you consider that desirable?

    So you can debug production/released code to find subtle errors.

    I think that when an unexpected error is detected (whether it is with
    hardware acceleration, like trap on overflow, or via explicit generated
    code), the way to handle it depends strongly on the situation. If a
    debugger is present, then it is most helpful to lead to a debugger break
    so that the developer can figure out what went wrong. When not
    debugging, there is no sensible default handling that works for jet
    engine controllers and video game frame generators.

    But I do support the aim of having the same generated code when
    debugging and when shipping - I am not a fan of "release" builds and
    "debug" builds. (Of course you might temporarily do builds with
    different flags while chasing down a particular bug.)

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Tue May 26 15:13:31 2026
    From Newsgroup: comp.arch

    On Sun, 24 May 2026 16:39:25 +0000, quadi wrote:

    On Sun, 24 May 2026 15:24:22 +0000, John Levine wrote:

    Sure they did. S/360 had separate unsigned versions of add and subtract
    instructions. The results were the same but the condition codes were
    different and the unsigned versions couldn't overflow.

    Ah, I didn't remember that!

    I just looked it up. It was, and is, the Add Logical instruction.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue May 26 18:02:51 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 5/25/2026 3:34 PM, quadi wrote:
    On Mon, 25 May 2026 16:49:59 +0000, MitchAlsup wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    RISC-V: add ignores signed overflow, there is no add that traps on
    signed overflow (and detecting signed overflow is pretty
    involved if both operands are unknown to the compiler).

    The worst of all possible semantic encodings

    Although I thought that making trapping on fixed-point overflow the
    default is a bad idea, I agree that making it impossible to do so, or even test for fixed-point overflow, is a much worse idea.


    Possibly true.

    The lack of things like ADD-with-Carry or ADD-with-Overflow are
    annoyance points on RISC-V.


    Though, it is less obvious what a useful behavior is at the language level:
    "signal()" ? ...
    Something like try/catch (mostly N/A to C)?
    Something similar to FENV_ACCESS?
    ...

    The important property is that overflow is detected precisely.
    Whether {trap, signal, throw} is performed is an environmental choice
    not an ISA choice.

    Well, and that if trapping were applied globally:
    Overhead due to trap detection/handling code causing excessive bloat; Overflows traps from any code that naively assumes wrap-on-overflow semantics;
    ...

    In some codebases, it is already enough of a pain to hunt and fix all
    the out-of-bounds and uninitialized variables mess.
    Signed integer overflows would likely "turn it up to 11";
    Then, how does one fix it? Ask that people start adding a bunch of casts
    to make it work?...

    One might say:
    Add "if()" cases to deal with the overflows, but, ... this only makes
    sense for cases where the overflows are not the expected behavior.

    If(overflow(??)) requires some flag to carry overflow from point of
    detection to if(()).

    And what happens if there is more than 1 overflow ??

    Then again, could maybe classify code, say:
    1, signed, value doesn't (or shouldn't) go out-of-range;
    2, unsigned, value doesn't (or shouldn't) go out-of-range;
    3, signed, value is expected to be modulo;
    4, unsigned, value is expected to be modulo.
    5, a language hint about in-range, wrap, trap, signal, throw

    "nasal demons" types assume 1 and 4 as dominant.
    Or, 1 as exclusive vs 3.

    For compilers, we often need to assume 3 and 4.
    Because, failure to uphold 3 results in misbehaving programs.
    And, if 3 were uncommon, RISC-V's "ADDW"/etc would be pure stupidity.

    You would prefer::

    AND R7,Rleft,#~(~0<<31)
    AND R8,Rright,#~(~0<<31)
    ADD Rd,R7,R8
    AND Rd,Rd,#~(~0<<31)

    That is ADDW range limits operands and performs a shorter ADD.
    Matching C's int a,b; semantic. In general the integer instructions
    ending with W apply C's int properties to the arithmetic. If compilers
    were (WERE) really good at range determination those instructions would
    be unnecessary--but they are not.

    I (My 66000) had to put in sized integer calculation reasons, and by
    doing so, gained 2%-4% in code density and a bit more in latency. -----------------------
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue May 26 14:28:56 2026
    From Newsgroup: comp.arch

    On 5/26/2026 1:02 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 5/25/2026 3:34 PM, quadi wrote:
    On Mon, 25 May 2026 16:49:59 +0000, MitchAlsup wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    RISC-V: add ignores signed overflow, there is no add that traps on
    signed overflow (and detecting signed overflow is pretty
    involved if both operands are unknown to the compiler).

    The worst of all possible semantic encodings

    Although I thought that making trapping on fixed-point overflow the
    default is a bad idea, I agree that making it impossible to do so, or even >>> test for fixed-point overflow, is a much worse idea.


    Possibly true.

    The lack of things like ADD-with-Carry or ADD-with-Overflow are
    annoyance points on RISC-V.


    Though, it is less obvious what a useful behavior is at the language level: >> "signal()" ? ...
    Something like try/catch (mostly N/A to C)?
    Something similar to FENV_ACCESS?
    ...

    The important property is that overflow is detected precisely.
    Whether {trap, signal, throw} is performed is an environmental choice
    not an ISA choice.

    Yeah.

    Say:
    ADDV Rs, Rt, Rd
    BT __trap_overflow

    Which is how I would assume doing it, if I were to re-add ADDV to my ISA
    (this had existed in SuperH and BJX1, but got lost along the way, but
    could re-add if needed; just it was less often needed than even ADC/ADDC).



    Well, and that if trapping were applied globally:
    Overhead due to trap detection/handling code causing excessive bloat;
    Overflows traps from any code that naively assumes wrap-on-overflow
    semantics;
    ...

    In some codebases, it is already enough of a pain to hunt and fix all
    the out-of-bounds and uninitialized variables mess.
    Signed integer overflows would likely "turn it up to 11";
    Then, how does one fix it? Ask that people start adding a bunch of casts
    to make it work?...

    One might say:
    Add "if()" cases to deal with the overflows, but, ... this only makes
    sense for cases where the overflows are not the expected behavior.

    If(overflow(??)) requires some flag to carry overflow from point of
    detection to if(()).

    And what happens if there is more than 1 overflow ??


    Dunno.
    You would need to set a start point and an end/detection point, and have
    some way for the compiler to know to track overflows.

    Say:
    ADDV ...
    OR?T Re, 0x100, Re

    Then a way to feed Re back into C land to act upon.


    There could maybe either be a 32-bit variant (ADDV.L), or some shorthand
    way to detect that the value has gone outside of 32-bit range.


    Then again, could maybe classify code, say:
    1, signed, value doesn't (or shouldn't) go out-of-range;
    2, unsigned, value doesn't (or shouldn't) go out-of-range;
    3, signed, value is expected to be modulo;
    4, unsigned, value is expected to be modulo.
    5, a language hint about in-range, wrap, trap, signal, throw

    Well, possible, but C doesn't have any hints here...

    But, yeah:
    Leaving plain 'int' as the "probably shouldn't overflow" and 'signed
    int' and 'unsigned int' as "wrap on overflow expected" could make sense.



    "nasal demons" types assume 1 and 4 as dominant.
    Or, 1 as exclusive vs 3.

    For compilers, we often need to assume 3 and 4.
    Because, failure to uphold 3 results in misbehaving programs.
    And, if 3 were uncommon, RISC-V's "ADDW"/etc would be pure stupidity.

    You would prefer::

    AND R7,Rleft,#~(~0<<31)
    AND R8,Rright,#~(~0<<31)
    ADD Rd,R7,R8
    AND Rd,Rd,#~(~0<<31)

    That is ADDW range limits operands and performs a shorter ADD.
    Matching C's int a,b; semantic. In general the integer instructions
    ending with W apply C's int properties to the arithmetic. If compilers
    were (WERE) really good at range determination those instructions would
    be unnecessary--but they are not.

    I (My 66000) had to put in sized integer calculation reasons, and by
    doing so, gained 2%-4% in code density and a bit more in latency. -----------------------

    OK.

    Ironically, the 4-op sequence above would have been a single "ADDWU" instruction in the RV BitManip drafts, but ADDWU was dropped as arguably
    it didn't make a big enough difference on SPEC scores. They decided to
    keep a whole bunch of other random crap though that serves no real
    purpose other than to micro-optimize the benchmarks...

    I revived this for my own extensions, but left out ADDIWU as it was
    still not common enough to justify the encoding space cost (if one has jumbo-prefixes, this could be handled well enough via
    immediate-synthesis, and the 64-bit encoding wasn't too bad for
    something that is comparably infrequent).

    ...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Tue May 26 15:29:08 2026
    From Newsgroup: comp.arch

    On Mon, 25 May 2026 23:05:06 GMT, MitchAlsup
    <user5857@newsgrouper.org.invalid> wrote:


    BGB <cr88192@gmail.com> posted:

    On 5/25/2026 9:28 AM, Anton Ertl wrote:
    --------------
    Integer overflow happens far too often for trapping to be a good solution.

    Even on 64-bit variables/machines ??

    Yes if there are options for 8/16/32 bit ops in 64 bit registers.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue May 26 22:09:28 2026
    From Newsgroup: comp.arch

    David Brown wrote:
    On 26/05/2026 01:00, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    What you do want is compiled code that can trap on overflow and avoid
    trapping on overflow without code substitution or being re-compiled.>>>> This way production code can avoid trapping but if the debugger is
    turned on, you can trap.

    Why do you consider that desirable?

    So you can debug production/released code to find subtle errors.
    I think that when an unexpected error is detected (whether it is with hardware acceleration, like trap on overflow, or via explicit generated code), the way to handle it depends strongly on the situation.  If a debugger is present, then it is most helpful to lead to a debugger break
    so that the developer can figure out what went wrong.  When not
    debugging, there is no sensible default handling that works for jet
    engine controllers and video game frame generators.

    But I do support the aim of having the same generated code when
    debugging and when shipping - I am not a fan of "release" builds and
    "debug" builds.  (Of course you might temporarily do builds with
    different flags while chasing down a particular bug.)
    I tend to like "Release with sometimes hard-to-grok debug info",
    typically resulting in a separate file with a best effort debug map of
    the executable.
    Then I can at least get some help when running the debugger and trying
    to binary search my way into the spot where the bug resides.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue May 26 20:54:30 2026
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    David Brown wrote:
    On 26/05/2026 01:00, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    What you do want is compiled code that can trap on overflow and avoid >>>> trapping on overflow without code substitution or being re-compiled. >>>> This way production code can avoid trapping but if the debugger is
    turned on, you can trap.

    Why do you consider that desirable?

    So you can debug production/released code to find subtle errors.
    I think that when an unexpected error is detected (whether it is with hardware acceleration, like trap on overflow, or via explicit generated code), the way to handle it depends strongly on the situation.  If a debugger is present, then it is most helpful to lead to a debugger break so that the developer can figure out what went wrong.  When not debugging, there is no sensible default handling that works for jet
    engine controllers and video game frame generators.

    But I do support the aim of having the same generated code when
    debugging and when shipping - I am not a fan of "release" builds and "debug" builds.  (Of course you might temporarily do builds with different flags while chasing down a particular bug.)

    I tend to like "Release with sometimes hard-to-grok debug info",
    typically resulting in a separate file with a best effort debug map of
    the executable.

    Encrypt the debug information (and put it in a {1234-5678-9101-1121-...} folder) so that only the owner (not licensee) of the code can debug
    it.

    Then I can at least get some help when running the debugger and trying
    to binary search my way into the spot where the bug resides.

    Terje

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue May 26 19:13:21 2026
    From Newsgroup: comp.arch

    On 5/26/2026 2:29 PM, George Neuner wrote:
    On Mon, 25 May 2026 23:05:06 GMT, MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    BGB <cr88192@gmail.com> posted:

    On 5/25/2026 9:28 AM, Anton Ertl wrote:
    --------------
    Integer overflow happens far too often for trapping to be a good solution. >>
    Even on 64-bit variables/machines ??

    Yes if there are options for 8/16/32 bit ops in 64 bit registers.

    32-bit overflow is the dominant scenario here.
    While 8 and 16-bit ranges do overflow readily, the normal semantics are
    for them to auto-promote to 32 bits before then being narrowed back down
    to 8 or 16 bits, so they don't count.


    Ironically, for my BS2 language, the semantics were in cases like this
    to instead auto-promote to 64 bits; but can't really do this for C as it
    gives different results in some cases (and early promotion is itself a
    bug, even if early promotion would often be the most natural semantics
    for a 64-bit machine).


    Well, and there is the usual thing that one can't usually allow a
    variable to hold values outside the range of what would be allowed for
    that variable.


    Well, except for floating-point types, where typically code doesn't care
    about out of ranges of values (if a value fails to go to 0 or Inf in a computation in local variables, typically no one cares).

    For float, it isn't obvious because the dynamic range of Binary32 is
    already quite large. A "short float" effectively having Binary64's
    dynamic range when used in scalar computations is a bit incredulous, but
    given these smaller formats are non-standard anyways, it reasonable to
    be like "these formats are only necessarily confined to their formal
    range when in-memory, otherwise all bets are off".

    Or: precision and dynamic range >= requested format.

    Code can't entirely rely on the higher precision though, as the format
    may also revert to its defined precision without warning (even if
    intermediate computations may potentially wildly exceed it).

    But, then again, this would be analogous to if one has an FPU with
    native Binary128, occasionally performing "double" calculations at
    Binary128 precision even though "double" is stated as Binary64.

    Well, or implementing some operations by widening temporarily to a higher-precision format before narrowing the result.


    Though, OTOH, the main use-case for things like scalar "short float" is
    more for saving memory in structs and arrays, not for trying to rely on
    its crappy range and precision.

    So, floating point math is very different from integer math in this regard.

    ...

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Wed May 27 10:59:31 2026
    From Newsgroup: comp.arch

    MitchAlsup [2026-05-26 20:54:30] wrote:
    Encrypt the debug information (and put it in
    a {1234-5678-9101-1121-...} folder) so that only the owner (not
    licensee) of the code can debug it.

    I resent that. All code should be Free Software.


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Wed May 27 18:19:49 2026
    From Newsgroup: comp.arch

    On Wed, 27 May 2026 10:59:31 -0400, Stefan Monnier wrote:
    MitchAlsup [2026-05-26 20:54:30] wrote:

    Encrypt the debug information (and put it in a
    {1234-5678-9101-1121-...} folder) so that only the owner (not
    licensee) of the code can debug it.

    I resent that. All code should be Free Software.

    It is wonderful that we have the open-source software movement.

    However, people have the right to the fruit of their labors. To give them
    away for free is generous, but it should remain a personal choice.

    Of course, copyright has been misused, and deserves a critical
    examination, not the sort of uncritical expansion given to it by
    legislators in the United States - and imposed on the rest of the world by trade threats.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed May 27 15:24:09 2026
    From Newsgroup: comp.arch

    On 5/25/2026 5:59 PM, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    David Brown <david.brown@hesbynett.no> schrieb:
    On 24/05/2026 23:39, quadi wrote:
    On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:
    quadi <quadibloc@ca.invalid> posted:

    It makes sense to trap on a floating-point overflow, but trapping on an >>>>>> integer overflow is usually a terrible idea.

    So, detecting something went wrong and you should inform the programmer >>>>> is a bad idea ???

    No, so being able to turn the trap for integer overflow on should
    definitely be allowed. But that shouldn't be the default behavior.
    Otherwise, programs like random number generators wouldn't work.

    John Savard

    That does not make sense. Code such as random number generators should
    be written so that they are correct in the language they are written in.

    In principle, yes.

    Principle is better in theory than in practice.

    In practice, people often used whatever "worked" on their systems.

    Face it, the poor slug writing the code may not have the faintest
    grasp at the system qualities we are discussing, and does not care
    to learn as long as he can slug through the writing and his pro-
    gram not blow up catastrophically while it is under his purview.

    That defines a lot of what is wrong with SW programming today.

    Implementors have a certain right because they control what their
    compiler does or does not do.

    You would be surprised at how little influence implementors have
    on compilers and other software.


    Yeah.

    You can design the ISA and compiler as one likes.
    But, if existing C code breaks, well then this is not good.


    One might think:
    You know, wrap on overflow, and type promotion where it overflows and
    wraps, and *then* promotes to the wider type on the final assignment, is
    kinda stupid and sucks.

    And, if one goes by "well, signed overflow is UB anyways", then they
    should be able to turn it into a "promote first, then ADD" scenario (may
    be both potentially faster, and less likely to lose information).

    I would be inclined to agree.

    But... there is old code around that will quietly break if the integer overflow and promotion doesn't follow the specific behavior that mimics
    how it would have behaved on 32-bit systems.


    I vaguely remember a case of this involving some robot enemies that
    drive around in ROTT, where if the integer overflow failed to work in
    just the right way, they would all miss their way-points and end up
    crashing into walls or similar.

    Where, the robot enemies followed a path defined as a series of
    waypoints (in a grid world), and once the robot hits a particular spot
    on the grid cell, it will change directions and head along the path.
    But, the particular way the expression to handle this was written was sensitive to the type promotion and wrap-on-overflow semantics in C.

    Also a similar case involving the "elevators", which were effectively
    timed teleporters between different parts of the map (would close door,
    play elevator sound, then right at the end as the door opens, it would teleport the player to the other location and initiate a screen shaking
    effect at around the same time). If the overflow was wrong, the teleport
    would fail and the player would still be in the original location.


    One could fix this stuff with casts or similar, but, when does one draw
    the line exactly?...

    Easier sometimes to make it to work, than to try to justify the code was already broken due to reliance on UB.

    Well, and to match the behavior of the other compilers, needed to
    implement the behavior the way ROTT expected.


    Where, as noted, ROTT uses fixed-point math with "fixed" as a signed
    32-bit integer, and some cases involve calculations with coordinates
    well outside the world bounds with the seeming intention that these
    high-order components simply disappear into the ether (with the world essentially treated as a wrapping modulo space).


    But, as noted, it differed from my BS2 language, where the default was effectively to auto-promote values to the widest reasonable integer type
    in these cases and then drop down to the final range afterwards (to
    avoid some integer overflows in cases they would happen in C).

    Well, and within BGBCC, there was some non-zero bleed-over between C and
    BS2 (where originally I had been implementing BS2 via BGBCC, with the intention that it would compile to an IL image that would then be run in
    the VM).

    The original VM however, while fast, ended up with horrible code-bloat.
    Had gotten creative with the use of the C preprocessor in ways that were ultimately a terrible idea (errm, trying to use it sorta like a
    poor-man's version of C++ templates). Binaries got huge, build times
    sucked. This VM was a dead end.


    Ironically, some of my current ISA projects were built on some of the groundwork left by this experiment, but also as a warning for something
    not to do.

    Or, when I learned the merit of actually writing all the opcode handler functions and similar by hand and not trying to do combinatorial stuff
    via the preprocessor.


    Also for the follow up VM (for BS2), had went back to ye-olde stack
    machine (vs a Register IR model). But, some parts of this were relevant
    to targeting an "actual CPU".

    The way JX2VM works isn't too far removed from those VMs in some ways,
    apart from JX2VM's general avoidance of getting too clever with the C preprocessor.

    ...


    --- Synchronet 3.22a-Linux NewsLink 1.2