Forum: War Ensemble BBS

Concertina IV Has Arrived

From quadi@quadibloc@ca.invalid to comp.arch on Tue May 19 20:14:37 2026

From Newsgroup: comp.arch

It had to happen?
I was not sure if it ever could happen.
There was Concertina II - an attempt at a practical ISA, unlike the
original Concertina, which was merely illustrative.
But it had a block structure, which was highly criticized. And I had to
admit it was overly complicated. And so I went on and used the Concertina
III designation again - for a CISC-like instruction set with variable-
length instructions. The price was, though, switching to register banks of
16 registers instead of 32.
The IBM 360 had banks of 16 registers, and more modern CISC designs, like
the 680x0 and the x86 have banks of only eight. Only RISC designs can
offer banks of 32 registers.
And yet it seemed so tantalizingly close - that it might just be possible, using what I've learned about squeezing an ISA into the available opcode space, to go back to banks of 32 registers.
I found it to be possible - at a price.
It could be done, but I wouldn't have much space left for 16-bit short instructions.
Even if I had lots of space for 16-bit short instructions, though, they
would still, just by being 16 bits long, where the banks of registers have
32 registers in them, be badly compromised.
And so I decided to offer only a very limited set of 16-bit short instructions, and to chiefly provide... 24-bit short instructions.
I didn't want to depart from the example of the 680x0 and the System/360
to allow instructions to start on odd bytes, but it seemed like I had no choice if I wanted to offer a reasonably complete set of short
instructions at all.
Concertina IV is described at:
http://www.quadibloc.com/arch/cw01int.htm

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Wed May 20 00:03:27 2026

From Newsgroup: comp.arch

I've made my first change to Concertina IV. I'm not happy with the way
things were before the change or the way they are now, so I may change it again.

The 16-bit short instructions only have 12 free bits available. That's not much to work with when there are 32 registers in each register bank.

Initially, I settled on four bits of opcode, along with the basic register specification scheme used for the 15-bit paired short instructions in Concertina II.

But choosing single and double precision floating-point as the only two
types supported didn't rest easily with me. Single precision isn't really precise enough to be useful, or so I've heard.

The alternative of supporting 48-bit intermediate precision and double precision, while it appeals to me personally... is clearly untenable.
Medium is a nonstandard data type, and so it would not be widely used.

So instead I decided to only support double precision, and use the extra
bits to allow additional ways to specify registers.

The result, of course, is messy.

So I'm considering going back to the earlier format, but instead of
supporting two floating-point data types, to support one integer type and
one floating type. But which integer type? 32-bit integer, or 64-bit long?

I could get more bits by going to _paired_ instructions. But I have some
free space between 32-bit instructions so that I could just add those
while keeping 16-bit short instructions.

And this also led me to thinking about something else.

I align different integer types on the right, even while aligning
different floating-point types on the left like everyone else. So integer operations must sign-extend if they're on values shorter than 64 bits.

Propagating a bit takes time.

So should I design the ALU so that the sign extension takes place after
the rest of the instruction, and allow another 32-bit (or shorter) integer instruction to use results when they're ready, before sign extension? Is
that just normal efficiency, or wasteful complexity?

In any case, I think I've come up with something that is a reasonable compromise I can live with after all.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Wed May 20 00:34:59 2026

From Newsgroup: comp.arch

On Wed, 20 May 2026 00:03:27 +0000, quadi wrote:

In any case, I think I've come up with something that is a reasonable compromise I can live with after all.

And what was that compromise?

When it comes to floating-point types, there was only one that I valued
above all the rest, so I couldn't decide what second one to use.

With integer types, on the other hand, there were two types that I
couldn't decide between.

So go with it!

Support two integer types - even with room for logical as well as
arithmetic operations - but with a more limited specification of source
and destination registers... and one floating-point type.

Done!

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed May 20 01:35:01 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> posted:

I've made my first change to Concertina IV. I'm not happy with the way things were before the change or the way they are now, so I may change it again.

The 16-bit short instructions only have 12 free bits available. That's not much to work with when there are 32 registers in each register bank.

Initially, I settled on four bits of opcode, along with the basic register specification scheme used for the 15-bit paired short instructions in Concertina II.

But choosing single and double precision floating-point as the only two types supported didn't rest easily with me. Single precision isn't really precise enough to be useful, or so I've heard.

Everything you have heard is both true and false::

There are many applications where DP is de rigueur {galactic simulations} smaller precision simply will not do. Many of these would like to go
FP128 but performance is not there yet.
There is a growing demand for FP16 and FP8 data types for memory-size
and BW reasons.
There is a growing background need for FP128, too.

The alternative of supporting 48-bit intermediate precision and double precision, while it appeals to me personally... is clearly untenable.
Medium is a nonstandard data type, and so it would not be widely used.

So instead I decided to only support double precision, and use the extra bits to allow additional ways to specify registers.

My 66000 started out that way and the compiler showed that this choice sucks.

The result, of course, is messy.

No it becomes unacceptable when FP32 takes 3 instructions while FP64
takes but 1.

So I'm considering going back to the earlier format, but instead of supporting two floating-point data types, to support one integer type and one floating type. But which integer type? 32-bit integer, or 64-bit long?

You will find you have no <marketable> choice; you need to support::

Integer{S8, S16, S32, S64, U8, U16, U32, U64}
Float {FP8, FP16, FP32, FP64 and some way to get FP128}

I could get more bits by going to _paired_ instructions. But I have some free space between 32-bit instructions so that I could just add those
while keeping 16-bit short instructions.

And this also led me to thinking about something else.

I align different integer types on the right, even while aligning
different floating-point types on the left like everyone else. So integer operations must sign-extend if they're on values shorter than 64 bits.

Go LE all the way. LE won get over BE thinking.

As far as integers go: all calculations produce proper integer values
in the 64-bit destination register.
S8 has range [-128..127]
u8 has range [0..255]
...

Propagating a bit takes time.

A solved HW gate-level problem.

So should I design the ALU so that the sign extension takes place after
the rest of the instruction, and allow another 32-bit (or shorter) integer instruction to use results when they're ready, before sign extension? Is that just normal efficiency, or wasteful complexity?

All the sign and zero stuff goes "in the CARRY chain".

In any case, I think I've come up with something that is a reasonable compromise I can live with after all.

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Wed May 20 02:09:08 2026

From Newsgroup: comp.arch

On Wed, 20 May 2026 01:35:01 +0000, MitchAlsup wrote:

quadi <quadibloc@ca.invalid> posted:

I align different integer types on the right, even while aligning
different floating-point types on the left like everyone else. So
integer operations must sign-extend if they're on values shorter than
64 bits.

Go LE all the way. LE won get over BE thinking.

a) I didn't think this really had anything to do with little-endian versus big-endian.

b) Yes, little-endian is more popular, but that's just because the PDP-11, 8080, and 6502 happened to choose it. Little-endian doesn't work as well
*if* you also want to put packed decimal values in registers.

As far as integers go: all calculations produce proper integer values in
the 64-bit destination register.
S8 has range [-128..127]
u8 has range [0..255]
...

If you have 64 bit registers, then if you want to avoid a gap between the
sign in a 32-bit number and the sign of a 64-bit number by placing the 32-
bit number on the most significant side, a 32-bit 1 is equal to a 64-bit 8,589,934,592.

Propagating a bit takes time.

A solved HW gate-level problem.

That's good news, then I don't have a problem. I figured the solution
would be to use slightly slower gates with larger current output.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Wed May 20 02:26:42 2026

From Newsgroup: comp.arch

On Wed, 20 May 2026 02:09:08 +0000, quadi wrote:

On Wed, 20 May 2026 01:35:01 +0000, MitchAlsup wrote:

quadi <quadibloc@ca.invalid> posted:

I align different integer types on the right, even while aligning
different floating-point types on the left like everyone else. So
integer operations must sign-extend if they're on values shorter than
64 bits.

Go LE all the way. LE won get over BE thinking.

a) I didn't think this really had anything to do with little-endian
versus big-endian.

b) Yes, little-endian is more popular, but that's just because the
PDP-11,
8080, and 6502 happened to choose it. Little-endian doesn't work as well
*if* you also want to put packed decimal values in registers.

As far as integers go: all calculations produce proper integer values
in the 64-bit destination register.
S8 has range [-128..127]
u8 has range [0..255]
...

If you have 64 bit registers, then if you want to avoid a gap between
the sign in a 32-bit number and the sign of a 64-bit number by placing
the 32-
bit number on the most significant side, a 32-bit 1 is equal to a 64-bit 8,589,934,592.

While the majority of computers nowadays are little-endian, back in the
old days only a very few computers treated fixed-point numbers as
fractions in the range [-1,1) instead of as integers.

Those that did that either wasted a bit in double-word integers, or
required one to do a right shift by one bit after doing a multiplication
if you wanted the result of the multiplication to correspond with treating
the numbers as integers instead.

This, not little-endian versus big-endian, was what I was talking about
not doing.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Wed May 20 07:21:04 2026

From Newsgroup: comp.arch

On Wed, 20 May 2026 01:35:01 +0000, MitchAlsup wrote:

Everything you have heard is both true and false::

There are many applications where DP is de rigueur {galactic
simulations} smaller precision simply will not do. Many of these would
like to go FP128 but performance is not there yet.
There is a growing demand for FP16 and FP8 data types for memory-size
and BW reasons.
There is a growing background need for FP128, too.

I'm aware of all of this.

You will find you have no <marketable> choice; you need to support::

Integer{S8, S16, S32, S64, U8, U16, U32, U64}
Float {FP8, FP16, FP32, FP64 and some way to get FP128}

I *do* intend to support them all. However, U8, U16, U32, and U64 don't
get special instructions; the compiler will just have to remember the
meaning of the condition codes for signed numbers when doing comparisons
on unsigned numbers.

Actually, though, that does mean I have to modify the conditional branch instructions. One will actually want to test for combinations of less,
equal, and greater when overflow is present, and I've assumed that some combinations can be excluded!

So in commenting on a different part of my design entirely, you've pointed
out an important flaw I will have to correct.

It's just that the pigeonhole principle prevents me, quite effectively,
from supporting them all *in 16-bit short instructions with only 12 bits available*. I don't care what marketing says; I believe engineering when
they say they can't do the impossible.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed May 20 05:38:07 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> writes:

b) Yes, little-endian is more popular, but that's just because the PDP-11, >8080, and 6502 happened to choose it.

Thinking about it:

* The last descendent of the PDP-11 was canceled long before the most
prominent big-endien architecture (SPARC) was canceled, and long
before Power switched its Linux support to little-endian, so the
PDP-11 had little, if any, influence on the outcome.

* 8080: Yes, because AMD64 inherited its byte order from it. But if
we go to the origin here, it's not the 8080 and not the 8008, but
the Datapoint 2200, which is remarkable, because it was designed as
a terminal for mainframes, and S/360 is big-endian.
<https://en.wikipedia.org/wiki/Datapoint_2200#Technical_description>
says:

|The fact that most laptops and cloud computers today store numbers
|in little-endian format is carried forward from the original
|Datapoint 2200. Because the original Datapoint 2200 had a serial
|processor, it needed to start with the lowest bit of the lowest byte
|in order to handle carries. Microprocessors descended from the
|Datapoint 2200 (the 8008, Z80, and the x86 chips used in most
|laptops and cloud computers today) kept the little-endian format
|used by that original Datapoint 2200.

* 6502: Yes, because ARM A64 inherited its byte order from it. The
6502 is remarkable because it is a child of the 6800, which is
big-endian. So the choice of little-endian byte order was
deliberate.

RISC-V inherits its original byte order from the descendents of 8080
and 6502. The ISA manual comments on this:

|We originally chose little-endian byte ordering for the RISC-V memory
|system because little-endian systems are currently dominant
|commercially (all x86 systems; iOS, Android, and Windows for ARM). A
|minor point is that we have also found little-endian memory systems to
|be more natural for hardware designers. However, certain application
|areas, such as IP networking, operate on big-endian data structures,
|and certain legacy code bases have been built assuming big-endian
|processors, so we have defined big-endian and bi-endian variants of
|RISC-V.
[...]
|We further make the instruction parcels themselves little-endian to
|decouple the instruction encoding from the memory system endianness |altogether.

I expect that big-endian RISC-V's will be as common as big-endian
Alphas and big-endian ARMs (all Alphas and ARMs after a certain point
in time support a big-endian mode), i.e., not at all.

Little-endian doesn't work as well
*if* you also want to put packed decimal values in registers.

It certainly does. I know it because we had a group exercise in
assembly language on 80286s that dealt with BCD numbers, and we split
the project into submodules, one for each student. In integration
testing we found that we had forgotten to specify the byte order in
our interface descriptions. Two in our group, two students (including
me) had chosen little-endian and IIRC two had chosen big-endian. I
did not find that doing the BCD stuff in little-endian byte order did
not work well.

With the BCD support of instruction sets typically requiring piecing
together the complete operation of suboperations of less than full
length (e.g., bytes on the 6502 and the 80(2)86), little-endian is
actually easier. When you add two BCD numbers that are longer than a
byte, you don't have to first go to the end of the number and then go
backwards from there. This is especially relevant if you do not want
to completely unroll the loop that handles these bytes.

Note that the 6502 includes BCD support with its decimal mode, and the designers of the 6502 obviously did not agree with the claim you made
above.

When the 8080 added BCD support in form of the DAA instruction (the
8086 added DAS), the byte order decision had already been made with
the Datapoint 2200, but if they really thought that decimal operation
is a good reason for big-endian byte order, they could have done what
the 6502 had done and switched the byte order around from its
ancestors.

On the other hand, given that the 6502 and 8080 BCD support worked on
bytes, the programmers were free to choose any byte order they prefer,
as our student project proved. Maybe some (how many?) of the
programmers who wrote BCD code for the 6502 and for the 8080 and its descendants actually chose a big-endian format. Things get more
interesting if the granularity of BCD support is bigger than a byte,
e.g., on the HPPA or IIRC S/360.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed May 20 08:10:10 2026

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

* 8080: Yes, because AMD64 inherited its byte order from it. But if
we go to the origin here, it's not the 8080 and not the 8008, but
the Datapoint 2200, which is remarkable, because it was designed as
a terminal for mainframes, and S/360 is big-endian.
<https://en.wikipedia.org/wiki/Datapoint_2200#Technical_description>
says:

|The fact that most laptops and cloud computers today store numbers
|in little-endian format is carried forward from the original
|Datapoint 2200. Because the original Datapoint 2200 had a serial
|processor, it needed to start with the lowest bit of the lowest byte
|in order to handle carries. Microprocessors descended from the
|Datapoint 2200 (the 8008, Z80, and the x86 chips used in most
|laptops and cloud computers today) kept the little-endian format
|used by that original Datapoint 2200.

For the Datapoint 2200, there was a solid technical reason:
It used shift register memory which supplied one bit at a time,
so the adder *had* to be little-endian.

See https://www.righto.com/2014/12/inside-intel-1405-die-photos-of-shift.html --
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Bernd Linsel@bl1-thispartdoesnotbelonghere@gmx.com to comp.arch on Wed May 20 10:42:44 2026

From Newsgroup: comp.arch

On 5/20/26 04:09, quadi wrote:

b) Yes, little-endian is more popular, but that's just because the

PDP-11,

8080, and 6502 happened to choose it. Little-endian doesn't work as well *if* you also want to put packed decimal values in registers.

For packed decimals that are processed in memory, little endian is
superior to big endian, because you don't have to look for the LSB when performing an addition, you can proceed bytewise on ascending addresses.

As a consequence should packed decimals in registers also be little
endian, conceding the fact that the classic byte-wise representation is
skewed (but when displaying words, the reading order is natural).
--
Bernd Linsel
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed May 20 08:36:05 2026

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

<https://en.wikipedia.org/wiki/Datapoint_2200#Technical_description>
says:

|[...] Because the original Datapoint 2200 had a serial
|processor, it needed to start with the lowest bit of the lowest byte
|in order to handle carries.

[...]

For the Datapoint 2200, there was a solid technical reason:
It used shift register memory which supplied one bit at a time,
so the adder *had* to be little-endian.

Looks plausible at first, but when I think about it some more, both
claims are wrong.

Yes, you start with the least significant bit, but given that the
architecture is not bit-addressed, this is irrelevant.

The architecture is byte-addressed, and the ALU only works on a single
byte, so the ALU does not work any better for little-endian than for big-endian.

For the 6502 dealing with carries in addressing, both in the relative addressing of conditional branches, and in the indexed addressing
modes with 16-bit base addresses, little-endian made the
implementation a little simpler. The Datapoint 2200 does not have
indexed addressing modes, so relative branches may have been the issue
(if the DataPoint 2200 has them).

Did I miss any other reason why little-endian byte order is easier to
implement on these processors than big-endian?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed May 20 10:37:39 2026

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

<https://en.wikipedia.org/wiki/Datapoint_2200#Technical_description>
says:

|[...] Because the original Datapoint 2200 had a serial
|processor, it needed to start with the lowest bit of the lowest byte
|in order to handle carries.

[...]

For the Datapoint 2200, there was a solid technical reason:
It used shift register memory which supplied one bit at a time,
so the adder *had* to be little-endian.

Looks plausible at first, but when I think about it some more, both
claims are wrong.

Unfortunately, you are mistaken.

Yes, you start with the least significant bit, but given that the architecture is not bit-addressed, this is irrelevant.

JMP with a two-byte address was little-endian on the Datapoint 2200,
and so had to be on the Intel 8808, which had to be binary compatible
with the TTL CPU of the 2200.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed May 20 15:03:22 2026

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

quadi <quadibloc@ca.invalid> writes:

b) Yes, little-endian is more popular, but that's just because the PDP-11, >>8080, and 6502 happened to choose it.

Thinking about it:

With the BCD support of instruction sets typically requiring piecing
together the complete operation of suboperations of less than full
length (e.g., bytes on the 6502 and the 80(2)86), little-endian is
actually easier. When you add two BCD numbers that are longer than a
byte, you don't have to first go to the end of the number and then go >backwards from there. This is especially relevant if you do not want
to completely unroll the loop that handles these bytes.

The B3500 had a clever algorithm for adding BCD numbers. The
addend and augend could each be from 1 to 100 digits in length.
The algorithm would start adding from the lowest (most significant
digit in the longested operand) address of each operand adding
each digit in turn.

"The processor uses an adder that accumulates two fields
from the most significant to the least significant digit
positions. Reverse addition, as incorporated in the
B2500 and B3500 systems has the advantage of detecting
an overflow condition prior to altering the receiving field"

The algorithm used a 9's counter to track the leading
digits.
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Wed May 20 15:28:16 2026

From Newsgroup: comp.arch

On Wed, 20 May 2026 05:38:07 +0000, Anton Ertl wrote:

* The last descendent of the PDP-11 was canceled long before the most
prominent big-endien architecture (SPARC) was canceled, and long
before Power switched its Linux support to little-endian, so the
PDP-11 had little, if any, influence on the outcome.

The reason I blame the PDP-11 for everything is that it was a hugely influential machine. It was widely used in academic settings, and it was
also the machine for which UNIX was first widely distributed.

When you add two BCD numbers that are longer than a
byte, you don't have to first go to the end of the number and then go backwards from there. This is especially relevant if you do not want to completely unroll the loop that handles these bytes.

This is the reason little-endian was popular for small processors. It is
no longer relevant if a processor has a 64-bit data bus. And, of course,
it applies equally to binary and BCD.

The reason I claim that BCD support strongly favors big-endian byte order
is this:

Character strings are, of course, in "big endian" order; that is,
normally, a character string is written in memory with successive
characters at increasing addresses - and, at least in languages that are written from left to right, numerals appear in texts with the most
significant digit first.

So if one has a hardware instruction to convert from BCD to the string representation of numbers, such as UNPK or EDIT, then those two representations should have the same endian-ness.

And if one wants to use the same ALU for binary and BCD arithmetic, then
those have to have the same endianness.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed May 20 15:32:41 2026

From Newsgroup: comp.arch

Bernd Linsel <bl1-thispartdoesnotbelonghere@gmx.com> writes:

On 5/20/26 04:09, quadi wrote:

b) Yes, little-endian is more popular, but that's just because the

PDP-11,

8080, and 6502 happened to choose it. Little-endian doesn't work as well *if* you also want to put packed decimal values in registers.

For packed decimals that are processed in memory, little endian is
superior to big endian, because you don't have to look for the LSB when >performing an addition, you can proceed bytewise on ascending addresses.

Burroughs figured that problem out a half century ago, and were
able to add two big-endian BCD numbers memory-to-memory handling
overflow (by counting leading 9s). Overflow was detected before
the receiving field was modified (without intermediate or internal
storage) by counting leading 9s.

--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed May 20 11:04:44 2026

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

<https://en.wikipedia.org/wiki/Datapoint_2200#Technical_description>
says:

|[...] Because the original Datapoint 2200 had a serial
|processor, it needed to start with the lowest bit of the lowest byte >>>> |in order to handle carries.

[...]

For the Datapoint 2200, there was a solid technical reason:
It used shift register memory which supplied one bit at a time,
so the adder *had* to be little-endian.

Looks plausible at first, but when I think about it some more, both
claims are wrong.

Unfortunately, you are mistaken.

A claim without any supporting argument.

Yes, you start with the least significant bit, but given that the
architecture is not bit-addressed, this is irrelevant.

JMP with a two-byte address was little-endian on the Datapoint 2200,

Yes, but is the bit-serial memory the reason for that? No, the ALU is
not involved, and they could just have decided to represent the
address in big-endian byte order, and the 16 bits into the PC (or
next-PC) register.

The conditional jump instructions of the Datapoint 2200 also have
absolute target addresses and don't involve the ALU.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed May 20 15:42:03 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> writes:

On Wed, 20 May 2026 05:38:07 +0000, Anton Ertl wrote:

* The last descendent of the PDP-11 was canceled long before the most
prominent big-endien architecture (SPARC) was canceled, and long
before Power switched its Linux support to little-endian, so the
PDP-11 had little, if any, influence on the outcome.

The reason I blame the PDP-11 for everything is that it was a hugely >influential machine. It was widely used in academic settings, and it was >also the machine for which UNIX was first widely distributed.

But its byte order was not influential into this century. Unix and
its applications are portable, including between byte orders (or at
least they were, when there were still enough machines of either byte
order around that one could test that). And somehow the PDP-11 and
its offspring did not capture the workstation market and the server
market that involved from that, and which constituted the Unix
markets.

Instead, the big-endian 68000 and its offspring dominated that market
for a while, and was replaced with RISCs later, which had the same
byte order as the earlier machines from the same company (i.e.,
little-endian for DEC and big-endian for the others). And when the
market for workstations and server on RISCs shrunk down to almost
nothing, not only did these big-endian machine vanish, but the
offspring of the PDP-11 as well (and actually before some of the
big-endian RISCs). What remains of this world is AIX on Power, and I
have no idea how many installations there still are.

Linux on Power was switched to little-endian with the introduction of OpenPower, not because of the PDP-11 descendants, but because of the
Datapoint 2200 descendants. And the Datapoint 2200 (announced in June
1970) was probably not influence by the PDP-11 (announced in January
1970).

When you add two BCD numbers that are longer than a
byte, you don't have to first go to the end of the number and then go
backwards from there. This is especially relevant if you do not want to
completely unroll the loop that handles these bytes.

This is the reason little-endian was popular for small processors. It is
no longer relevant if a processor has a 64-bit data bus. And, of course,
it applies equally to binary and BCD.

If the numbers fit in one granule, yes, that benefit does not matter.
But 64 bits are not enough for all binary numbers and probably not for
all BCD numbers, either: the decimal FP people were not satisfied with
the 15-digit mantissa that are easily possible with their
representations in 64 bits; they did not even define a decimal64
format last I checked. So will 16-digit BCD numbers be satisfactory?

The reason I claim that BCD support strongly favors big-endian byte order
is this:

Character strings are, of course, in "big endian" order; that is,
normally, a character string is written in memory with successive
characters at increasing addresses - and, at least in languages that are >written from left to right, numerals appear in texts with the most >significant digit first.

So if one has a hardware instruction to convert from BCD to the string >representation of numbers, such as UNPK or EDIT, then those two >representations should have the same endian-ness.

Reality check: Modern architectures tend to have byte-swap and shuffle instructions. They tend not to have BCD-to-ASCII instructions, but
these can be implemented easily enough with the help of shuffle and
bitwise instructions. And given that you need to use shuffle anyway,
the byte-swapping does not cost extra.

For doing it for more than one granule, you have to pay the big-endian
cost on that conversion (for storing into the string, the loading of
the BCD number would still be in little-endian order), but at least
not for the arithmetic operations.

And if one wants to use the same ALU for binary and BCD arithmetic, then >those have to have the same endianness.

Sure, but that's not a reason to use big-endian byte order, see above.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed May 20 17:01:56 2026

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

<https://en.wikipedia.org/wiki/Datapoint_2200#Technical_description> >>>>> says:

|[...] Because the original Datapoint 2200 had a serial
|processor, it needed to start with the lowest bit of the lowest byte >>>>> |in order to handle carries.

[...]

For the Datapoint 2200, there was a solid technical reason:
It used shift register memory which supplied one bit at a time,
so the adder *had* to be little-endian.

Looks plausible at first, but when I think about it some more, both
claims are wrong.

Unfortunately, you are mistaken.

A claim without any supporting argument.

Then maybe some more explanation is needed. It is sometimes difficult
to think back to the limitations those designers faced.

The 2200 did not have byte-addressable memory; memory contents only
could be used when they bubbled up through the shift registers.
Otherwise, the CPU had to wait. (It was a silicon version of the
mercury delay lines of the UNIVAC I).

So, how do you add or subtract values in memory? From low to high
value, saving carries. You then have a choice of either loading
them in sequence, in a single go, or to load the high value,
wait for half a microsecond and then load the low value.

Would you build such a machine in big-endian or little-endian?

(And yes, it seems negative branches could take ~ 500 cycles, as
well.)
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Wed May 20 17:25:21 2026

From Newsgroup: comp.arch

On Wed, 20 May 2026 15:42:03 +0000, Anton Ertl wrote:

quadi <quadibloc@ca.invalid> writes:

Character strings are, of course, in "big endian" order; that is,
normally, a character string is written in memory with successive >>characters at increasing addresses - and, at least in languages that are >>written from left to right, numerals appear in texts with the most >>significant digit first.

So if one has a hardware instruction to convert from BCD to the string >>representation of numbers, such as UNPK or EDIT, then those two >>representations should have the same endian-ness.

Reality check: Modern architectures tend to have byte-swap and shuffle instructions. They tend not to have BCD-to-ASCII instructions, but
these can be implemented easily enough with the help of shuffle and
bitwise instructions. And given that you need to use shuffle anyway,
the byte-swapping does not cost extra.

An additional instruction is an additional instruction! But I think you
simply mean that the hardware is present. I'm not saying that BCD can't be implemented in a little-endian architecture; I'm saying it's much easier
to understand and define when BCD and character strings and binary all go
the same way - and the byte order of character strings is fixed.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Wed May 20 17:47:59 2026

From Newsgroup: comp.arch

According to Bernd Linsel <bl1-thispartdoesnotbelonghere@gmx.com>:

For packed decimals that are processed in memory, little endian is
superior to big endian, because you don't have to look for the LSB when >performing an addition, you can proceed bytewise on ascending addresses.

It depends what you're doing. If you're doing arithmetic, you need to start at the low end. If you're packing or unpacking or editing for display, you need to start at the high end. My understanding is that back in the day when performance
mattered, the applications that used BCD arithmetic typically did one arithmetic
operation on each value, so the pack/edit mattered more.

Having looked into this in some detail, both when IBM used bigendian order on S/360 and DEC used little-endian on the PDP-11, neither documented the reasons for the byte order choice at all. Not even a litle bit.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Wed May 20 18:07:14 2026

From Newsgroup: comp.arch

According to Scott Lurndal <slp53@pacbell.net>:

The B3500 had a clever algorithm for adding BCD numbers. The
addend and augend could each be from 1 to 100 digits in length.
The algorithm would start adding from the lowest (most significant
digit in the longested operand) address of each operand adding
each digit in turn.

"The processor uses an adder that accumulates two fields
from the most significant to the least significant digit
positions. Reverse addition, as incorporated in the
B2500 and B3500 systems has the advantage of detecting
an overflow condition prior to altering the receiving field"

The algorithm used a 9's counter to track the leading
digits.

How did it handle carries? Let's say you're adding

099999999999999999999999999999999999999999999999999
000000000000000000000000000000000000000000000000001

If it starts at the high digit, it won't know until it gets to the end
that it has to propagate carries all the way back to the beginning.

S/360 had operand lengths in the instructions so even though it
addresed the high byte, it could do one add and get the address
of the low byte. On S/370 and later machines with virtual memory
it was more complicated since it had to check and be sure that all
of the pages where the operands resided were available.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed May 20 17:30:49 2026

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

The 2200 did not have byte-addressable memory; memory contents only
could be used when they bubbled up through the shift registers.
Otherwise, the CPU had to wait. (It was a silicon version of the
mercury delay lines of the UNIVAC I).

So, how do you add or subtract values in memory? From low to high
value, saving carries. You then have a choice of either loading
them in sequence, in a single go, or to load the high value,
wait for half a microsecond and then load the low value.

The Datapoint 2200 has only instructions for adding or subtracting the
bits of a byte. For adding two 16-bit values X and Y, you load the
LSB of X and the LSB of Y, add them, store the result, load the MSB of
X and MSB of Y, adc them, and store the result.

Given that you have only HL for memory access, and several registers,
if the LSBs and MSBs are adjacent, you probably first want to load the
LSB and MSB of X (and in that case, there is no preferred order), and
add the LSB of Y, move A to some other register, then move the MSB of
X into A, and adc the MSB of Y, then store the LSB and MSB of the
result (again, no preferred order). And note that for any new address
you access, you have to change at least L between the memory accesses,
and maybe also H.

Even with that kind of drum-like memory, how will little-endian
provide a benefit? At best in the memory accesses to Y, but only if
the other stuff that is going on between these two memory accesses
does not advance the memory chip across the MSB (if the MSB is
actually in the same memory chip as the LSB).

And in any case, this is pure software convention. There is nothing
in the architecture that tells programmers how to arrange the two
bytes of a 16-bit data number. They could also do an array for the
LSBs and an array for the MSBs (structure-of-array style), and then
one would not need so many registers for intermediate storage. Load
LSB of X, (update L), add LSB of Y, (update L), store LSB of the
result, then (update L and maybe H), load MSB of X, (update L), adc
MSB of Y, (update L) store the MSB of the result.

The only thing in the architecture that actually specifies
little-endian byte order is in the control-flow instructions where the
byte order of the target address is little-endian. But bit-serial
memory is not the reason for that, implementing these instructions
with a big-endian target address would have been just as fast and just
as hard.

Would you build such a machine in big-endian or little-endian?

It's not about what I would do, but about what is little-endian about
the Datapoint 2200, and if there were technical reasons for that. I
don't see any.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed May 20 18:13:22 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> posted:

On Wed, 20 May 2026 01:35:01 +0000, MitchAlsup wrote:

quadi <quadibloc@ca.invalid> posted:

I align different integer types on the right, even while aligning
different floating-point types on the left like everyone else. So
integer operations must sign-extend if they're on values shorter than
64 bits.

Go LE all the way. LE won get over BE thinking.

a) I didn't think this really had anything to do with little-endian versus big-endian.

b) Yes, little-endian is more popular, but that's just because the PDP-11, 8080, and 6502 happened to choose it. Little-endian doesn't work as well *if* you also want to put packed decimal values in registers.

BEs advantage is only when packed decimal is both not a power of 2 in
size, and residing in memory. Once in a register those advantages vanish.
One could make a LE in MEM PD solution work with modern resource counts,
too.

As far as integers go: all calculations produce proper integer values in the 64-bit destination register.
S8 has range [-128..127]
u8 has range [0..255]
...

If you have 64 bit registers, then if you want to avoid a gap between the sign in a 32-bit number and the sign of a 64-bit number by placing the 32- bit number on the most significant side, a 32-bit 1 is equal to a 64-bit 8,589,934,592.

Propagating a bit takes time.

A solved HW gate-level problem.

That's good news, then I don't have a problem. I figured the solution
would be to use slightly slower gates with larger current output.

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed May 20 19:03:01 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes:

According to Scott Lurndal <slp53@pacbell.net>:

The B3500 had a clever algorithm for adding BCD numbers. The
addend and augend could each be from 1 to 100 digits in length.
The algorithm would start adding from the lowest (most significant
digit in the longested operand) address of each operand adding
each digit in turn.

"The processor uses an adder that accumulates two fields
from the most significant to the least significant digit
positions. Reverse addition, as incorporated in the
B2500 and B3500 systems has the advantage of detecting
an overflow condition prior to altering the receiving field"

The algorithm used a 9's counter to track the leading
digits.

How did it handle carries? Let's say you're adding

099999999999999999999999999999999999999999999999999 000000000000000000000000000000000000000000000000001

A value that overflows the size of the receiving field
cannot be represented, so the overflow toggle is set and
the instruction terminates _without modifying the
receiving field_.

The size of the receiving field is the larger of the
two source fields. So

ADD 0508 000000 100000 200000

would add the 5 digit value at address 0 to the
8 digit value at address 100000 and store the
result at address 200000.

If it starts at the high digit, it won't know until it gets to the end
that it has to propagate carries all the way back to the beginning.

Actually, that's the clever part. They count 9s.

Example 1: 10 digit receiving field, 10 digit addend, 1 digit augend:

Memory contents before:

000000: 9999999999
000010: 1

ADD 1001 000000 000010 000020

The result of the instruction is that the overflow toggle
will be set and the destination field will remain unmodified.

The algorithm implicitly fills leading zeros into
the shorter operand.

The first digit of the addend operand is read. '9' in
this case. The first digit of the augend is added (in this
case, implicitly zero) and the result is 9. A special
register (the 9's counter) is incremented and the algorithm
proceeds to the next digit. Wash, rinse and repeat until
reaching the last digit, where the sum of 9 + 1 will overflow
a single digit, so the instruction terminates with overflow.

If in the case you showed above, there was a zero in the
first digit of both operands, there is no posibility of
overflow and the algorithm will simply process each
digit of the addend+augend sequentially from higher
magnitude to lower magnitude. It delays writing each
digit of the sum (other than the last) until it knows
the following digit doesn't overflow. If it does
overflow, it increments the delayed value before
writing. To the extent that there multiple sequential
9s in the sum, when the next digit would overflow, the
processor uses the 9's counter and the saved digit to
store the correct digits to the receiving field.

There's a flow chart in 1025475_B2500_B3500_RefMan_Oct69.pdf
which is available on bitsavers.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed May 20 21:33:05 2026

From Newsgroup: comp.arch

Anton Ertl wrote:

quadi <quadibloc@ca.invalid> writes:

On Wed, 20 May 2026 05:38:07 +0000, Anton Ertl wrote:

* The last descendent of the PDP-11 was canceled long before the most
prominent big-endien architecture (SPARC) was canceled, and long
before Power switched its Linux support to little-endian, so the
PDP-11 had little, if any, influence on the outcome.

The reason I blame the PDP-11 for everything is that it was a hugely
influential machine. It was widely used in academic settings, and it was
also the machine for which UNIX was first widely distributed.

But its byte order was not influential into this century. Unix and
its applications are portable, including between byte orders (or at
least they were, when there were still enough machines of either byte
order around that one could test that). And somehow the PDP-11 and
its offspring did not capture the workstation market and the server
market that involved from that, and which constituted the Unix
markets.

Instead, the big-endian 68000 and its offspring dominated that market
for a while, and was replaced with RISCs later, which had the same
byte order as the earlier machines from the same company (i.e.,
little-endian for DEC and big-endian for the others). And when the
market for workstations and server on RISCs shrunk down to almost
nothing, not only did these big-endian machine vanish, but the
offspring of the PDP-11 as well (and actually before some of the
big-endian RISCs). What remains of this world is AIX on Power, and I
have no idea how many installations there still are.

Linux on Power was switched to little-endian with the introduction of OpenPower, not because of the PDP-11 descendants, but because of the Datapoint 2200 descendants. And the Datapoint 2200 (announced in June
1970) was probably not influence by the PDP-11 (announced in January
1970).

When you add two BCD numbers that are longer than a
byte, you don't have to first go to the end of the number and then go
backwards from there. This is especially relevant if you do not want to >>> completely unroll the loop that handles these bytes.

This is the reason little-endian was popular for small processors. It is
no longer relevant if a processor has a 64-bit data bus. And, of course,
it applies equally to binary and BCD.

If the numbers fit in one granule, yes, that benefit does not matter.
But 64 bits are not enough for all binary numbers and probably not for
all BCD numbers, either: the decimal FP people were not satisfied with
the 15-digit mantissa that are easily possible with their
representations in 64 bits; they did not even define a decimal64
format last I checked. So will 16-digit BCD numbers be satisfactory?

ieee754 does define decimal64, decimal128 and even decimal32, but the
first two has pretty much all the actual usage, probably (?) decimal128
as the majority, at least for all accumulators.

The reason I claim that BCD support strongly favors big-endian byte order
is this:

Character strings are, of course, in "big endian" order; that is,
normally, a character string is written in memory with successive
characters at increasing addresses - and, at least in languages that are
written from left to right, numerals appear in texts with the most
significant digit first.

So if one has a hardware instruction to convert from BCD to the string
representation of numbers, such as UNPK or EDIT, then those two
representations should have the same endian-ness.

Reality check: Modern architectures tend to have byte-swap and shuffle instructions. They tend not to have BCD-to-ASCII instructions, but
these can be implemented easily enough with the help of shuffle and
bitwise instructions. And given that you need to use shuffle anyway,
the byte-swapping does not cost extra.

BCD-to-ASCII, with the input in an AVX 32-byte register, so up to 64
digits, would start with an exchange of the high and low 16-byte halves,
then a permute of each half to reverse the order. The final single-cycle operation is the only overhead of the little vs high-endian inputs.

Next we duplicate the input by unpacking the high and low 16 bytes into
each byte value into 16 16-bit shorts, with the leading byte 0, then (in parallel) you copy and mask the low nybble while shifting all shorts up
by 4 bits, then use the same all-15 mask to save the high nybbles.
OR these two back together, and do the same for the other half of the
original input. About 15-20 cycles in total with well under 10% being
the byte order swap.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.22a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed May 20 21:45:57 2026

From Newsgroup: comp.arch

Scott Lurndal wrote:

John Levine <johnl@taugh.com> writes:

According to Scott Lurndal <slp53@pacbell.net>:

The B3500 had a clever algorithm for adding BCD numbers. The
addend and augend could each be from 1 to 100 digits in length.
The algorithm would start adding from the lowest (most significant
digit in the longested operand) address of each operand adding
each digit in turn.

"The processor uses an adder that accumulates two fields
from the most significant to the least significant digit
positions. Reverse addition, as incorporated in the
B2500 and B3500 systems has the advantage of detecting
an overflow condition prior to altering the receiving field"

The algorithm used a 9's counter to track the leading
digits.

How did it handle carries? Let's say you're adding

099999999999999999999999999999999999999999999999999
000000000000000000000000000000000000000000000000001

A value that overflows the size of the receiving field
cannot be represented, so the overflow toggle is set and
the instruction terminates _without modifying the
receiving field_.

The size of the receiving field is the larger of the
two source fields. So

ADD 0508 000000 100000 200000

would add the 5 digit value at address 0 to the
8 digit value at address 100000 and store the
result at address 200000.

If it starts at the high digit, it won't know until it gets to the end
that it has to propagate carries all the way back to the beginning.

Actually, that's the clever part. They count 9s.

Example 1: 10 digit receiving field, 10 digit addend, 1 digit augend:

Memory contents before:

000000: 9999999999
000010: 1

ADD 1001 000000 000010 000020

The example he showed had an 11 digit receive field so it would not
overflow, but the two inputs would cause a full carry propagate all the
way to the top digit.

The result of the instruction is that the overflow toggle
will be set and the destination field will remain unmodified.

The algorithm implicitly fills leading zeros into
the shorter operand.

The first digit of the addend operand is read. '9' in
this case. The first digit of the augend is added (in this
case, implicitly zero) and the result is 9. A special
register (the 9's counter) is incremented and the algorithm
proceeds to the next digit. Wash, rinse and repeat until
reaching the last digit, where the sum of 9 + 1 will overflow
a single digit, so the instruction terminates with overflow.

If in the case you showed above, there was a zero in the
first digit of both operands, there is no posibility of

That's what he showed afair?

overflow and the algorithm will simply process each
digit of the addend+augend sequentially from higher
magnitude to lower magnitude. It delays writing each
digit of the sum (other than the last) until it knows
the following digit doesn't overflow. If it does
overflow, it increments the delayed value before
writing. To the extent that there multiple sequential
9s in the sum, when the next digit would overflow, the
processor uses the 9's counter and the saved digit to
store the correct digits to the receiving field.

There's a flow chart in 1025475_B2500_B3500_RefMan_Oct69.pdf
which is available on bitsavers.

So it did process them top-down, but delayed writing the anything to the output field until it was known that it would not overflow, and the same happened for every subsequent partial sum of 9.

Yeah, that works but it probably caused some output hickups when a long
chain of potential carries finally resolved. :-)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Wed May 20 22:50:04 2026

From Newsgroup: comp.arch

On Wed, 20 May 2026 18:07:14 +0000, John Levine wrote:

On S/370 and later machines with virtual memory it was more complicated
since it had to check and be sure that all of the pages where the
operands resided were available.

Yes, since while the System/360 gave you an error if you tried to use unaligned operands in memory, this restriction was abolished with the System/370. Only an unaligned operand can possibly cross a page boundary, since pages have a power-of-two size greater than the size of any data
type.

But this means that even on the System/370, it's a rare event that an instruction will refer to an unaligned operand. So that there is some
extra overhead for unaligned values might well have been considered acceptable.

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 00:06:54 2026

From Newsgroup: comp.arch

On Wed, 20 May 2026 07:21:04 +0000, quadi wrote:

So in commenting on a different part of my design entirely, you've
pointed out an important flaw I will have to correct.

It's possible that I panicked needlessly, and the conditional branches I support, being the conventional set, are indeed sufficient for unsigned
values as well; for them, they would have alternate names in assembler,
but no additional types of branch perhaps are needed.

I will have to review this point, however, to be sure.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Thu May 21 00:37:39 2026

From Newsgroup: comp.arch

It appears that quadi <quadibloc@ca.invalid> said:

On Wed, 20 May 2026 18:07:14 +0000, John Levine wrote:

On S/370 and later machines with virtual memory it was more complicated
since it had to check and be sure that all of the pages where the
operands resided were available.

Yes, since while the System/360 gave you an error if you tried to use >unaligned operands in memory, this restriction was abolished with the >System/370. Only an unaligned operand can possibly cross a page boundary, >since pages have a power-of-two size greater than the size of any data
type.

While that is true for the RX and RS instructions that do loads and
stores and arithmetic operations, it is not at all true for the SS
instructions common in commercial code.

Yhey have two storage operands with the length specified in the second
byte of the instruction. Even on S/360 there is no alignment
requirement for any of the operands. In most cases it can tell the
sizes of the operands at the time the instruction is decoded, e.g.,
decimal add (AP) has two four-bit length codes that say how long each
operand is and move characters (MVC) has a single 8-bit length code
that applies to both operands.

But sometimes it is not that simple. Translate and test (TRT) has
a string operand with a length, and a second 256 byte table operand.
It fetches the bytes from the string one at a time, looks them up
in the table, and stops as soon as the looked up value is non-zero,
putting the address of the source byte and the lookup values in
R1 and R2. Only the bytes actually fetched have to be resident.

The Edit instruction (ED) takes a packed decimal operand and
a pattern, with the length specifying the length of the pattern.
It goes through the pattern a byte at a time with some pattern
bytes ("digit selector") taking the next digit from the input
operand and others just copied literally. The length of the
input operand depends on the contents of the pattern.

To make this work S/370 and its successors first do a trial
execution of the instruction without storing anything to see
if it causes a page fault. If not, it then redoes the
instruction for real, storing the result. I suspect that
if they had known how soon S/370 would add paging to the 360
architecture, they might have designed these instructions
differently.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 02:18:27 2026

From Newsgroup: comp.arch

On Thu, 21 May 2026 00:37:39 +0000, John Levine wrote:

To make this work S/370 and its successors first do a trial execution of
the instruction without storing anything to see if it causes a page
fault. If not, it then redoes the instruction for real, storing the
result. I suspect that if they had known how soon S/370 would add
paging to the 360 architecture, they might have designed these
instructions differently.

When I first read that, I thought that you meant they would have designed
it differently when they designed the 370, but, of course, the
instructions already existed. After I realized my mistake, of course, I
also knew that back in 1964 or before, there was really no way that they
could possibly have known that.

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 02:33:51 2026

From Newsgroup: comp.arch

On Wed, 20 May 2026 01:35:01 +0000, MitchAlsup wrote:

quadi <quadibloc@ca.invalid> posted:

So instead I decided to only support double precision, and use the
extra bits to allow additional ways to specify registers.

My 66000 started out that way and the compiler showed that this choice
sucks.

The good news is that this only concerns the 16-bit short instructions. A compiler can choose to ignore them if it can't handle them.

Currently, the 16-bit instructions provide the following:

All the basic operate instructions for two integer types; they can only operate on the first eight integer registers.

The basic floating operate instructions for one floating-point type; the register specification is the one used with Concertina II's paired 15-bit operate instructions; choose one of four banks of eight registers, and
both operands must be in that bank.

The idea is that it can be used for efficient pipelined code where four sequences of instructions which are independent are interleaved.

Everything else is straightforwards; the 24-bit short instructions and all
the 32-bit and longer instructions that operate on registers allow the use
of all 32 registers in a bank.

Of course, though, the other restrictions are still present - seven
choices for an index register, seven choices for a base register (for each
of three displacement sizes, 20, 16, and 12 bits).

I think I have indeed achieved the goal which, when I started out, I
thought might prove to be an "impossible dream" - combining what a CISC instruction set offers with what a RISC instruction set offers, and yet
doing so without making the instructions longer than they usually are in
those instruction types.

Except for register-to-register operate instructions being 24 bits instead
of 16 bits, this has been achieved - but for a very limited subset of the possible register-to-register operate instructions, chosen by me as the
ones I think are the most useful and popular - and I realize the choice is subjective and hence potentially controversial - the 16-bit instruction
length is retained!

I think it's an ISA that, in this respect, has achieved more than anyone
could have expected!

Now, of course, whether or not this is an achievement that anyone cares
about, that anyone wants, that anyone is interested in... well, I don't
know.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 06:12:11 2026

From Newsgroup: comp.arch

I had a tiny bit of unused opcode space within the 32-bit operate instructions.

As well, there were a couple of lengths of instructions longer than 32
bits which were allocated more opcode space than they actually needed.

That let me move those two lengths of instructions, plus one other length
of instructions longer than 32 bits which kept is entire, though small, allocation of opcode space, into that unused space.

And that let me increase the opcode space allocated to 16-bit short instructions from 1/16th of the opcode space to 3/32nds of the opcode
space.

Which allowed me to give them a much simpler and plainer format, of which
it finally could be argued - without the claim being utterly laughable -
that they offer just about what 16-bit short instructions do in a CISC architecture.

So now the 16-bit short instructions have all 96 basic operate opcodes, so they can perform all the basic operations on all the basic integer and floating-point types.

They are in all cases now limited to just the first eight registers. So
this is inferior to the System/360, which has sixteen, but it matches the 680x0 which had eight.

Finally, I have achieved my dream, insane and useless though it may be!

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu May 21 06:29:29 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> writes:

When I first read that, I thought that you meant they would have designed
it differently when they designed the 370, but, of course, the
instructions already existed. After I realized my mistake, of course, I
also knew that back in 1964 or before, there was really no way that they >could possibly have known that.

The Atlas existed in 1962 and did have paging. So it was possible.
Is it excusable that the S/360 designers did not consider this
development at the time? Probably, although according to <https://en.wikipedia.org/wiki/Atlas_(computer)> "it was a 1959
description of Muse [the 1959 name for Atlas] that gave CDC ideas that significantly accelerated the development of the 6600 and allowed it
to be delivered earlier than originally estimated".

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 07:03:47 2026

From Newsgroup: comp.arch

On Thu, 21 May 2026 00:06:54 +0000, quadi wrote:

I will have to review this point, however, to be sure.

Although I have not yet completed that review, it has become apparent
that, since I want the compare instruction to produce a correct result for signed numbers even if one is comparing, say, a positive number and a
negative number which are both over half of the maximum possible magnitude
for their format... it will be necessary to have a special compare
instruction for unsigned integers.

Since there is opcode space for that readily available, though, there is
no difficulty in adding that.

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu May 21 10:29:12 2026

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Anton Ertl wrote:

But 64 bits are not enough for all binary numbers and probably not for
all BCD numbers, either: the decimal FP people were not satisfied with
the 15-digit mantissa that are easily possible with their
representations in 64 bits; they did not even define a decimal64
format last I checked. So will 16-digit BCD numbers be satisfactory?

ieee754 does define decimal64, decimal128 and even decimal32, but the
first two has pretty much all the actual usage, probably (?) decimal128
as the majority, at least for all accumulators.

I should check half-known things before I make claims in a posting.

Anyway, looking at <https://en.wikipedia.org/wiki/Decimal64_floating-point_format>, I see
that Decimal64 even has 16 digits of mantissa. So 15 digits is not
enough. (And, as an aside, they complicated things by not specifying
a 54-bit mantissa, but combining the exponent with the upper bits of
the mantissa).

To the point: these 16 digits are not enough, as the lack of
popularity of decimal64 (even relative to decimal128) shows, so 64-bit
BCD numbers are not enough in all cases, either.

Reality check: Modern architectures tend to have byte-swap and shuffle
instructions. They tend not to have BCD-to-ASCII instructions, but
these can be implemented easily enough with the help of shuffle and
bitwise instructions. And given that you need to use shuffle anyway,
the byte-swapping does not cost extra.

BCD-to-ASCII, with the input in an AVX 32-byte register, so up to 64
digits, would start with an exchange of the high and low 16-byte halves, >then a permute of each half to reverse the order. The final single-cycle >operation is the only overhead of the little vs high-endian inputs.

Next we duplicate the input by unpacking the high and low 16 bytes into
each byte value into 16 16-bit shorts, with the leading byte 0, then (in >parallel) you copy and mask the low nybble while shifting all shorts up
by 4 bits, then use the same all-15 mask to save the high nybbles.
OR these two back together, and do the same for the other half of the >original input. About 15-20 cycles in total with well under 10% being
the byte order swap.

My thinking was along the lines of using VPERMB to do the
byte-swapping, the duplicating, and the unpacking in one step. E.g.,
if you have a 64-bit BCD number 1234567890123456 as the following
sequence of bytes

56 34 12 90 78 56 34 12

Then you have the index vector

7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0

and VPERMB xmm1, xmm2, xmm3

(where the BCD number is in xmm3 and the index vector is in xmm2) will
put the following in xmm1:

12 12 34 34 56 56 78 78 90 90 12 12 34 34 56 56

So no extra instruction for the byte swapping.

The problem is that I now would like a masked parallel byte shift to
shift the even-indexed bytes right by 4 bits, but I don't find
parallel byte shifts. I guess the answer is to let the VPERMB arrange
the result as follows

1234 1234 5678 5678 9012 9012 3456 3456
^^^^ ^^^^ ^^^^ ^^^^

then use a masked VPSRLW for shifting the marked 16-bit pieces to the
right by 4 bits, resulting in

0123 1234 0567 5678 0901 9012 0345 3456

Now use VPSHUFB or VPERMB to rearrange the bytes in the intended order:

01 12 23 34 45 56 67 78 89 90 01 12 23 34 45 56

Now mask away the top 4 bits of each byte with VPAND and turn it into
ASCII by VPORing every byte with 0x30.

And the whole thing can be done with BCD numbers of up to 64 digits
per pass.

The absence of VPSRLB caused an additional instruction, but that's
also necessary for dealing with big-endian BCD numbers. So storing
the BCD numbers in little-endian format costs no additional
instruction.

VPERMB is not in AVX2, so if you want to limit yourself to that,
little-endian needs an extra instruction indeed.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu May 21 11:52:51 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> writes:

On Wed, 20 May 2026 15:42:03 +0000, Anton Ertl wrote:

quadi <quadibloc@ca.invalid> writes:
Reality check: Modern architectures tend to have byte-swap and shuffle
instructions. They tend not to have BCD-to-ASCII instructions, but
these can be implemented easily enough with the help of shuffle and
bitwise instructions. And given that you need to use shuffle anyway,
the byte-swapping does not cost extra.

An additional instruction is an additional instruction!

There is no additional instruction. VPERMB does the byte swapping and
byte duplication at the same time, see <2026May21.122912@mips.complang.tuwien.ac.at>.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu May 21 12:04:40 2026

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

BCD-to-ASCII, with the input in an AVX 32-byte register, so up to 64 >>digits, would start with an exchange of the high and low 16-byte halves, >>then a permute of each half to reverse the order. The final single-cycle >>operation is the only overhead of the little vs high-endian inputs.

Next we duplicate the input by unpacking the high and low 16 bytes into >>each byte value into 16 16-bit shorts, with the leading byte 0, then (in >>parallel) you copy and mask the low nybble while shifting all shorts up
by 4 bits, then use the same all-15 mask to save the high nybbles.
OR these two back together, and do the same for the other half of the >>original input. About 15-20 cycles in total with well under 10% being
the byte order swap.

My thinking was along the lines of using VPERMB to do the
byte-swapping, the duplicating, and the unpacking in one step. E.g.,
if you have a 64-bit BCD number 1234567890123456 as the following
sequence of bytes

56 34 12 90 78 56 34 12

Then you have the index vector

7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0

and VPERMB xmm1, xmm2, xmm3

(where the BCD number is in xmm3 and the index vector is in xmm2) will
put the following in xmm1:

12 12 34 34 56 56 78 78 90 90 12 12 34 34 56 56

So no extra instruction for the byte swapping.

The problem is that I now would like a masked parallel byte shift to
shift the even-indexed bytes right by 4 bits, but I don't find
parallel byte shifts. I guess the answer is to let the VPERMB arrange
the result as follows

1234 1234 5678 5678 9012 9012 3456 3456
^^^^ ^^^^ ^^^^ ^^^^

then use a masked VPSRLW for shifting the marked 16-bit pieces to the
right by 4 bits, resulting in

0123 1234 0567 5678 0901 9012 0345 3456

Now use VPSHUFB or VPERMB to rearrange the bytes in the intended order:

01 12 23 34 45 56 67 78 89 90 01 12 23 34 45 56

I have a better approach:

First do the shifting with, e.g. VPSRLW, with the result in a new
register. So you now have

56 34 12 90 78 56 34 12 #original data
0563 0129 0785 0341 #shifted version

Now you use VPERMT2B to rearrange the bytes from both registers into a
third one, doing the byte-swapping while you are at it, resulting in:

41 12 03 34 85 56 07 78 29 90 01 12 63 34 05 56

The remainder uses VPAND and VPOR, as described earlier.

If you have BCD numbers with more than 64, but at most 128 digits, the
first step would only have to be performed once. You would then use
two VPERMI2B instructions with different index inputs to produce the
64 least significant and the 64 most significant digits, and the VPAND
and VPOR would also have to be duplicated.

So 4 central instructions for a BCD number with up to 64 digits, and 7
for up to 128 digits. In addition, you need the VPERMT2B index, the
VPSRLW shift amounts and the other operand for VPAND and VPOR in
registers, but if you are converting a lot of BCD numbers, you may
already have them in registers when you convert the next BCD number.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 13:13:23 2026

From Newsgroup: comp.arch

On Thu, 21 May 2026 06:12:11 +0000, quadi wrote:

Finally, I have achieved my dream, insane and useless though it may be!

Someone once suggested that, if a genie grants you three wishes, you
should use one of them to wish for more wishes.

Well, I have taken the opportunity to squeeze one more little thing into
the instruction set that Concertina III had, but this time I could not
squeeze quite as many of them in... 16-bit prefixes for instructions,
which allow the instruction set to be extended.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 13:22:55 2026

From Newsgroup: comp.arch

On Thu, 21 May 2026 13:13:23 +0000, quadi wrote:

Well, I have taken the opportunity to squeeze one more little thing into
the instruction set that Concertina III had, but this time I could not squeeze quite as many of them in... 16-bit prefixes for instructions,
which allow the instruction set to be extended.

I've taken the opportunity now, before things go on, to modify this
addition in one important way: I've precluded the possibility that the complexity of instruction length encoding might grow without bounds by specifying the length scheme now for any prefixed instructions that might
be added.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 13:42:09 2026

From Newsgroup: comp.arch

On Thu, 21 May 2026 07:03:47 +0000, quadi wrote:

it will be necessary to have a special
compare instruction for unsigned integers.

I have now back-propagated this needful change to Concertina II. The description of Concertina III hadn't gotten to the point where this would
be placed.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu May 21 14:36:13 2026

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Scott Lurndal wrote:

<snip>

overflow and the algorithm will simply process each
digit of the addend+augend sequentially from higher
magnitude to lower magnitude. It delays writing each
digit of the sum (other than the last) until it knows
the following digit doesn't overflow. If it does
overflow, it increments the delayed value before
writing. To the extent that there multiple sequential
9s in the sum, when the next digit would overflow, the
processor uses the 9's counter and the saved digit to
store the correct digits to the receiving field.

There's a flow chart in 1025475_B2500_B3500_RefMan_Oct69.pdf
which is available on bitsavers.

So it did process them top-down, but delayed writing the anything to the >output field until it was known that it would not overflow, and the same >happened for every subsequent partial sum of 9.

Yeah, that works but it probably caused some output hickups when a long >chain of potential carries finally resolved. :-)

The maximum size of an operand was 100 digits.

To add to the potential for a long hickup, each of the operands
could be indirect, which in turn could point to indirect
operands ad infinitum. A processor timer was started with
each instruction, and if it expired before the instruction
finished, the processor would raise a fault and the application
would be terminated.

There was also search table and linked list instructions, which had
variable timing depending on the number of entries in the
list or table (the instruction timer would handle infinite
loops in the list).

--- Synchronet 3.22a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Thu May 21 15:41:17 2026

From Newsgroup: comp.arch

According to quadi <quadibloc@ca.invalid>:

On Thu, 21 May 2026 00:37:39 +0000, John Levine wrote:

result. I suspect that if they had known how soon S/370 would add
paging to the 360 architecture, they might have designed these
instructions differently.

When I first read that, I thought that you meant they would have designed
it differently when they designed the 370, but, of course, the
instructions already existed. After I realized my mistake, of course, I
also knew that back in 1964 or before, there was really no way that they >could possibly have known that.

According to Pugh et al., IBM Research was quite aware of Atlas and
was doing its own work on one-level store and time sharing. They were
also close to CTSS at MIT Project MAC. Atlas' performance was terrible
(later solved partly by better paging schemes but mostly by larger
real memory) and I get the impression that there was an internal
institutional bias that only batch was real computing and time sharing
was somewhere between a niche and a fad.

The MIT people were deeply disappointed when S/360 had no memory
mapping at all, which led to Multics switching from IBM to GE for its
new computer. IBM then came out with the 360/67 which had quite decent
virtual memory but it was too late. It didn't help that its intended
main operating system was TSS which was overambitious and didn't work.
Lucky for them CP/67 escaped from the lab to become VM/370.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu May 21 18:26:32 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> posted:

On Wed, 20 May 2026 01:35:01 +0000, MitchAlsup wrote:

quadi <quadibloc@ca.invalid> posted:

So instead I decided to only support double precision, and use the
extra bits to allow additional ways to specify registers.

My 66000 started out that way and the compiler showed that this choice sucks.

The good news is that this only concerns the 16-bit short instructions. A compiler can choose to ignore them if it can't handle them.

Currently, the 16-bit instructions provide the following:

All the basic operate instructions for two integer types; they can only operate on the first eight integer registers.

I suspect you (and compiler) will end up not liking the restriction.

The basic floating operate instructions for one floating-point type; the register specification is the one used with Concertina II's paired 15-bit operate instructions; choose one of four banks of eight registers, and
both operands must be in that bank.

I suspect you (and compiler) will end up not liking the restriction.

The idea is that it can be used for efficient pipelined code where four sequences of instructions which are independent are interleaved.

I suspect you (and compiler) will end up not finding that much parallelism.

Everything else is straightforwards; the 24-bit short instructions and all the 32-bit and longer instructions that operate on registers allow the use of all 32 registers in a bank.

Of course, though, the other restrictions are still present - seven
choices for an index register, seven choices for a base register (for each of three displacement sizes, 20, 16, and 12 bits).

I think I have indeed achieved the goal which, when I started out, I
thought might prove to be an "impossible dream" - combining what a CISC instruction set offers with what a RISC instruction set offers, and yet doing so without making the instructions longer than they usually are in those instruction types.

Except for register-to-register operate instructions being 24 bits instead of 16 bits, this has been achieved - but for a very limited subset of the possible register-to-register operate instructions, chosen by me as the
ones I think are the most useful and popular - and I realize the choice is subjective and hence potentially controversial - the 16-bit instruction length is retained!

I think it's an ISA that, in this respect, has achieved more than anyone could have expected!

Now, of course, whether or not this is an achievement that anyone cares about, that anyone wants, that anyone is interested in... well, I don't know.

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu May 21 18:32:48 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> posted:

On Thu, 21 May 2026 00:06:54 +0000, quadi wrote:

I will have to review this point, however, to be sure.

Although I have not yet completed that review, it has become apparent
that, since I want the compare instruction to produce a correct result for signed numbers even if one is comparing, say, a positive number and a negative number which are both over half of the maximum possible magnitude for their format... it will be necessary to have a special compare instruction for unsigned integers.

Or a wider condition register !

Since there is opcode space for that readily available, though, there is
no difficulty in adding that.

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 22:14:51 2026

From Newsgroup: comp.arch

On Thu, 21 May 2026 07:03:47 +0000, quadi wrote:

Although I have not yet completed that review, it has become apparent
that, since I want the compare instruction to produce a correct result
for signed numbers even if one is comparing, say, a positive number and
a negative number which are both over half of the maximum possible
magnitude for their format... it will be necessary to have a special
compare instruction for unsigned integers.

I have now given the matter thought, and I found that it would indeed be necessary to add an extra bit to all the conditional jump, branch, or set
flag instructions to indicate the test was being applied to the condition
code settings left after an integer arithmetic instruction on integers
deemed to be unsigned.

Amazingly enough, however, it turned out that in each case there was no difficulty in finding the additional opcode space that was needed.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 22:21:53 2026

From Newsgroup: comp.arch

On Thu, 21 May 2026 18:32:48 +0000, MitchAlsup wrote:

quadi <quadibloc@ca.invalid> posted:

it will be necessary to have a special
compare instruction for unsigned integers.

Or a wider condition register !

A wider condition register isn't enough by itself.

I have now realized that I will have to add a bit to the conditional
branch instructions. Amazingly, though, that bit was readily available
without much trouble.

In the case of conditional branches after integer arithmetic, a wider condition register might be needed, although it seems that carry,
overflow, negative, and zero will suffice.

The compare instruction in my ISA _does not_ return the same condition
codes as the subtract instruction. So if I compare bytes, the compare instruction will correctly indicate that -100 is less than 100. The fact
that if you subtracted -100 from 100 as byte values, you wouldn't get 200, since that doesn't fit into a signed byte, but the negative value -44 is neither here nor there.

Because of this special handling of the MSB, I do need a different compare instruction - not just the modified branch instructions for unsigned
values - to yield correct behavior.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 23:44:34 2026

From Newsgroup: comp.arch

On Thu, 21 May 2026 18:26:32 +0000, MitchAlsup wrote:

quadi <quadibloc@ca.invalid> posted:

Currently, the 16-bit instructions provide the following:

All the basic operate instructions for two integer types; they can only
operate on the first eight integer registers.

I suspect you (and compiler) will end up not liking the restriction.

I don't like the restriction, but since there's not much opcode space available, there's not much I can do.

The basic floating operate instructions for one floating-point type;
the register specification is the one used with Concertina II's paired
15-bit operate instructions; choose one of four banks of eight
registers, and both operands must be in that bank.

I suspect you (and compiler) will end up not liking the restriction.

The compiler will, indeed, probably have difficulty dealing with a kind of restriction that no one else has ever put in an ISA.

But this is moot now. I've found some additional opcode space for 16-bit
short instructions. Not much, just enough to increase the available opcode space by a factor of 1.5.

So now all the operations are restricted to only the first eight registers
- but 16-bit short instructions now support all the basic data types.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Thu May 21 23:46:05 2026

From Newsgroup: comp.arch

On Thu, 21 May 2026 22:14:51 +0000, quadi wrote:

Amazingly enough, however, it turned out that in each case there was no difficulty in finding the additional opcode space that was needed.

I even managed to find enough opcode space to increase the size of the displacement field from 8 bits to 9 bits in all the branch instructions,
so that having 24-bit short instructions doesn't shorten their range.

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Fri May 22 02:20:14 2026

From Newsgroup: comp.arch

On Thu, 21 May 2026 23:46:05 +0000, quadi wrote:

I even managed to find enough opcode space to increase the size of the displacement field from 8 bits to 9 bits in all the branch instructions,
so that having 24-bit short instructions doesn't shorten their range.

However, there were a number of serious mistakes on the page, which I have
now corrected.

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri May 22 07:22:05 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> writes:

Although I have not yet completed that review, it has become apparent
that, since I want the compare instruction to produce a correct result for >signed numbers even if one is comparing, say, a positive number and a >negative number which are both over half of the maximum possible magnitude >for their format... it will be necessary to have a special compare >instruction for unsigned integers.

The fact that IA-32/AMD64 and ARM A64 do not have a special compare
instruction for unsigned integers (and manage to do with NCVZ) shows
that this is unnecessary. What you do for your "if (-100<100)" case
is encode it (on AMD64) as

cmpb %r8, %r9 #note that AT&T syntax has the arguments reversed
jnl target
... code to execute if r9<r8 ...
target:

And JNL (jump if not less) tests for N=V (the Intel manual writes SF=OF).

See <https://www.felixcloutier.com/x86/jcc>

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri May 22 07:35:36 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> writes:

The compare instruction in my ISA _does not_ return the same condition
codes as the subtract instruction. So if I compare bytes, the compare >instruction will correctly indicate that -100 is less than 100. The fact >that if you subtracted -100 from 100 as byte values, you wouldn't get 200, >since that doesn't fit into a signed byte, but the negative value -44 is >neither here nor there.

8086, IA-32, AMD64, and AFAIK ARM A64 produce the same condition codes
for compare and subtract instructions. That the subtract instruction
writes back the result does not influence the condition codes. The
fact that you see an overflow/underflow if you byte-subtract/compare
100 with -100 and want to interpret the result as a signed byte is
reflected in the overflow flag for both subtract and compare, and the conditional jumps for signed <, <=, >, >= take the overflow flag into
account (as well as the sign flag, and, in some cases, the zero flag).

Because of this special handling of the MSB, I do need a different compare >instruction - not just the modified branch instructions for unsigned
values - to yield correct behavior.

You only need that if your flags are insufficiently expressive (i.e.,
less powerful than NCZV).

An interesting case is PowerPC (and Power). It stores < = > flags
(for comparsons, for other instructions it's <0, =0, and >0) and a
sticky overflow flag in one of the CRs (for many instructions, CR0,
for comparison instructions, the CR can be selected). It has an
overflow flag and a carry flag elsewhere, so it could use the
subtraction instruction together with these flags for both signed and
unsigned conditional branches, but instead it has unsigned and signed comparisons, and the conditional branches are only conditional on
flags in a CR register.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Fri May 22 15:48:18 2026

From Newsgroup: comp.arch

On Fri, 22 May 2026 07:35:36 +0000, Anton Ertl wrote:

You only need that if your flags are insufficiently expressive (i.e.,
less powerful than NCZV).

While the System/360 had only two condition code bits, I do plan to have
full VZNC bits. However, unlike the System/360, I do not have a complete
set of sixteen conditional branch instructions. I just have twelve: eight instructions for testing between negative, zero, and positive nonzero in
any combination, and instructions for separately testing for carry and overflow.

However, if I have enough opcode space to add a U bit to all the
conditional branch instructions, then I also have enough opcode space to
fix that instead, so I likely will rework this part of the ISA into
something more conventional.

I want a compare instruction which, for integers, isn't fooled by
overflows - and overflows happen at a different point in the two's
complement number circle for signed and unsigned; for unsigned, basically carry takes the role of overflow. And I don't want to have to do two instructions for the conditional branch afterwards to handle that. So I
_may_ still need a separate compare unsigned, even though the rest of your points are well taken.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Fri May 22 21:22:48 2026

From Newsgroup: comp.arch

On Fri, 22 May 2026 15:48:18 +0000, quadi wrote:

However, if I have enough opcode space to add a U bit to all the
conditional branch instructions, then I also have enough opcode space to
fix that instead, so I likely will rework this part of the ISA into
something more conventional.

I have made the first set of changes, using five-bit condition code fields
to nicely and fully handle both the signed and unsigned cases; I checked
what the Motorola 68000 did, and found that it only provided a complete
set of tests for signed values, but only two tests for unsigned ones.
(Worse yet, it used separate condition codes for floating-point numbers,
which makes sense, given that they were originally in a coprocessor, but
that means an extra set of instructions is needed.)

So, while it used a four-bit condition code field, I needed a five-bit one.

I did notice it didn't just always fail the signed tests if overflow was present; instead, in that case it switched plus and minus. Given that, and treating carry the same way for unsigned tests, you likely are right that
an unsigned compare is not needed. Oh, wait; my assumed behavior that everything should just fail if there's an overflow... is reasonable for floating-point numbers.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat May 23 08:36:49 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> writes:

On Fri, 22 May 2026 07:35:36 +0000, Anton Ertl wrote:

You only need that if your flags are insufficiently expressive (i.e.,
less powerful than NCZV).

While the System/360 had only two condition code bits, I do plan to have >full VZNC bits. However, unlike the System/360,

The S/360 is a mess as far as dealing with conditions is concerned.
Or is there a great underlying principle involved, and I fail to see
it? I doubt it, for the following reasons: 1) I have not come across
any description that eplained the underlying principe, and in fact I
have come across few descriptions at all. 2) In the 62 years that
S/360 has been available, it has not found any successors in its
particular approach to conditions.

So my recommendation is that you look at other architectures for
inspiration. 8086, 88000, MIPS/Alpha/RISC-V (including the
differences between them), and IA-64 all have quite different
approaches that are worthy of study. And if you want to look for
something unproven, look at <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.

I do not have a complete

set of sixteen conditional branch instructions. I just have twelve: eight >instructions for testing between negative, zero, and positive nonzero in
any combination, and instructions for separately testing for carry and >overflow.

...

I want a compare instruction which, for integers, isn't fooled by
overflows - and overflows happen at a different point in the two's >complement number circle for signed and unsigned; for unsigned, basically >carry takes the role of overflow. And I don't want to have to do two >instructions for the conditional branch afterwards to handle that.

What's this thing about "two instructions for the conditional branch afterwards"? On the 8086, if you want to branch on signed <, you use
JL, and if you want to branch on unsigned <, you use JB; each of them
is one instruction (and the 8086 has IIRC signed and unsigned <= > >=,
too).

If you mean the opcode space, then yes, you may use less opcode space
if you have a signed and unsigned comparison, and fewer conditional
branches (depending on how much proportion of your opcode space the
respective instructions take). You can also save opcode space by
leaving away the <= and > conditions (reverse the operands of < and

=). One question in such a design is if there are cases where you

want to have the unsigned and signed conditions for the same operands,
but it's probably rare enough that it is not a big disadvantage that
you need to use both comparison instructions for those cases (at least
I have never seen a complaint about this aspect of PowerPC).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat May 23 09:28:45 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> writes:

On Fri, 22 May 2026 15:48:18 +0000, quadi wrote:

However, if I have enough opcode space to add a U bit to all the
conditional branch instructions, then I also have enough opcode space to
fix that instead, so I likely will rework this part of the ISA into
something more conventional.

I have made the first set of changes, using five-bit condition code fields >to nicely and fully handle both the signed and unsigned cases; I checked >what the Motorola 68000 did, and found that it only provided a complete
set of tests for signed values, but only two tests for unsigned ones.

I see four tests for unsigned conditions on the 68000 <https://en.wikibooks.org/wiki/68000_Assembly/Conditional_Tests>:

HI >
LS <=
CC >=
CS <

For the signed ones there is

GT >
LE <=
GE >=
LT <

my assumed behavior that
everything should just fail if there's an overflow... is reasonable for >floating-point numbers.

The usual setup is that FP operations silently overflow to +INF and
underflow to -INF. They do set sticky flags (called "exceptions" in
the IEEE FP standard) on various conditions, including on overflows,
but also on rounding errors ("inexact").

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 23 16:19:35 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> posted:

On Fri, 22 May 2026 15:48:18 +0000, quadi wrote:

However, if I have enough opcode space to add a U bit to all the conditional branch instructions, then I also have enough opcode space to fix that instead, so I likely will rework this part of the ISA into something more conventional.

I have made the first set of changes, using five-bit condition code fields to nicely and fully handle both the signed and unsigned cases; I checked what the Motorola 68000 did, and found that it only provided a complete
set of tests for signed values, but only two tests for unsigned ones.
(Worse yet, it used separate condition codes for floating-point numbers, which makes sense, given that they were originally in a coprocessor, but that means an extra set of instructions is needed.)

So, while it used a four-bit condition code field, I needed a five-bit one.

x86 uses COZAP but this includes P=parity, which it is unlikely you do.
Thus, 4 bits are sufficient to define 16-states, of which you only need 10-states signless{EQ, NEQ}, signed{>=, >, <, <=}, unsigned{>=, >, <, <=}.

I did notice it didn't just always fail the signed tests if overflow was present; instead, in that case it switched plus and minus. Given that, and treating carry the same way for unsigned tests, you likely are right that
an unsigned compare is not needed. Oh, wait; my assumed behavior that everything should just fail if there's an overflow... is reasonable for floating-point numbers.

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat May 23 16:38:57 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

quadi <quadibloc@ca.invalid> posted:

On Fri, 22 May 2026 15:48:18 +0000, quadi wrote:

However, if I have enough opcode space to add a U bit to all the
conditional branch instructions, then I also have enough opcode space to >> > fix that instead, so I likely will rework this part of the ISA into
something more conventional.

I have made the first set of changes, using five-bit condition code fields >> to nicely and fully handle both the signed and unsigned cases; I checked
what the Motorola 68000 did, and found that it only provided a complete
set of tests for signed values, but only two tests for unsigned ones.
(Worse yet, it used separate condition codes for floating-point numbers,
which makes sense, given that they were originally in a coprocessor, but
that means an extra set of instructions is needed.)

So, while it used a four-bit condition code field, I needed a five-bit one.

x86 uses COZAP but this includes P=parity, which it is unlikely you do.
Thus, 4 bits are sufficient to define 16-states, of which you only need >10-states signless{EQ, NEQ}, signed{>=, >, <, <=}, unsigned{>=, >, <, <=}.

ARM includes the Q flag (saturation).

--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat May 23 16:46:40 2026

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

quadi <quadibloc@ca.invalid> writes:

On Fri, 22 May 2026 07:35:36 +0000, Anton Ertl wrote:

You only need that if your flags are insufficiently expressive (i.e.,
less powerful than NCZV).

While the System/360 had only two condition code bits, I do plan to have >>full VZNC bits. However, unlike the System/360,

The S/360 is a mess as far as dealing with conditions is concerned.
Or is there a great underlying principle involved, and I fail to see
it? I doubt it, for the following reasons: 1) I have not come across
any description that eplained the underlying principe, and in fact I
have come across few descriptions at all. 2) In the 62 years that
S/360 has been available, it has not found any successors in its
particular approach to conditions.

The B3500 had three bits: Overflow, COM Low and COM High. The
V-Series added COM null, used by the search linked list (SLT)
instruction when the search key wasn't found.

Condition Flags
--------- -----------------------
EQUAL COML=1, COMH=1
Less Than COML=1, COMH=0
Greater Than COML=0, COMH=1
NULL COML=0, COMH=0

--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Sat May 23 17:01:10 2026

From Newsgroup: comp.arch

On Sat, 23 May 2026 09:28:45 +0000, Anton Ertl wrote:

quadi <quadibloc@ca.invalid> writes:

On Fri, 22 May 2026 15:48:18 +0000, quadi wrote:

I have made the first set of changes, using five-bit condition code
fields to nicely and fully handle both the signed and unsigned cases; I >>checked what the Motorola 68000 did, and found that it only provided a >>complete set of tests for signed values, but only two tests for unsigned >>ones.

I see four tests for unsigned conditions on the 68000 <https://en.wikibooks.org/wiki/68000_Assembly/Conditional_Tests>:

HI >
LS <=
CC >=
CS <

For the signed ones there is

GT >
LE <=
GE >=
LT <

What I was going by was Table 3-19 on page 3-19 of the M68000 Family Programmer's Reference Manual on the Internet Archive from Bitsavers; it
gives the available condition code tests on the architecture as:

0000 True
0001 False
0010 High not C and not Z
0011 Low or Same C or Z
0100 Carry Clear
0101 Carry Set
0110 Not Equal not Z
0111 Equal Z
1000 Overflow Clear not V
1001 Overflow Set V
1010 Plus not N
1011 Minus N
1100 Greater or Equal (N and V) or (not N and not V)
1101 Less Than (N and not V) or (not N and V)
1110 Greater Than (N and V and not Z) or (not N and not V and not Z)
1111 Less or Equal Z or (N and not V) or (not N and V)

I took Low or Same as unsigned, and Plus, Minus, Greater or Equal, Less
Than, Greater Than, and Less or Equal as signed.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat May 23 14:15:46 2026

From Newsgroup: comp.arch

On 2026-05-23 5:28 a.m., Anton Ertl wrote:

quadi <quadibloc@ca.invalid> writes:

On Fri, 22 May 2026 15:48:18 +0000, quadi wrote:

However, if I have enough opcode space to add a U bit to all the
conditional branch instructions, then I also have enough opcode space to >>> fix that instead, so I likely will rework this part of the ISA into
something more conventional.

I have made the first set of changes, using five-bit condition code fields >> to nicely and fully handle both the signed and unsigned cases; I checked
what the Motorola 68000 did, and found that it only provided a complete
set of tests for signed values, but only two tests for unsigned ones.

I see four tests for unsigned conditions on the 68000 <https://en.wikibooks.org/wiki/68000_Assembly/Conditional_Tests>:

HI >
LS <=
CC >=
CS <

CS may also be called LO
CC may also be called HS

For the signed ones there is

GT >
LE <=
GE >=
LT <

my assumed behavior that
everything should just fail if there's an overflow... is reasonable for
floating-point numbers.

The usual setup is that FP operations silently overflow to +INF and
underflow to -INF. They do set sticky flags (called "exceptions" in

Methinks overflow could be to +/- INF and underflow to zero or a denormal.

the IEEE FP standard) on various conditions, including on overflows,
but also on rounding errors ("inexact").

- anton

If one has CVNZ it is enough for both signed and unsigned integer
conditional testing using only four bits.

The CVNZ could be repurposed for float comparisons. V = INF. C=inexact
for instance.

--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat May 23 18:37:39 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> writes:

On Sat, 23 May 2026 09:28:45 +0000, Anton Ertl wrote:

I see four tests for unsigned conditions on the 68000
<https://en.wikibooks.org/wiki/68000_Assembly/Conditional_Tests>:

HI >
LS <=
CC >=
CS <

For the signed ones there is

GT >
LE <=
GE >=
LT <

What I was going by was Table 3-19 on page 3-19 of the M68000 Family >Programmer's Reference Manual on the Internet Archive from Bitsavers; it >gives the available condition code tests on the architecture as:

0000 True
0001 False
0010 High not C and not Z
0011 Low or Same C or Z
0100 Carry Clear
0101 Carry Set
0110 Not Equal not Z
0111 Equal Z
1000 Overflow Clear not V
1001 Overflow Set V
1010 Plus not N
1011 Minus N
1100 Greater or Equal (N and V) or (not N and not V)
1101 Less Than (N and not V) or (not N and V)
1110 Greater Than (N and V and not Z) or (not N and not V and not Z) >1111 Less or Equal Z or (N and not V) or (not N and V)

I took Low or Same as unsigned, and Plus, Minus, Greater or Equal, Less >Than, Greater Than, and Less or Equal as signed.

Carry Clear (CC) is unsigned >=
Carry Set (CS) is unsigned <

after a CMP or SUB instruction.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Sat May 23 19:33:46 2026

From Newsgroup: comp.arch

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

The S/360 is a mess as far as dealing with conditions is concerned.
Or is there a great underlying principle involved, and I fail to see
it? I doubt it, for the following reasons: 1) I have not come across
any description that eplained the underlying principe, and in fact I
have come across few descriptions at all. 2) In the 62 years that
S/360 has been available, it has not found any successors in its
particular approach to conditions.

I suspect the encoded condition bits in S/360 are a reflection of
the expensive memory era in which it was created. If they had
decoded condition codes, they'd have had to find more bits in
the PSW to store them, and it was already quite full.

I agree that nobody else did that, and in retrospect it was an overoptimization.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 23 20:01:07 2026

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2026-05-23 5:28 a.m., Anton Ertl wrote:

quadi <quadibloc@ca.invalid> writes:

On Fri, 22 May 2026 15:48:18 +0000, quadi wrote:

However, if I have enough opcode space to add a U bit to all the
conditional branch instructions, then I also have enough opcode space to >>> fix that instead, so I likely will rework this part of the ISA into
something more conventional.

I have made the first set of changes, using five-bit condition code fields >> to nicely and fully handle both the signed and unsigned cases; I checked >> what the Motorola 68000 did, and found that it only provided a complete
set of tests for signed values, but only two tests for unsigned ones.

I see four tests for unsigned conditions on the 68000 <https://en.wikibooks.org/wiki/68000_Assembly/Conditional_Tests>:

HI >
LS <=
CC >=
CS <

CS may also be called LO
CC may also be called HS

For the signed ones there is

GT >
LE <=
GE >=
LT <

my assumed behavior that
everything should just fail if there's an overflow... is reasonable for
floating-point numbers.

The usual setup is that FP operations silently overflow to +INF and underflow to -INF. They do set sticky flags (called "exceptions" in

Methinks overflow could be to +/- INF and underflow to zero or a denormal.

IEEE defines OVERFLOW as finite becomes signed infinite.
IEEE defines UNDERFLOW as finite becomes signed sub-finite*.
Sub-finite ={deNormal or zero}

the IEEE FP standard) on various conditions, including on overflows,
but also on rounding errors ("inexact").

- anton

If one has CVNZ it is enough for both signed and unsigned integer conditional testing using only four bits.

The CVNZ could be repurposed for float comparisons. V = INF. C=inexact
for instance.

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 23 20:03:34 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> posted:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

The S/360 is a mess as far as dealing with conditions is concerned.
Or is there a great underlying principle involved, and I fail to see
it? I doubt it, for the following reasons: 1) I have not come across
any description that eplained the underlying principe, and in fact I
have come across few descriptions at all. 2) In the 62 years that
S/360 has been available, it has not found any successors in its
particular approach to conditions.

I suspect the encoded condition bits in S/360 are a reflection of
the expensive memory era in which it was created. If they had
decoded condition codes, they'd have had to find more bits in
the PSW to store them, and it was already quite full.

S/360 would have been better off as defining PSW as a PSQW (128-bits)
which would have alleviated several problems associated with running
out of PSW space.

I agree that nobody else did that, and in retrospect it was an overoptimization.

--- Synchronet 3.22a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Sat May 23 20:09:54 2026

From Newsgroup: comp.arch

According to MitchAlsup <user5857@newsgrouper.org.invalid>:

I suspect the encoded condition bits in S/360 are a reflection of
the expensive memory era in which it was created. If they had
decoded condition codes, they'd have had to find more bits in
the PSW to store them, and it was already quite full.

S/360 would have been better off as defining PSW as a PSQW (128-bits)
which would have alleviated several problems associated with running
out of PSW space.

They'd also have been better off making the addresses 32 bits and not
putting junk in the high byte, which caused endless pain later, but
they were really really worried about making low end models with 8K
bytes usable.

Remember that the major reason for B+D addressing was that it let them
have 16 bit address fields in instructions while keeping 24 bit flat addressing.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 23 22:15:30 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> posted:

According to MitchAlsup <user5857@newsgrouper.org.invalid>:

I suspect the encoded condition bits in S/360 are a reflection of
the expensive memory era in which it was created. If they had
decoded condition codes, they'd have had to find more bits in
the PSW to store them, and it was already quite full.

S/360 would have been better off as defining PSW as a PSQW (128-bits)
which would have alleviated several problems associated with running
out of PSW space.

They'd also have been better off making the addresses 32 bits and not
putting junk in the high byte, which caused endless pain later, but
they were really really worried about making low end models with 8K
bytes usable.

Remember that the major reason for B+D addressing was that it let them
have 16 bit address fields in instructions while keeping 24 bit flat addressing.

B+X+D addressing only got 12-bits
B+D addressing was for RS and SS instructions

I think they thought they were saving on complexity and HW logic, but
I think the whole RS and SS could have used a "more regular format pattern"; and they (IBM) would have been better off long term.

But that was "Oh so long ago."
--- Synchronet 3.22a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Sun May 24 01:43:29 2026

From Newsgroup: comp.arch

According to MitchAlsup <user5857@newsgrouper.org.invalid>:

Remember that the major reason for B+D addressing was that it let them
have 16 bit address fields in instructions while keeping 24 bit flat
addressing.

B+X+D addressing only got 12-bits
B+D addressing was for RS and SS instructions

four bits of B, 12 bits of D, 16 bit addresses
you're right that RX used another four bits.

I think they thought they were saving on complexity and HW logic, but

We don't have to guess. "Architecture of the IBM System/360" by Amdahl, Blaauw, and Brooks in the IBM Systems Journal in April 1964 described a lot
of the reasoning, and they wrote a whole book about it.

They had to make a lot of other design decisions like 6 vs 8 bit
bytes, ones- vs twos-complement, length fields vs word marks for
variable length data, stack vs registers, floating point format (they
blew that one).

They said that the combination of a full length base register and a
short displacement "gives consequent gains in instruction density. The base-register approach was adopted, and then augmented, for some
instructions, with a second level of indexing."

In retrospect, B+X+D was probably a mistake since I believe that
double indexing is rarely used, and easy to do with an extra register
add. On the other hand, it's not obvious what a better use of the X
field would have been. I suppose they could have made instructions
three operand, e.g.

A Rx,Ry,B(D)

would add the memory operand to Ry and put it in Rx but it was a long
time until compilers could make good use of that.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun May 24 03:10:27 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> posted:

According to MitchAlsup <user5857@newsgrouper.org.invalid>:

Remember that the major reason for B+D addressing was that it let them
have 16 bit address fields in instructions while keeping 24 bit flat
addressing.

B+X+D addressing only got 12-bits
B+D addressing was for RS and SS instructions

four bits of B, 12 bits of D, 16 bit addresses
you're right that RX used another four bits.

I think they thought they were saving on complexity and HW logic, but

We don't have to guess. "Architecture of the IBM System/360" by Amdahl, Blaauw, and Brooks in the IBM Systems Journal in April 1964 described a lot of the reasoning, and they wrote a whole book about it.

They had to make a lot of other design decisions like 6 vs 8 bit
bytes, ones- vs twos-complement, length fields vs word marks for
variable length data, stack vs registers, floating point format (they
blew that one).

They said that the combination of a full length base register and a
short displacement "gives consequent gains in instruction density. The base-register approach was adopted, and then augmented, for some instructions, with a second level of indexing."

In retrospect, B+X+D was probably a mistake since I believe that
double indexing is rarely used, and easy to do with an extra register
add.

That is the view of MIPS and RISC_V
That is not the view of x86 or ARM or My 66000 or Mc 88K

On the other hand, it's not obvious what a better use of the X
field would have been. I suppose they could have made instructions
three operand, e.g.

A Rx,Ry,B(D)

would add the memory operand to Ry and put it in Rx but it was a long
time until compilers could make good use of that.

Agreed about time it took compiler to be taught how to use it.

--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 13:30:42 2026

From Newsgroup: comp.arch

On Sat, 23 May 2026 20:09:54 +0000, John Levine wrote:

Remember that the major reason for B+D addressing was that it let them
have 16 bit address fields in instructions while keeping 24 bit flat addressing.

12 bits, of course. And they felt that 12 bits were enough because memory
was such an issue back then.

In hindsight, of course having a two-bit condition code was a "mistake".
But C hadn't been invented yet, so nobody knew there would be any real use
for unsigned integers.

And the PSW really was full - when IBM went to System/370, they had to repurpose a bit in the PSW that was already assigned to an existing
feature, ASCII mode. Since nobody ever used it, however, using it instead
for the System/370's "Extended Control Mode", wherein the PSW *did* get doubled in length was possible.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 13:32:41 2026

From Newsgroup: comp.arch

On Sat, 23 May 2026 20:03:34 +0000, MitchAlsup wrote:

S/360 would have been better off as defining PSW as a PSQW (128-bits)
which would have alleviated several problems associated with running out
of PSW space.

It's not as if these problems were impossible to fix.

Remember the System/370, and its Extended Control Mode? All they lost was
the ability to switch the computer into an ASCII mode nobody ever used.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 13:49:26 2026

From Newsgroup: comp.arch

On Sun, 24 May 2026 13:32:41 +0000, quadi wrote:

On Sat, 23 May 2026 20:03:34 +0000, MitchAlsup wrote:

S/360 would have been better off as defining PSW as a PSQW (128-bits)
which would have alleviated several problems associated with running
out of PSW space.

Remember the System/370, and its Extended Control Mode? All they lost
was the ability to switch the computer into an ASCII mode nobody ever
used.

Come to think of this, though, that fact doesn't make you wrong. They
would have been better off defining it as 128 bits long in the first
place, since one thing they _couldn't_ do with Extended Control Mode was change the condition codes from two bits to full NZVC, since user programs
had to remain compatible.

Of course, though, people must have been able to get C compilers working
on z/Architecture, despite inefficiencies, or it wouldn't be possible to install Linux on those machines.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 13:57:12 2026

From Newsgroup: comp.arch

On Sun, 24 May 2026 01:43:29 +0000, John Levine wrote:

In retrospect, B+X+D was probably a mistake since I believe that double indexing is rarely used, and easy to do with an extra register add. On
the other hand, it's not obvious what a better use of the X field would
have been. I suppose they could have made instructions three operand,
e.g.

A Rx,Ry,B(D)

would add the memory operand to Ry and put it in Rx but it was a long
time until compilers could make good use of that.

Since there were three-address machines back in the days before general registers, I am surprised to hear that they didn't know how to write
compilers that made use of such a field.

But the "better use of the X field" is obvious - make the displacement
field 16 bits instead of 12 bits. Except, of course, that this would have killed the SS format of instructions.

But I don't agree that B+X+D is a bad thing. An extra register add is an
extra instruction. And it's not rarely used; it's used every time an array
is accessed, and arrays are often accessed in inner loops!

Of course, there are counterarguments. B+X+D, when used, involves an extra
add inside the instruction. Doesn't that take time too? Wouldn't it be
better to add just once at the beginning of the loop?

The thing is, though, there's also *register pressure* to think about.
Plus, the extra add inside the instruction just means a three-input add,
and once one recalls how *multipliers* are designed, one realizes that
this extra add, though it may still take time, does not take _much_ time.

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 14:14:41 2026

From Newsgroup: comp.arch

On Sun, 24 May 2026 13:49:26 +0000, quadi wrote:

Of course, though, people must have been able to get C compilers working
on z/Architecture, despite inefficiencies, or it wouldn't be possible to install Linux on those machines.

I did a search, and found that z/Architecture added add-with-carry, subtract-with-borrow, and LLGF and LLGH which appear to be UL and ULH in
my architectures.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun May 24 09:32:07 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes:

I suspect the encoded condition bits in S/360 are a reflection of
the expensive memory era in which it was created. If they had
decoded condition codes, they'd have had to find more bits in
the PSW to store them, and it was already quite full.

Some additional possible reasons:

Most[1] architecture before the S/360 use ones-complement or
sign/magnitude representation for integers, and trap on overflow [2],
and I guess they have separate comparison instructions (I don't know
that much about these machines, so I may be wrong here). So there was
no need for flags indicating signed overflow (V), or unsigned overflow
(C). Having only two flag bits was good enough to represent the three
possible outcomes of a comparison.

[1] Zuse chose twos-complement in the early 1940s. I don't know if he
stuck with that in his later machines.

[2] Reading the IBM 704 manual, it just says for some instructions
that "Ac overflow is possible", but does not describe the
behaviour. For division, the IBM 704 has "divide or halt" and
"divide or proceed", so I guess that trapping in the modern sense
was not yet on the table.

S/360 also supports the trap-on-overflow behaviour for signed
arithmetics, but one can turn the trapping off. Arithmetic
instructions set the flags in different ways depending on whether they
are signed or unsigned. So S/360 has a separate add-signed (A) and add-unsigned (AL) instruction; thanks to 2s-complement arithmetics,
when they don't trap, they produce the same result in the target
register, but different behaviour in the flags and in trapping.

I expect that this all costs in control logic, so more constrained
processors like the 6502 and the 8080 then later went with NCZV. The
PDP-11 too AFAIK, but that may be due to the features of the bit-slice
ALUs available when the PDP-11 was designed. These machines also did
not have as many encoding bits to waste on separate signed and
unsigned integer arithemetics, thanks to their very narrow memory
bandwidth.

The architectures before the S/360 do not provide support multi-word
integer arithmetic (unless you count the digit-serial and
character-serial machines), and S/360 does not, either. It takes
until 1990 for IBM to add ALCR (add with carry-in) to the
architecture. For architectures with smaller word sizes like the
PDP-11, the 6502 and 8080, the need for multi-word integer arithmetic
was much greater.

Interestingly, the IBM 704 has the ACL instruction, an unsigned
addition with carry-in, like the ESA/390's ALCR.

Bottom line: When the S/360 was designed, the design of 2s-complement
machines was in its infancy (if we ignore Zuse, and the S/360
designers may have ignored him), so it was not known how to design the
flags for them.

One other aspect that may have played a role is that various S/360 implementations included compatibility modes for earlier IBM models,
and the may have designed the flags with that in mind. However, given
the vast differences between a 36-bit machine with sign/magnitude representation (IBM 704) and the S/360, implementing different flags
for the different architectures was probably just a minor
complication. Moreover, different S/360 models offer compatibility
for different older architectures, where the flags probably were
different.

Concerning the question about why IBM chose big-endian for the S/360.
I see <https://en.wikipedia.org/wiki/IBM_704#Registers> that already
the IBM 704 used big-endian bit-numbering. As long as you only have
one width at which to talk to registers or memory, that's as good as little-endian. It only becomes an issue if you talk to registers at
different widths (e.g., 32-bit and 64-bit Power(PC)), and likewise,
for memory it only becomes an issue when you talk to memory at
different widths; i.e., not word-addressed machines, but
byte-addressed machines.

For FP the machines have different widths from early on, but they tend
not to access the halves of a double-precision number, so the
difference between big- and little-endian rarely makes a difference
there.

Actually, one does see some effects of big-endian bit numbering in the
IBM 704, because the Accumulator has additional bits, and they are
called P and Q (with little-endian bit ordering starting with bit 0,
they would just be called 35 and 36). Also the 15-bit index registers
run from bit 3 (MSB) to bit 17 (LSB), not from 0 to 17.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Sun May 24 15:12:48 2026

From Newsgroup: comp.arch

According to quadi <quadibloc@ca.invalid>:

On Sun, 24 May 2026 01:43:29 +0000, John Levine wrote:

In retrospect, B+X+D was probably a mistake since I believe that double
indexing is rarely used, and easy to do with an extra register add. On
the other hand, it's not obvious what a better use of the X field would
have been. I suppose they could have made instructions three operand,
e.g.

A Rx,Ry,B(D)

would add the memory operand to Ry and put it in Rx but it was a long
time until compilers could make good use of that.

Since there were three-address machines back in the days before general >registers, I am surprised to hear that they didn't know how to write >compilers that made use of such a field.

Optimizing compilers largely meant Fortran, which came from the single address 70x series. Human programmers did all sorts of clever tricks but it took a while to get compilers to do it. It probably needed graph coloring register allocation which wasn't invented until 1980.

But the "better use of the X field" is obvious - make the displacement
field 16 bits instead of 12 bits. Except, of course, that this would have >killed the SS format of instructions.

Or worse had some instructions with 12 bit displacement and some with 16
which would have been a programming nightmare.

But I don't agree that B+X+D is a bad thing. An extra register add is an >extra instruction. And it's not rarely used; it's used every time an array >is accessed, and arrays are often accessed in inner loops!

A decent optimizing compiler will do strength reduction so there's a register pointing at the array and stepping through it. You're right about register pressure but with 16 registers it shouldn't be hard to find one for an inner loop value.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Sun May 24 15:24:22 2026

From Newsgroup: comp.arch

According to quadi <quadibloc@ca.invalid>:

On Sat, 23 May 2026 20:09:54 +0000, John Levine wrote:

Remember that the major reason for B+D addressing was that it let them
have 16 bit address fields in instructions while keeping 24 bit flat
addressing.

12 bits, of course. And they felt that 12 bits were enough because memory >was such an issue back then.

It was also to force all addresses to be base relative to make code relocatable.

You should read the 1964 paper. It's not very long. Here's a copy:

https://www.ece.ucdavis.edu/~vojin/CLASSES/EEC272/S2005/Papers/IBM360-Amdahl_april64.pdf

In hindsight, of course having a two-bit condition code was a "mistake".
But C hadn't been invented yet, so nobody knew there would be any real use >for unsigned integers.

Sure they did. S/360 had separate unsigned versions of add and subtract instructions. The results were the same but the condition codes were
different and the unsigned versions couldn't overflow. There were also arithmetic and logical shifts.

And the PSW really was full - when IBM went to System/370, they had to >repurpose a bit in the PSW that was already assigned to an existing
feature, ASCII mode. Since nobody ever used it, however, using it instead >for the System/370's "Extended Control Mode", wherein the PSW *did* get >doubled in length was possible.

Yup.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 16:39:25 2026

From Newsgroup: comp.arch

On Sun, 24 May 2026 15:24:22 +0000, John Levine wrote:

Sure they did. S/360 had separate unsigned versions of add and subtract instructions. The results were the same but the condition codes were different and the unsigned versions couldn't overflow.

Ah, I didn't remember that!

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 16:44:26 2026

From Newsgroup: comp.arch

On Sun, 24 May 2026 15:12:48 +0000, John Levine wrote:

According to quadi <quadibloc@ca.invalid>:

But the "better use of the X field" is obvious - make the displacement >>field 16 bits instead of 12 bits. Except, of course, that this would
have killed the SS format of instructions.

Or worse had some instructions with 12 bit displacement and some with 16 which would have been a programming nightmare.

Of course, the z/Architecture does have instructions with 20 bit
displacements as well as 12 bits. But unlike the case where only the SS instructions have a 12-bit displacement, it has a complete set of
instructions in each size.

And my Concertina II and IV also have 12, 16, and 20 bit displacements -
but it uses a different set of registers as the base registers for each,
and also has a complete set of instructions for each, thus avoiding the nightmare.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Sun May 24 17:06:52 2026

From Newsgroup: comp.arch

According to MitchAlsup <user5857@newsgrouper.org.invalid>:

In retrospect, B+X+D was probably a mistake since I believe that
double indexing is rarely used, and easy to do with an extra register
add.

That is the view of MIPS and RISC_V
That is not the view of x86 or ARM or My 66000 or Mc 88K

I suppose, but I don't think any of them reserved four instruction bits
for an index register that's rarely used. On x86 it's one bit in the r/m
field and arguably not even that since it's part of a three bit field
that's overloaded as a register number, or in 32 bit mode one address
form out of 8 that takes an extra byte for the base and index registers.

Vax also had double indexing, but it was an extra prefix byte in the
address field that said add register N scaled by the operand size to
whaver other address followed, so there it was one addrsss mode out of
16.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 17:16:35 2026

From Newsgroup: comp.arch

On Sun, 24 May 2026 09:32:07 +0000, Anton Ertl wrote:

Most[1] architecture before the S/360 use ones-complement or
sign/magnitude representation for integers, and trap on overflow [2],

It makes sense to trap on a floating-point overflow, but trapping on an integer overflow is usually a terrible idea.

Before the System/360, it's definitely true that one's complement and sign- magnitude representations of integers were valid options for designers.
I'm not sure of their relative frequency.

I do know of a claim made by one maker of a 24-bit computer in its
advertising literature, and I suspect it did represent the situation then.

Sign-magnitude was what the IBM 704 and its descendants used. As a result,
it was the... aspirational... integer representation.

One's complement was very popular back then - simpler to implement than sign-magnitude, but almost equivalent, in some sense. Thus, one's
complement was the preferred representation in the PDP-4, which also had a limited two's complement capability.

And two's complement was the simplest to implement, and thus chosen where
cost savings were paramount. So the PDP-5 used two's complement.

And then the IBM 360 came along, and woke everyone up to the fact that
there was no real reason to use anything but two's complement.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 17:30:40 2026

From Newsgroup: comp.arch

On Sun, 24 May 2026 09:32:07 +0000, Anton Ertl wrote:

Concerning the question about why IBM chose big-endian for the S/360

...I'm not really aware that they had a choice.

Some machines before the IBM System/360 did use little-endian ordering for multiple words, to simplify handling the carries when adding pairs of
words.

Until the PDP-11 came along, though, _nobody_ thought of putting the characters inside a word starting at the least-significant end, so that
the ordering of bytes would be consistent with the ordering of words.

Until the PDP-11 came along, therefore, little-endian wasn't a "thing";
while the most significant part of a two-word integer might be placed
second, so you could fetch the parts in forwards order and start adding
right away, but that wasn't part of a philosophy.

The System/360 _only_ did BCD arithmetic with the SS instructions, it
didn't put BCD in registers. So it wasn't forced to use big-endian by my consistency argument; binary values could still have been little-endian if they had preferred. But the different machines in the System/260 family
had different bus widths.

So they couldn't just be little-endian at the 16-bit level; they would
have had to have been consistent. I suppose they could have thought of it first even if they didn't have the PDP-11 to copy from. But because almost
all their machines were microcoded, they were in a position to do things
like working backwards from the end of a number to do arithmetic to avoid having a severe cost penalty for big-endian.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun May 24 17:32:10 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> posted:

On Sun, 24 May 2026 09:32:07 +0000, Anton Ertl wrote:

Most[1] architecture before the S/360 use ones-complement or
sign/magnitude representation for integers, and trap on overflow [2],

It makes sense to trap on a floating-point overflow, but trapping on an integer overflow is usually a terrible idea.

So, detecting something went wrong and you should inform the programmer
is a bad idea ???

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Sun May 24 21:39:42 2026

From Newsgroup: comp.arch

On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:

quadi <quadibloc@ca.invalid> posted:

It makes sense to trap on a floating-point overflow, but trapping on an
integer overflow is usually a terrible idea.

So, detecting something went wrong and you should inform the programmer
is a bad idea ???

No, so being able to turn the trap for integer overflow on should
definitely be allowed. But that shouldn't be the default behavior.
Otherwise, programs like random number generators wouldn't work.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun May 24 22:07:18 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> posted:

On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:

quadi <quadibloc@ca.invalid> posted:

It makes sense to trap on a floating-point overflow, but trapping on an
integer overflow is usually a terrible idea.

So, detecting something went wrong and you should inform the programmer
is a bad idea ???

No, so being able to turn the trap for integer overflow on should
definitely be allowed. But that shouldn't be the default behavior. Otherwise, programs like random number generators wouldn't work.

They work just fine using unSigned integers.

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun May 24 15:22:46 2026

From Newsgroup: comp.arch

On 5/24/2026 3:07 PM, MitchAlsup wrote:

quadi <quadibloc@ca.invalid> posted:

On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:

quadi <quadibloc@ca.invalid> posted:

It makes sense to trap on a floating-point overflow, but trapping on an >>>> integer overflow is usually a terrible idea.

So, detecting something went wrong and you should inform the programmer
is a bad idea ???

No, so being able to turn the trap for integer overflow on should
definitely be allowed. But that shouldn't be the default behavior.
Otherwise, programs like random number generators wouldn't work.

They work just fine using unSigned integers.

Ditto!

[...]

--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Mon May 25 01:04:36 2026

From Newsgroup: comp.arch

On Wed, 20 May 2026 01:35:01 +0000, MitchAlsup wrote:

You will find you have no <marketable> choice; you need to support::

Integer{S8, S16, S32, S64, U8, U16, U32, U64}
Float {FP8, FP16, FP32, FP64 and some way to get FP128}

After realizing that I did need a second instruction for unsigned
_division_ I then learned, to my shock, that division was not one, but
two, instructions, at least in my architecture, for integers.

And there didn't seem to be enough opcode space left for Divide Extensibly Unsigned.

I was able to re-adjust the 32-bit operate instructions so that the two
places where only 96 opcodes were provided for the basic operate
instructions could now provide 128 opcodes.

The 16-bit and 24-bit short instructions could not be so modified. But
there were a few unused opcodes; so Divide Extensibly Unsigned could still
fit in, just out of place.

But that meant that this one operation would be missing from the minimum- length immediate instructions, and would still be treated as out of the
basic instruction set, getting immediate instructions that were 16 bits longer, for them.

The Pigeonhole Principle has finally bit me!

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Mon May 25 01:29:37 2026

From Newsgroup: comp.arch

On Mon, 25 May 2026 01:04:36 +0000, quadi wrote:

The 16-bit and 24-bit short instructions could not be so modified. But
there were a few unused opcodes; so Divide Extensibly Unsigned could
still fit in, just out of place.

But that meant that this one operation would be missing from the
minimum- length immediate instructions, and would still be treated as
out of the basic instruction set, getting immediate instructions that
were 16 bits longer, for them.

I have found a way around even that problem. There is no use for a "swap immediate" instruction, so I'll put Divide Extensibly Unsigned in its
spot, so it will be in the columns for its types, and put the swap instruction, another exotic one, in the out-of-place spots left over.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Mon May 25 10:23:00 2026

From Newsgroup: comp.arch

On 24/05/2026 23:39, quadi wrote:

On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:

quadi <quadibloc@ca.invalid> posted:

It makes sense to trap on a floating-point overflow, but trapping on an
integer overflow is usually a terrible idea.

So, detecting something went wrong and you should inform the programmer
is a bad idea ???

No, so being able to turn the trap for integer overflow on should
definitely be allowed. But that shouldn't be the default behavior.
Otherwise, programs like random number generators wouldn't work.

John Savard

That does not make sense. Code such as random number generators should
be written so that they are correct in the language they are written in.
If that is C, signed integer overflow is UB while unsigned integers
have wrapping behaviour - thus if your code depends on wrapping, and it
is written in C, it needs to use unsigned types or compiler-specific extensions, flags, etc. (Or C23 ckd_add and other checked arithmetic functions.)

If it is written in Zig, you need to use the specific modulo arithmetic functions even for unsigned arithmetic. If it is written in Java,
signed integer arithmetic is fine.

It all depends on the language and/or any options the language and tools
might support - and code should be written to work correctly according
to the language rules.

The hardware, of course, cannot always enable trapping on overflow if it
is going to efficiently support a range of programming languages. But
as an optional feature it can be helpful for catching a few bugs in
code, so it can be a good idea (both for signed and unsigned overflow).

--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon May 25 14:28:21 2026

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> writes:

On 24/05/2026 23:39, quadi wrote:

On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:

quadi <quadibloc@ca.invalid> posted:

It makes sense to trap on a floating-point overflow, but trapping on an >>>> integer overflow is usually a terrible idea.

Most programming environments I have had contact with don't trap on floating-point overflow.

So, detecting something went wrong and you should inform the programmer
is a bad idea ???

The question is if an integer overflow means that something went
wrong. Despite their eagerness to "optimize" based on the assumption
that signed integer overflow does not happen, the GCC developers have
avoided making -ftrap the default, even on platforms like MIPS and
Alpha where the implementation of -ftrapv just means to use different instructions (e.g., add instead of addu on MIPS, and addv instead of
add on Alpha).

The hardware, of course, cannot always enable trapping on overflow if it
is going to efficiently support a range of programming languages. But
as an optional feature it can be helpful for catching a few bugs in
code, so it can be a good idea (both for signed and unsigned overflow).

This supposedly helpful feature has been neglected by C compiler
developers, and you see in the progression from MIPS (1986) to Alpha
(1992) and then RISC-V (2011) that the hardware architects have
accepted that:

MIPS: add traps on signed overflow, you need to write addu if you
don't want that.

Alpha: add ignores signed overflow, you need to write addv if you want
the trapping.

RISC-V: add ignores signed overflow, there is no add that traps on
signed overflow (and detecting signed overflow is pretty
involved if both operands are unknown to the compiler).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Mon May 25 17:18:18 2026

From Newsgroup: comp.arch

On 25/05/2026 16:28, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

On 24/05/2026 23:39, quadi wrote:

On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:

quadi <quadibloc@ca.invalid> posted:

It makes sense to trap on a floating-point overflow, but trapping on an >>>>> integer overflow is usually a terrible idea.

Most programming environments I have had contact with don't trap on floating-point overflow.

So, detecting something went wrong and you should inform the programmer >>>> is a bad idea ???

The question is if an integer overflow means that something went
wrong.

At the source code level, that is often the case - but not always. I
think it is quite clear that if you do something the language does not
allow, the code is wrong, but it might give the correct results for some
tools nonetheless. And overflow will often mean something went wrong
even when the language (or compiler options) specifically allow it. At
the object code level, things may be different again. (For an obvious example, if you are using a double-width integer type then the source
code may have no overflow but the implementation might use two "add-with-carry" instructions where overflow is a natural part of the implementation.)

Despite their eagerness to "optimize" based on the assumption
that signed integer overflow does not happen, the GCC developers have
avoided making -ftrap the default, even on platforms like MIPS and
Alpha where the implementation of -ftrapv just means to use different instructions (e.g., add instead of addu on MIPS, and addv instead of
add on Alpha).

An awkward thing about using trap on overflow is determining how
precisely it is defined. Supposing you have the expression "a + b - a".
Perhaps "a + b" overflows. I would hope than when using debug-related compiler flags such as "-fsanitize=signed-integer-overflow", a compiler
would check for overflow on "a + b", and report it at runtime.
(Unfortunately, gcc does not do that unless the partial expression is
assigned to a variable.) But in "normal" usage, I'd expect the
expression to be simplified, resulting in just "b" and no overflow.

If "trap on overflow" has precise semantics in the code, then this
disables a range of useful optimisations and re-arrangements. If it is
just "use trapping arithmetic instructions", then it will miss many
possible cases of actual overflow in the code, which we might want to
catch. And "trap on overflow" might either trigger when there is no
overflow in the original code, or hinder optimisations. (Consider the expression "x / 2 + y / 2" - the compiler could implement that as a
combined "(x + y) / 2", but that might introduce overflow.)

It is not easy to see how a tool can avoid false positives and false
negatives and also conveniently optimise and re-arrange code.

The hardware, of course, cannot always enable trapping on overflow if it
is going to efficiently support a range of programming languages. But
as an optional feature it can be helpful for catching a few bugs in
code, so it can be a good idea (both for signed and unsigned overflow).

This supposedly helpful feature has been neglected by C compiler
developers, and you see in the progression from MIPS (1986) to Alpha
(1992) and then RISC-V (2011) that the hardware architects have
accepted that:

MIPS: add traps on signed overflow, you need to write addu if you
don't want that.

Alpha: add ignores signed overflow, you need to write addv if you want
the trapping.

RISC-V: add ignores signed overflow, there is no add that traps on
signed overflow (and detecting signed overflow is pretty
involved if both operands are unknown to the compiler).

- anton

Compilers have not always been good at taking advantage of all the
features provided by hardware - nor have languages been good at exposing
the possibilities in the language so that programmers can take advantage
of them.

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 16:45:07 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> posted:

On Wed, 20 May 2026 01:35:01 +0000, MitchAlsup wrote:

You will find you have no <marketable> choice; you need to support::

Integer{S8, S16, S32, S64, U8, U16, U32, U64}
Float {FP8, FP16, FP32, FP64 and some way to get FP128}

After realizing that I did need a second instruction for unsigned
_division_ I then learned, to my shock, that division was not one, but
two, instructions, at least in my architecture, for integers.

And there didn't seem to be enough opcode space left for Divide Extensibly Unsigned.

My 66000 has an instruction bit that denotes the signedness of integer calculations {Signed, unSigned}. This bit is available as another OpCode
bit for non-integer calculation instructions.

I was able to re-adjust the 32-bit operate instructions so that the two places where only 96 opcodes were provided for the basic operate instructions could now provide 128 opcodes.

The 16-bit and 24-bit short instructions could not be so modified. But
there were a few unused opcodes; so Divide Extensibly Unsigned could still fit in, just out of place.

But that meant that this one operation would be missing from the minimum- length immediate instructions, and would still be treated as out of the basic instruction set, getting immediate instructions that were 16 bits longer, for them.

The Pigeonhole Principle has finally bit me!

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 16:49:59 2026

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

On 24/05/2026 23:39, quadi wrote:

On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:

-----------------

This supposedly helpful feature has been neglected by C compiler
developers, and you see in the progression from MIPS (1986) to Alpha
(1992) and then RISC-V (2011) that the hardware architects have
accepted that:

MIPS: add traps on signed overflow, you need to write addu if you
don't want that.

Alpha: add ignores signed overflow, you need to write addv if you want
the trapping.

RISC-V: add ignores signed overflow, there is no add that traps on
signed overflow (and detecting signed overflow is pretty
involved if both operands are unknown to the compiler).

The worst of all possible semantic encodings

- anton

--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon May 25 16:43:07 2026

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> writes:

On 25/05/2026 16:28, Anton Ertl wrote:

Despite their eagerness to "optimize" based on the assumption
that signed integer overflow does not happen, the GCC developers have
avoided making -ftrap the default, even on platforms like MIPS and
Alpha where the implementation of -ftrapv just means to use different
instructions (e.g., add instead of addu on MIPS, and addv instead of
add on Alpha).

An awkward thing about using trap on overflow is determining how
precisely it is defined. Supposing you have the expression "a + b - a".
Perhaps "a + b" overflows. I would hope than when using debug-related
compiler flags such as "-fsanitize=signed-integer-overflow", a compiler >would check for overflow on "a + b", and report it at runtime. >(Unfortunately, gcc does not do that unless the partial expression is >assigned to a variable.) But in "normal" usage, I'd expect the
expression to be simplified, resulting in just "b" and no overflow.

OTOH, cases like a+b+c where the result is in range, while an
intermediate result is out of range are one of the reasons why I
prefer -fwrapv over -ftrapv. As for your preference of nasal demons,
given enough information, the compiler might "optimize" "a+b-a" into,
e.g., 0.

Anyway, the definition of -ftrapv is not very precise; for gcc-12.2:

|'-ftrapv'
| This option generates traps for signed overflow on addition,
| subtraction, multiplication operations.

As for what gcc-12.2 does for your example on AMD64:

long foo(long a, long b)
{
return a+b-a;
}

is compiled with gcc -O3 -ftrapv to:

0: 48 89 f0 mov %rsi,%rax
3: c3 ret

If "trap on overflow" has precise semantics in the code, then this
disables a range of useful optimisations and re-arrangements. If it is
just "use trapping arithmetic instructions", then it will miss many
possible cases of actual overflow in the code, which we might want to
catch.

Which would you prefer by default?

The gcc developers apparently took the latter approach, even when you
ask for -ftrapv explicitly. So what, IYO, speaks against doing that
by default on machines like MIPS and Alpha.

And "trap on overflow" might either trigger when there is no
overflow in the original code, or hinder optimisations. (Consider the >expression "x / 2 + y / 2" - the compiler could implement that as a
combined "(x + y) / 2", but that might introduce overflow.)

x/2+y/2 produces a different result from (x+y)/2 when both x and y are
odd integers.

gcc-12.2 compiles

long bar(long x, long y)
{
return x/2+y/2;
}

on AMD64 to:

gcc -O3 -ftrapv gcc -O3
mov %rdi,%rax mov %rdi,%rax
sub $0x8,%rsp mov %rsi,%rdx
shr $0x3f,%rax shr $0x3f,%rax
add %rax,%rdi shr $0x3f,%rdx
mov %rsi,%rax add %rdi,%rax
shr $0x3f,%rax add %rsi,%rdx
sar %rdi sar %rax
add %rax,%rsi sar %rdx
sar %rsi add %rdx,%rax
call __addvdi3@PLT ret
add $0x8,%rsp
ret

so the -ftrapv introduces an additional mov and a call; I would have
expected that the + would be compiled to an ADD instruction followed
by a JO instruction.

Trying the same on a MIPS64 machine with gcc-8.3 (which apparently
produces ILP32 code) produces a call to __addvsi3 instead of the
expected add instruction:

gcc -O3 -ftrapv gcc -O3
lui gp,0x0 srl v0,a0,0x1f
addiu gp,gp,0 srl v1,a1,0x1f
addu gp,gp,t9 addu v0,v0,a0
srl v1,a0,0x1f addu a1,v1,a1
lw t9,__addvsi3(gp) sra v0,v0,0x1
srl v0,a1,0x1f sra a1,a1,0x1
addiu sp,sp,-32 jr ra
addu a0,v1,a0 addu v0,v0,a1
addu a1,v0,a1
sra a0,a0,0x1
sw ra,28(sp)
sw gp,16(sp)
jalr t9
sra a1,a1,0x1
lw ra,28(sp)
jr ra
addiu sp,sp,32

The call costs a lot of overhead.

It is not easy to see how a tool can avoid false positives and false >negatives and also conveniently optimise and re-arrange code.

It can't. But it does not try to avoid false negatives even when
explicitly asked for trapping on overflow.

If some overflow trapping when it can be done without additional
instructions would be preferable over no overflow, gcc would compile
signed adds that survive after optimization into add on MIPS rather
than addu, by default. Given that it does not, the GCC developers
probably found out that it is not preferable. I guess they would get
too many customer complaints, including for "relevant" code, i.e.,
code where the usual "it's UB, so your code is broken" excuse does not
work.

The fact that they don't even try to make -ftrapv produce efficient
code indicates that there is no "relevant" interest in efficient
-ftrapv. It would be interesting to know who came up with the idea of
adding -ftrapv, and why they are still keeping it.

Compilers have not always been good at taking advantage of all the
features provided by hardware

GCC is pretty good at implementing -fwrapv. For the two examples
above, "gcc -O3 -fwrapv" produces the same code on AMD64 and MIPS as
"gcc -O3".

nor have languages been good at exposing
the possibilities in the language so that programmers can take advantage
of them.

Yes. But I leave that for another day.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 19:20:01 2026

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

On 25/05/2026 16:28, Anton Ertl wrote:

Despite their eagerness to "optimize" based on the assumption
that signed integer overflow does not happen, the GCC developers have
avoided making -ftrap the default, even on platforms like MIPS and
Alpha where the implementation of -ftrapv just means to use different
instructions (e.g., add instead of addu on MIPS, and addv instead of
add on Alpha).

An awkward thing about using trap on overflow is determining how
precisely it is defined. Supposing you have the expression "a + b - a".
Perhaps "a + b" overflows. I would hope than when using debug-related
compiler flags such as "-fsanitize=signed-integer-overflow", a compiler >would check for overflow on "a + b", and report it at runtime. >(Unfortunately, gcc does not do that unless the partial expression is >assigned to a variable.) But in "normal" usage, I'd expect the
expression to be simplified, resulting in just "b" and no overflow.

OTOH, cases like a+b+c where the result is in range, while an
intermediate result is out of range are one of the reasons why I
prefer -fwrapv over -ftrapv. As for your preference of nasal demons,
given enough information, the compiler might "optimize" "a+b-a" into,
e.g., 0.

a/0/b/

Anyway, the definition of -ftrapv is not very precise; for gcc-12.2:

|'-ftrapv'
| This option generates traps for signed overflow on addition,
| subtraction, multiplication operations.

As for what gcc-12.2 does for your example on AMD64:

long foo(long a, long b)
{
return a+b-a;
}

is compiled with gcc -O3 -ftrapv to:

0: 48 89 f0 mov %rsi,%rax
3: c3 ret

If "trap on overflow" has precise semantics in the code, then this >disables a range of useful optimisations and re-arrangements. If it is >just "use trapping arithmetic instructions", then it will miss many >possible cases of actual overflow in the code, which we might want to >catch.

Which would you prefer by default?

What you do want is compiled code that can trap on overflow and avoid
trapping on overflow without code substitution or being re-compiled.
This way production code can avoid trapping but if the debugger is
turned on, you can trap.

The gcc developers apparently took the latter approach, even when you
ask for -ftrapv explicitly. So what, IYO, speaks against doing that
by default on machines like MIPS and Alpha.

Both architectures got this one wrong--IMO--and so does RISC-V.

And "trap on overflow" might either trigger when there is no
overflow in the original code, or hinder optimisations. (Consider the >expression "x / 2 + y / 2" - the compiler could implement that as a >combined "(x + y) / 2", but that might introduce overflow.)

x/2+y/2 produces a different result from (x+y)/2 when both x and y are
odd integers.

gcc-12.2 compiles

long bar(long x, long y)
{
return x/2+y/2;
}

on AMD64 to:

gcc -O3 -ftrapv gcc -O3
mov %rdi,%rax mov %rdi,%rax
sub $0x8,%rsp mov %rsi,%rdx
shr $0x3f,%rax shr $0x3f,%rax
add %rax,%rdi shr $0x3f,%rdx
mov %rsi,%rax add %rdi,%rax
shr $0x3f,%rax add %rsi,%rdx
sar %rdi sar %rax
add %rax,%rsi sar %rdx
sar %rsi add %rdx,%rax
call __addvdi3@PLT ret
add $0x8,%rsp
ret

so the -ftrapv introduces an additional mov and a call; I would have
expected that the + would be compiled to an ADD instruction followed
by a JO instruction.

Trying the same on a MIPS64 machine with gcc-8.3 (which apparently
produces ILP32 code) produces a call to __addvsi3 instead of the
expected add instruction:

gcc -O3 -ftrapv gcc -O3
lui gp,0x0 srl v0,a0,0x1f
addiu gp,gp,0 srl v1,a1,0x1f
addu gp,gp,t9 addu v0,v0,a0
srl v1,a0,0x1f addu a1,v1,a1
lw t9,__addvsi3(gp) sra v0,v0,0x1
srl v0,a1,0x1f sra a1,a1,0x1
addiu sp,sp,-32 jr ra
addu a0,v1,a0 addu v0,v0,a1
addu a1,v0,a1
sra a0,a0,0x1
sw ra,28(sp)
sw gp,16(sp)
jalr t9
sra a1,a1,0x1
lw ra,28(sp)
jr ra
addiu sp,sp,32

The call costs a lot of overhead.

Architectures without overflow traps are notorious for excess instruction
count when overflow detection is desired or mandated.

It is not easy to see how a tool can avoid false positives and false >negatives and also conveniently optimise and re-arrange code.

It can't. But it does not try to avoid false negatives even when
explicitly asked for trapping on overflow.

Granted, Optimization can do a lot of strange code emission and movement
when one does not care about precise overflow semantics. But, as a whole,
we are a society where we want high HP automobiles more than we want safe automobiles ('we' not including *.gov's).

If some overflow trapping when it can be done without additional
instructions would be preferable over no overflow, gcc would compile
signed adds that survive after optimization into add on MIPS rather
than addu, by default. Given that it does not, the GCC developers
probably found out that it is not preferable. I guess they would get
too many customer complaints, including for "relevant" code, i.e.,
code where the usual "it's UB, so your code is broken" excuse does not
work.

It is much harder than that. For example: does a signed shift left
overflow when significant bits are shifted out ?? What if the sub-
sequent instruction shifts the result back and the pair are acting
as a bit-field extract ?? My 66000 has bit field extracts for exactly
this reason. Floating-point has a lot of these cases, too.

The fact that they don't even try to make -ftrapv produce efficient
code indicates that there is no "relevant" interest in efficient
-ftrapv. It would be interesting to know who came up with the idea of
adding -ftrapv, and why they are still keeping it.

Compilers have not always been good at taking advantage of all the >features provided by hardware

GCC is pretty good at implementing -fwrapv. For the two examples
above, "gcc -O3 -fwrapv" produces the same code on AMD64 and MIPS as
"gcc -O3".

nor have languages been good at exposing
the possibilities in the language so that programmers can take advantage >of them.

Yes. But I leave that for another day.

A whole new kettle of fish...

- anton

--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Mon May 25 20:26:24 2026

From Newsgroup: comp.arch

On Mon, 25 May 2026 10:23:00 +0200, David Brown wrote:

The hardware, of course, cannot always enable trapping on overflow if it
is going to efficiently support a range of programming languages.

Yes. And I am used to FORTRAN, which did not trap on integer overflows.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Mon May 25 20:32:15 2026

From Newsgroup: comp.arch

On Mon, 25 May 2026 19:20:01 +0000, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

On 25/05/2026 16:28, Anton Ertl wrote:

Despite their eagerness to "optimize" based on the assumption
that signed integer overflow does not happen, the GCC developers
have avoided making -ftrap the default, even on platforms like MIPS
and Alpha where the implementation of -ftrapv just means to use
different instructions (e.g., add instead of addu on MIPS, and addv
instead of add on Alpha).

Both architectures got this one wrong--IMO--and so does RISC-V.

You may not have been replying to what Anton Ertl wrote above, since there
was a lot in between that I snipped. But it does mention two architectures that took an approach to trapping on integer overflow... that I also tend
to disagree with.

What I'm used to is the System/360. While it made the mistake of having
two condition code bits instead of NZVC, the idea of having "trap on
overflow" controlled by a bit in the PSW is... what I assumed to be normal
and correct.

I could be wrong, as I haven't examined that approach critically and given full consideration to the alternatives.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon May 25 20:32:15 2026

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> schrieb:

On 24/05/2026 23:39, quadi wrote:

On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:

quadi <quadibloc@ca.invalid> posted:

It makes sense to trap on a floating-point overflow, but trapping on an >>>> integer overflow is usually a terrible idea.

So, detecting something went wrong and you should inform the programmer
is a bad idea ???

No, so being able to turn the trap for integer overflow on should
definitely be allowed. But that shouldn't be the default behavior.
Otherwise, programs like random number generators wouldn't work.

John Savard

That does not make sense. Code such as random number generators should
be written so that they are correct in the language they are written in.

In principle, yes.

In practice, people often used whatever "worked" on their systems.
Implementors have a certain right because they control what their
compiler does or does not do. But users did so, as well, with
Numerical Recipes a(n in)famous example.

And yes, this bites people. You can see this at https://gcc.gnu.org/gcc-13/porting_to.html :

# GCC 13 includes new optimizations which may change behavior
# on integer overflow. Traditional code, like linear congruential
# pseudo-random number generators in old programs and relying on
# a specific, non-standard behavior may now generate unexpected
# results. The option -fsanitize=undefined can be used to detect
# such code at runtime.

# It is recommended to use the intrinsic subroutine RANDOM_NUMBER for
# random number generators or, if the old behavior is desired, to use
# the -fwrapv option. Note that this option can impact performance.

If that is C, signed integer overflow is UB while unsigned integers
have wrapping behaviour - thus if your code depends on wrapping, and it
is written in C, it needs to use unsigned types or compiler-specific extensions, flags, etc. (Or C23 ckd_add and other checked arithmetic functions.)

If it is written in Zig, you need to use the specific modulo arithmetic functions even for unsigned arithmetic. If it is written in Java,
signed integer arithmetic is fine.

It all depends on the language and/or any options the language and tools might support - and code should be written to work correctly according
to the language rules.

Fortran has no standard way of implementing this unless you
restrict yourself to sizes which do not overflow a signed integer.
Implementing LCGRNGs was one reason why I pushed for unsigned
arithmetic (modulo 2**n) in Fortran. The attempt failed (not
taken up by WG5 after being endorsed by J3), but I implemented it
for gfortran anyway.

The hardware, of course, cannot always enable trapping on overflow if it
is going to efficiently support a range of programming languages. But
as an optional feature it can be helpful for catching a few bugs in
code, so it can be a good idea (both for signed and unsigned overflow).

Sanitizers are also fairly good now, but of course cost performance.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Mon May 25 20:34:41 2026

From Newsgroup: comp.arch

On Mon, 25 May 2026 16:49:59 +0000, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

RISC-V: add ignores signed overflow, there is no add that traps on
signed overflow (and detecting signed overflow is pretty
involved if both operands are unknown to the compiler).

The worst of all possible semantic encodings

Although I thought that making trapping on fixed-point overflow the
default is a bad idea, I agree that making it impossible to do so, or even test for fixed-point overflow, is a much worse idea.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Mon May 25 20:45:20 2026

From Newsgroup: comp.arch

On Mon, 25 May 2026 16:45:07 +0000, MitchAlsup wrote:

My 66000 has an instruction bit that denotes the signedness of integer calculations {Signed, unSigned}. This bit is available as another OpCode
bit for non-integer calculation instructions.

That's nice. It's not an option I can consider, as having lots of
orthogonal modifiers on instructions would tend to increase their length.
A major goal of the Concertina II, III, and IV architectures is for instructions not to be longer than similar instructions on the Motorola
68020 or the IBM System/360 if at all possible.

Basically, the selling point is... "Your programs only get 10% bigger, if that, and yet you have 32 registers, so they run faster!".

Or they _would_, if the design didn't have so many extra transistors for supporting both IBM-format and Intel-format Decimal Floating Point, old-
style IBM floats, simple floating (You too can work with numbers that go around the world 2 1/2 times!), packed decimal, mixed-radix arithmetic...

But, hey, supporting these things in hardware is faster than doing them in software!

And are people even going to _read_ the part of the manual that
explains... as is noted in the description of the original Concertina architecture...

This chip has 8-way simultaneous multi-threading, but only for programs
which do not make use of extensions to the register set.

Only two programs per core may use the extended register banks with 128 elements.

Only one program per core may use the vector registers for long vector instructions. The 256-bit short vector registers, on the other hand, like
the integer and floating-point registers, are available to all
simultaneous threads.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon May 25 20:32:35 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
What you do want is compiled code that can trap on overflow and avoid >trapping on overflow without code substitution or being re-compiled.
This way production code can avoid trapping but if the debugger is
turned on, you can trap.

Why do you consider that desirable?

long bar(long x, long y)
{
return x/2+y/2;
}

...

Trying the same on a MIPS64 machine with gcc-8.3 (which apparently
produces ILP32 code) produces a call to __addvsi3 instead of the
expected add instruction:

gcc -O3 -ftrapv gcc -O3
lui gp,0x0 srl v0,a0,0x1f
addiu gp,gp,0 srl v1,a1,0x1f
addu gp,gp,t9 addu v0,v0,a0
srl v1,a0,0x1f addu a1,v1,a1
lw t9,__addvsi3(gp) sra v0,v0,0x1
srl v0,a1,0x1f sra a1,a1,0x1
addiu sp,sp,-32 jr ra
addu a0,v1,a0 addu v0,v0,a1
addu a1,v0,a1
sra a0,a0,0x1
sw ra,28(sp)
sw gp,16(sp)
jalr t9
sra a1,a1,0x1
lw ra,28(sp)
jr ra
addiu sp,sp,32

The call costs a lot of overhead.

Architectures without overflow traps are notorious for excess instruction >count when overflow detection is desired or mandated.

MIPS' add traps on overflow. gcc could have emitted almost the same
code for gcc -O3 -trapv as for gcc -O3, except that the last
instruction would be an add, not an addu. But apparently nobody gives
a damn about the efficiency of -trapv, possibly rightly so.

If some overflow trapping when it can be done without additional
instructions would be preferable over no overflow, gcc would compile
signed adds that survive after optimization into add on MIPS rather
than addu, by default. Given that it does not, the GCC developers
probably found out that it is not preferable. I guess they would get
too many customer complaints, including for "relevant" code, i.e.,
code where the usual "it's UB, so your code is broken" excuse does not
work.

It is much harder than that. For example: does a signed shift left
overflow when significant bits are shifted out ??

-ftrapv specifies trapping on overflow only for additions,
subtractions, and multiplications.
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon May 25 16:34:50 2026

From Newsgroup: comp.arch

On 5/25/2026 9:28 AM, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

On 24/05/2026 23:39, quadi wrote:

On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:

quadi <quadibloc@ca.invalid> posted:

It makes sense to trap on a floating-point overflow, but trapping on an >>>>> integer overflow is usually a terrible idea.

Most programming environments I have had contact with don't trap on floating-point overflow.

Many just go Inf...

Division by zero is usually handled by going NaN.

Contrast with integer division by zero which does usually trap.

So, detecting something went wrong and you should inform the programmer >>>> is a bad idea ???

The question is if an integer overflow means that something went
wrong. Despite their eagerness to "optimize" based on the assumption
that signed integer overflow does not happen, the GCC developers have
avoided making -ftrap the default, even on platforms like MIPS and
Alpha where the implementation of -ftrapv just means to use different instructions (e.g., add instead of addu on MIPS, and addv instead of
add on Alpha).

Integer overflow happens far too often for trapping to be a good solution.

We almost need a separate "integer that should not overflow" type, with
more explicit "do something special if it does" semantics.

Though, more likely to be useful would be a "detect if an overflow had happened" mechanism.

errno_t ovfstate;
__int_no_overflow x, y, z;
...
__start_errsense(&ovfstate);
z=x+y;
__end_errsense(&ovfstate);
if(ovfstate&ERRSENSE_FLAG_OVERFLOW)
...

Which would be awkward, but probably more useful than, say, raising a
signal and/or terminating the program.

The hardware, of course, cannot always enable trapping on overflow if it
is going to efficiently support a range of programming languages. But
as an optional feature it can be helpful for catching a few bugs in
code, so it can be a good idea (both for signed and unsigned overflow).

This supposedly helpful feature has been neglected by C compiler
developers, and you see in the progression from MIPS (1986) to Alpha
(1992) and then RISC-V (2011) that the hardware architects have
accepted that:

MIPS: add traps on signed overflow, you need to write addu if you
don't want that.

Alpha: add ignores signed overflow, you need to write addv if you want
the trapping.

RISC-V: add ignores signed overflow, there is no add that traps on
signed overflow (and detecting signed overflow is pretty
involved if both operands are unknown to the compiler).

In practice, given:
We have instructions like ADDW, etc, whose behavior is explicitly to sign-extend the results of 32-bit ADD;
Behavior in practice is often to meticulously follow wrap-on-overflow semantics;
Exceptions to wrap-on-overflow usually exist as edge cases;
Various programs exist that will actively break if wrap-on-overflow is
not the observed behavior in C land;
...

The expectation that 'int' can or meaningfully do something other than
wrap on overflow is more of a fantasy.

Or like some other some other "portability boogeymen":
Non two's complement integer arithmetic;
Big endian machines;
Machines that don't allow unaligned loads and stores;
Types with sizes other than the "usually accepted" set;
...

The argument has often been, "but, 64-bit machines might not provide
native 32-bit arithmetic".

But, often in 64-bit machines, a pattern emerges:
Most ops are full 64-bit;
A subset of instructions have variants that produce sign and/or zero
extended results;
The instructions which produce these results, typically being, the ones
needed to preserve the usual wrap-on-overflow semantics in those places
where something could happen that would produce a deviation from the
expected semantics.

The ones that have zero-extension usually treating signed integers as zero-extended.

The reverse has also been done; treating unsigned as sign-extended, as
in the standard RISC-V ABI, but IMO this is stupid. Even in the absence
of a native zero-extension op (as in plain RV64G), the mess that results
from sign-extending unsigned is worse than the cost of explicit zero extension.

Best case here being to keep values using "native extension":
'int' : Always sign extended;
'unsigned int': Always zero extended.
Then 32-bit types are a strict subset of the 64-bit range, and
up-promotion becomes free. Not sure why some people don't see this as
obvious though. Well, and people keep making the choice of adding
garbage edge cases to RISC-V that would have been entirely unnecessary
if people weren't being stupid about the ABI rules.

But yeah...

But, all this would not be expected to happen unless one accepts that it
is already generally accepted that wrap-on-overflow for 'int' and
similar is the only really practical or viable solution here.

Otherwise, recently:
In my case I decided to live with a "breaking change" in XG3 and to
change some things that may matter later. Then ended up tweaking some
other things on my annoyance list (since I was already breaking existing binaries, better to cluster breakage to a singular event if doing it).

ADD, ADDS.L, and ADDU.L have all been changed from Imm10u/n to Imm10s.
The Imm10u cases are now Imm10s;
The Imm10n sub-case is now dropped/reserved.
May be reused later.
This reclaims 3 out of the 20 Imm10 spots.
Was mostly a case of it being harder to justify the encoding space.
Old behavior will need to remain for XG1 and XG2.
In this case, XG3 will explicitly deviate from XG1 and XG2 here.
Does mean that XG3 now had less ADD/SUB Imm range than XG2, but...
Only goes from 97.1% hit rate to 95.9%,
no significant effect on overall code density.
Could use the RV Imm12 ops (ADDI / ADDIW), but:
Hit rate for the RV ops here is negligible;
Much of these also happen to miss on one or both registers.

The MULS.L and MULU.L ops were also switched to Imm10s.
This means all of the Imm10 ALU ops are now unified on Imm10s.

Relocated TST and TSTN from the F0-8 block (with the XMOV instructions)
to the F0-9 block (with the other CMPxx 3R ops).

A few very rarely used instructions were demoted from 32-bit to 64-bit encodings.

Have experimentally added some 32-bit:
Bcc Rm, Imm6s, (PC, Disp6s)
instructions, where:
Imm6s: Hits ~ 80% of these cases;
Disp6s: Hits ~ 60% of these cases;
Imm5s + Disp7s would hit slightly better, but,
would have needed more new decoder logic...
Resulting in it hitting about half over the:
Bcc Rm, Imm17s, (PC, Disp10s)
Cases, for an overall code-density improvement of ~ 0.5%, ...
Dominant use-case: Final compare-and-branch in a short "for()" loop.
Secondary use-case: Short non-predicated "if()" branches.
But, is out-weighed by said predicated "if()" branches.
Would likely see more use here if not using predication.
If it would have hit for 100% of these, would have saved ~ 1%.

This is debatable.

This reused the encoding spots previously used for the Load-Disp5us ops,
which still exist for XG1 and XG2 (decoder special-case handling), but
were N/A in XG3 (they would be in effect entirely redundant with the
Disp10s forms in XG3; but had non-redundant edge-cases in XG1 and XG2).

Like with the Imm17s+Disp10s ops, these will still depend on the IMMB extension, as they still need the same basic mechanism.

Was a fairly low-priority feature, in any case.

Seemingly running low on obvious optimization paths.

- anton

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 22:49:58 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> posted:

On Mon, 25 May 2026 10:23:00 +0200, David Brown wrote:

The hardware, of course, cannot always enable trapping on overflow if it
is going to efficiently support a range of programming languages.

Yes. And I am used to FORTRAN, which did not trap on integer overflows.

WATfor and WATfive trapped on integer overflows.

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 22:51:42 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> posted:

On Mon, 25 May 2026 19:20:01 +0000, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

On 25/05/2026 16:28, Anton Ertl wrote:

Despite their eagerness to "optimize" based on the assumption
that signed integer overflow does not happen, the GCC developers
have avoided making -ftrap the default, even on platforms like MIPS
and Alpha where the implementation of -ftrapv just means to use
different instructions (e.g., add instead of addu on MIPS, and addv
instead of add on Alpha).

Both architectures got this one wrong--IMO--and so does RISC-V.

You may not have been replying to what Anton Ertl wrote above, since there was a lot in between that I snipped. But it does mention two architectures that took an approach to trapping on integer overflow... that I also tend
to disagree with.

What I'm used to is the System/360. While it made the mistake of having
two condition code bits instead of NZVC, the idea of having "trap on overflow" controlled by a bit in the PSW is... what I assumed to be normal and correct.

And what My 66000 does....

I purport that ANY Industrial quality ISA should provide a means to
trap on integer overflow.

I could be wrong, as I haven't examined that approach critically and given full consideration to the alternatives.

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 22:59:10 2026

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

David Brown <david.brown@hesbynett.no> schrieb:

On 24/05/2026 23:39, quadi wrote:

On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:

quadi <quadibloc@ca.invalid> posted:

It makes sense to trap on a floating-point overflow, but trapping on an >>>> integer overflow is usually a terrible idea.

So, detecting something went wrong and you should inform the programmer >>> is a bad idea ???

No, so being able to turn the trap for integer overflow on should
definitely be allowed. But that shouldn't be the default behavior.
Otherwise, programs like random number generators wouldn't work.

John Savard

That does not make sense. Code such as random number generators should
be written so that they are correct in the language they are written in.

In principle, yes.

Principle is better in theory than in practice.

In practice, people often used whatever "worked" on their systems.

Face it, the poor slug writing the code may not have the faintest
grasp at the system qualities we are discussing, and does not care
to learn as long as he can slug through the writing and his pro-
gram not blow up catastrophically while it is under his purview.

That defines a lot of what is wrong with SW programming today.

Implementors have a certain right because they control what their
compiler does or does not do.

You would be surprised at how little influence implementors have
on compilers and other software.

But users did so, as well, with
Numerical Recipes a(n in)famous example.

And yes, this bites people. You can see this at https://gcc.gnu.org/gcc-13/porting_to.html :

# GCC 13 includes new optimizations which may change behavior
# on integer overflow. Traditional code, like linear congruential
# pseudo-random number generators in old programs and relying on
# a specific, non-standard behavior may now generate unexpected
# results. The option -fsanitize=undefined can be used to detect
# such code at runtime.

My VAX favorite was:

for( int i = 1; i; i+=i )

Traps instead of exiting the loop normally.

# It is recommended to use the intrinsic subroutine RANDOM_NUMBER for
# random number generators or, if the old behavior is desired, to use
# the -fwrapv option. Note that this option can impact performance.

If that is C, signed integer overflow is UB while unsigned integers
have wrapping behaviour - thus if your code depends on wrapping, and it
is written in C, it needs to use unsigned types or compiler-specific extensions, flags, etc. (Or C23 ckd_add and other checked arithmetic functions.)

If it is written in Zig, you need to use the specific modulo arithmetic functions even for unsigned arithmetic. If it is written in Java,
signed integer arithmetic is fine.

It all depends on the language and/or any options the language and tools might support - and code should be written to work correctly according
to the language rules.

Fortran has no standard way of implementing this unless you
restrict yourself to sizes which do not overflow a signed integer.

Old FORTRAN had no unSigned integer type and no way to avoid overflows.

Implementing LCGRNGs was one reason why I pushed for unsigned
arithmetic (modulo 2**n) in Fortran. The attempt failed (not
taken up by WG5 after being endorsed by J3), but I implemented it
for gfortran anyway.

The hardware, of course, cannot always enable trapping on overflow if it is going to efficiently support a range of programming languages. But
as an optional feature it can be helpful for catching a few bugs in
code, so it can be a good idea (both for signed and unsigned overflow).

Sanitizers are also fairly good now, but of course cost performance.

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 23:00:32 2026

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
What you do want is compiled code that can trap on overflow and avoid >trapping on overflow without code substitution or being re-compiled.
This way production code can avoid trapping but if the debugger is
turned on, you can trap.

Why do you consider that desirable?

So you can debug production/released code to find subtle errors.

long bar(long x, long y)
{
return x/2+y/2;
}

...

Trying the same on a MIPS64 machine with gcc-8.3 (which apparently
produces ILP32 code) produces a call to __addvsi3 instead of the
expected add instruction:

gcc -O3 -ftrapv gcc -O3
lui gp,0x0 srl v0,a0,0x1f
addiu gp,gp,0 srl v1,a1,0x1f
addu gp,gp,t9 addu v0,v0,a0
srl v1,a0,0x1f addu a1,v1,a1
lw t9,__addvsi3(gp) sra v0,v0,0x1
srl v0,a1,0x1f sra a1,a1,0x1
addiu sp,sp,-32 jr ra
addu a0,v1,a0 addu v0,v0,a1
addu a1,v0,a1
sra a0,a0,0x1
sw ra,28(sp)
sw gp,16(sp)
jalr t9
sra a1,a1,0x1
lw ra,28(sp)
jr ra
addiu sp,sp,32

The call costs a lot of overhead.

Architectures without overflow traps are notorious for excess instruction >count when overflow detection is desired or mandated.

MIPS' add traps on overflow. gcc could have emitted almost the same
code for gcc -O3 -trapv as for gcc -O3, except that the last
instruction would be an add, not an addu. But apparently nobody gives
a damn about the efficiency of -trapv, possibly rightly so.

If some overflow trapping when it can be done without additional
instructions would be preferable over no overflow, gcc would compile
signed adds that survive after optimization into add on MIPS rather
than addu, by default. Given that it does not, the GCC developers
probably found out that it is not preferable. I guess they would get
too many customer complaints, including for "relevant" code, i.e.,
code where the usual "it's UB, so your code is broken" excuse does not
work.

It is much harder than that. For example: does a signed shift left
overflow when significant bits are shifted out ??

-ftrapv specifies trapping on overflow only for additions,
subtractions, and multiplications.

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 23:03:03 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> posted:

On Mon, 25 May 2026 16:45:07 +0000, MitchAlsup wrote:

My 66000 has an instruction bit that denotes the signedness of integer calculations {Signed, unSigned}. This bit is available as another OpCode bit for non-integer calculation instructions.

That's nice. It's not an option I can consider, as having lots of
orthogonal modifiers on instructions would tend to increase their length.

And harm instruction Entropy.

A major goal of the Concertina II, III, and IV architectures is for instructions not to be longer than similar instructions on the Motorola 68020 or the IBM System/360 if at all possible.

Basically, the selling point is... "Your programs only get 10% bigger, if that, and yet you have 32 registers, so they run faster!".

Mine are getting 30% smaller and needing fewer instructions at the same
time

Or they _would_, if the design didn't have so many extra transistors for supporting both IBM-format and Intel-format Decimal Floating Point, old- style IBM floats, simple floating (You too can work with numbers that go around the world 2 1/2 times!), packed decimal, mixed-radix arithmetic...

But, hey, supporting these things in hardware is faster than doing them in software!

And are people even going to _read_ the part of the manual that
explains... as is noted in the description of the original Concertina architecture...

This chip has 8-way simultaneous multi-threading, but only for programs which do not make use of extensions to the register set.

Another One Bites the Dust.....

Only two programs per core may use the extended register banks with 128 elements.

Only one program per core may use the vector registers for long vector instructions. The 256-bit short vector registers, on the other hand, like the integer and floating-point registers, are available to all
simultaneous threads.

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 23:05:06 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 5/25/2026 9:28 AM, Anton Ertl wrote:

--------------

Integer overflow happens far too often for trapping to be a good solution.

Even on 64-bit variables/machines ??
--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon May 25 20:02:52 2026

From Newsgroup: comp.arch

On 5/25/2026 3:34 PM, quadi wrote:

On Mon, 25 May 2026 16:49:59 +0000, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

RISC-V: add ignores signed overflow, there is no add that traps on
signed overflow (and detecting signed overflow is pretty
involved if both operands are unknown to the compiler).

The worst of all possible semantic encodings

Although I thought that making trapping on fixed-point overflow the
default is a bad idea, I agree that making it impossible to do so, or even test for fixed-point overflow, is a much worse idea.

Possibly true.

The lack of things like ADD-with-Carry or ADD-with-Overflow are
annoyance points on RISC-V.

Though, it is less obvious what a useful behavior is at the language level:
"signal()" ? ...
Something like try/catch (mostly N/A to C)?
Something similar to FENV_ACCESS?
...

Well, and that if trapping were applied globally:
Overhead due to trap detection/handling code causing excessive bloat;
Overflows traps from any code that naively assumes wrap-on-overflow
semantics;
...

In some codebases, it is already enough of a pain to hunt and fix all
the out-of-bounds and uninitialized variables mess.
Signed integer overflows would likely "turn it up to 11";
Then, how does one fix it? Ask that people start adding a bunch of casts
to make it work?...

One might say:
Add "if()" cases to deal with the overflows, but, ... this only makes
sense for cases where the overflows are not the expected behavior.

Then again, could maybe classify code, say:
1, signed, value doesn't (or shouldn't) go out-of-range;
2, unsigned, value doesn't (or shouldn't) go out-of-range;
3, signed, value is expected to be modulo;
4, unsigned, value is expected to be modulo.

"nasal demons" types assume 1 and 4 as dominant.
Or, 1 as exclusive vs 3.

For compilers, we often need to assume 3 and 4.
Because, failure to uphold 3 results in misbehaving programs.
And, if 3 were uncommon, RISC-V's "ADDW"/etc would be pure stupidity.
Instead:
Something like plain ADD plus ADDWU would have made sense.
But, they dropped ADDWU instead (also stupid IMO).

While, granted, a lot of 1 code likely exists, 3 code tends to generate
the vast majority of overflows; and if there is any reasonable
expectation for 'int' to overflow, and it is not desired for int to
overflow.

We mostly ignore 2 vs 4, because standard specifies 4 making 2 to be
purely a programming error, in which case "2" becomes "should have used
a bigger signed type instead".

Then again, could maybe make sense to add a semantic distinction, say:
"int" (plain):
Maybe a case could be made that overflow be assumed unexpected.
"signed int":
Maybe make separate from plain case, explicitly modulo;
So, could be made distinct;
Explicitly like the "unsigned" case in being modulo.
"unsigned int":
Remains the same, no real controversy here.

Or, say:
char, short, int, long, long long:
For code, assume that overflow may be unexpected / undesirable;
signed char, signed int, signed long, signed long long:
Assume signed modulo;
Compiler should, ideally, always produce wrap-on-overflow semantics.
unsigned ...:
Unsigned modulo.

For a compiler, then:
-ftrapv:
May ideally trap on lack of "signed";
Explicit "signed", continues to wrap.
-fwrapv:
Both default and signed will wrap.
Neither:
Dunno, probably better for compiler to assume "-fwrapv" semantics;
Maybe assume UB opts are safe if no "signed".

Well, and for the programmer POV:
If assuming maximum portability:
Only unsigned overflow wrapping is "safe".
If assuming "any reasonable system":
Both will wrap in most cases;
Absent "-fwrapv", UB opts may occur in certain obscure edge cases.
Though usually in the form of "early" vs "late" type promotion;
In most cases, where it does occur, early promotion is benign.
Vs whatever "nasal demons" people may assert.
What else, that it late propmotes?
(as "-fwrapv" semantics would dictate...)

Like, say:
int x;
long z;
...
z = 42 - x;
//Oh no! UB opt has turned this into a 64-bit RSUB instruction!

Yeah...

Granted, ATM, for BGBCC, wouldn't make much difference at present. Could
maybe make sense to add a distinction either to strengthen semantic
analysis, or if I decided to change away from my existing "assume wrap
on overflow semantics as sole option" policy. Or maybe adding an
"-fno-wrapv" option, with "wrapv" remaining default but allowing an
option to opt-out, sort of like how there is an "-fptropts" option to
"opt into" strict-aliasing / TBAA semantics, vs the default semantics of "assume every explicit store may alias" semantics. Though, may still
assume that loads may be cached and reordered, unless "volatile" is
used, which explicitly disallows caching and reordering loads, though at present is a little "shotgun" and will basically disable caching
throughout the whole basic block; which works as a detractor to the
"casually use volatile as a way to dispel TBAA" interpretation (works on
GCC, and is less adverse for performance than the "use memcpy" option on
some other compilers, ...).

Or, say:
Bare pointer cast and deref:
GCC: averse (falls afoul of default semantics);
MSVC: benign;
BGBCC: benign.
Volatile pointer cast and deref:
GCC: benign (doesn't use TBAA on volatile pointers);
MSVC: benign;
BGBCC: detrimental, disables caching and ld/st reordering;
Using memcpy:
GCC: benign;
MSVC:
Old (15+ years):
Averse (actually calls memcpy, significant impact);
Some intermediate versions would do an inline for "REP MOVSB".
Also kinda crap, but less bad vs calling "memcpy()".
Mostly only matters if still targeting WinXP or similar.
Newer: Mild detriment in some cases.
Inline loads/stores
may fail to optimize to plain register moves for locals.
BGBCC;
Mostly similar to newer MSVC here;
Works, just less efficient than plain "cast and deref".

...

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon May 25 15:27:29 2026

From Newsgroup: comp.arch

An awkward thing about using trap on overflow is determining how
precisely it is defined.

Indeed, this is a nasty part of language design.

[ IMO, the only sane choice (beside wrapping and explicit `ckd_add`) is
to treat overflow not as a exception (in the sense of `try..catch`
thingies, not in the CPU hardware sense of the word) but as an
execution error comparable to memory exhaustion. ]

Luckily, for `comp.arch` the same problem doesn't plague ISAs because
it's accepted that a CPU should stick religiously to the literal
semantics of the machine code, no matter how far it is from what
really happens inside the machine.

=== Stefan
--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue May 26 05:39:02 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> schrieb:

On Mon, 25 May 2026 10:23:00 +0200, David Brown wrote:

The hardware, of course, cannot always enable trapping on overflow if it
is going to efficiently support a range of programming languages.

Yes. And I am used to FORTRAN, which did not trap on integer overflows.

Incorrect.

Integer overflow is illegal in Fortran, so what the compiler then
does is not determined (see my post on random number generators).

Example:

$ cat overfl.f90
program main
integer :: a, b
a = 12345678
b = 2345678
print *,a*b
end program main
$ gfortran -fsanitize=undefined overfl.f90
$ ./a.out
overfl.f90:5:13: runtime error: signed integer overflow: 12345678 * 2345678 cannot be represented in type 'integer(kind=4)'
-1979197244
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Tue May 26 08:18:17 2026

From Newsgroup: comp.arch

On 25/05/2026 18:43, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

On 25/05/2026 16:28, Anton Ertl wrote:

Despite their eagerness to "optimize" based on the assumption
that signed integer overflow does not happen, the GCC developers have
avoided making -ftrap the default, even on platforms like MIPS and
Alpha where the implementation of -ftrapv just means to use different
instructions (e.g., add instead of addu on MIPS, and addv instead of
add on Alpha).

An awkward thing about using trap on overflow is determining how
precisely it is defined. Supposing you have the expression "a + b - a".
Perhaps "a + b" overflows. I would hope than when using debug-related
compiler flags such as "-fsanitize=signed-integer-overflow", a compiler
would check for overflow on "a + b", and report it at runtime.
(Unfortunately, gcc does not do that unless the partial expression is
assigned to a variable.) But in "normal" usage, I'd expect the
expression to be simplified, resulting in just "b" and no overflow.

OTOH, cases like a+b+c where the result is in range, while an
intermediate result is out of range are one of the reasons why I
prefer -fwrapv over -ftrapv. As for your preference of nasal demons,
given enough information, the compiler might "optimize" "a+b-a" into,
e.g., 0.

Anyway, the definition of -ftrapv is not very precise; for gcc-12.2:

|'-ftrapv'
| This option generates traps for signed overflow on addition,
| subtraction, multiplication operations.

My understanding is that the GCC developers would rather deprecate
-ftrapv entirely, and encourage the use of -fsanitize instead as a way
to detect run-time errors. I don't know the details of the internals,
but I believe the GCC developers see the sanitize options as more
accurate and more likely to be further developed in the future.

As for what gcc-12.2 does for your example on AMD64:

long foo(long a, long b)
{
return a+b-a;
}

is compiled with gcc -O3 -ftrapv to:

0: 48 89 f0 mov %rsi,%rax
3: c3 ret

If "trap on overflow" has precise semantics in the code, then this
disables a range of useful optimisations and re-arrangements. If it is
just "use trapping arithmetic instructions", then it will miss many
possible cases of actual overflow in the code, which we might want to
catch.

Which would you prefer by default?

I don't know for sure. A "by default" choice has to be suitable for a
wide variety of users and a wide variety of cases, and preferably err on
the side of caution. For my own personal use, I'm happy with UB
overflow and would have preferred that as the default even for unsigned arithmetic (but of course with a way to specify wrapping when I need
it). But that's for /my/ use - I don't think that should necessarily be
the default for others. Let those who are willing to spend the time and effort learning the details and the care needed use compiler flags to
get the highest efficiency from their code, and let the defaults help
others catch their bugs. However, the logical endpoint of that is that
C should only be used by those that have a detailed understanding of the language and need it for peak efficiency, while other programmers should
work with other languages that have more error handling.

The gcc developers apparently took the latter approach, even when you
ask for -ftrapv explicitly. So what, IYO, speaks against doing that
by default on machines like MIPS and Alpha.

And "trap on overflow" might either trigger when there is no
overflow in the original code, or hinder optimisations. (Consider the
expression "x / 2 + y / 2" - the compiler could implement that as a
combined "(x + y) / 2", but that might introduce overflow.)

x/2+y/2 produces a different result from (x+y)/2 when both x and y are
odd integers.

True. Can we pretend that is not the case, and still see my point? The
point is that the compiler can, during re-arrangements, introduce new overflows as long as it knows the final results are correct (since the compiler knows the details of how instructions are actually implemented).

gcc-12.2 compiles

long bar(long x, long y)
{
return x/2+y/2;
}

on AMD64 to:

gcc -O3 -ftrapv gcc -O3
mov %rdi,%rax mov %rdi,%rax
sub $0x8,%rsp mov %rsi,%rdx
shr $0x3f,%rax shr $0x3f,%rax
add %rax,%rdi shr $0x3f,%rdx
mov %rsi,%rax add %rdi,%rax
shr $0x3f,%rax add %rsi,%rdx
sar %rdi sar %rax
add %rax,%rsi sar %rdx
sar %rsi add %rdx,%rax
call __addvdi3@PLT ret
add $0x8,%rsp
ret

so the -ftrapv introduces an additional mov and a call; I would have
expected that the + would be compiled to an ADD instruction followed
by a JO instruction.

Trying the same on a MIPS64 machine with gcc-8.3 (which apparently
produces ILP32 code) produces a call to __addvsi3 instead of the
expected add instruction:

gcc -O3 -ftrapv gcc -O3
lui gp,0x0 srl v0,a0,0x1f
addiu gp,gp,0 srl v1,a1,0x1f
addu gp,gp,t9 addu v0,v0,a0
srl v1,a0,0x1f addu a1,v1,a1
lw t9,__addvsi3(gp) sra v0,v0,0x1
srl v0,a1,0x1f sra a1,a1,0x1
addiu sp,sp,-32 jr ra
addu a0,v1,a0 addu v0,v0,a1
addu a1,v0,a1
sra a0,a0,0x1
sw ra,28(sp)
sw gp,16(sp)
jalr t9
sra a1,a1,0x1
lw ra,28(sp)
jr ra
addiu sp,sp,32

The call costs a lot of overhead.

Agreed. I don't know why GCC uses a function call here. In my quick
godbolt testing, clang uses the "add, jump-on-overflow" sequence.

Using

-fsanitize=signed-integer-overflow -fsanitize-trap

gives an add followed by a jump-on-overflow sequence.

It is not easy to see how a tool can avoid false positives and false
negatives and also conveniently optimise and re-arrange code.

It can't. But it does not try to avoid false negatives even when
explicitly asked for trapping on overflow.

If some overflow trapping when it can be done without additional
instructions would be preferable over no overflow, gcc would compile
signed adds that survive after optimization into add on MIPS rather
than addu, by default. Given that it does not, the GCC developers
probably found out that it is not preferable. I guess they would get
too many customer complaints, including for "relevant" code, i.e.,
code where the usual "it's UB, so your code is broken" excuse does not
work.

If "-ftrapv" is to have any use at all, then overflow is no longer UB -
it has to be defined to trap. But I have to conclude that in GCC,
-ftrapv is too vaguely defined and too inconsistently and inefficiently implemented to be of any use. This matches my understanding that the "-fsanitize=signed-integer-overflow -fsanitize-trap" flags are preferred
by the GCC developers.

The fact that they don't even try to make -ftrapv produce efficient
code indicates that there is no "relevant" interest in efficient
-ftrapv. It would be interesting to know who came up with the idea of
adding -ftrapv, and why they are still keeping it.

Compilers have not always been good at taking advantage of all the
features provided by hardware

GCC is pretty good at implementing -fwrapv. For the two examples
above, "gcc -O3 -fwrapv" produces the same code on AMD64 and MIPS as
"gcc -O3".

That is my experience too (though I expect your experience here vastly outweighs mine).

nor have languages been good at exposing
the possibilities in the language so that programmers can take advantage
of them.

Yes. But I leave that for another day.

Good idea :-)

--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Tue May 26 08:27:28 2026

From Newsgroup: comp.arch

On 26/05/2026 01:00, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
What you do want is compiled code that can trap on overflow and avoid
trapping on overflow without code substitution or being re-compiled.
This way production code can avoid trapping but if the debugger is
turned on, you can trap.

Why do you consider that desirable?

So you can debug production/released code to find subtle errors.

I think that when an unexpected error is detected (whether it is with
hardware acceleration, like trap on overflow, or via explicit generated
code), the way to handle it depends strongly on the situation. If a
debugger is present, then it is most helpful to lead to a debugger break
so that the developer can figure out what went wrong. When not
debugging, there is no sensible default handling that works for jet
engine controllers and video game frame generators.

But I do support the aim of having the same generated code when
debugging and when shipping - I am not a fan of "release" builds and
"debug" builds. (Of course you might temporarily do builds with
different flags while chasing down a particular bug.)

--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Tue May 26 15:13:31 2026

From Newsgroup: comp.arch

On Sun, 24 May 2026 16:39:25 +0000, quadi wrote:

On Sun, 24 May 2026 15:24:22 +0000, John Levine wrote:

Sure they did. S/360 had separate unsigned versions of add and subtract
instructions. The results were the same but the condition codes were
different and the unsigned versions couldn't overflow.

Ah, I didn't remember that!

I just looked it up. It was, and is, the Add Logical instruction.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue May 26 18:02:51 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 5/25/2026 3:34 PM, quadi wrote:

On Mon, 25 May 2026 16:49:59 +0000, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

RISC-V: add ignores signed overflow, there is no add that traps on
signed overflow (and detecting signed overflow is pretty
involved if both operands are unknown to the compiler).

The worst of all possible semantic encodings

Although I thought that making trapping on fixed-point overflow the
default is a bad idea, I agree that making it impossible to do so, or even test for fixed-point overflow, is a much worse idea.

Possibly true.

The lack of things like ADD-with-Carry or ADD-with-Overflow are
annoyance points on RISC-V.

Though, it is less obvious what a useful behavior is at the language level:
"signal()" ? ...
Something like try/catch (mostly N/A to C)?
Something similar to FENV_ACCESS?
...

The important property is that overflow is detected precisely.
Whether {trap, signal, throw} is performed is an environmental choice
not an ISA choice.

Well, and that if trapping were applied globally:
Overhead due to trap detection/handling code causing excessive bloat; Overflows traps from any code that naively assumes wrap-on-overflow semantics;
...

In some codebases, it is already enough of a pain to hunt and fix all
the out-of-bounds and uninitialized variables mess.
Signed integer overflows would likely "turn it up to 11";
Then, how does one fix it? Ask that people start adding a bunch of casts
to make it work?...

One might say:
Add "if()" cases to deal with the overflows, but, ... this only makes
sense for cases where the overflows are not the expected behavior.

If(overflow(??)) requires some flag to carry overflow from point of
detection to if(()).

And what happens if there is more than 1 overflow ??

Then again, could maybe classify code, say:
1, signed, value doesn't (or shouldn't) go out-of-range;
2, unsigned, value doesn't (or shouldn't) go out-of-range;
3, signed, value is expected to be modulo;
4, unsigned, value is expected to be modulo.

5, a language hint about in-range, wrap, trap, signal, throw

"nasal demons" types assume 1 and 4 as dominant.
Or, 1 as exclusive vs 3.

For compilers, we often need to assume 3 and 4.
Because, failure to uphold 3 results in misbehaving programs.
And, if 3 were uncommon, RISC-V's "ADDW"/etc would be pure stupidity.

You would prefer::

AND R7,Rleft,#~(~0<<31)
AND R8,Rright,#~(~0<<31)
ADD Rd,R7,R8
AND Rd,Rd,#~(~0<<31)

That is ADDW range limits operands and performs a shorter ADD.
Matching C's int a,b; semantic. In general the integer instructions
ending with W apply C's int properties to the arithmetic. If compilers
were (WERE) really good at range determination those instructions would
be unnecessary--but they are not.

I (My 66000) had to put in sized integer calculation reasons, and by
doing so, gained 2%-4% in code density and a bit more in latency. -----------------------
--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue May 26 14:28:56 2026

From Newsgroup: comp.arch

On 5/26/2026 1:02 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 5/25/2026 3:34 PM, quadi wrote:

On Mon, 25 May 2026 16:49:59 +0000, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

RISC-V: add ignores signed overflow, there is no add that traps on
signed overflow (and detecting signed overflow is pretty
involved if both operands are unknown to the compiler).

The worst of all possible semantic encodings

Although I thought that making trapping on fixed-point overflow the
default is a bad idea, I agree that making it impossible to do so, or even >>> test for fixed-point overflow, is a much worse idea.

Possibly true.

The lack of things like ADD-with-Carry or ADD-with-Overflow are
annoyance points on RISC-V.

Though, it is less obvious what a useful behavior is at the language level: >> "signal()" ? ...
Something like try/catch (mostly N/A to C)?
Something similar to FENV_ACCESS?
...

The important property is that overflow is detected precisely.
Whether {trap, signal, throw} is performed is an environmental choice
not an ISA choice.

Yeah.

Say:
ADDV Rs, Rt, Rd
BT __trap_overflow

Which is how I would assume doing it, if I were to re-add ADDV to my ISA
(this had existed in SuperH and BJX1, but got lost along the way, but
could re-add if needed; just it was less often needed than even ADC/ADDC).

Well, and that if trapping were applied globally:
Overhead due to trap detection/handling code causing excessive bloat;
Overflows traps from any code that naively assumes wrap-on-overflow
semantics;
...

In some codebases, it is already enough of a pain to hunt and fix all
the out-of-bounds and uninitialized variables mess.
Signed integer overflows would likely "turn it up to 11";
Then, how does one fix it? Ask that people start adding a bunch of casts
to make it work?...

One might say:
Add "if()" cases to deal with the overflows, but, ... this only makes
sense for cases where the overflows are not the expected behavior.

If(overflow(??)) requires some flag to carry overflow from point of
detection to if(()).

And what happens if there is more than 1 overflow ??

Dunno.
You would need to set a start point and an end/detection point, and have
some way for the compiler to know to track overflows.

Say:
ADDV ...
OR?T Re, 0x100, Re

Then a way to feed Re back into C land to act upon.

There could maybe either be a 32-bit variant (ADDV.L), or some shorthand
way to detect that the value has gone outside of 32-bit range.

Then again, could maybe classify code, say:
1, signed, value doesn't (or shouldn't) go out-of-range;
2, unsigned, value doesn't (or shouldn't) go out-of-range;
3, signed, value is expected to be modulo;
4, unsigned, value is expected to be modulo.

5, a language hint about in-range, wrap, trap, signal, throw

Well, possible, but C doesn't have any hints here...

But, yeah:
Leaving plain 'int' as the "probably shouldn't overflow" and 'signed
int' and 'unsigned int' as "wrap on overflow expected" could make sense.

"nasal demons" types assume 1 and 4 as dominant.
Or, 1 as exclusive vs 3.

For compilers, we often need to assume 3 and 4.
Because, failure to uphold 3 results in misbehaving programs.
And, if 3 were uncommon, RISC-V's "ADDW"/etc would be pure stupidity.

You would prefer::

AND R7,Rleft,#~(~0<<31)
AND R8,Rright,#~(~0<<31)
ADD Rd,R7,R8
AND Rd,Rd,#~(~0<<31)

That is ADDW range limits operands and performs a shorter ADD.
Matching C's int a,b; semantic. In general the integer instructions
ending with W apply C's int properties to the arithmetic. If compilers
were (WERE) really good at range determination those instructions would
be unnecessary--but they are not.

I (My 66000) had to put in sized integer calculation reasons, and by
doing so, gained 2%-4% in code density and a bit more in latency. -----------------------

OK.

Ironically, the 4-op sequence above would have been a single "ADDWU" instruction in the RV BitManip drafts, but ADDWU was dropped as arguably
it didn't make a big enough difference on SPEC scores. They decided to
keep a whole bunch of other random crap though that serves no real
purpose other than to micro-optimize the benchmarks...

I revived this for my own extensions, but left out ADDIWU as it was
still not common enough to justify the encoding space cost (if one has jumbo-prefixes, this could be handled well enough via
immediate-synthesis, and the 64-bit encoding wasn't too bad for
something that is comparably infrequent).

...

--- Synchronet 3.22a-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch on Tue May 26 15:29:08 2026

From Newsgroup: comp.arch

On Mon, 25 May 2026 23:05:06 GMT, MitchAlsup
<user5857@newsgrouper.org.invalid> wrote:

BGB <cr88192@gmail.com> posted:

On 5/25/2026 9:28 AM, Anton Ertl wrote:

--------------

Integer overflow happens far too often for trapping to be a good solution.

Even on 64-bit variables/machines ??

Yes if there are options for 8/16/32 bit ops in 64 bit registers.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue May 26 22:09:28 2026

From Newsgroup: comp.arch

David Brown wrote:

On 26/05/2026 01:00, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
What you do want is compiled code that can trap on overflow and avoid
trapping on overflow without code substitution or being re-compiled.>>>> This way production code can avoid trapping but if the debugger is
turned on, you can trap.

Why do you consider that desirable?

So you can debug production/released code to find subtle errors.

I think that when an unexpected error is detected (whether it is with hardware acceleration, like trap on overflow, or via explicit generated code), the way to handle it depends strongly on the situation. If a debugger is present, then it is most helpful to lead to a debugger break
so that the developer can figure out what went wrong. When not
debugging, there is no sensible default handling that works for jet
engine controllers and video game frame generators.

But I do support the aim of having the same generated code when
debugging and when shipping - I am not a fan of "release" builds and
"debug" builds. (Of course you might temporarily do builds with
different flags while chasing down a particular bug.)

I tend to like "Release with sometimes hard-to-grok debug info",
typically resulting in a separate file with a best effort debug map of
the executable.
Then I can at least get some help when running the debugger and trying
to binary search my way into the spot where the bug resides.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue May 26 20:54:30 2026

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:

David Brown wrote:

On 26/05/2026 01:00, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
What you do want is compiled code that can trap on overflow and avoid >>>> trapping on overflow without code substitution or being re-compiled. >>>> This way production code can avoid trapping but if the debugger is
turned on, you can trap.

Why do you consider that desirable?

So you can debug production/released code to find subtle errors.

I think that when an unexpected error is detected (whether it is with hardware acceleration, like trap on overflow, or via explicit generated code), the way to handle it depends strongly on the situation. If a debugger is present, then it is most helpful to lead to a debugger break so that the developer can figure out what went wrong. When not debugging, there is no sensible default handling that works for jet
engine controllers and video game frame generators.

But I do support the aim of having the same generated code when
debugging and when shipping - I am not a fan of "release" builds and "debug" builds. (Of course you might temporarily do builds with different flags while chasing down a particular bug.)

I tend to like "Release with sometimes hard-to-grok debug info",
typically resulting in a separate file with a best effort debug map of
the executable.

Encrypt the debug information (and put it in a {1234-5678-9101-1121-...} folder) so that only the owner (not licensee) of the code can debug
it.

Then I can at least get some help when running the debugger and trying
to binary search my way into the spot where the bug resides.

Terje

--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue May 26 19:13:21 2026

From Newsgroup: comp.arch

On 5/26/2026 2:29 PM, George Neuner wrote:

On Mon, 25 May 2026 23:05:06 GMT, MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

BGB <cr88192@gmail.com> posted:

On 5/25/2026 9:28 AM, Anton Ertl wrote:

--------------

Integer overflow happens far too often for trapping to be a good solution. >>

Even on 64-bit variables/machines ??

Yes if there are options for 8/16/32 bit ops in 64 bit registers.

32-bit overflow is the dominant scenario here.
While 8 and 16-bit ranges do overflow readily, the normal semantics are
for them to auto-promote to 32 bits before then being narrowed back down
to 8 or 16 bits, so they don't count.

Ironically, for my BS2 language, the semantics were in cases like this
to instead auto-promote to 64 bits; but can't really do this for C as it
gives different results in some cases (and early promotion is itself a
bug, even if early promotion would often be the most natural semantics
for a 64-bit machine).

Well, and there is the usual thing that one can't usually allow a
variable to hold values outside the range of what would be allowed for
that variable.

Well, except for floating-point types, where typically code doesn't care
about out of ranges of values (if a value fails to go to 0 or Inf in a computation in local variables, typically no one cares).

For float, it isn't obvious because the dynamic range of Binary32 is
already quite large. A "short float" effectively having Binary64's
dynamic range when used in scalar computations is a bit incredulous, but
given these smaller formats are non-standard anyways, it reasonable to
be like "these formats are only necessarily confined to their formal
range when in-memory, otherwise all bets are off".

Or: precision and dynamic range >= requested format.

Code can't entirely rely on the higher precision though, as the format
may also revert to its defined precision without warning (even if
intermediate computations may potentially wildly exceed it).

But, then again, this would be analogous to if one has an FPU with
native Binary128, occasionally performing "double" calculations at
Binary128 precision even though "double" is stated as Binary64.

Well, or implementing some operations by widening temporarily to a higher-precision format before narrowing the result.

Though, OTOH, the main use-case for things like scalar "short float" is
more for saving memory in structs and arrays, not for trying to rely on
its crappy range and precision.

So, floating point math is very different from integer math in this regard.

...

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Wed May 27 10:59:31 2026

From Newsgroup: comp.arch

MitchAlsup [2026-05-26 20:54:30] wrote:

Encrypt the debug information (and put it in
a {1234-5678-9101-1121-...} folder) so that only the owner (not
licensee) of the code can debug it.

I resent that. All code should be Free Software.

=== Stefan
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Wed May 27 18:19:49 2026

From Newsgroup: comp.arch

On Wed, 27 May 2026 10:59:31 -0400, Stefan Monnier wrote:

MitchAlsup [2026-05-26 20:54:30] wrote:

Encrypt the debug information (and put it in a
{1234-5678-9101-1121-...} folder) so that only the owner (not
licensee) of the code can debug it.

I resent that. All code should be Free Software.

It is wonderful that we have the open-source software movement.

However, people have the right to the fruit of their labors. To give them
away for free is generous, but it should remain a personal choice.

Of course, copyright has been misused, and deserves a critical
examination, not the sort of uncritical expansion given to it by
legislators in the United States - and imposed on the rest of the world by trade threats.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed May 27 15:24:09 2026

From Newsgroup: comp.arch

On 5/25/2026 5:59 PM, MitchAlsup wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

David Brown <david.brown@hesbynett.no> schrieb:

On 24/05/2026 23:39, quadi wrote:

On Sun, 24 May 2026 17:32:10 +0000, MitchAlsup wrote:

quadi <quadibloc@ca.invalid> posted:

It makes sense to trap on a floating-point overflow, but trapping on an >>>>>> integer overflow is usually a terrible idea.

So, detecting something went wrong and you should inform the programmer >>>>> is a bad idea ???

No, so being able to turn the trap for integer overflow on should
definitely be allowed. But that shouldn't be the default behavior.
Otherwise, programs like random number generators wouldn't work.

John Savard

That does not make sense. Code such as random number generators should
be written so that they are correct in the language they are written in.

In principle, yes.

Principle is better in theory than in practice.

In practice, people often used whatever "worked" on their systems.

Face it, the poor slug writing the code may not have the faintest
grasp at the system qualities we are discussing, and does not care
to learn as long as he can slug through the writing and his pro-
gram not blow up catastrophically while it is under his purview.

That defines a lot of what is wrong with SW programming today.

Implementors have a certain right because they control what their
compiler does or does not do.

You would be surprised at how little influence implementors have
on compilers and other software.

Yeah.

You can design the ISA and compiler as one likes.
But, if existing C code breaks, well then this is not good.

One might think:
You know, wrap on overflow, and type promotion where it overflows and
wraps, and *then* promotes to the wider type on the final assignment, is
kinda stupid and sucks.

And, if one goes by "well, signed overflow is UB anyways", then they
should be able to turn it into a "promote first, then ADD" scenario (may
be both potentially faster, and less likely to lose information).

I would be inclined to agree.

But... there is old code around that will quietly break if the integer overflow and promotion doesn't follow the specific behavior that mimics
how it would have behaved on 32-bit systems.

I vaguely remember a case of this involving some robot enemies that
drive around in ROTT, where if the integer overflow failed to work in
just the right way, they would all miss their way-points and end up
crashing into walls or similar.

Where, the robot enemies followed a path defined as a series of
waypoints (in a grid world), and once the robot hits a particular spot
on the grid cell, it will change directions and head along the path.
But, the particular way the expression to handle this was written was sensitive to the type promotion and wrap-on-overflow semantics in C.

Also a similar case involving the "elevators", which were effectively
timed teleporters between different parts of the map (would close door,
play elevator sound, then right at the end as the door opens, it would teleport the player to the other location and initiate a screen shaking
effect at around the same time). If the overflow was wrong, the teleport
would fail and the player would still be in the original location.

One could fix this stuff with casts or similar, but, when does one draw
the line exactly?...

Easier sometimes to make it to work, than to try to justify the code was already broken due to reliance on UB.

Well, and to match the behavior of the other compilers, needed to
implement the behavior the way ROTT expected.

Where, as noted, ROTT uses fixed-point math with "fixed" as a signed
32-bit integer, and some cases involve calculations with coordinates
well outside the world bounds with the seeming intention that these
high-order components simply disappear into the ether (with the world essentially treated as a wrapping modulo space).

But, as noted, it differed from my BS2 language, where the default was effectively to auto-promote values to the widest reasonable integer type
in these cases and then drop down to the final range afterwards (to
avoid some integer overflows in cases they would happen in C).

Well, and within BGBCC, there was some non-zero bleed-over between C and
BS2 (where originally I had been implementing BS2 via BGBCC, with the intention that it would compile to an IL image that would then be run in
the VM).

The original VM however, while fast, ended up with horrible code-bloat.
Had gotten creative with the use of the C preprocessor in ways that were ultimately a terrible idea (errm, trying to use it sorta like a
poor-man's version of C++ templates). Binaries got huge, build times
sucked. This VM was a dead end.

Ironically, some of my current ISA projects were built on some of the groundwork left by this experiment, but also as a warning for something
not to do.

Or, when I learned the merit of actually writing all the opcode handler functions and similar by hand and not trying to do combinatorial stuff
via the preprocessor.

Also for the follow up VM (for BS2), had went back to ye-olde stack
machine (vs a Register IR model). But, some parts of this were relevant
to targeting an "actual CPU".

The way JX2VM works isn't too far removed from those VMs in some ways,
apart from JX2VM's general avoidance of getting too clever with the C preprocessor.

...

--- Synchronet 3.22a-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,118
Nodes:	10 (0 / 10)
Uptime:	39:22:35
Calls:	14,340
Files:	186,357
D/L today:	23,668 files (7,691M bytes)
Messages:	2,532,986

Concertina IV Has Arrived

Who's Online

System Info