Forum: War Ensemble BBS

"Mini" tags to reduce the number of op codes

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Apr 3 09:43:44 2024

From Newsgroup: comp.arch

There has been discussion here about the benefits of reducing the number
of op codes. One reason not mentioned before is if you have fixed
length instructions, you may want to leave as many codes as possible
available for future use. Of course, if you are doing a 16-bit
instruction design, where instruction bits are especially tight, you may
save enough op-codes to save a bit, perhaps allowing a larger register specifier field, or to allow more instructions in the smaller subset.

It is in this spirit that I had an idea, partially inspired by Mill’s
use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First, it has several features that are “friendly” to the idea. Second, I know Mitch cares about keeping the number of op codes low.

Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate
discussions, and because it has been fun to think about.

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If
set, the bit indicates that the corresponding register contains a floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load single floating and load double floating, which work the same as the other 32-
and 64-bit loads, but in addition to loading the value, set the tag bit
for the destination register. Non-floating-point loads would clear the
tag bit. As I show below, I don’t think you need any special "store
tag" instructions.

When executing arithmetic instructions, if the tag bits of both sources
of an instruction are the same, do the appropriate operation (floating
or integer), and set the tag bit of the result register appropriately.
If the tag bits of the two sources are different, I see several
possibilities.

1. Generate an exception.
2. Use the sense of source 1 for the arithmetic operation, but perform the appropriate conversion on the second operand first, potentially
saving an instruction
3. Always do the operation in floating point and convert the integer operand prior to the operation. (Or, if you prefer, change floating
point to integer in the above description.)
4. Same as 2 or 3 above, but don’t do the conversions.

I suspect this is the least useful choice. I am not sure which is the
best option.

Given that, use the same op code for the floating-point and fixed
versions of the same operations. So we can save eight op codes, the
four arithmetic operations, max, min, abs and compare. So far, a net
savings of six opcodes.

But we can go further. There are some opcodes that only make sense for
FP operands, e.g. the transcendental instructions. And there are some operations that probably only make sense for non-FP operands, e.g. POP,
FF1, probably shifts. Given the tag bit, these could share the same
op-code. There may be several more of these.

I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with
separate compilations? The called function probably doesn’t know the
tag value for callee saved registers. Fortunately, the My 66000
architecture comes to the rescue here. You would modify the Enter and
Exit instructions to save/restore the tag bits of the registers they are saving or restoring in the same data structure it uses for the registers
(yes, it adds 32 bits to that structure – minimal cost). The same
mechanism works for interrupts that take control away from a running
process.

I don’t think you need to set or clear the tag bits without doing
anything else, but if you do, I think you could “repurpose” some other instructions to do this, without requiring another op-code. For
example, Oring a register with itself could be used to set the tag bit
and Oring a register with zero could clear it. These should be pretty rare.

That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it? To me, a major question is the effect on performance. What
is the cost of having to decode the source registers and reading their respective tag bits before knowing which FU to use? If it causes an
extra cycle per instruction, then it is almost certainly not worth it.
IANAHG, so I don’t know. But even if it doesn’t cost any performance, I think the overall gains are pretty small, and probably not worth it
unless the op-code space is really tight (which, for My 66000 it isn’t).

Anyway, it has been fun thinking about this, so I hope you don’t mind
the, probably too long, post.
Any comments are welcome.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Apr 3 17:24:05 2024

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If >set, the bit indicates that the corresponding register contains a >floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load single >floating and load double floating, which work the same as the other 32-
and 64-bit loads, but in addition to loading the value, set the tag bit
for the destination register. Non-floating-point loads would clear the
tag bit. As I show below, I don’t think you need any special "store
tag" instructions.

...

But we can go further. There are some opcodes that only make sense for
FP operands, e.g. the transcendental instructions. And there are some >operations that probably only make sense for non-FP operands, e.g. POP,
FF1, probably shifts. Given the tag bit, these could share the same >op-code. There may be several more of these.

Certainly makes reading disassembler output fun (or writing the
disassembler). This reminds me of the work on SafeTSA [amme+01] where
they encode only programs that are correct (according to some notion
of correctness).

I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with >separate compilations? The called function probably doesn’t know the
tag value for callee saved registers. Fortunately, the My 66000
architecture comes to the rescue here. You would modify the Enter and
Exit instructions to save/restore the tag bits of the registers they are >saving or restoring in the same data structure it uses for the registers >(yes, it adds 32 bits to that structure – minimal cost).

That's expensive in an OoO CPU. There you want each tag to be stored
alongside with the other 64 bits of the register, because they should
be renamed at the same time. So the ENTER instruction would depend on
all the registers that it saves (or maybe on all registers). And upon
EXIT the restored registers have to be reassembled (which ist not that expensive).

I have a similar problem for the carry and overflow bits in <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>, and chose to
let those bits not survive across calls; if there was a cheap solution
for the problem, it would eliminate this drawback of my idea.

The same
mechanism works for interrupts that take control away from a running >process.

For context switches one cannot get around the problem, but they are
much rarer than calls and returns, so requiring a pipeline drain for
them is not so bad.

Concerning interrupts, as long as nesting is limited, one could just
treat the physical registers of the interrupted program as taken, and
execute the interrupt with the remaining physical registers. No need
to save any architectural registers or their tag, carry, or overflow
bits.

That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it? To me, a major question is the effect on performance. What
is the cost of having to decode the source registers and reading their >respective tag bits before knowing which FU to use?

In in OoO CPU, that's pretty heavy.

But actually, your idea does not need any computation results for
determining the tag bits of registers (except during EXIT), so you
probably can handle the tags in the front end (decoder and renamer).
Then the tags are really separate and not part of the rgisters that
have to be renamed, and you don't need to perform any waiting on
ENTER.

However, in EXIT the front end would have to wait for the result of
the load/store unit loading the 32 bits, unless you add a special
mechanism for that. So EXIT would become expensive, one way or the
other.

@InProceedings{amme+01,
author = {Wolfram Amme and Niall Dalton and Jeffery von Ronne
and Michael Franz},
title = {Safe{TSA}: A Type Safe and Referentially Secure
Mobile-Code Representation Based on Static Single
Assignment Form},
crossref = {sigplan01},
pages = {137--147},
annote = {The basic ideas in this representation are:
variables are named as the pair (distance in the
dominator tree, assignment within basic block);
variables are separated by type, with operations
referring only to variables of the right type (like
integer and FP instructions and registers in
assemblers); memory references use types to encode
that a null-pointer check and/or a range check has
already occured, allowing optimizing these
operations; the resulting code is encoded (using
text compression methods) in a way that supports
only correct code. These ideas are discussed mostly
in a general way, with some Java-specifics, but the
representation supposedly also supports Fortran95
and Ada95. The representation supports some CSE, but
not for address computation operations. The paper
also gives numbers on size (usually a little smaller
than Java bytecode), and some other static metrics,
especially wrt. the effect of optimizations.}
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Apr 3 14:44:27 2024

From Newsgroup: comp.arch

Stephen Fuld wrote:

There has been discussion here about the benefits of reducing the number
of op codes. One reason not mentioned before is if you have fixed
length instructions, you may want to leave as many codes as possible available for future use. Of course, if you are doing a 16-bit
instruction design, where instruction bits are especially tight, you may save enough op-codes to save a bit, perhaps allowing a larger register specifier field, or to allow more instructions in the smaller subset.

It is in this spirit that I had an idea, partially inspired by Mill’s
use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First, it has several features that are “friendly” to the idea. Second, I know Mitch cares about keeping the number of op codes low.

Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate
discussions, and because it has been fun to think about.

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If set, the bit indicates that the corresponding register contains a floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load single floating and load double floating, which work the same as the other 32-
and 64-bit loads, but in addition to loading the value, set the tag bit
for the destination register. Non-floating-point loads would clear the
tag bit. As I show below, I don’t think you need any special "store
tag" instructions.

If you are adding a float/int data type flag you might as well
also add operand size for floats at least, though some ISA's
have both int32 and int64 ALU operations for result compatibility.

When executing arithmetic instructions, if the tag bits of both sources
of an instruction are the same, do the appropriate operation (floating
or integer), and set the tag bit of the result register appropriately.
If the tag bits of the two sources are different, I see several possibilities.

1. Generate an exception.
2. Use the sense of source 1 for the arithmetic operation, but
perform the appropriate conversion on the second operand first,
potentially saving an instruction
3. Always do the operation in floating point and convert the integer operand prior to the operation. (Or, if you prefer, change floating
point to integer in the above description.)
4. Same as 2 or 3 above, but don’t do the conversions.

I suspect this is the least useful choice. I am not sure which is the
best option.

Given that, use the same op code for the floating-point and fixed
versions of the same operations. So we can save eight op codes, the
four arithmetic operations, max, min, abs and compare. So far, a net savings of six opcodes.

But we can go further. There are some opcodes that only make sense for
FP operands, e.g. the transcendental instructions. And there are some operations that probably only make sense for non-FP operands, e.g. POP,
FF1, probably shifts. Given the tag bit, these could share the same op-code. There may be several more of these.

I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with separate compilations? The called function probably doesn’t know the
tag value for callee saved registers. Fortunately, the My 66000 architecture comes to the rescue here. You would modify the Enter and
Exit instructions to save/restore the tag bits of the registers they are saving or restoring in the same data structure it uses for the registers (yes, it adds 32 bits to that structure – minimal cost). The same mechanism works for interrupts that take control away from a running process.

I don’t think you need to set or clear the tag bits without doing
anything else, but if you do, I think you could “repurpose” some other instructions to do this, without requiring another op-code. For
example, Oring a register with itself could be used to set the tag bit
and Oring a register with zero could clear it. These should be pretty
rare.

That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it? To me, a major question is the effect on performance. What
is the cost of having to decode the source registers and reading their respective tag bits before knowing which FU to use? If it causes an
extra cycle per instruction, then it is almost certainly not worth it. IANAHG, so I don’t know. But even if it doesn’t cost any performance, I think the overall gains are pretty small, and probably not worth it
unless the op-code space is really tight (which, for My 66000 it isn’t).

Anyway, it has been fun thinking about this, so I hope you don’t mind
the, probably too long, post.
Any comments are welcome.

Currently the opcode data type can tell the uArch how to route
the operands internally without knowing the data values.
For example, FPU reservation stations monitor float operands
and schedule for just the FPU FADD or FMUL units.

Dynamic data typing would change that to be data dependent routing.
It means, for example, you can't begin to schedule a uOp
until you know all its operand types and opcode.

Looks like it makes such distributed decisions impossible.
Probably everything winds up in a big pile of logic in the center,
which might be problematic for those things whose complexity grows N^2.
Not sure how significant that is.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Apr 3 20:02:25 2024

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

[saving opcodes]

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If set, the bit indicates that the corresponding register contains a floating-point value. Clear indicates not floating point (integer,
address, etc.).

I don't think this would save a lot of opcode space, which
is the important thing.

A typical RISC design has a six-bit major opcode.
Having three registers takes away fifteen bits, leaving
eleven, which is far more than anybody would ever want as
minor opdoce for arithmetic instructions. Compare with https://en.wikipedia.org/wiki/DEC_Alpha#Instruction_formats
where DEC actually left out three bits because they did not
need them.

What is _really_ eating up opcode space are many- (usually 16-) bit
constants in the instructions.
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Wed Apr 3 15:25:01 2024

From Newsgroup: comp.arch

On 4/3/2024 11:43 AM, Stephen Fuld wrote:

There has been discussion here about the benefits of reducing the number
of op codes. One reason not mentioned before is if you have fixed
length instructions, you may want to leave as many codes as possible available for future use. Of course, if you are doing a 16-bit
instruction design, where instruction bits are especially tight, you may save enough op-codes to save a bit, perhaps allowing a larger register specifier field, or to allow more instructions in the smaller subset.

It is in this spirit that I had an idea, partially inspired by Mill’s
use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First, it has several features that are “friendly” to the idea. Second, I know Mitch cares about keeping the number of op codes low.

Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate discussions, and because it has been fun to think about.

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If set, the bit indicates that the corresponding register contains a floating-point value. Clear indicates not floating point (integer, address, etc.). There would be two additional instructions, load single floating and load double floating, which work the same as the other 32-
and 64-bit loads, but in addition to loading the value, set the tag bit
for the destination register. Non-floating-point loads would clear the
tag bit. As I show below, I don’t think you need any special "store
tag" instructions.

When executing arithmetic instructions, if the tag bits of both sources
of an instruction are the same, do the appropriate operation (floating
or integer), and set the tag bit of the result register appropriately.
If the tag bits of the two sources are different, I see several possibilities.

1.    Generate an exception.
2.    Use the sense of source 1 for the arithmetic operation, but
perform the appropriate conversion on the second operand first,
potentially saving an instruction
3.    Always do the operation in floating point and convert the integer operand prior to the operation. (Or, if you prefer, change floating
point to integer in the above description.)
4.    Same as 2 or 3 above, but don’t do the conversions.

I suspect this is the least useful choice. I am not sure which is the
best option.

Given that, use the same op code for the floating-point and fixed
versions of the same operations. So we can save eight op codes, the
four arithmetic operations, max, min, abs and compare. So far, a net savings of six opcodes.

But we can go further. There are some opcodes that only make sense for
FP operands, e.g. the transcendental instructions. And there are some operations that probably only make sense for non-FP operands, e.g. POP,
FF1, probably shifts. Given the tag bit, these could share the same op-code. There may be several more of these.

I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with separate compilations? The called function probably doesn’t know the
tag value for callee saved registers. Fortunately, the My 66000 architecture comes to the rescue here. You would modify the Enter and
Exit instructions to save/restore the tag bits of the registers they are saving or restoring in the same data structure it uses for the registers (yes, it adds 32 bits to that structure – minimal cost). The same mechanism works for interrupts that take control away from a running process.

I don’t think you need to set or clear the tag bits without doing
anything else, but if you do, I think you could “repurpose” some other instructions to do this, without requiring another op-code.   For
example, Oring a register with itself could be used to set the tag bit
and Oring a register with zero could clear it. These should be pretty rare.

That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it? To me, a major question is the effect on performance. What
is the cost of having to decode the source registers and reading their respective tag bits before knowing which FU to use? If it causes an
extra cycle per instruction, then it is almost certainly not worth it. IANAHG, so I don’t know. But even if it doesn’t cost any performance, I
think the overall gains are pretty small, and probably not worth it
unless the op-code space is really tight (which, for My 66000 it isn’t).

Anyway, it has been fun thinking about this, so I hope you don’t mind
the, probably too long, post.
Any comments are welcome.

FWIW:
This doesn't seem too far off from what would be involved with dynamic
typing at the ISA level, but with many of same sorts of drawbacks...

Say, for example, top 2 bits of a register:
00: Object Reference
Next 2 bits:
00: Pointer (with type-tag)
01: ?
1z: Bounded Array
01: Fixnum (route to ALU)
10: Flonum (route to FPU)
11: Other types
00: Smaller value types
Say: int/uint, short/ushort, ...
...

One issue:
Decoding based on register tags would mean needing to know the register
tag bits at the same time the instruction is being decoded. In this
case, one is likely to need two clock-cycles to fully decode the opcode.

ID1: Unpack instruction to figure out register fields, etc.
ID2: Fetch registers, specialize variable instructions based on tag bits.

For timing though, one ideally doesn't want to do anything with the
register values until the EX stages (since ID2 might already be tied up
with the comparably expensive register-forwarding logic), but asking for
3 cycles for decode is a bit much.

Otherwise, if one does not know which FU should handle the operation
until EX1, this has its own issues. Or, possible, the FU's decide
whether to accept the operation:
ALU: Accepts operation if both are fixnum, FPU if both are Flonum.

But, a proper dynamic language allows mixing fixnum and flonum with the
result being implicitly converted to flonum, but from the FPU's POV,
this would effectively require two chained FADD operations (one for the
Fixnum to Flonum conversion, one for the FADD itself).

Many other cases could get hairy, but to have any real benefit, the CPU
would need to be able to deal with them. In cases where the compiler
deals with everything, the type-tags become mostly moot (or potentially detrimental).

But, then, there is another issue:
C code expects C type semantics to be respected, say:
Signed int overflow wraps at 32 bits (sign extending);
Unsigned int overflow wraps at 32 bits (zero extending);
Variables may not hold values out-of-range for that type;
The 'long long' and 'unsigned long long' types are exactly 64-bit;
...
...

If one has tagged 64-bit registers, then fixnum might not hold the
entire range of 'long long'. If one has 66 or 68 bit registers, then
memory storage is a problem.

If one has untagged registers for cases where they are needed, one has
not saved any encoding space.

And, if one type-tags statically-typed variables, there no real
"value-added" here (and saving a little encoding space at the cost of
making the rest of the CPU more complicated and expensive, isn't much of
a win).

Better as I see it, to leave the CPU itself mostly working with raw
untagged values.

It can make sense to have helper-ops for type-tags, but these don't save
any encoding space, but rather making cases for dealing type-tagged data
a little faster.

Say:
Sign-extending a fixnum to 64 bits;
Setting the tag bits for a fixnum;
Doing the twiddling to convert between Flonum and Double;
Setting the tag for various bit patterns;
Checking the tag(s) against various bit patterns;
...

Where, on a more traditional ISA, the logic to do the bit-twiddling for type-checking and tag modification are a significant part of the runtime
cost of a dynamically typed language.

With luck, one can have dynamic typing that isn't horribly slow.
But, one still isn't likely to see serious use of dynamic typing in systems-level programming (if anything, Haskell style type-systems seem
to be more in fashion in this space at present, where trying to get the
code to be accepted by the compiler is itself an exercise in pain).

Well, and those of use who would prefer a more ActionScript or Haxe like approach in a systems-level language (at least as an option for when it
is useful to do so) are likely kind of the minority.

Well, and having a C dialect where one can be like, say:
__variant obj;
obj = __var { .x=3, .y=4 }; //ex-nihilo object
obj.z=5; //implicitly creates a new 'z' field in the object.
Is, not exactly standard...

And, I end up limiting its use, as any code which touches this stuff can
only be compiled in BGBCC (for example, getting the TestKern core to
build in GCC ended up needing to disable things like the BASIC
interpreter and similar, as I had used some of these features to
implement the interpreter).

Though, personally I still prefer being a little more strict than JS/ES
in some areas, like any of ""==null or 0==null or 0=="" or null=="null"
or similar being true, is a misfeature as far as I am concerned (my
original scripting language had quietly dropped this feature, despite otherwise resembling JS/ES).

Though, in my case, the ISA is not tagged, so all of this stuff is built
on top of implicit runtime calls. There is not currently a garbage
collector, but adding stuff to support precise GC could be possible in
theory (and in my current project would be a tempting alternative to the
use of conservative GC, if albeit likely neither could be used in a
real-time context).

In one of my own languages, I had instead defined rules to try to allow
the compiler to figure out object lifetimes in various cases (how an
object is created implicitly also gave some information about its
semantics and lifetime).

...

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 3 21:30:02 2024

From Newsgroup: comp.arch

BGB-Alt wrote:

On 4/3/2024 11:43 AM, Stephen Fuld wrote:

There has been discussion here about the benefits of reducing the number
of op codes. One reason not mentioned before is if you have fixed
length instructions, you may want to leave as many codes as possible
available for future use. Of course, if you are doing a 16-bit
instruction design, where instruction bits are especially tight, you may
save enough op-codes to save a bit, perhaps allowing a larger register
specifier field, or to allow more instructions in the smaller subset.

It is in this spirit that I had an idea, partially inspired by Mill’s
use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First, it

66000

has several features that are “friendly” to the idea. Second, I know >> Mitch cares about keeping the number of op codes low.

Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate
discussions, and because it has been fun to think about.

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If >> set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load single >> floating and load double floating, which work the same as the other 32-
and 64-bit loads, but in addition to loading the value, set the tag bit
for the destination register. Non-floating-point loads would clear the
tag bit. As I show below, I don’t think you need any special "store
tag" instructions.

What do you do when you want a FP bit pattern interpreted as an integer,
or vice versa.

When executing arithmetic instructions, if the tag bits of both sources
of an instruction are the same, do the appropriate operation (floating
or integer), and set the tag bit of the result register appropriately.
If the tag bits of the two sources are different, I see several
possibilities.

1.    Generate an exception.
2.    Use the sense of source 1 for the arithmetic operation, but
perform the appropriate conversion on the second operand first,
potentially saving an instruction

Conversions to/from FP often require a rounding mode. How do you specify that?

3.    Always do the operation in floating point and convert the integer >> operand prior to the operation. (Or, if you prefer, change floating
point to integer in the above description.)
4.    Same as 2 or 3 above, but don’t do the conversions.

I suspect this is the least useful choice. I am not sure which is the
best option.

Given that, use the same op code for the floating-point and fixed
versions of the same operations. So we can save eight op codes, the
four arithmetic operations, max, min, abs and compare. So far, a net
savings of six opcodes.

But we can go further. There are some opcodes that only make sense for
FP operands, e.g. the transcendental instructions. And there are some
operations that probably only make sense for non-FP operands, e.g. POP,
FF1, probably shifts. Given the tag bit, these could share the same
op-code. There may be several more of these.

Hands waving:: "Danger Will Robinson, Danger" more waving of hands.

I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with
separate compilations? The called function probably doesn’t know the

The compiler will certainly have a function prototype. In any event, if FP
and Integers share a register file the lack of prototype is much less stress- full to the compiler/linking system.

tag value for callee saved registers. Fortunately, the My 66000
architecture comes to the rescue here. You would modify the Enter and
Exit instructions to save/restore the tag bits of the registers they are
saving or restoring in the same data structure it uses for the registers
(yes, it adds 32 bits to that structure – minimal cost). The same
mechanism works for interrupts that take control away from a running
process.

Yes, but we do just fine without the tag and without the stuff mentioned above. Neither ENTER nor EXIT care about the 64-bit pattern in the register.

I don’t think you need to set or clear the tag bits without doing
anything else, but if you do, I think you could “repurpose” some other >> instructions to do this, without requiring another op-code.   For
example, Oring a register with itself could be used to set the tag bit
and Oring a register with zero could clear it. These should be pretty
rare.

That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it?

No.

To me, a major question is the effect on performance. What

is the cost of having to decode the source registers and reading their
respective tag bits before knowing which FU to use?

The problem is you have put decode dependent on dynamic pipeline information.
I suggest you don't want to do that. Consider a change from int to FP instruction
as a predicated instruction, so the pipeline cannot DECODE the instruction at hand until the predicate resolves. Yech.

If it causes an

extra cycle per instruction, then it is almost certainly not worth it.
IANAHG, so I don’t know. But even if it doesn’t cost any performance, I
think the overall gains are pretty small, and probably not worth it
unless the op-code space is really tight (which, for My 66000 it isn’t). >>
Anyway, it has been fun thinking about this, so I hope you don’t mind
the, probably too long, post.
Any comments are welcome.

It is actually an interesting idea if you want to limit your architecture
to 1-wide.

FWIW:
This doesn't seem too far off from what would be involved with dynamic typing at the ISA level, but with many of same sorts of drawbacks...

Say, for example, top 2 bits of a register:
00: Object Reference
Next 2 bits:
00: Pointer (with type-tag)
01: ?
1z: Bounded Array
01: Fixnum (route to ALU)
10: Flonum (route to FPU)
11: Other types
00: Smaller value types
Say: int/uint, short/ushort, ...
...

One issue:
Decoding based on register tags would mean needing to know the register
tag bits at the same time the instruction is being decoded. In this
case, one is likely to need two clock-cycles to fully decode the opcode.

More importantly, you added a cycle AFTER register READ/Forward before
you can start executing (more when OoO is in use).

And finally, the compiler KNOWS what the type is at compile time.
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 3 21:53:26 2024

From Newsgroup: comp.arch

BGB-Alt wrote:

FWIW:
This doesn't seem too far off from what would be involved with dynamic typing at the ISA level, but with many of same sorts of drawbacks...

Say, for example, top 2 bits of a register:
00: Object Reference
Next 2 bits:
00: Pointer (with type-tag)
01: ?
1z: Bounded Array
01: Fixnum (route to ALU)
10: Flonum (route to FPU)
11: Other types
00: Smaller value types
Say: int/uint, short/ushort, ...
...

So, you either have 66-bit registers, or you have 62-bit FP numbers ?!?
This solves nobody's problems; not even LISP.

One issue:
Decoding based on register tags would mean needing to know the register
tag bits at the same time the instruction is being decoded. In this
case, one is likely to need two clock-cycles to fully decode the opcode.

Not good. But what if you don't know the tag until the register is delivered from a latent FU, do you stall DECODE, or do you launch and make the instruction
queue element have to deal with all outcomes.

ID1: Unpack instruction to figure out register fields, etc.
ID2: Fetch registers, specialize variable instructions based on tag bits.

For timing though, one ideally doesn't want to do anything with the
register values until the EX stages (since ID2 might already be tied up
with the comparably expensive register-forwarding logic), but asking for
3 cycles for decode is a bit much.

Otherwise, if one does not know which FU should handle the operation
until EX1, this has its own issues.

Real-friggen-ely

Or, possible, the FU's decide
whether to accept the operation:
ALU: Accepts operation if both are fixnum, FPU if both are Flonum.

What if IMUL is performed in FMAC, IDIV in FDIV,... Int<->FP routing is
based on calculation capability {Even CDC 6600 performed int × in the
FP × unit (not in Thornton's book, but via conversation with 6600 logic designer at Asilomar some time ago. All they had to do to get FP × to
perform int × was disable 1 gate.......)

But, a proper dynamic language allows mixing fixnum and flonum with the result being implicitly converted to flonum, but from the FPU's POV,
this would effectively require two chained FADD operations (one for the Fixnum to Flonum conversion, one for the FADD itself).

That is a LANGUAGE problem not an ISA problem. SNOBOL allowed one to add
a string to an integer and the string would be converted to int before.....

Many other cases could get hairy, but to have any real benefit, the CPU would need to be able to deal with them. In cases where the compiler
deals with everything, the type-tags become mostly moot (or potentially detrimental).

You are arguing that the added complexity would somehow pay for itself.
I can't see it paying for itself.

But, then, there is another issue:
C code expects C type semantics to be respected, say:
Signed int overflow wraps at 32 bits (sign extending);

maybe

Unsigned int overflow wraps at 32 bits (zero extending);

maybe

Variables may not hold values out-of-range for that type;

LLVM does this GCC does not.

The 'long long' and 'unsigned long long' types are exactly 64-bit;

At least 64-bit not exactly.

...
...

If one has tagged 64-bit registers, then fixnum might not hold the
entire range of 'long long'. If one has 66 or 68 bit registers, then
memory storage is a problem.

Ya think ?

If one has untagged registers for cases where they are needed, one has
not saved any encoding space.

I give up--not worth trying to teach cosmologist why the color of the
lipstick going on the pig is not the problem.....
--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Apr 3 23:20:46 2024

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

BGB-Alt wrote:

But, a proper dynamic language allows mixing fixnum and flonum with the
result being implicitly converted to flonum, but from the FPU's POV,
this would effectively require two chained FADD operations (one for the
Fixnum to Flonum conversion, one for the FADD itself).

That is a LANGUAGE problem not an ISA problem. SNOBOL allowed one to add
a string to an integer and the string would be converted to int before.....

The Burroughs B3500 would simply ignore the zone digit when adding
a string to an integer, based on the address controller for the
operand.

ADD 1225 010000(UN) 020000(UA) 030000(UN)

Would add the 12 unsigned numeric nibbles at address 10000
to the 25 numeric digits of the 8-bit EBCDIC/ASCII data at address 20000
and store the result as 25 numeric nibbles at address 30000.

ADD 0507 010000(UN) 020000(UN) 030000(UA)

Would add the 5 unsigned numeric nibbles at 10000 to
the 7 unsigned numeric nibbles at 20000 and store them
as 8-bit EBCDIC bytes at 30000 (inserting the zone digit @F@
before each numeric nibble). A processor mode toggle selected
whether the inserted zone digit should be @F@ (EBCDIC) or @3@ (ASCII).

Likewise for SUB, INC, DEC, MPY, DIV and data movement instructions.

The data movement instructions would left- or right-align the destination
field (MVN (move numeric) would right justify and MVA (move alphanumeric) would left justify) when the destination and source field lengths differ.

Floating point was BCD with an exponent sign digit, two exponent digits,
a mantissa sign digit and a variable length mantissa of up
to 100 digits in length. The integer instructions could be used
on either the mantissa or exponent individually, as they were
just fields in memory.
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Wed Apr 3 19:27:59 2024

From Newsgroup: comp.arch

On 4/3/2024 4:53 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

FWIW:
This doesn't seem too far off from what would be involved with dynamic
typing at the ISA level, but with many of same sorts of drawbacks...

Say, for example, top 2 bits of a register:
   00: Object Reference
     Next 2 bits:
       00: Pointer (with type-tag)
       01: ?
       1z: Bounded Array
   01: Fixnum (route to ALU)
   10: Flonum (route to FPU)
   11: Other types
     00: Smaller value types
       Say: int/uint, short/ushort, ...
     ...

So, you either have 66-bit registers, or you have 62-bit FP numbers ?!?
This solves nobody's problems; not even LISP.

Yeah, there is likely no way to make this worthwhile...

One issue:
Decoding based on register tags would mean needing to know the
register tag bits at the same time the instruction is being decoded.
In this case, one is likely to need two clock-cycles to fully decode
the opcode.

Not good. But what if you don't know the tag until the register is
delivered from a latent FU, do you stall DECODE, or do you launch and
make the instruction
queue element have to deal with all outcomes.

It is likely that the pipeline would need to stall until results are available.

It is also likely that such a CPU would have a minimum effective latency
of 2 or 3 clock cycles for *every* instruction (and probably 4 or 5
cycles for memory load), in addition to requiring pipeline stalls.

ID1: Unpack instruction to figure out register fields, etc.
ID2: Fetch registers, specialize variable instructions based on tag bits.

For timing though, one ideally doesn't want to do anything with the
register values until the EX stages (since ID2 might already be tied
up with the comparably expensive register-forwarding logic), but
asking for 3 cycles for decode is a bit much.

Otherwise, if one does not know which FU should handle the operation
until EX1, this has its own issues.

Real-friggen-ely

These issues could be a deal-breaker for such a CPU.

                                    Or, possible, the FU's decide
whether to accept the operation:
   ALU: Accepts operation if both are fixnum, FPU if both are Flonum.

What if IMUL is performed in FMAC, IDIV in FDIV,... Int<->FP routing is
based on calculation capability {Even CDC 6600 performed int × in the FP
× unit (not in Thornton's book, but via conversation with 6600 logic designer at Asilomar some time ago. All they had to do to get FP × to perform int × was disable 1 gate.......)

Then you have a mess...

So, probably need to sort it out before EX in any case.

But, a proper dynamic language allows mixing fixnum and flonum with
the result being implicitly converted to flonum, but from the FPU's
POV, this would effectively require two chained FADD operations (one
for the Fixnum to Flonum conversion, one for the FADD itself).

That is a LANGUAGE problem not an ISA problem. SNOBOL allowed one to add
a string to an integer and the string would be converted to int before.....

If you have dynamic types in hardware in this way, then effectively the typesystem mechanics switch from being a language issue to a hardware issue.

One may also end up with, say, a CPU that can run Scheme or JavaScript
or similar, but likely couldn't run C without significant hassles.

Many other cases could get hairy, but to have any real benefit, the
CPU would need to be able to deal with them. In cases where the
compiler deals with everything, the type-tags become mostly moot (or
potentially detrimental).

You are arguing that the added complexity would somehow pay for itself.
I can't see it paying for itself.

One either goes all in, or abandons the idea entirely.
There isn't really a middle option in this scenario (then one just ends
up with something that is bad at everything).

I was not saying it could work, but in a way, pointing out the issues
that would likely make this unworkable.

Though, that said, there could be possible merit in a CPU core that
could run a language like ECMAScript at roughly C like speeds, even if
it was basically unusable for pretty much anything else.

Though, for ECMAScript, also make a case for taking the SpiderMonkey
option and largely abandoning the use of an integer ALU (instead running
all of the integer math through the FPU; which could be modified to
support bitwise integer operations and similar as well).

But, then, there is another issue:
   C code expects C type semantics to be respected, say:
     Signed int overflow wraps at 32 bits (sign extending);

maybe

     Unsigned int overflow wraps at 32 bits (zero extending);

maybe

I am dealing with some code that has a bad habit of breaking if integer overflows don't happen in the expected ways (say, the ROTT engine is
pretty bad about this one...).

When I first started working on my ROTT port, there was also a lot of wackiness where the engine would go out of bounds, then behavior would
depend on what other things in memory it encountered when it did so.

I have mostly managed to fix up all the out-of-bounds issues, but this
isn't enough to keep the demo's from desyncing (a similar issue applies
with my Doom port).

Apparently, other engines like ZDoom and similar needed to do a bit of
"heavy lifting" to get the demos from all of the various WAD versions to
play without desync; as Doom was also dependent on the behavior of out-of-bounds memory accesses, and it was needed to turn these into
in-bounds accesses (to larger memory objects) with the memory contents
of the out-of-bounds accesses being faked.

Of course, the other option is just to "fix" the out-of-bounds accesses,
and live with a port where the demo playback desyncs.

Meanwhile, Quake entirely avoided this issue:
The demo playback is based on recording the location and orientation of
the player and any enemies at every point in time and similar, rather
than based on recording and replaying the original sequence of keyboard
inputs (and assuming that everything always happens exactly the same
each time).

Then again, these sorts of issues are not unique to these games. Have
watched more than a few speed-runs involving using glitches either to
leave the playable parts of the map, or using convoluted sequences of
actions to corrupt memory in such a way as to achieve a desired effect
(such as triggering a warp to the end of the game).

Like, during normal gameplay, these games are seemingly just sorta
corrupting memory all over the place but, for the most part, no one
notices until something goes more obviously wrong...

     Variables may not hold values out-of-range for that type;

LLVM does this GCC does not.

     The 'long long' and 'unsigned long long' types are exactly 64-bit;

At least 64-bit not exactly.

C only requires at-least 64 bits.
I suspect in-practice, most code expects exactly 64 bits.

       ...
     ...

If one has tagged 64-bit registers, then fixnum might not hold the
entire range of 'long long'. If one has 66 or 68 bit registers, then
memory storage is a problem.

Ya think ?

Both options suck, granted.

If one has untagged registers for cases where they are needed, one has
not saved any encoding space.

I give up--not worth trying to teach cosmologist why the color of the lipstick going on the pig is not the problem.....

I was not trying to claim that this idea wouldn't suck.

In my case, I went a different route that works a little better:
Leaving all this stuff mostly up to software...

--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Apr 4 10:32:48 2024

From Newsgroup: comp.arch

MitchAlsup1 wrote:

BGB-Alt wrote:

On 4/3/2024 11:43 AM, Stephen Fuld wrote:

There has been discussion here about the benefits of reducing the
number of op codes. One reason not mentioned before is if you have
fixed length instructions, you may want to leave as many codes as
possible available for future use. Of course, if you are doing a
16-bit instruction design, where instruction bits are especially
tight, you may save enough op-codes to save a bit, perhaps allowing a
larger register specifier field, or to allow more instructions in the
smaller subset.

It is in this spirit that I had an idea, partially inspired by Mill’s >>> use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First, it

66000

has several features that are “friendly” to the idea. Second, I know >>> Mitch cares about keeping the number of op codes low.

Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate >>> discussions, and because it has been fun to think about.

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag.
If set, the bit indicates that the corresponding register contains a >>> floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load
single floating and load double floating, which work the same as the >>> other 32- and 64-bit loads, but in addition to loading the value, set
the tag bit for the destination register. Non-floating-point loads
would clear the tag bit. As I show below, I don’t think you need any >>> special "store tag" instructions.

What do you do when you want a FP bit pattern interpreted as an integer,
or vice versa.

This is why, if you want to copy Mill, you have to do it properly:
Mill does NOT care about the type of data loaded into a particular belt
slot, only the size and if it is a scalar or a vector filling up the
full belt slot. In either case you will also have marker bits for
special types like None and NaR.
So scalar 8/16/32/64/128 and vector 8x16/16x8/32x4/64x2/128x1 (with the
last being the same as the scalar anyway).
Only load ops and explicit widening/narrowing ops sets the size tag
bits, from that point any op where it makes sense will do the right
thing for either a scalar or a short vector, so you can add 16+16 8-bit
vars with the same ADD encoding as you would use for a single 64-bit ADD.We do NOT make any attempt to interpret the actual bit patterns sotred
within each belt slot, that is up to the instructions. This means that
there is no difference between loading a float or an int32_t, it also
means that it is perfectly legel (and supported) to use bit operations
on a FP variable. This can be very useful, not just to fake exact
arithmetic by splitting a double into two 26-bit mantissa parts:
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Apr 4 16:47:44 2024

From Newsgroup: comp.arch

On Thu, 4 Apr 2024 10:32:48 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

We do NOT make any attempt

Terje

Does a present tense means that you are still involved in Mill project?

--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Apr 4 21:13:21 2024

From Newsgroup: comp.arch

Michael S wrote:

On Thu, 4 Apr 2024 10:32:48 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

We do NOT make any attempt

Terje

Does a present tense means that you are still involved in Mill project?

I am much less active than I used to be, but I still get the weekly conf
call invites and respond to any interesting subject on our mailing list.

So, yes, I do consider myself to still be involved.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Apr 4 22:25:30 2024

From Newsgroup: comp.arch

On Thu, 4 Apr 2024 21:13:21 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:

On Thu, 4 Apr 2024 10:32:48 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

We do NOT make any attempt

Terje

Does a present tense means that you are still involved in Mill
project?

I am much less active than I used to be, but I still get the weekly
conf call invites and respond to any interesting subject on our
mailing list.

So, yes, I do consider myself to still be involved.

Terje

Thank you

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Thu Apr 4 17:28:43 2024

From Newsgroup: comp.arch

On 4/4/2024 3:32 AM, Terje Mathisen wrote:

MitchAlsup1 wrote:

BGB-Alt wrote:

On 4/3/2024 11:43 AM, Stephen Fuld wrote:

There has been discussion here about the benefits of reducing the
number of op codes. One reason not mentioned before is if you have
fixed length instructions, you may want to leave as many codes as
possible available for future use. Of course, if you are doing a
16-bit instruction design, where instruction bits are especially
tight, you may save enough op-codes to save a bit, perhaps allowing
a larger register specifier field, or to allow more instructions in
the smaller subset.

It is in this spirit that I had an idea, partially inspired by
Mill’s use of tags in registers, but not memory. I worked through >>>> this idea using the My 6600 as an example “substrate” for two
reasons. First, it

66000

has several features that are “friendly” to the idea. Second, I >>>> know Mitch cares about keeping the number of op codes low.

Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate
discussions, and because it has been fun to think about.

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag.
If set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load
single floating and load double floating, which work the same as the
other 32- and 64-bit loads, but in addition to loading the value,
set the tag bit for the destination register. Non-floating-point
loads would clear the tag bit. As I show below, I don’t think you >>>> need any special "store tag" instructions.

What do you do when you want a FP bit pattern interpreted as an
integer, or vice versa.

This is why, if you want to copy Mill, you have to do it properly:

Mill does NOT care about the type of data loaded into a particular belt slot, only the size and if it is a scalar or a vector filling up the
full belt slot. In either case you will also have marker bits for
special types like None and NaR.

So scalar 8/16/32/64/128 and vector 8x16/16x8/32x4/64x2/128x1 (with the
last being the same as the scalar anyway).

Only load ops and explicit widening/narrowing ops sets the size tag
bits, from that point any op where it makes sense will do the right
thing for either a scalar or a short vector, so you can add 16+16 8-bit
vars with the same ADD encoding as you would use for a single 64-bit ADD.

We do NOT make any attempt to interpret the actual bit patterns sotred within each belt slot, that is up to the instructions. This means that
there is no difference between loading a float or an int32_t, it also
means that it is perfectly legel (and supported) to use bit operations
on a FP variable. This can be very useful, not just to fake exact
arithmetic by splitting a double into two 26-bit mantissa parts:

I guess useful to know.

Haven't heard much about Mill in a while, so don't know what if any
progress is being made.

As I can note, in my actual ISA, any type-tagging in the registers was explicit and opt-in, generally managed by the compiler/runtime/etc; in
this case, the ISA merely providing facilities to assist with this.

The main exception would likely have been the possible "Bounds Check
Enforce" mode, which would still need a bit of work to implement, and is
not likely to be terribly useful. Most complicated and expensive parts
are that it will require implicit register and memory tagging (to flag capabilities). Though, cheaper option is simply to not enable it, in
which case things either behave as before, with the new functionality essentially being NOP. Much of the work still needed on this would be
getting the 128-bit ABI working, and adding some new tweaks to the ABI
to play well with the capability addressing (effectively it requires
partly reworking how global variables are accessed).

The type-tagging scheme used in my case is very similar to that used in
my previous BGBScript VMs (where, as I can note, BGBCC was itself a fork
off of an early version of the BGBScript VM, and effectively using a lax hybrid typesystem masquerading as C). Though, it has long since moved to
a more proper C style typesystem, with dynamic types more as an optional extension.

But, as can be noted, since dynamic typing is implemented via runtime
calls, it is slower than the use of static types. But, this is likely to
be unavoidable with any kind of conventional-ish architecture (and, some structures, like bounded array objects and ex-nihilo objects, are
difficult to make performance competitive with bare pointers and structs).

Though, it is not so much that I think it is justifiable to forbid their existence entirely (as is more the philosophy in many strict static languages), or to mandate that programs roll their own (as typical in C
and C++ land). Where, with compiler and runtime support, it is possible
to provide them in ways that are higher performance than a plain C implementation.

Well, and also the annoyance that seemingly every dynamic-language VM
takes a different approach to the implementation of its dynamic
typesystem (along with language differences, ...).

For example, Common Lisp is very different from Smalltalk, despite both
being categorically similar in this sense (or, versus Python, or versus JavaScript, or, ...). Not likely viable to address all of them in the
same runtime (and would likely result in a typesystem that doesn't
really match with any of them, ...).

Though, annoyingly, there are not really any mainstream languages in the "hybrid" category (say, in the gray area between C and ActionScript).
And, then it ends up being a question of which is better in a choice
between C with AS-like features, or "like AS but with C features".

So, alas...

Terje

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Apr 5 01:48:33 2024

From Newsgroup: comp.arch

BGB-Alt wrote:

On 4/4/2024 3:32 AM, Terje Mathisen wrote:

MitchAlsup1 wrote:

As I can note, in my actual ISA, any type-tagging in the registers was explicit and opt-in, generally managed by the compiler/runtime/etc; in
this case, the ISA merely providing facilities to assist with this.

The main exception would likely have been the possible "Bounds Check Enforce" mode, which would still need a bit of work to implement, and is
not likely to be terribly useful.

A while back (and maybe in the future) My 66000 had what I called the
Foreign Access Mode. When the HoB of the pointer was set, the first
entry in the translation table was a 4 doubleword structure, A Root
pointer, the Lowest addressable Byte, the Highest addressable Byte,
and a DW of access rights, permissions,... While sort-of like a capability
I don't think it was close enough to actually be a capability or used as
one.

So, it fell out of favor, and it was not clear how it fit into the HyperVisor/SuperVisor model, either.

Most complicated and expensive parts
are that it will require implicit register and memory tagging (to flag capabilities). Though, cheaper option is simply to not enable it, in
which case things either behave as before, with the new functionality essentially being NOP. Much of the work still needed on this would be getting the 128-bit ABI working, and adding some new tweaks to the ABI
to play well with the capability addressing (effectively it requires
partly reworking how global variables are accessed).

The type-tagging scheme used in my case is very similar to that used in
my previous BGBScript VMs (where, as I can note, BGBCC was itself a fork
off of an early version of the BGBScript VM, and effectively using a lax hybrid typesystem masquerading as C). Though, it has long since moved to
a more proper C style typesystem, with dynamic types more as an optional extension.

In general, any time one needs to change the type you waste an instruction compared to type less registers.
--- Synchronet 3.20a-Linux NewsLink 1.114

From John Savard@quadibloc@servername.invalid to comp.arch on Thu Apr 4 21:13:13 2024

From Newsgroup: comp.arch

On some older CPUs, there might be one set of integer opcodes and one
set of floating-point opcodes, with a status register containing the
integer precision, and the floating-point precision, currently in use.

The idea was that this would be efficient because most programs only
use one size of each type of number, so the number of opcodes would be
the most appropriate, and that status register wouldn't need to be
reloaded too often.

It's considered dangerous, though, to have a mechanism for changing
what instructions mean, since this could let malware alter what
programs do in a useful and sneaky fashion. Memory bandwidth is no
longer a crippling constraint the way it was back in the days of core
memory and discrete transistors - at least not for program code, even
if memory bandwidth for _data_ often limits the processing speed of
computers.

This is basically because any program that does any real work, taking
any real length of time to do its job, is going to mostly consist of
loops that fit in cache. So letting program code be verbose if there
are other benefits obtained thereby is the current conventional
wisdom.

John Savard
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Fri Apr 5 00:54:54 2024

From Newsgroup: comp.arch

On 4/4/2024 8:48 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

On 4/4/2024 3:32 AM, Terje Mathisen wrote:

MitchAlsup1 wrote:

As I can note, in my actual ISA, any type-tagging in the registers was
explicit and opt-in, generally managed by the compiler/runtime/etc; in
this case, the ISA merely providing facilities to assist with this.

The main exception would likely have been the possible "Bounds Check
Enforce" mode, which would still need a bit of work to implement, and
is not likely to be terribly useful.

A while back (and maybe in the future) My 66000 had what I called the Foreign Access Mode. When the HoB of the pointer was set, the first
entry in the translation table was a 4 doubleword structure, A Root
pointer, the Lowest addressable Byte, the Highest addressable Byte,
and a DW of access rights, permissions,... While sort-of like a capability
I don't think it was close enough to actually be a capability or used as
one.

So, it fell out of favor, and it was not clear how it fit into the HyperVisor/SuperVisor model, either.

Possibly true.

The idea with BCE mode would be that the pointers would contain an
address along with an upper and lower bound, and possibly a few access
flags. It would disable the narrower 64-bit pointer instructions,
forcing the use of the 128-bit pointer instructions; which would perform bounds checks, and some instructions would gain some additional semantics.

In addition, the Boot SRAM and DRAM gain some special "Tag Bits" areas.

However, it is unclear if the enforcing mode gains much over the normal optional bounds checking to justify the extra cost. The main "merit"
case is that, in theory, it could offer some additional protection
against hostile machine code (whereas the non-enforcing mode is mostly
useful for detecting out-of-bounds memory accesses).

However, the optional mode is compatible with the use of 64-bit pointers
and the existing C ABI, so there is less overhead.

Most complicated and expensive parts
are that it will require implicit register and memory tagging (to flag
capabilities). Though, cheaper option is simply to not enable it, in
which case things either behave as before, with the new functionality
essentially being NOP. Much of the work still needed on this would be
getting the 128-bit ABI working, and adding some new tweaks to the ABI
to play well with the capability addressing (effectively it requires
partly reworking how global variables are accessed).

The type-tagging scheme used in my case is very similar to that used
in my previous BGBScript VMs (where, as I can note, BGBCC was itself a
fork off of an early version of the BGBScript VM, and effectively
using a lax hybrid typesystem masquerading as C). Though, it has long
since moved to a more proper C style typesystem, with dynamic types
more as an optional extension.

In general, any time one needs to change the type you waste an instruction compared to type less registers.

In my case, both types of values are used:
int x; //x is a bare register
void *p; //may or may not have tag, high 16 bits 0000 if untagged
__variant y; //y is tagged
auto z; //may be tagged or untagged

Here, untagged values will generally be used for non-variant types,
whereas tagged values for variant types.

Here, 'auto' and 'variant' differ, in that variant says "the type is
only known at runtime", whereas 'auto' assumes that a type exists and
may optionally be resolved at compile time (or, alternatively, it may
decay into variant; assumption being that one may not use auto in ways
that are incompatible with variant). In terms of behavior, both cases
may appear superficially similar.

Though:
auto z = expr;
Would instead define 'z' as a type inferred from the expression (in a
similar way to how it works in C++).

Note that:
__var x;
Would also give a variable of type variant, but is not exactly the same ("__variant" is the type, where "__var" is a statement/expression
keyword that just so happens to declare a variable of type "__variant"
when used in this way).

Say, non-variant:
int, long, double
void*, char*, Foo*, ...
__m128, __vec4f, ...
Variant:
__variant, __object, __fixnum, __string, ...

Where, for example:
__variant
May hold (nearly) any type of value at runtime.
Though, with some semantic restrictions.
__object
Tagged value, like variant;
But does not allow using operators on it directly.
__fixnum
Represents a 62-bit signed integer value.
Always exists in tagged form.
__flonum
Represents a 62-bit floating-point value.
Effectively a tagged Binary64 shifted-right by 2 bits.
__string
Holds a string;
Essentially 'char*' but with a type-tagged pointer.
Defaults to CP-1252 at present, but may also hold a UCS-2 string.
Strings are assumed to be a read-only character array.
...

So, say:
int x, z;
__variant y;

y=x; //implicit int -> __fixnum -> __variant
z=(int)y; //coerces y to 'int'

There are some operators that exist for variant types but not for
non-variant types, such as __instanceof.

if(y __instanceof __fixnum)
{
//y is known to be a fixnum here
}

Where __instanceof can also be used on class instances:
__class Foo __extends Bar __implements IBaz {
... class members ...
};

In theory, could add a header to #define a lot of these keywords in non-prefixed forms, in which case one could theoretically write, say:
public class Foo extends Bar implements IBaz {
private int x, y;
public int someMethod()
{ return x+y; }
public void setX(int val)
{ x=val; }
...
};

And, if one has, say:
IBaz baz;
...
if(baz instanceof Foo)
{
//baz is an instance of the Foo class
}

Though, will note that object instances are pass-by-reference here (like
in Java and C#) and not by-value. Though, if one is familiar with Java, probably not too hard to figure out how some of this works. Also, as can
be noted, the object model is more like Java family languages than like C++.

However, unlike Java (and more like ActionScript), one can throw a
'dynamic' (or '__dynamic') keyword on a class, in which case it is
possible to create new members in the object instances merely by
assigning to them (where any members created this way will default to
being 'public variant').

Object member access will differ depending on the type of object.
Direct access to a non-dynamic class member will use a fixed
displacement (like when accessing a struct). Dynamic members will
implicitly access an ex-nihilo object that exists as a hidden member in
the class instance (and using the 'dynamic' modifier on a class will implicitly create this member).

In this case, interfaces are pulled off by sticking a interface VTable
pointer onto the end of the object, and then encoding the Interface
reference as a pointer to the pointer to this vtable (with the VTable
encoding the offset to adjust the object pointer to give a pointer to
the base class for the virtual method). Note that (unlike in the JVM),
what interfaces a class implements is fixed at compile time ("interface injection" is not possible in BGBCC).

There was an experimental C++ mode, which tries to mimic C++ syntax and semantics (kinda), sort of trying to awkwardly fake C++'s object system
on top of the Java-like object system (with POD classes decaying into C structs; value objects faked with object cloning, ...). Will not take
much to see through this illusion though (and almost doesn't really seem
worth it).

If ex-nihilo objects are used, these are treated as separate from the instance-of-class objects. In the current implementation, these objects
are represented as small B-Trees representing key/value associations.
Here, each key is a 16-bit number (associated with a "symbol") and the
value is a 64-bit value (variant). Each object has a fixed capacity (16 members), and if exceeded, splits apart into a tree (say, a 2-level tree representing up to 256 members; with the keys in the top-level node
encoding the ranges of keys present in each sub-node).

At present, there is a limit of 64K unique symbols, but this isn't too
big of an issue in practice (each symbol can be seen as a mapping
between a 16-bit number and an ASCII string representing the symbol's name).

If accessing a normal class member, it will be accessed as a direct
memory load or store, or if it is a dynamic member, an implicit runtime
call will be used.

For dynamic types (variant), pretty much all operations involve runtime
calls. These calls will perform a dynamically-typed dispatch based on
the tags of the values they are given.

Similarly, getting/setting a member in an ex-nihilo object is
accomplished via a runtime call.

For performance reasons, high-traffic areas of the dynamic-type runtime
were written in ASM.

Though, for performance, the rule here is to avoid using variant except
in cases where one actually needs dynamic types (and based on whether or
not compatibility with mainline C compilers is needed).

On the other side of the language border (BS), the syntax differs slightly:
function foo(x:int, y:int):int
{
var z:int;
z=x+y;
...
}

And, another language (BS2) of mine had switched to a more Java-like
syntax (with parts of C syntax bolted on). Though, the practical
difference between BS2 and the extended C variant is small (if you
#define the keywords, it is possible to do similar things with mostly
only minor syntactic differences).

Similarly, the extended C variant has another advantage:
It is backwards compatible with C.

Though, not quite so compatible with C++, and I don't expect C++ fans to
be all that interested in BGBCC's non-standard dialect.

OTOH: I will argue that it is at least much less horrid looking than Objective-C.

...

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Fri Apr 5 14:46:35 2024

From Newsgroup: comp.arch

On 4/4/2024 10:13 PM, John Savard wrote:

On some older CPUs, there might be one set of integer opcodes and one
set of floating-point opcodes, with a status register containing the
integer precision, and the floating-point precision, currently in use.

The idea was that this would be efficient because most programs only
use one size of each type of number, so the number of opcodes would be
the most appropriate, and that status register wouldn't need to be
reloaded too often.

It's considered dangerous, though, to have a mechanism for changing
what instructions mean, since this could let malware alter what
programs do in a useful and sneaky fashion. Memory bandwidth is no
longer a crippling constraint the way it was back in the days of core
memory and discrete transistors - at least not for program code, even
if memory bandwidth for _data_ often limits the processing speed of computers.

This is basically because any program that does any real work, taking
any real length of time to do its job, is going to mostly consist of
loops that fit in cache. So letting program code be verbose if there
are other benefits obtained thereby is the current conventional
wisdom.

This was how the FPU worked in SH-4. Reloading some bits in FPSCR would effectively bank out the current set of FPU instructions (say, between
Single and Double, etc).

Also it was how 64-bit operations worked in early versions of 64-bit
versions of BJX1.

Say. there were DQ and JQ bits added to the control register:
DQ=0: 32-bit for variable-sized operations (like SH-4)
DQ=1: 64-bit for variable-sized operations.
JQ=0: 32-bit addressing (SH-4 memory map)
JQ=1: 48-bit addressing (like the later BJX2 memory map).

The DQ bit would also effect whether one had MOV.W or MOV.Q operations available.
DQ=0: Interpret ops as MOV.W (16-bit)
DQ=1: Interpret ops as MOV.Q (64-bit)

In the DQ=JQ=0 case, it would have been mostly equivalent to SH-4 (and
could still run GCC's compiler output). This was a similar situation to switching the FPU mode.

Though, a later version of the BJX1 ISA had dropped and repurposed some encodings, allowing MOV.W and MOV.Q to coexist (and avoiding the need
for the compiler to endlessly toggle this bit), albeit with fewer
addressing modes for the latter.

All this was an issue mostly because SH-4 had used fixed-length 16-bit instructions, and the encoding space was effectively almost entirely
full when I started (so new instructions required either sacrificing
existing instructions, or using mode bits).

Though, BJX1 did end up with some 32-bit ops, some borrowed from SH-2A
and similar. These were mostly stuck into awkward ad-hoc places in the
16-bit map, so decoding was kind of a pain.

...

When I later rebooted things as my BJX2 project, I effectively dropped
this whole mess and started over (with the caveat that it lost SH-4 compatibility). However, it has since gained RISC-V compatibility, for better/worse, at least RISC-V is likely to get slightly better
performance than SH-4 at least (and both ISA's can be 64-bit).

...

John Savard

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Apr 5 21:34:16 2024

From Newsgroup: comp.arch

John Savard wrote:

On some older CPUs, there might be one set of integer opcodes and one
set of floating-point opcodes, with a status register containing the
integer precision, and the floating-point precision, currently in use.

The idea was that this would be efficient because most programs only
use one size of each type of number, so the number of opcodes would be
the most appropriate, and that status register wouldn't need to be
reloaded too often.

Most programs I write use bytes (mostly unsigned) a few halfwords (mostly signed) a useful count of integers (both signed and unsigned--mainly as
already defined arguments/returns), and a vast majority of doublewords (invariably unsigned).

Early in My 66000 LLVM development Brian looked at the cost of having
only 1 FP OpCode set--and it did not look good--so we went back to the
standard way of an OpCode for each FP size × calculation.

It's considered dangerous, though, to have a mechanism for changing
what instructions mean, since this could let malware alter what
programs do in a useful and sneaky fashion. Memory bandwidth is no
longer a crippling constraint the way it was back in the days of core
memory and discrete transistors - at least not for program code, even
if memory bandwidth for _data_ often limits the processing speed of computers.

This is basically because any program that does any real work, taking
any real length of time to do its job, is going to mostly consist of
loops that fit in cache. So letting program code be verbose if there
are other benefits obtained thereby is the current conventional
wisdom.

John Savard

--- Synchronet 3.20a-Linux NewsLink 1.114

From John Savard@quadibloc@servername.invalid to comp.arch on Sat Apr 6 21:30:47 2024

From Newsgroup: comp.arch

On Fri, 5 Apr 2024 21:34:16 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

Early in My 66000 LLVM development Brian looked at the cost of having
only 1 FP OpCode set--and it did not look good--so we went back to the >standard way of an OpCode for each FP size � calculation.

I do tend to agree.

However, a silly idea has now occurred to me.

256 bits can contain eight instructions that are 32 bits long.

Or they can also contain seven instructions that are 36 bits long,
with four bits left over.

So they could contain *nine* instructions that are 28 bits long, also
with four bits left over.

Thus, instead of having mode bits, one _could_ do the following:

Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.

But have one value for the first four bits in a block that indicates
36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed
in which don't fill a whole block.

While that's a theoretical possibility, I don't view it as being
worthwhile in practice.

John Savard
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Apr 7 20:41:45 2024

From Newsgroup: comp.arch

John Savard wrote:

On Fri, 5 Apr 2024 21:34:16 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

Early in My 66000 LLVM development Brian looked at the cost of having
only 1 FP OpCode set--and it did not look good--so we went back to the >>standard way of an OpCode for each FP size × calculation.

I do tend to agree.

However, a silly idea has now occurred to me.

256 bits can contain eight instructions that are 32 bits long.

Or they can also contain seven instructions that are 36 bits long,
with four bits left over.

So they could contain *nine* instructions that are 28 bits long, also
with four bits left over.

I agree with the arithmetic going into this statement. What I don't
have sufficient data concerning is "whether these extra formats pay
for themselves". For example, how many of the 36-bit encodings are
irredundant with the 32-bit ones, and so on with the 28-bit ones.

Take::

ADD R7,R7,#1

I suspect there is a 28-bit form, a 32-bit form, and a 36-bit form
for this semantic step, that you pay for multiple times in decoding
and possibly pipelining. {{There may also be other encodings for
this; such as:: INC R7}}

Thus, instead of having mode bits, one _could_ do the following:

Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.

How do you attach 32-bit or 64-bit constants to 28-bit instructions ??

How do you switch from 64-bit to Byte to 32-bit to 16-bit in one
set of 256-bit instruction decodes ??

But have one value for the first four bits in a block that indicates
36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed
in which don't fill a whole block.

In complicated if-then-else codes (and switches) I often see one inst-
ruction followed by a branch to a common point. Does your encoding deal
with these efficiently ?? That is:: what happens when you jump to the
middle of a block of 36-bit instructions ??

While that's a theoretical possibility, I don't view it as being
worthwhile in practice.

Agreed.............

John Savard

--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Apr 7 21:01:15 2024

From Newsgroup: comp.arch

John Savard <quadibloc@servername.invalid> schrieb:

Thus, instead of having mode bits, one _could_ do the following:

Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.

But have one value for the first four bits in a block that indicates
36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed
in which don't fill a whole block.

While that's a theoretical possibility, I don't view it as being
worthwhile in practice.

I played around a bit with another scheme: Encoding things into
128-bit blocks, with either 21-bit or 42-bit or longer instructions
(or a block header with six bits, and 20 or 40 bits for each
instruction).

Did that look promising? Not really; the 21 bits offered a lot
of useful opcode space for two-register operations and even for
a few of the often-used three-register, but 42 bits was really
a bit too long, so the advantage wasn't great. And embedding
32-bit or 64-bit instructions in the code stream does not really
fit the 21-bit raster well, so compared to an ISA which can do so
(like My 66000) it came out at a disadvantage. Might be possible
to beat RISC-V, though.
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Apr 7 21:22:50 2024

From Newsgroup: comp.arch

Thomas Koenig wrote:

John Savard <quadibloc@servername.invalid> schrieb:

Thus, instead of having mode bits, one _could_ do the following:

Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.

But have one value for the first four bits in a block that indicates
36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed
in which don't fill a whole block.

While that's a theoretical possibility, I don't view it as being
worthwhile in practice.

I played around a bit with another scheme: Encoding things into
128-bit blocks, with either 21-bit or 42-bit or longer instructions
(or a block header with six bits, and 20 or 40 bits for each
instruction).

Not having seen said encoding scheme:: I suspect you used the Rd=Rs1 destructive operand model for the 21-bit encodings. Yes :: no ??
Otherwise one has 3×5-bit registers = 15-bits leaving only 6-bits
for 64 OpCodes. Now if you have floats and doubles and signed and
unsigned, you get 16 of each and we have not looked at memory
references or branching.

Did that look promising? Not really; the 21 bits offered a lot
of useful opcode space for two-register operations and even for
a few of the often-used three-register, but 42 bits was really
a bit too long, so the advantage wasn't great. And embedding
32-bit or 64-bit instructions in the code stream does not really
fit the 21-bit raster well, so compared to an ISA which can do so
(like My 66000) it came out at a disadvantage. Might be possible
to beat RISC-V, though.

But beating RISC-V is easy, try getting you instruction count down
to VAX counts without losing the ability to pipeline and parallel
instruction execution.

At handwaving accuracy::
VAX has 1.0 instructions
My 66000 has 1.1 instructions
RISC-V has 1.5 instructions
--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Apr 8 06:21:43 2024

From Newsgroup: comp.arch

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

John Savard <quadibloc@servername.invalid> schrieb:

Thus, instead of having mode bits, one _could_ do the following:

Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.

But have one value for the first four bits in a block that indicates
36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed
in which don't fill a whole block.

While that's a theoretical possibility, I don't view it as being
worthwhile in practice.

I played around a bit with another scheme: Encoding things into
128-bit blocks, with either 21-bit or 42-bit or longer instructions
(or a block header with six bits, and 20 or 40 bits for each
instruction).

Not having seen said encoding scheme:: I suspect you used the Rd=Rs1 destructive operand model for the 21-bit encodings. Yes :: no ??

It was not very well developed, I gave it up when I saw there wasn't
much to gain.

Otherwise one has 3×5-bit registers = 15-bits leaving only 6-bits
for 64 OpCodes.

There could have been a case for adding this (maybe just for
the a few frequent ones: "add r1,r2,r3", "add r1,r2,-r3", "add
r1,r2,#num" and "add r1,r2,#-num", but I did not pursue that
further.

I looked at load and store instructions with short offsets
(these would then have been scaled), and short branches. But
the 21-bit opcode space filled up really, really rapidly.

Also, it is easy to synthesize a 3-register operation from
a 2-register operation and a memory move. If the decoder is
set up for 42 bits anyway, instruction fusion also a possibility.
This got a bit weird.

Now if you have floats and doubles and signed and
unsigned, you get 16 of each and we have not looked at memory
references or branching.

For somebody who does Fortran, I find the frequency of floating
point instructions surprisingly low, even in Fortran code.

Did that look promising? Not really; the 21 bits offered a lot
of useful opcode space for two-register operations and even for
a few of the often-used three-register, but 42 bits was really
a bit too long, so the advantage wasn't great. And embedding
32-bit or 64-bit instructions in the code stream does not really
fit the 21-bit raster well, so compared to an ISA which can do so
(like My 66000) it came out at a disadvantage. Might be possible
to beat RISC-V, though.

But beating RISC-V is easy, try getting you instruction count down
to VAX counts without losing the ability to pipeline and parallel
instruction execution.

At handwaving accuracy::
VAX has 1.0 instructions
My 66000 has 1.1 instructions
RISC-V has 1.5 instructions

To reach VAX instruction density, one would have to have things
like memory operands (with the associated danger that compilers
will not put intermediate results in registers, but since they have
been optimized for x86 for decades, they are probably better now)
and load with update, which would then have to be cracked
into two micro-ops. Not sure about the benefit.
--- Synchronet 3.20a-Linux NewsLink 1.114

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Apr 8 07:16:08 2024

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

But beating RISC-V is easy, try getting you instruction count down
to VAX counts without losing the ability to pipeline and parallel
instruction execution.

At handwaving accuracy::
VAX has 1.0 instructions
My 66000 has 1.1 instructions
RISC-V has 1.5 instructions

To reach VAX instruction density

Note that in recent times Mitch Alsup ist writing not about code
density (static code size or dynamically executed bytes), but about instrruction counts. It's unclear why instruction count would be a
primary metric, except that he thinks that he can score points for My
66000 with it. As VAX demonstrates, you can produce an instruction
set with low instruction counts that is bad at the metrics that really
count: cycles for executing the program (for a given CPU chip area in
a given manufacturing process), and, for very small systems, static
code size.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.20a-Linux NewsLink 1.114

From John Savard@quadibloc@servername.invalid to comp.arch on Mon Apr 8 07:05:35 2024

From Newsgroup: comp.arch

On Sun, 7 Apr 2024 20:41:45 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

How do you attach 32-bit or 64-bit constants to 28-bit instructions ??

Yes, that's a problem. Presumably, I would have to do without
immediates.

An option would be to reserve some 16-bit codes to indicate a block
consisting of one 28-bit instruction and seven 32-bit instructions,
but that means a third instruction set.

How do you switch from 64-bit to Byte to 32-bit to 16-bit in one
set of 256-bit instruction decodes ??

By using 36-bit instructions instead of 28-bit instructions.

In complicated if-then-else codes (and switches) I often see one inst- >ruction followed by a branch to a common point. Does your encoding deal
with these efficiently ?? That is:: what happens when you jump to the
middle of a block of 36-bit instructions ??

Well, when the computer fetches a 256-bit block of code, the first
four bits indicates whether it is composed of 36-bit instructions or
28-bit instructions. So the computer knows where the instructions are;
and thus a convention can be applied, such as addressing each 36-bit instruction by the addresses of the first seven 32-bit positions in
the block.

In the case of 28-bit instructions, the first eight correspond to the
32-bit positions, the ninth corresponds to the last 16 bits of the
block.

John Savard
--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Apr 8 17:25:38 2024

From Newsgroup: comp.arch

John Savard <quadibloc@servername.invalid> schrieb:

Well, when the computer fetches a 256-bit block of code, the first
four bits indicates whether it is composed of 36-bit instructions or
28-bit instructions.

Do you think that instructions which require a certain size (almost)
always happen to be situated together so they fit in a block?
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Apr 8 19:56:27 2024

From Newsgroup: comp.arch

John Savard wrote:

On Sun, 7 Apr 2024 20:41:45 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

In complicated if-then-else codes (and switches) I often see one inst- >>ruction followed by a branch to a common point. Does your encoding deal >>with these efficiently ?? That is:: what happens when you jump to the >>middle of a block of 36-bit instructions ??

Well, when the computer fetches a 256-bit block of code, the first
four bits indicates whether it is composed of 36-bit instructions or
28-bit instructions. So the computer knows where the instructions are;
and thus a convention can be applied, such as addressing each 36-bit instruction by the addresses of the first seven 32-bit positions in
the block.

So, instead of using the branch target address, one rounds it down to
a 256-bit boundary, reads 256-bits and looks at the first 4-bits to
determine the format, nd then uses the branch offset to pick a cont-
tainer which will become the first instruction executed.

Sounds more complicated than necessary.

In the case of 28-bit instructions, the first eight correspond to the
32-bit positions, the ninth corresponds to the last 16 bits of the
block.

John Savard

--- Synchronet 3.20a-Linux NewsLink 1.114

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Apr 9 18:24:55 2024

From Newsgroup: comp.arch

I wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

John Savard <quadibloc@servername.invalid> schrieb:

Thus, instead of having mode bits, one _could_ do the following:

Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.

But have one value for the first four bits in a block that indicates
36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed
in which don't fill a whole block.

While that's a theoretical possibility, I don't view it as being
worthwhile in practice.

I played around a bit with another scheme: Encoding things into
128-bit blocks, with either 21-bit or 42-bit or longer instructions
(or a block header with six bits, and 20 or 40 bits for each
instruction).

Not having seen said encoding scheme:: I suspect you used the Rd=Rs1
destructive operand model for the 21-bit encodings. Yes :: no ??

It was not very well developed, I gave it up when I saw there wasn't
much to gain.

Maybe one more thing: In order to justify the more complex encoding,
I was going for 64 registers, and that didn't work out too well
(missing bits).

Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Tue Apr 9 15:01:50 2024

From Newsgroup: comp.arch

On 4/9/2024 1:24 PM, Thomas Koenig wrote:

I wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

John Savard <quadibloc@servername.invalid> schrieb:

Thus, instead of having mode bits, one _could_ do the following:

Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.

But have one value for the first four bits in a block that indicates >>>>> 36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed >>>>> in which don't fill a whole block.

While that's a theoretical possibility, I don't view it as being
worthwhile in practice.

I played around a bit with another scheme: Encoding things into
128-bit blocks, with either 21-bit or 42-bit or longer instructions
(or a block header with six bits, and 20 or 40 bits for each
instruction).

Not having seen said encoding scheme:: I suspect you used the Rd=Rs1
destructive operand model for the 21-bit encodings. Yes :: no ??

It was not very well developed, I gave it up when I saw there wasn't
much to gain.

Maybe one more thing: In order to justify the more complex encoding,
I was going for 64 registers, and that didn't work out too well
(missing bits).

Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.

For 32-bit instructions at least, 64 GPRs can work out OK.

Though, the gain of 64 over 32 seems to be fairly small for most
"typical" code, mostly bringing a benefit if one is spending a lot of
CPU time in functions that have large numbers of local variables all
being used at the same time.

Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
performance.

Where, 16 GPRs isn't really enough (lots of register spills), and 128
GPRs is wasteful (would likely need lots of monster functions with 250+
local variables to make effective use of this, *, which probably isn't
going to happen).

*: Where, it appears it is most efficient (for non-leaf functions) if
the number of local variables is roughly twice that of the number of CPU registers. If more local variables than this, then spill/fill rate goes
up significantly, and if less, then the registers aren't utilized as effectively.

Well, except in "tiny leaf" functions, where the criteria is instead
that the number of local variables be less than the number of scratch registers. However, for many/most small leaf functions, the total number
of variables isn't all that large either.

Where, function categories:
Tiny Leaf:
Everything fits in scratch registers, no stack frame, no calls.
Leaf:
No function calls (either explicit or implicit);
Will have a stack frame.
Non-Leaf:
May call functions, has a stack frame.

There is a "static assign everything" case in my case, where all of the variables are statically assigned to registers (for the scope of the function). This case typically requires that everything fit into callee
save registers, so (like the "tiny leaf" category, requires that the
number of local variables is less than the available registers).

On a 32 register machine, if there are 14 available callee-save
registers, the limit is 14 variables. On a 64 register machine, this
limit might be 30 instead. This seems to have good coverage.

In the non-static case, the top N variables might be static-assigned,
and the remaining variables dynamically assigned. Though, it appears
this is more an artifact of my naive register allocator, and might not
be as effective of a strategy with an "actually clever" register
allocator (like those in GCC or LLVM), where purely dynamic allocation
may be better (they are able to carry dynamic assignments across basic
block boundaries, rather than needing to spill/fill everything whenever
a branch or label is encountered).

...

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Apr 9 21:05:44 2024

From Newsgroup: comp.arch

BGB wrote:

On 4/9/2024 1:24 PM, Thomas Koenig wrote:

I wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

Maybe one more thing: In order to justify the more complex encoding,
I was going for 64 registers, and that didn't work out too well
(missing bits).

Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.

For 32-bit instructions at least, 64 GPRs can work out OK.

Though, the gain of 64 over 32 seems to be fairly small for most
"typical" code, mostly bringing a benefit if one is spending a lot of
CPU time in functions that have large numbers of local variables all
being used at the same time.

Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for performance.

Where, 16 GPRs isn't really enough (lots of register spills), and 128
GPRs is wasteful (would likely need lots of monster functions with 250+ local variables to make effective use of this, *, which probably isn't
going to happen).

16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part of
GPRs AND you have good access to constants.

*: Where, it appears it is most efficient (for non-leaf functions) if
the number of local variables is roughly twice that of the number of CPU registers. If more local variables than this, then spill/fill rate goes
up significantly, and if less, then the registers aren't utilized as effectively.

Well, except in "tiny leaf" functions, where the criteria is instead
that the number of local variables be less than the number of scratch registers. However, for many/most small leaf functions, the total number
of variables isn't all that large either.

The vast majority of leaf functions use less than 16 GPRs, given one has
a SP not part of GPRs {including arguments and return values}. Once one
starts placing things like memove(), memset(), sin(), cos(), exp(), log()
in the ISA, it goes up even more.

Where, function categories:
Tiny Leaf:
Everything fits in scratch registers, no stack frame, no calls.
Leaf:
No function calls (either explicit or implicit);
Will have a stack frame.
Non-Leaf:
May call functions, has a stack frame.

You are forgetting about FP, GOT, TLS, and whatever resources are required
to do try-throw-catch stuff as demanded by the source language.

There is a "static assign everything" case in my case, where all of the variables are statically assigned to registers (for the scope of the function). This case typically requires that everything fit into callee
save registers, so (like the "tiny leaf" category, requires that the
number of local variables is less than the available registers).

On a 32 register machine, if there are 14 available callee-save
registers, the limit is 14 variables. On a 64 register machine, this
limit might be 30 instead. This seems to have good coverage.

The apparent number of registers goes up when one does not waste a register
to hold a use-once constant.
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Tue Apr 9 17:47:13 2024

From Newsgroup: comp.arch

On 4/9/2024 4:05 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/9/2024 1:24 PM, Thomas Koenig wrote:

I wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

Maybe one more thing: In order to justify the more complex encoding,
I was going for 64 registers, and that didn't work out too well
(missing bits).

Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.

For 32-bit instructions at least, 64 GPRs can work out OK.

Though, the gain of 64 over 32 seems to be fairly small for most
"typical" code, mostly bringing a benefit if one is spending a lot of
CPU time in functions that have large numbers of local variables all
being used at the same time.

Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code
density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
performance.

Where, 16 GPRs isn't really enough (lots of register spills), and 128
GPRs is wasteful (would likely need lots of monster functions with
250+ local variables to make effective use of this, *, which probably
isn't going to happen).

16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part
of GPRs AND you have good access to constants.

On the main ISA's I had tried to generate code for, 16 GPRs was kind of
a pain as it resulted in fairly high spill rates.

Though, it would probably be less bad if the compiler was able to use
all of the registers at the same time without stepping on itself (such
as dealing with register allocation involving scratch registers while
also not conflicting with the use of function arguments, ...).

My code generators had typically only used callee save registers for
variables in basic blocks which ended in a function call (in my compiler design, both function calls and branches terminating the current
basic-block).

On SH, the main way of getting constants (larger than 8 bits) was via PC-relative memory loads, which kinda sucked.

This is slightly less bad on x86-64, since one can use memory operands
with most instructions, and the CPU tends to deal fairly well with code
that has lots of spill-and-fill. This along with instructions having
access to 32-bit immediate values.

*: Where, it appears it is most efficient (for non-leaf functions) if
the number of local variables is roughly twice that of the number of
CPU registers. If more local variables than this, then spill/fill rate
goes up significantly, and if less, then the registers aren't utilized
as effectively.

Well, except in "tiny leaf" functions, where the criteria is instead
that the number of local variables be less than the number of scratch
registers. However, for many/most small leaf functions, the total
number of variables isn't all that large either.

The vast majority of leaf functions use less than 16 GPRs, given one has
a SP not part of GPRs {including arguments and return values}. Once one starts placing things like memove(), memset(), sin(), cos(), exp(), log()
in the ISA, it goes up even more.

Yeah.

Things like memcpy/memmove/memset/etc, are function calls in cases when
not directly transformed into register load/store sequences.

Did end up with an intermediate "memcpy slide", which can handle medium
size memcpy and memset style operations by branching into a slide.

As noted, on a 32 GPR machine, most leaf functions can fit entirely in
scratch registers. On a 64 GPR machine, this percentage is slightly
higher (but, not significantly, since there are few leaf functions
remaining at this point).

If one had a 16 GPR machine with 6 usable scratch registers, it is a
little harder though (as typically these need to cover both any
variables used by the function, and any temporaries used, ...). There
are a whole lot more leaf functions that exceed a limit of 6 than of 14.

But, say, a 32 GPR machine could still do well here.

Note that there are reasons why I don't claim 64 GPRs as a large
performance advantage:
On programs like Doom, the difference is small at best.

It mostly effects things like GLQuake in my case, mostly because TKRA-GL
has a lot of functions with a large numbers of local variables (some
exceeding 100 local variables).

Partly though this is due to code that is highly inlined and unrolled
and uses lots of variables tending to perform better in my case (and
tightly looping code, with lots of small functions, not so much...).

Where, function categories:
   Tiny Leaf:
     Everything fits in scratch registers, no stack frame, no calls.
   Leaf:
     No function calls (either explicit or implicit);
     Will have a stack frame.
   Non-Leaf:
     May call functions, has a stack frame.

You are forgetting about FP, GOT, TLS, and whatever resources are required
to do try-throw-catch stuff as demanded by the source language.

Yeah, possibly true.

In my case:
There is no frame pointer, as BGBCC doesn't use one;
All stack-frames are fixed size, VLA's and alloca use the heap;
GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
TLS, accessed via TBR.

Try/throw/catch:
Mostly N/A for leaf functions.

Any function that can "throw", is in effect no longer a leaf function. Implicitly, any function which uses "variant" or similar is also, no
longer a leaf function.

Need for GBR save/restore effectively excludes a function from being tiny-leaf. This may happen, say, if a function accesses global variables
and may be called as a function pointer.

There is a "static assign everything" case in my case, where all of
the variables are statically assigned to registers (for the scope of
the function). This case typically requires that everything fit into
callee save registers, so (like the "tiny leaf" category, requires
that the number of local variables is less than the available registers).

On a 32 register machine, if there are 14 available callee-save
registers, the limit is 14 variables. On a 64 register machine, this
limit might be 30 instead. This seems to have good coverage.

The apparent number of registers goes up when one does not waste a register to hold a use-once constant.

Possibly true. In the "static assign everything" case, each constant
used is also assigned a register.

One "TODO" here would be to merge constants with the same "actual" value
into the same register. At present, they will be duplicated if the types
are sufficiently different (such as integer 0 vs NULL).

For functions with dynamic assignment, immediate values are more likely
to be used. If the code-generator were clever, potentially it could
exclude assigning registers to constants which are only used by
instructions which can encode them directly as an immediate. Currently,
BGBCC is not that clever.

Or, say:
y=x+31; //31 only being used here, and fits easily in an Imm9.
Ideally, compiler could realize 31 does not need a register here.

Well, and another weakness is with temporaries that exist as function arguments:
If static assigned, the "target variable directly to argument register" optimization can't be used (it ends up needing to go into a callee-save register and then be MOV'ed into the argument register; otherwise the
compiler breaks...).

Though, I guess possible could be that the compiler could try to
partition temporaries that are used exclusively as function arguments
into a different category from "normal" temporaries (or those whose
values may cross a basic-block boundary), and then avoid
statically-assigning them (and somehow not cause this to effectively
break the full-static-assignment scheme in the process).

Though, IIRC, I had also considered the possibility of a temporary
"virtual assignment", allowing the argument value to be temporarily
assigned to a function argument register, then going "poof" and
disappearing when the function is called. Hadn't yet thought of a good
way to add this logic to the register allocator though.

But, yeah, compiler stuff is really fiddly...

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 10 00:28:02 2024

From Newsgroup: comp.arch

BGB-Alt wrote:

On 4/9/2024 4:05 PM, MitchAlsup1 wrote:

BGB wrote:

Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code
density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
performance.

Where, 16 GPRs isn't really enough (lots of register spills), and 128
GPRs is wasteful (would likely need lots of monster functions with
250+ local variables to make effective use of this, *, which probably
isn't going to happen).

16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part
of GPRs AND you have good access to constants.

On the main ISA's I had tried to generate code for, 16 GPRs was kind of
a pain as it resulted in fairly high spill rates.

Though, it would probably be less bad if the compiler was able to use
all of the registers at the same time without stepping on itself (such
as dealing with register allocation involving scratch registers while
also not conflicting with the use of function arguments, ...).

My code generators had typically only used callee save registers for variables in basic blocks which ended in a function call (in my compiler design, both function calls and branches terminating the current basic-block).

On SH, the main way of getting constants (larger than 8 bits) was via PC-relative memory loads, which kinda sucked.

This is slightly less bad on x86-64, since one can use memory operands
with most instructions, and the CPU tends to deal fairly well with code
that has lots of spill-and-fill. This along with instructions having
access to 32-bit immediate values.

Yes, x86 and any architecture (IBM 360, S.E.L. , Interdata, ...) that have LD-Ops act as if they have 4-6 more registers than they really have. x86
with 16 GPRs acts like a RISC with 20-24 GPRs as does 360. Does not really
take the place of universal constants, but goes a long way.

The vast majority of leaf functions use less than 16 GPRs, given one has
a SP not part of GPRs {including arguments and return values}. Once one
starts placing things like memove(), memset(), sin(), cos(), exp(), log()
in the ISA, it goes up even more.

Yeah.

Things like memcpy/memmove/memset/etc, are function calls in cases when
not directly transformed into register load/store sequences.

My 66000 does not convert them into LD-ST sequences, MM is a single inst- ruction.

Did end up with an intermediate "memcpy slide", which can handle medium
size memcpy and memset style operations by branching into a slide.

MMs and MSs that do not cross page boundaries are ATOMIC. The entire system sees only the before or only the after state and nothing in between. This
means one can start (queue up) a SATA disk access without obtaining a lock
to the device--simply because one can fill in all the data of a command in
a single instruction which smells ATOMIC to all interested 3rd parties.

As noted, on a 32 GPR machine, most leaf functions can fit entirely in scratch registers.

Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without getting totally screwed.

On a 64 GPR machine, this percentage is slightly
higher (but, not significantly, since there are few leaf functions
remaining at this point).

If one had a 16 GPR machine with 6 usable scratch registers, it is a
little harder though (as typically these need to cover both any
variables used by the function, and any temporaries used, ...). There
are a whole lot more leaf functions that exceed a limit of 6 than of 14.

The data back in the R2000-3000 days indicated that 32 GPRs has a 15%+ advantage over a 16 GPRs; while 84 had only a 3% advantage.

But, say, a 32 GPR machine could still do well here.

Note that there are reasons why I don't claim 64 GPRs as a large
performance advantage:
On programs like Doom, the difference is small at best.

It mostly effects things like GLQuake in my case, mostly because TKRA-GL
has a lot of functions with a large numbers of local variables (some exceeding 100 local variables).

Partly though this is due to code that is highly inlined and unrolled
and uses lots of variables tending to perform better in my case (and
tightly looping code, with lots of small functions, not so much...).

Where, function categories:
   Tiny Leaf:
     Everything fits in scratch registers, no stack frame, no calls. >>>    Leaf:
     No function calls (either explicit or implicit);
     Will have a stack frame.
   Non-Leaf:
     May call functions, has a stack frame.

You are forgetting about FP, GOT, TLS, and whatever resources are required >> to do try-throw-catch stuff as demanded by the source language.

Yeah, possibly true.

In my case:
There is no frame pointer, as BGBCC doesn't use one;

Can't do PASCAL and other ALOGO derived languages with block structure.

All stack-frames are fixed size, VLA's and alloca use the heap;

longjump() is at a serious disadvantage here.
desctructors are sometimes hard to position on the stack.

GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
TLS, accessed via TBR.

Try/throw/catch:
Mostly N/A for leaf functions.

Any function that can "throw", is in effect no longer a leaf function. Implicitly, any function which uses "variant" or similar is also, no
longer a leaf function.

You do realize that there is a set of #define-s that can implement try-throw-catch without requiring any subroutines ?!?

Need for GBR save/restore effectively excludes a function from being tiny-leaf. This may happen, say, if a function accesses global variables
and may be called as a function pointer.

------------------------------------------------------

One "TODO" here would be to merge constants with the same "actual" value into the same register. At present, they will be duplicated if the types
are sufficiently different (such as integer 0 vs NULL).

In practice, the upper 48-bits of a extern variable is completely shared whereas the lower 16-bits are unique.

For functions with dynamic assignment, immediate values are more likely
to be used. If the code-generator were clever, potentially it could
exclude assigning registers to constants which are only used by
instructions which can encode them directly as an immediate. Currently, BGBCC is not that clever.

And then there are languages like PL/1 and FORTRAN where the compiler
has to figure out how big an intermediate array is, allocate it, perform
the math, and then deallocate it.

Or, say:
y=x+31; //31 only being used here, and fits easily in an Imm9.
Ideally, compiler could realize 31 does not need a register here.

Well, and another weakness is with temporaries that exist as function arguments:
If static assigned, the "target variable directly to argument register" optimization can't be used (it ends up needing to go into a callee-save register and then be MOV'ed into the argument register; otherwise the compiler breaks...).

Though, I guess possible could be that the compiler could try to
partition temporaries that are used exclusively as function arguments
into a different category from "normal" temporaries (or those whose
values may cross a basic-block boundary), and then avoid statically-assigning them (and somehow not cause this to effectively
break the full-static-assignment scheme in the process).

Brian's compiler finds the largest argument list and the largest return
value list and merges them into a single area on the stack used only
for passing arguments and results across the call interface. And the
<static> SP points at this area.

Though, IIRC, I had also considered the possibility of a temporary
"virtual assignment", allowing the argument value to be temporarily
assigned to a function argument register, then going "poof" and
disappearing when the function is called. Hadn't yet thought of a good
way to add this logic to the register allocator though.

But, yeah, compiler stuff is really fiddly...

More orthogonality helps.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue Apr 9 22:01:00 2024

From Newsgroup: comp.arch

On 4/9/2024 3:47 PM, BGB-Alt wrote:

On 4/9/2024 4:05 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/9/2024 1:24 PM, Thomas Koenig wrote:

I wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

Maybe one more thing: In order to justify the more complex encoding,
I was going for 64 registers, and that didn't work out too well
(missing bits).

Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.

For 32-bit instructions at least, 64 GPRs can work out OK.

Though, the gain of 64 over 32 seems to be fairly small for most
"typical" code, mostly bringing a benefit if one is spending a lot of
CPU time in functions that have large numbers of local variables all
being used at the same time.

Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for
code density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
performance.

Where, 16 GPRs isn't really enough (lots of register spills), and 128
GPRs is wasteful (would likely need lots of monster functions with
250+ local variables to make effective use of this, *, which probably
isn't going to happen).

16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part
of GPRs AND you have good access to constants.

On the main ISA's I had tried to generate code for, 16 GPRs was kind of
a pain as it resulted in fairly high spill rates.

Though, it would probably be less bad if the compiler was able to use
all of the registers at the same time without stepping on itself (such
as dealing with register allocation involving scratch registers while
also not conflicting with the use of function arguments, ...).

My code generators had typically only used callee save registers for variables in basic blocks which ended in a function call (in my compiler design, both function calls and branches terminating the current basic-block).

On SH, the main way of getting constants (larger than 8 bits) was via PC-relative memory loads, which kinda sucked.

This is slightly less bad on x86-64, since one can use memory operands
with most instructions, and the CPU tends to deal fairly well with code
that has lots of spill-and-fill. This along with instructions having
access to 32-bit immediate values.

*: Where, it appears it is most efficient (for non-leaf functions) if
the number of local variables is roughly twice that of the number of
CPU registers. If more local variables than this, then spill/fill
rate goes up significantly, and if less, then the registers aren't
utilized as effectively.

Well, except in "tiny leaf" functions, where the criteria is instead
that the number of local variables be less than the number of scratch
registers. However, for many/most small leaf functions, the total
number of variables isn't all that large either.

The vast majority of leaf functions use less than 16 GPRs, given one has
a SP not part of GPRs {including arguments and return values}. Once
one starts placing things like memove(), memset(), sin(), cos(),
exp(), log()
in the ISA, it goes up even more.

Yeah.

Things like memcpy/memmove/memset/etc, are function calls in cases when
not directly transformed into register load/store sequences.

Did end up with an intermediate "memcpy slide", which can handle medium
size memcpy and memset style operations by branching into a slide.

As noted, on a 32 GPR machine, most leaf functions can fit entirely in scratch registers. On a 64 GPR machine, this percentage is slightly
higher (but, not significantly, since there are few leaf functions
remaining at this point).

If one had a 16 GPR machine with 6 usable scratch registers, it is a
little harder though (as typically these need to cover both any
variables used by the function, and any temporaries used, ...). There
are a whole lot more leaf functions that exceed a limit of 6 than of 14.

But, say, a 32 GPR machine could still do well here.

Note that there are reasons why I don't claim 64 GPRs as a large
performance advantage:
On programs like Doom, the difference is small at best.

It mostly effects things like GLQuake in my case, mostly because TKRA-GL
has a lot of functions with a large numbers of local variables (some exceeding 100 local variables).

Partly though this is due to code that is highly inlined and unrolled
and uses lots of variables tending to perform better in my case (and
tightly looping code, with lots of small functions, not so much...).

Where, function categories:
   Tiny Leaf:
     Everything fits in scratch registers, no stack frame, no calls. >>>    Leaf:
     No function calls (either explicit or implicit);
     Will have a stack frame.
   Non-Leaf:
     May call functions, has a stack frame.

You are forgetting about FP, GOT, TLS, and whatever resources are
required
to do try-throw-catch stuff as demanded by the source language.

Yeah, possibly true.

In my case:
There is no frame pointer, as BGBCC doesn't use one;
    All stack-frames are fixed size, VLA's and alloca use the heap;
GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
TLS, accessed via TBR.[...]

alloca using the heap? Strange to me...

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Wed Apr 10 02:24:40 2024

From Newsgroup: comp.arch

On 4/9/2024 7:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

On 4/9/2024 4:05 PM, MitchAlsup1 wrote:

BGB wrote:

Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for
code density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
performance.

Where, 16 GPRs isn't really enough (lots of register spills), and
128 GPRs is wasteful (would likely need lots of monster functions
with 250+ local variables to make effective use of this, *, which
probably isn't going to happen).

16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not
part of GPRs AND you have good access to constants.

On the main ISA's I had tried to generate code for, 16 GPRs was kind
of a pain as it resulted in fairly high spill rates.

Though, it would probably be less bad if the compiler was able to use
all of the registers at the same time without stepping on itself (such
as dealing with register allocation involving scratch registers while
also not conflicting with the use of function arguments, ...).

My code generators had typically only used callee save registers for
variables in basic blocks which ended in a function call (in my
compiler design, both function calls and branches terminating the
current basic-block).

On SH, the main way of getting constants (larger than 8 bits) was via
PC-relative memory loads, which kinda sucked.

Also the blob of constants needed to be within 512 bytes of the load instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of a
basic block and then branch over it).

Usually they were spilled between basic-blocks, with the basic-block
needing to branch to the following basic-block in these cases.

Also 8-bit branch displacements are kinda lame, ...

And, if one wanted a 16-bit branch:
MOV.W (PC, 4), R0 //load a 16-bit branch displacement
BRA/F R0
.L0:
NOP // delay slot
.WORD $(Label - .L0)

Also kinda bad...

This is slightly less bad on x86-64, since one can use memory operands
with most instructions, and the CPU tends to deal fairly well with
code that has lots of spill-and-fill. This along with instructions
having access to 32-bit immediate values.

Yes, x86 and any architecture (IBM 360, S.E.L. , Interdata, ...) that have LD-Ops act as if they have 4-6 more registers than they really have. x86
with 16 GPRs acts like a RISC with 20-24 GPRs as does 360. Does not really take the place of universal constants, but goes a long way.

Yeah.

The vast majority of leaf functions use less than 16 GPRs, given one has >>> a SP not part of GPRs {including arguments and return values}. Once
one starts placing things like memove(), memset(), sin(), cos(),
exp(), log()
in the ISA, it goes up even more.

Yeah.

Things like memcpy/memmove/memset/etc, are function calls in cases
when not directly transformed into register load/store sequences.

My 66000 does not convert them into LD-ST sequences, MM is a single inst- ruction.

I have no high-level memory move/copy/set instructions.
Only loads/stores...

For small copies, can encode them inline, but past a certain size this
becomes too bulky.

A copy loop makes more sense for bigger copies, but has a high overhead
for small to medium copy.

So, there is a size range where doing it inline would be too bulky, but
a loop caries an undesirable level of overhead.

Ended up doing these with "slides", which end up eating roughly several
kB of code space, but was more compact than using larger inline copies.

Say (IIRC):
128 bytes or less: Inline Ld/St sequence
129 bytes to 512B: Slide
Over 512B: Call "memcpy()" or similar.

The slide generally has entry points in multiples of 32 bytes, and
operates in reverse order. So, if not a multiple of 32 bytes, the last
bytes need to be handled externally prior to branching into the slide.

Though, this is only used for fixed-size copies (or "memcpy()" when
value is constant).

Say:

__memcpy64_512_ua:
MOV.Q (R5, 480), R20
MOV.Q (R5, 488), R21
MOV.Q (R5, 496), R22
MOV.Q (R5, 504), R23
MOV.Q R20, (R4, 480)
MOV.Q R21, (R4, 488)
MOV.Q R22, (R4, 496)
MOV.Q R23, (R4, 504)

__memcpy64_480_ua:
MOV.Q (R5, 448), R20
MOV.Q (R5, 456), R21
MOV.Q (R5, 464), R22
MOV.Q (R5, 472), R23
MOV.Q R20, (R4, 448)
MOV.Q R21, (R4, 456)
MOV.Q R22, (R4, 464)
MOV.Q R23, (R4, 472)

...

__memcpy64_32_ua:
MOV.Q (R5), R20
MOV.Q (R5, 8), R21
MOV.Q (R5, 16), R22
MOV.Q (R5, 24), R23
MOV.Q R20, (R4)
MOV.Q R21, (R4, 8)
MOV.Q R22, (R4, 16)
MOV.Q R23, (R4, 24)
RTS

Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into a slide.

MMs and MSs that do not cross page boundaries are ATOMIC. The entire system sees only the before or only the after state and nothing in between. This means one can start (queue up) a SATA disk access without obtaining a lock
to the device--simply because one can fill in all the data of a command in
a single instruction which smells ATOMIC to all interested 3rd parties.

My case, non-atomic, polling IO.

Code fragment:
while(ct<cte)
{
P_SPI_QDATA=0xFFFFFFFFFFFFFFFFULL;
P_SPI_CTRL=tkspi_ctl_status|SPICTRL_XMIT8X;
v=P_SPI_CTRL;
while(v&SPICTRL_BUSY)
v=P_SPI_CTRL;
*(u64 *)ct=P_SPI_QDATA;
ct+=8;
}

Where the MMIO interface allows sending/receiving 8 bytes at a time to
avoid bogging down at around 500 K/s or so (with 8B transfers, could theoretically do 4 MB/s; though it is only ~ 1.5 MB/s with 12.5 MHz SPI).

Though, this is part of why I had ended up LZ compressing damn near
everything (LZ4 or RP2 being faster than sending ~ 3x as much data over
the SPI interface).

Hadn't generally used Huffman as the additional compression wasn't worth
the fairly steep performance cost (with something like Deflate, it would barely be much faster than the bare SPI interface).

Did recently come up with a "pseudo entropic" coding that seems
promising in some testing:
Rank symbols by probability, sending the most common 128 symbols;
Send the encoded symbols as table indices via bytes, say:
00..78: Pair of symbol indices, 00..0A;
7F: Escaped Byte
80..FF: Symbol Index

Which, while it seems like this would likely fail to do much of
anything, it "sorta works", and is much faster to unpack than Huffman.

Though, if the distribution is "too flat", one needs to be able to fall
back to raw bytes.

Had experimentally written a compressor based around this scheme, and
while not as fast as LZ4, it did give compression much closer to Deflate.

Where, IME, on my current main PC:
LZMA: ~ 35 MB/s
Bitwise range coder.
Deflate: ~ 200 MB/s
Huffman based, symbols limited to 15 bits.
TKuLZ: ~ 350 MB/s
Resembles a Deflate / LZ4 Hybrid.
Huffman based, symbols limited to 12 bits.
TKFLZH: ~ 500 MB/s
Similar to a more elaborate version of TKuLZ.
Huffman symbols limited to 13 bits.
TKDELZ: ~ 700 MB/s
Similar to the prior to, but:
Splits symbols into separately-coded blocks;
Uses an interleaved encoding scheme, decoding 4 symbols at a time.
PSELZ: ~ 1.0 GB/s
Uses separate symbol blocks, with the "pseudo entropic" encoding.
RP2: ~ 1.8 GB/s
Byte oriented
LZ4: ~ 2.1 GB/s

Though, RP2 and LZ4 switch places on BJX2, where RP2 is both slightly
faster and gives slightly better compression.

I suspect this is likely because of differences in the relative cost of
byte loads and branch mispredicts.

Note that TKuLZ/TKFLZH/TKDELZ/PSELZ used a similar structure for
encoding LZ matches:
TAG
(Raw Length)
(Match Length)
Match Distance
(Literal Bytes)
Where, TAG has a structure like:
(7:5): Raw Length (0..6, 7 = Separate length)
(4:0): Match Length (3..33, 34 = Separate length)

Though, the former 3 were using a combination nybble-stream and bitstream.

Had considered a nybble stream for PSELZ, but ended up using bytes as
bytes are faster.

As noted, on a 32 GPR machine, most leaf functions can fit entirely in
scratch registers.

Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without getting totally screwed.

OK.

                    On a 64 GPR machine, this percentage is slightly
higher (but, not significantly, since there are few leaf functions
remaining at this point).

If one had a 16 GPR machine with 6 usable scratch registers, it is a
little harder though (as typically these need to cover both any
variables used by the function, and any temporaries used, ...). There
are a whole lot more leaf functions that exceed a limit of 6 than of 14.

The data back in the R2000-3000 days indicated that 32 GPRs has a 15%+ advantage over a 16 GPRs; while 84 had only a 3% advantage.

Probably true enough.

But, say, a 32 GPR machine could still do well here.

Note that there are reasons why I don't claim 64 GPRs as a large
performance advantage:
On programs like Doom, the difference is small at best.

It mostly effects things like GLQuake in my case, mostly because
TKRA-GL has a lot of functions with a large numbers of local variables
(some exceeding 100 local variables).

Partly though this is due to code that is highly inlined and unrolled
and uses lots of variables tending to perform better in my case (and
tightly looping code, with lots of small functions, not so much...).

Where, function categories:
   Tiny Leaf:
     Everything fits in scratch registers, no stack frame, no calls. >>>>    Leaf:
     No function calls (either explicit or implicit);
     Will have a stack frame.
   Non-Leaf:
     May call functions, has a stack frame.

You are forgetting about FP, GOT, TLS, and whatever resources are
required
to do try-throw-catch stuff as demanded by the source language.

Yeah, possibly true.

In my case:
   There is no frame pointer, as BGBCC doesn't use one;

Can't do PASCAL and other ALOGO derived languages with block structure.

Nothing prevents having a frame pointer, just BGBCC doesn't do so as it doesn't really gain anything if one has fixed-size stack frames (and is another register to be saved/restored).

Granted, it would make stack-walking easier.
As is, one needs to use a similar strategy to how one does stack
unwinding in the Win64 ABI (namely looking stuff up in a table, and then parsing the instruction sequence).

     All stack-frames are fixed size, VLA's and alloca use the heap;

longjump() is at a serious disadvantage here. desctructors are sometimes hard to position on the stack.

Yeah... If you use longjmp, any VLA's or alloca's are gonna be leaked...

Nothing really mandates that longjmp + alloca not result in a memory
leak though (or that alloca can't be implemented as a fancy wrapper over malloc).

   GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
   TLS, accessed via TBR.

Try/throw/catch:
   Mostly N/A for leaf functions.

Any function that can "throw", is in effect no longer a leaf function.
Implicitly, any function which uses "variant" or similar is also, no
longer a leaf function.

You do realize that there is a set of #define-s that can implement try-throw-catch without requiring any subroutines ?!?

?...

Throw was implemented as a runtime call in my case.

Though, try/catch involves arcane metadata and compiler magic (basically borrowing a similar design to the WinCE and Win64 exception handling).

Though, could have in theory gone the SEH route and implemented it using hidden runtime calls and a linked list. There are pros/cons either way
(and SEH would have an on-average lower overhead for C, since when it is
not used, its cost would effectively drop to 0; unlike the cost of
PE/COFF exception-handling tables, and needing to provide dummy
exception catch/continue blobs in the off-chance exceptions were used
and passed through C land).

Need for GBR save/restore effectively excludes a function from being
tiny-leaf. This may happen, say, if a function accesses global
variables and may be called as a function pointer.

------------------------------------------------------

One "TODO" here would be to merge constants with the same "actual"
value into the same register. At present, they will be duplicated if
the types are sufficiently different (such as integer 0 vs NULL).

In practice, the upper 48-bits of a extern variable is completely shared whereas the lower 16-bits are unique.

For functions with dynamic assignment, immediate values are more
likely to be used. If the code-generator were clever, potentially it
could exclude assigning registers to constants which are only used by
instructions which can encode them directly as an immediate.
Currently, BGBCC is not that clever.

And then there are languages like PL/1 and FORTRAN where the compiler
has to figure out how big an intermediate array is, allocate it, perform
the math, and then deallocate it.

I don't expect BJX2 and FORTRAN to cross paths...

Or, say:
   y=x+31; //31 only being used here, and fits easily in an Imm9.
Ideally, compiler could realize 31 does not need a register here.

Well, and another weakness is with temporaries that exist as function
arguments:
If static assigned, the "target variable directly to argument
register" optimization can't be used (it ends up needing to go into a
callee-save register and then be MOV'ed into the argument register;
otherwise the compiler breaks...).

Though, I guess possible could be that the compiler could try to
partition temporaries that are used exclusively as function arguments
into a different category from "normal" temporaries (or those whose
values may cross a basic-block boundary), and then avoid
statically-assigning them (and somehow not cause this to effectively
break the full-static-assignment scheme in the process).

Brian's compiler finds the largest argument list and the largest return
value list and merges them into a single area on the stack used only
for passing arguments and results across the call interface. And the
<static> SP points at this area.

The issue isn't with stack space, this part is straightforward.

Rather, that in the IR stage, one has something like (pseduocode):
t4 := t0 + 1;
t5 := t1 + 2;
t6 := func(t4, t5);
t7 := t6 + 3;

Where, at the ASM level, one could do, say:
ADD R8, 1, R4
ADD R9, 2, R5
BSR func
ADD R2, 3, R10

But... Pulling this off without the compiler and/or compiled program
exploding in the process, is easier said than done.

OTOH:
ADD R8, 1, R11
ADD R9, 2, R12
MOV R11, R4
MOV R12, R5
BSR func
MOV R2, R11
ADD R11, 3, R10

Is, not quite as efficient, but despite some efforts, is closer to the
current situation than to the one above. There are some cases where
these optimizations can be performed, but only if the variables in
question are not static-assigned, which forces the latter scenario.

Though, IIRC, I had also considered the possibility of a temporary
"virtual assignment", allowing the argument value to be temporarily
assigned to a function argument register, then going "poof" and
disappearing when the function is called. Hadn't yet thought of a good
way to add this logic to the register allocator though.

But, yeah, compiler stuff is really fiddly...

More orthogonality helps.

These parts of my compiler are a horrible mess, and rather brittle...

Partial reason there is no RISC-V support in BGBCC, is because the ABI
is different, and the current design of the register allocator can't
deal with a different ABI design.

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Wed Apr 10 02:41:01 2024

From Newsgroup: comp.arch

On 4/10/2024 12:01 AM, Chris M. Thomasson wrote:

On 4/9/2024 3:47 PM, BGB-Alt wrote:

On 4/9/2024 4:05 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/9/2024 1:24 PM, Thomas Koenig wrote:

I wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

Maybe one more thing: In order to justify the more complex encoding, >>>>> I was going for 64 registers, and that didn't work out too well
(missing bits).

Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.

For 32-bit instructions at least, 64 GPRs can work out OK.

Though, the gain of 64 over 32 seems to be fairly small for most
"typical" code, mostly bringing a benefit if one is spending a lot
of CPU time in functions that have large numbers of local variables
all being used at the same time.

Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for
code density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
performance.

Where, 16 GPRs isn't really enough (lots of register spills), and
128 GPRs is wasteful (would likely need lots of monster functions
with 250+ local variables to make effective use of this, *, which
probably isn't going to happen).

16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not
part of GPRs AND you have good access to constants.

On the main ISA's I had tried to generate code for, 16 GPRs was kind
of a pain as it resulted in fairly high spill rates.

Though, it would probably be less bad if the compiler was able to use
all of the registers at the same time without stepping on itself (such
as dealing with register allocation involving scratch registers while
also not conflicting with the use of function arguments, ...).

My code generators had typically only used callee save registers for
variables in basic blocks which ended in a function call (in my
compiler design, both function calls and branches terminating the
current basic-block).

On SH, the main way of getting constants (larger than 8 bits) was via
PC-relative memory loads, which kinda sucked.

This is slightly less bad on x86-64, since one can use memory operands
with most instructions, and the CPU tends to deal fairly well with
code that has lots of spill-and-fill. This along with instructions
having access to 32-bit immediate values.

*: Where, it appears it is most efficient (for non-leaf functions)
if the number of local variables is roughly twice that of the number
of CPU registers. If more local variables than this, then spill/fill
rate goes up significantly, and if less, then the registers aren't
utilized as effectively.

Well, except in "tiny leaf" functions, where the criteria is instead
that the number of local variables be less than the number of
scratch registers. However, for many/most small leaf functions, the
total number of variables isn't all that large either.

The vast majority of leaf functions use less than 16 GPRs, given one has >>> a SP not part of GPRs {including arguments and return values}. Once
one starts placing things like memove(), memset(), sin(), cos(),
exp(), log()
in the ISA, it goes up even more.

Yeah.

Things like memcpy/memmove/memset/etc, are function calls in cases
when not directly transformed into register load/store sequences.

Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into a slide.

As noted, on a 32 GPR machine, most leaf functions can fit entirely in
scratch registers. On a 64 GPR machine, this percentage is slightly
higher (but, not significantly, since there are few leaf functions
remaining at this point).

If one had a 16 GPR machine with 6 usable scratch registers, it is a
little harder though (as typically these need to cover both any
variables used by the function, and any temporaries used, ...). There
are a whole lot more leaf functions that exceed a limit of 6 than of 14.

But, say, a 32 GPR machine could still do well here.

Note that there are reasons why I don't claim 64 GPRs as a large
performance advantage:
On programs like Doom, the difference is small at best.

It mostly effects things like GLQuake in my case, mostly because
TKRA-GL has a lot of functions with a large numbers of local variables
(some exceeding 100 local variables).

Partly though this is due to code that is highly inlined and unrolled
and uses lots of variables tending to perform better in my case (and
tightly looping code, with lots of small functions, not so much...).

Where, function categories:
   Tiny Leaf:
     Everything fits in scratch registers, no stack frame, no calls. >>>>    Leaf:
     No function calls (either explicit or implicit);
     Will have a stack frame.
   Non-Leaf:
     May call functions, has a stack frame.

You are forgetting about FP, GOT, TLS, and whatever resources are
required
to do try-throw-catch stuff as demanded by the source language.

Yeah, possibly true.

In my case:
   There is no frame pointer, as BGBCC doesn't use one;
     All stack-frames are fixed size, VLA's and alloca use the heap;
   GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
   TLS, accessed via TBR.[...]

alloca using the heap? Strange to me...

Well, in this case:
The alloca calls are turned into calls which allocate the memory blob
and add it to a linked list;
when the function returns, everything in the linked list is freed;
Then, it internally pulls this off via malloc and free.

Also the typical default stack size in this case is 128K, so trying to
put big allocations on the stack is more liable to result in a stack
overflow.

Bigger stack needs more memory, so is not ideal for NOMMU use. Luckily
heap allocation is not too slow in this case.

Though, at the same time, ideally one limits use of language features
where the code-generation degenerates into a mess of hidden runtime
calls. These cases are not ideal for performance...

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 10 17:12:47 2024

From Newsgroup: comp.arch

BGB wrote:

On 4/9/2024 7:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

Also the blob of constants needed to be within 512 bytes of the load instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of a
basic block and then branch over it).

In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.

Usually they were spilled between basic-blocks, with the basic-block
needing to branch to the following basic-block in these cases.

Also 8-bit branch displacements are kinda lame, ...

Why do that to yourself ??

And, if one wanted a 16-bit branch:
MOV.W (PC, 4), R0 //load a 16-bit branch displacement
BRA/F R0
.L0:
NOP // delay slot
.WORD $(Label - .L0)

Also kinda bad...

Can you say Yech !!

Things like memcpy/memmove/memset/etc, are function calls in cases
when not directly transformed into register load/store sequences.

My 66000 does not convert them into LD-ST sequences, MM is a single inst-
ruction.

I have no high-level memory move/copy/set instructions.
Only loads/stores...

You have the power to fix it.........

For small copies, can encode them inline, but past a certain size this becomes too bulky.

A copy loop makes more sense for bigger copies, but has a high overhead
for small to medium copy.

So, there is a size range where doing it inline would be too bulky, but
a loop caries an undesirable level of overhead.

All the more reason to put it (a highly useful unit of work) into an instruction.

Ended up doing these with "slides", which end up eating roughly several
kB of code space, but was more compact than using larger inline copies.

Say (IIRC):
128 bytes or less: Inline Ld/St sequence
129 bytes to 512B: Slide
Over 512B: Call "memcpy()" or similar.

Versus::
1-infinity: use MM instruction.

The slide generally has entry points in multiples of 32 bytes, and
operates in reverse order. So, if not a multiple of 32 bytes, the last
bytes need to be handled externally prior to branching into the slide.

Does this remain sequentially consistent ??

Though, this is only used for fixed-size copies (or "memcpy()" when
value is constant).

Say:

__memcpy64_512_ua:
MOV.Q (R5, 480), R20
MOV.Q (R5, 488), R21
MOV.Q (R5, 496), R22
MOV.Q (R5, 504), R23
MOV.Q R20, (R4, 480)
MOV.Q R21, (R4, 488)
MOV.Q R22, (R4, 496)
MOV.Q R23, (R4, 504)

__memcpy64_480_ua:
MOV.Q (R5, 448), R20
MOV.Q (R5, 456), R21
MOV.Q (R5, 464), R22
MOV.Q (R5, 472), R23
MOV.Q R20, (R4, 448)
MOV.Q R21, (R4, 456)
MOV.Q R22, (R4, 464)
MOV.Q R23, (R4, 472)

....

__memcpy64_32_ua:
MOV.Q (R5), R20
MOV.Q (R5, 8), R21
MOV.Q (R5, 16), R22
MOV.Q (R5, 24), R23
MOV.Q R20, (R4)
MOV.Q R21, (R4, 8)
MOV.Q R22, (R4, 16)
MOV.Q R23, (R4, 24)
RTS

Duff's device in any other name.
--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Apr 10 17:29:22 2024

From Newsgroup: comp.arch

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

On 4/9/2024 7:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

Also the blob of constants needed to be within 512 bytes of the load
instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of a
basic block and then branch over it).

In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.

It does occupy some icache space, however; have you boosted the icache
size to compensate?
--- Synchronet 3.20a-Linux NewsLink 1.114

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed Apr 10 11:57:14 2024

From Newsgroup: comp.arch

On 4/10/2024 12:41 AM, BGB wrote:

On 4/10/2024 12:01 AM, Chris M. Thomasson wrote:

On 4/9/2024 3:47 PM, BGB-Alt wrote:

On 4/9/2024 4:05 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/9/2024 1:24 PM, Thomas Koenig wrote:

I wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

Maybe one more thing: In order to justify the more complex encoding, >>>>>> I was going for 64 registers, and that didn't work out too well
(missing bits).

Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.

For 32-bit instructions at least, 64 GPRs can work out OK.

Though, the gain of 64 over 32 seems to be fairly small for most
"typical" code, mostly bringing a benefit if one is spending a lot
of CPU time in functions that have large numbers of local variables >>>>> all being used at the same time.

Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for
code density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
performance.

Where, 16 GPRs isn't really enough (lots of register spills), and
128 GPRs is wasteful (would likely need lots of monster functions
with 250+ local variables to make effective use of this, *, which
probably isn't going to happen).

16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not
part of GPRs AND you have good access to constants.

On the main ISA's I had tried to generate code for, 16 GPRs was kind
of a pain as it resulted in fairly high spill rates.

Though, it would probably be less bad if the compiler was able to use
all of the registers at the same time without stepping on itself
(such as dealing with register allocation involving scratch registers
while also not conflicting with the use of function arguments, ...).

My code generators had typically only used callee save registers for
variables in basic blocks which ended in a function call (in my
compiler design, both function calls and branches terminating the
current basic-block).

On SH, the main way of getting constants (larger than 8 bits) was via
PC-relative memory loads, which kinda sucked.

This is slightly less bad on x86-64, since one can use memory
operands with most instructions, and the CPU tends to deal fairly
well with code that has lots of spill-and-fill. This along with
instructions having access to 32-bit immediate values.

*: Where, it appears it is most efficient (for non-leaf functions)
if the number of local variables is roughly twice that of the
number of CPU registers. If more local variables than this, then
spill/fill rate goes up significantly, and if less, then the
registers aren't utilized as effectively.

Well, except in "tiny leaf" functions, where the criteria is
instead that the number of local variables be less than the number
of scratch registers. However, for many/most small leaf functions,
the total number of variables isn't all that large either.

The vast majority of leaf functions use less than 16 GPRs, given one
has
a SP not part of GPRs {including arguments and return values}. Once
one starts placing things like memove(), memset(), sin(), cos(),
exp(), log()
in the ISA, it goes up even more.

Yeah.

Things like memcpy/memmove/memset/etc, are function calls in cases
when not directly transformed into register load/store sequences.

Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into a
slide.

As noted, on a 32 GPR machine, most leaf functions can fit entirely
in scratch registers. On a 64 GPR machine, this percentage is
slightly higher (but, not significantly, since there are few leaf
functions remaining at this point).

If one had a 16 GPR machine with 6 usable scratch registers, it is a
little harder though (as typically these need to cover both any
variables used by the function, and any temporaries used, ...). There
are a whole lot more leaf functions that exceed a limit of 6 than of 14. >>>
But, say, a 32 GPR machine could still do well here.

Note that there are reasons why I don't claim 64 GPRs as a large
performance advantage:
On programs like Doom, the difference is small at best.

It mostly effects things like GLQuake in my case, mostly because
TKRA-GL has a lot of functions with a large numbers of local
variables (some exceeding 100 local variables).

Partly though this is due to code that is highly inlined and unrolled
and uses lots of variables tending to perform better in my case (and
tightly looping code, with lots of small functions, not so much...).

Where, function categories:
   Tiny Leaf:
     Everything fits in scratch registers, no stack frame, no calls. >>>>>    Leaf:
     No function calls (either explicit or implicit);
     Will have a stack frame.
   Non-Leaf:
     May call functions, has a stack frame.

You are forgetting about FP, GOT, TLS, and whatever resources are
required
to do try-throw-catch stuff as demanded by the source language.

Yeah, possibly true.

In my case:
   There is no frame pointer, as BGBCC doesn't use one;
     All stack-frames are fixed size, VLA's and alloca use the heap; >>>    GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
   TLS, accessed via TBR.[...]

alloca using the heap? Strange to me...

Well, in this case:
The alloca calls are turned into calls which allocate the memory blob
and add it to a linked list;
when the function returns, everything in the linked list is freed;
Then, it internally pulls this off via malloc and free.

Also the typical default stack size in this case is 128K, so trying to
put big allocations on the stack is more liable to result in a stack overflow.

Bigger stack needs more memory, so is not ideal for NOMMU use. Luckily
heap allocation is not too slow in this case.

Though, at the same time, ideally one limits use of language features
where the code-generation degenerates into a mess of hidden runtime
calls. These cases are not ideal for performance...

Sometimes alloca is useful wrt offsetting the stack to avoid false
sharing between stacks. Intel wrote a little paper that addresses this:

https://www.intel.com/content/dam/www/public/us/en/documents/training/developing-multithreaded-applications.pdf

Remember that one?
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Wed Apr 10 15:51:07 2024

From Newsgroup: comp.arch

On 4/10/2024 12:12 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/9/2024 7:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

Also the blob of constants needed to be within 512 bytes of the load
instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of a
basic block and then branch over it).

In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.

Yeah.

This was why some of the first things I did when I started extending
SH-4 were:
Adding mechanisms to build constants inline;
Adding Load/Store ops with a displacement (albeit with encodings
borrowed from SH-2A);
Adding 3R and 3RI encodings (originally Imm8 for 3RI).

Did have a mess when I later extended the ISA to 32 GPRs, as (like with
BJX2 Baseline+XGPR) only part of the ISA had access to R16..R31.

Usually they were spilled between basic-blocks, with the basic-block
needing to branch to the following basic-block in these cases.

Also 8-bit branch displacements are kinda lame, ...

Why do that to yourself ??

I didn't design SuperH, Hitachi did...

All of this stuff was apparently sufficient for the SEGA
32X/Saturn/Dreamcast consoles... (vs the Genesis/MegaDrive using a
M68000, and Master System using a Z80).

I guess for a while it was also popular in CD-ROM and HDD controllers.
I guess after SEGA left the game-console market, they had continued
using it for a while in arcade machines, before apparently later jumping
over to x86 via low-end PC motherboards (I guess it being cheaper since
the mid/late 2000s to build an arcade machine with off-the-shelf PC parts).

Saw a video where a guy was messing with one of these, where I guess
despite being built with low-end PC parts (and an entry-level graphics
card), the parts were balanced well enough that it still gave fairly
decent gaming performance.

But, with BJX1, I had added Disp16 branches.

With BJX2, they were replaced with 20 bit branches. These have the merit
of being able to branch anywhere within a Doom or Quake sized binary.

And, if one wanted a 16-bit branch:
   MOV.W (PC, 4), R0 //load a 16-bit branch displacement
   BRA/F R0
   .L0:
   NOP    // delay slot
   .WORD $(Label - .L0)

Also kinda bad...

Can you say Yech !!

Yeah.
This sort of stuff created strong incentive for ISA redesign...

Granted, it is possible had I instead started with RISC-V instead of
SuperH, it is probable BJX2 wouldn't exist.

Though, at the time, the original thinking was that SuperH having
smaller instructions meant it would have better code density than RV32I
or similar. Turns out not really, as the penalty of the 16 bit ops was
needing almost twice as many on average.

Things like memcpy/memmove/memset/etc, are function calls in cases
when not directly transformed into register load/store sequences.

My 66000 does not convert them into LD-ST sequences, MM is a single
inst-
ruction.

I have no high-level memory move/copy/set instructions.
Only loads/stores...

You have the power to fix it.........

But, at what cost...

I had generally avoided anything that will have required microcode or
shoving state-machines into the pipeline or similar.

Things like Load/Store-Multiple or

For small copies, can encode them inline, but past a certain size this
becomes too bulky.

A copy loop makes more sense for bigger copies, but has a high
overhead for small to medium copy.

So, there is a size range where doing it inline would be too bulky,
but a loop caries an undesirable level of overhead.

All the more reason to put it (a highly useful unit of work) into an instruction.

This is an area where "slides" work well, the main cost is mostly the
bulk that the slide adds to the binary (albeit, it is one-off).

Which is why it is a 512B memcpy slide vs, say, a 4kB memcpy slide...

For looping memcpy, it makes sense to copy 64 or 128 bytes per loop
iteration or so to try to limit looping overhead.

Though, leveraging the memcpy slide for the interior part of the copy
could be possible in theory as well.

For LZ memcpy, it is typically smaller, as LZ copies tend to be a lot
shorter (a big part of LZ decoder performance mostly being in
fine-tuning the logic for the match copies).

Though, this is part of why my runtime library had added a
"_memlzcpy(dst, src, len)" and "_memlzcpyf(dst, src, len)" functions,
which can consolidate this rather than needing to do it one-off for each
LZ decoder (as I see it, it is a similar issue to not wanting code to endlessly re-roll stuff for functions like memcpy or malloc/free, *).

*: Though, nevermind that the standard C interface for malloc is
annoyingly minimal, and ends up requiring most non-trivial programs to
roll their own memory management.

Ended up doing these with "slides", which end up eating roughly
several kB of code space, but was more compact than using larger
inline copies.

Say (IIRC):
   128 bytes or less: Inline Ld/St sequence
   129 bytes to 512B: Slide
   Over 512B: Call "memcpy()" or similar.

Versus::
    1-infinity: use MM instruction.

Yeah, but it makes the CPU logic more expensive.

The slide generally has entry points in multiples of 32 bytes, and
operates in reverse order. So, if not a multiple of 32 bytes, the last
bytes need to be handled externally prior to branching into the slide.

Does this remain sequentially consistent ??

Within a thread, it is fine.

Main wonk is that it does start copying from the high address first.
Presumably interrupts or similar wont be messing with application memory
mid memcpy.

The looping memcpy's generally work from low to high addresses though.

Though, this is only used for fixed-size copies (or "memcpy()" when
value is constant).

Say:

__memcpy64_512_ua:
   MOV.Q        (R5, 480), R20
   MOV.Q        (R5, 488), R21
   MOV.Q        (R5, 496), R22
   MOV.Q        (R5, 504), R23
   MOV.Q        R20, (R4, 480)
   MOV.Q        R21, (R4, 488)
   MOV.Q        R22, (R4, 496)
   MOV.Q        R23, (R4, 504)

__memcpy64_480_ua:
   MOV.Q        (R5, 448), R20
   MOV.Q        (R5, 456), R21
   MOV.Q        (R5, 464), R22
   MOV.Q        (R5, 472), R23
   MOV.Q        R20, (R4, 448)
   MOV.Q        R21, (R4, 456)
   MOV.Q        R22, (R4, 464)
   MOV.Q        R23, (R4, 472)

....

__memcpy64_32_ua:
   MOV.Q        (R5), R20
   MOV.Q        (R5, 8), R21
   MOV.Q        (R5, 16), R22
   MOV.Q        (R5, 24), R23
   MOV.Q        R20, (R4)
   MOV.Q        R21, (R4, 8)
   MOV.Q        R22, (R4, 16)
   MOV.Q        R23, (R4, 24)
   RTS

Duff's device in any other name.

More or less, though I think the idea of Duff's device is specifically
in the way that it abuses the while-loop and switch constructs.

This is basically just an unrolled slide.
So, where one branches into it, determines how much is copied.

For small-to-medium copies, the advantage is mostly that this avoids
looping overhead.

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 10 21:19:20 2024

From Newsgroup: comp.arch

BGB-Alt wrote:

On 4/10/2024 12:12 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/9/2024 7:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

Also the blob of constants needed to be within 512 bytes of the load
instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of a
basic block and then branch over it).

In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.

Yeah.

This was why some of the first things I did when I started extending
SH-4 were:
Adding mechanisms to build constants inline;
Adding Load/Store ops with a displacement (albeit with encodings
borrowed from SH-2A);
Adding 3R and 3RI encodings (originally Imm8 for 3RI).

My suggestion is that:: "Now that you have screwed around for a while,
Why not take that experience and do a new ISA without any of those
mistakes in it" ??

Did have a mess when I later extended the ISA to 32 GPRs, as (like with
BJX2 Baseline+XGPR) only part of the ISA had access to R16..R31.

Usually they were spilled between basic-blocks, with the basic-block
needing to branch to the following basic-block in these cases.

Also 8-bit branch displacements are kinda lame, ...

Why do that to yourself ??

I didn't design SuperH, Hitachi did...

But you did not fix them en massé, and you complain about them
at least once a week. There comes a time when it takes less time
and less courage to do that big switch and clean up all that mess.

But, with BJX1, I had added Disp16 branches.

With BJX2, they were replaced with 20 bit branches. These have the merit
of being able to branch anywhere within a Doom or Quake sized binary.

And, if one wanted a 16-bit branch:
   MOV.W (PC, 4), R0 //load a 16-bit branch displacement
   BRA/F R0
   .L0:
   NOP    // delay slot
   .WORD $(Label - .L0)

Also kinda bad...

Can you say Yech !!

Yeah.
This sort of stuff created strong incentive for ISA redesign...

Maybe consider now as the appropriate time to strt.

Granted, it is possible had I instead started with RISC-V instead of
SuperH, it is probable BJX2 wouldn't exist.

Though, at the time, the original thinking was that SuperH having
smaller instructions meant it would have better code density than RV32I
or similar. Turns out not really, as the penalty of the 16 bit ops was needing almost twice as many on average.

My 66000 only requires 70% the instruction count of RISC-V,
Yours could too ................

Things like memcpy/memmove/memset/etc, are function calls in cases
when not directly transformed into register load/store sequences.

My 66000 does not convert them into LD-ST sequences, MM is a single
inst-
ruction.

I have no high-level memory move/copy/set instructions.
Only loads/stores...

You have the power to fix it.........

But, at what cost...

You would not have to spend hours a week defending the indefensible !!

I had generally avoided anything that will have required microcode or shoving state-machines into the pipeline or similar.

Things as simple as IDIV and FDIV require sequencers.
But LDM, STM, MM require sequencers simpler than IDIV and FDIV !!

Things like Load/Store-Multiple or

If you like polluted ICaches..............

For small copies, can encode them inline, but past a certain size this
becomes too bulky.

A copy loop makes more sense for bigger copies, but has a high
overhead for small to medium copy.

So, there is a size range where doing it inline would be too bulky,
but a loop caries an undesirable level of overhead.

All the more reason to put it (a highly useful unit of work) into an
instruction.

This is an area where "slides" work well, the main cost is mostly the
bulk that the slide adds to the binary (albeit, it is one-off).

Consider that the predictor getting into the slide the first time
always mispredicts !!

Which is why it is a 512B memcpy slide vs, say, a 4kB memcpy slide...

What if you only wanted to copy 63 bytes ?? Your DW slide fails miserably,
yet a HW sequencer only has to avoid asserting a single byte write enable
once.

For looping memcpy, it makes sense to copy 64 or 128 bytes per loop iteration or so to try to limit looping overhead.

On low end machines, you want to operate at cache port width,
On high end machines, you want to operate at cache line widths per port.
This is essentially impossible using slides.....here, the same code is
not optimal across a line of implementations.

Though, leveraging the memcpy slide for the interior part of the copy
could be possible in theory as well.

What do you do when the STAT drive wants to write a whole page ??

For LZ memcpy, it is typically smaller, as LZ copies tend to be a lot shorter (a big part of LZ decoder performance mostly being in
fine-tuning the logic for the match copies).

Though, this is part of why my runtime library had added a
"_memlzcpy(dst, src, len)" and "_memlzcpyf(dst, src, len)" functions,
which can consolidate this rather than needing to do it one-off for each
LZ decoder (as I see it, it is a similar issue to not wanting code to endlessly re-roll stuff for functions like memcpy or malloc/free, *).

*: Though, nevermind that the standard C interface for malloc is
annoyingly minimal, and ends up requiring most non-trivial programs to
roll their own memory management.

Ended up doing these with "slides", which end up eating roughly
several kB of code space, but was more compact than using larger
inline copies.

Say (IIRC):
   128 bytes or less: Inline Ld/St sequence
   129 bytes to 512B: Slide
   Over 512B: Call "memcpy()" or similar.

Versus::
    1-infinity: use MM instruction.

Yeah, but it makes the CPU logic more expensive.

By what, 37-gates ??

The slide generally has entry points in multiples of 32 bytes, and
operates in reverse order. So, if not a multiple of 32 bytes, the last
bytes need to be handled externally prior to branching into the slide.

Does this remain sequentially consistent ??

Within a thread, it is fine.

What if a SATA drive is reading while you are writing !!
That is, DMA is no different than multi-threaded applications--except
DMA cannot perform locks.

Main wonk is that it does start copying from the high address first. Presumably interrupts or similar wont be messing with application memory
mid memcpy.

The only things wanting high-low access patterns are dumping stuff to the stack. The fact you CAN get away with it most of the time is no excuse.

The looping memcpy's generally work from low to high addresses though.

As does all string processing.
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Wed Apr 10 16:53:32 2024

From Newsgroup: comp.arch

On 4/10/2024 1:57 PM, Chris M. Thomasson wrote:

On 4/10/2024 12:41 AM, BGB wrote:

On 4/10/2024 12:01 AM, Chris M. Thomasson wrote:

On 4/9/2024 3:47 PM, BGB-Alt wrote:

On 4/9/2024 4:05 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/9/2024 1:24 PM, Thomas Koenig wrote:

I wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

Maybe one more thing: In order to justify the more complex encoding, >>>>>>> I was going for 64 registers, and that didn't work out too well
(missing bits).

Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.

For 32-bit instructions at least, 64 GPRs can work out OK.

Though, the gain of 64 over 32 seems to be fairly small for most
"typical" code, mostly bringing a benefit if one is spending a lot >>>>>> of CPU time in functions that have large numbers of local
variables all being used at the same time.

Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for >>>>>> code density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for >>>>>> performance.

Where, 16 GPRs isn't really enough (lots of register spills), and >>>>>> 128 GPRs is wasteful (would likely need lots of monster functions >>>>>> with 250+ local variables to make effective use of this, *, which >>>>>> probably isn't going to happen).

16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not
part of GPRs AND you have good access to constants.

On the main ISA's I had tried to generate code for, 16 GPRs was kind
of a pain as it resulted in fairly high spill rates.

Though, it would probably be less bad if the compiler was able to
use all of the registers at the same time without stepping on itself
(such as dealing with register allocation involving scratch
registers while also not conflicting with the use of function
arguments, ...).

My code generators had typically only used callee save registers for
variables in basic blocks which ended in a function call (in my
compiler design, both function calls and branches terminating the
current basic-block).

On SH, the main way of getting constants (larger than 8 bits) was
via PC-relative memory loads, which kinda sucked.

This is slightly less bad on x86-64, since one can use memory
operands with most instructions, and the CPU tends to deal fairly
well with code that has lots of spill-and-fill. This along with
instructions having access to 32-bit immediate values.

*: Where, it appears it is most efficient (for non-leaf functions) >>>>>> if the number of local variables is roughly twice that of the
number of CPU registers. If more local variables than this, then
spill/fill rate goes up significantly, and if less, then the
registers aren't utilized as effectively.

Well, except in "tiny leaf" functions, where the criteria is
instead that the number of local variables be less than the number >>>>>> of scratch registers. However, for many/most small leaf functions, >>>>>> the total number of variables isn't all that large either.

The vast majority of leaf functions use less than 16 GPRs, given
one has
a SP not part of GPRs {including arguments and return values}. Once >>>>> one starts placing things like memove(), memset(), sin(), cos(),
exp(), log()
in the ISA, it goes up even more.

Yeah.

Things like memcpy/memmove/memset/etc, are function calls in cases
when not directly transformed into register load/store sequences.

Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into a
slide.

As noted, on a 32 GPR machine, most leaf functions can fit entirely
in scratch registers. On a 64 GPR machine, this percentage is
slightly higher (but, not significantly, since there are few leaf
functions remaining at this point).

If one had a 16 GPR machine with 6 usable scratch registers, it is a
little harder though (as typically these need to cover both any
variables used by the function, and any temporaries used, ...).
There are a whole lot more leaf functions that exceed a limit of 6
than of 14.

But, say, a 32 GPR machine could still do well here.

Note that there are reasons why I don't claim 64 GPRs as a large
performance advantage:
On programs like Doom, the difference is small at best.

It mostly effects things like GLQuake in my case, mostly because
TKRA-GL has a lot of functions with a large numbers of local
variables (some exceeding 100 local variables).

Partly though this is due to code that is highly inlined and
unrolled and uses lots of variables tending to perform better in my
case (and tightly looping code, with lots of small functions, not so
much...).

Where, function categories:
   Tiny Leaf:
     Everything fits in scratch registers, no stack frame, no calls. >>>>>>    Leaf:
     No function calls (either explicit or implicit);
     Will have a stack frame.
   Non-Leaf:
     May call functions, has a stack frame.

You are forgetting about FP, GOT, TLS, and whatever resources are
required
to do try-throw-catch stuff as demanded by the source language.

Yeah, possibly true.

In my case:
   There is no frame pointer, as BGBCC doesn't use one;
     All stack-frames are fixed size, VLA's and alloca use the heap; >>>>    GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR); >>>>    TLS, accessed via TBR.[...]

alloca using the heap? Strange to me...

Well, in this case:
The alloca calls are turned into calls which allocate the memory blob
and add it to a linked list;
when the function returns, everything in the linked list is freed;
Then, it internally pulls this off via malloc and free.

Also the typical default stack size in this case is 128K, so trying to
put big allocations on the stack is more liable to result in a stack
overflow.

Bigger stack needs more memory, so is not ideal for NOMMU use. Luckily
heap allocation is not too slow in this case.

Though, at the same time, ideally one limits use of language features
where the code-generation degenerates into a mess of hidden runtime
calls. These cases are not ideal for performance...

Sometimes alloca is useful wrt offsetting the stack to avoid false
sharing between stacks. Intel wrote a little paper that addresses this:

https://www.intel.com/content/dam/www/public/us/en/documents/training/developing-multithreaded-applications.pdf

Remember that one?

This seems mostly N/A in my case, as the cores use a weak memory model
and there is no SMT.

Also thread creation tends to offset stacks by a random amount as well
as a form of ASLR (IIRC, roughly 0..256 bytes in a multiple of 16).

As for reducing the cost of heap sharing between threads, there is
another option here:
Give each thread its own local version of the heap (essentially,
per-thread free-lists and similar). This can avoid the need to use mutex locking and similar, though may have a penalty if one tries to free
memory objects that weren't allocated in the same thread.

In my case, the heap is split into multiple object sizes:
Small:
Under around 1K, allocated in terms of 16-byte cells;
Allocated in chunks from the medium heap.
Medium:
Around 1K to 64K, allocated via subdividing a larger block (256K).
Allocated via the large heap.
Large:
Bigger than 64K or so, allocated via pages (eg: "mmap()").

For the most common sizes (small and medium), the free-list and similar
could be thread local; global locking then being used for large
allocation, or for allocating new heap blocks (for the medium heap).

As can be noted, objects in the small object heap tend to be passed to a multiple of 16 bytes, and normally have a 16 byte object header (the
pointer to an allocated object points just after this header).

Note that objects in the large heap may instead store this metadata externally.

Granted, yeah, mutex locking is fairly expensive with a weak memory
model, and shared memory is generally undesirable as there is little in
the way of memory coherence (absent explicit flushes, accessing memory belonging to a different thread may result in stale data).

A similar trick was used in the past for my BGBScript VMs, mostly
because mutex locking is slow; and dynamic languages like this tend to
involve a lot of rapid-fire small granularity allocations and frees
(every object and array goes on the heap).

In BGBCC, it is merely just that large objects and arrays go on the heap (along with VLAs and similar). If one creates a lambda, this also goes
on the heap.

If one wants to support proper lexical closures (N/A for both C++ and
Java style lambdas, *), the local variables may also end up on the heap.
And, if one wanted to support Scheme style call/cc (call-with-current-continuation), the entire ABI frame needs to go on
the heap. However, a decision was made early on not to bother with
call/cc support in the BGBCC ABI (an ABI capable of supporting call/cc
would impose a severe performance penalty).

There was provision made for supporting exit-only continuations, which
can effectively leverage the same mechanism as try/catch and throw (the continuation is effectively a self-throwing exception which will be
caught at a predefined location).

*: By default, lambdas in BGBCC do not use lexical capture, and instead
either capture by-value or capture-by-reference (like in C++ style
lambdas, or GCC inner functions, using C++ style syntax) but differ in
that the lambdas are callable as normal function pointers (and
heap-allocated, rather than RAII value-types, though will be auto-freed
when the originating function returns in the capture-by-reference case).

In my BGBScript2 language, capture-by-value had been the default, with
lambdas having an indeterminate lifespan (they may continue to exist
outside of the scope where the calling function had terminated; unlike
GCC inner functions).

...

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Wed Apr 10 16:56:51 2024

From Newsgroup: comp.arch

On 4/10/2024 12:29 PM, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

On 4/9/2024 7:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

Also the blob of constants needed to be within 512 bytes of the load
instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of a
basic block and then branch over it).

In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.

It does occupy some icache space, however; have you boosted the icache
size to compensate?

FWIW, in my case:
32K I$ + 32K D$ does fairly well IME;

16K I$ + 32K D$ works well for Doom, but has a fairly higher I$ miss
rate for Quake and similar (and most other non-Doom programs). Vs, Doom seemingly being pretty much D$ bound.

Constants are generally encoded inline in BJX2.

...

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 10 23:30:02 2024

From Newsgroup: comp.arch

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.

It does occupy some icache space, however; have you boosted the icache
size to compensate?

The space occupied in the ICache is freed up from being in the DCache
so the overall hit rate goes up !! At typical sizes, ICache miss rate
is about ¼ the miss rate of DCache.

Besides:: if you had to LD the constant from memory, you use a LD instruction and 1 or 2 words in DCache, while consuming a GPR. So, overall, it takes
fewer cycles, fewer GPRs, and fewer instructions.

Alternatively:: if you paste constants together (LUI, AUPIC) you have no
direct route to either 64-bit constants or 64-bit address spaces.

It looks to be a win-win !!
--- Synchronet 3.20a-Linux NewsLink 1.114

From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Wed Apr 10 22:18:25 2024

From Newsgroup: comp.arch

On 4/9/24 8:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

[snip]

Things like memcpy/memmove/memset/etc, are function calls in
cases when not directly transformed into register load/store
sequences.

My 66000 does not convert them into LD-ST sequences, MM is a
single instruction.

I wonder if it would be useful to have an immediate count form of
memory move. Copying fixed-size structures would be able to use an
immediate. Aside from not having to load an immediate for such
cases, there might be microarchitectural benefits to using a
constant. Since fixed-sized copies would likely be limited to
smaller regions (with the possible exception of 8 MiB page copies)
and the overhead of loading a constant for large sizes would be
tiny, only providing a 16-bit immediate form might be reasonable.

Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into
a slide.

MMs and MSs that do not cross page boundaries are ATOMIC. The
entire system
sees only the before or only the after state and nothing in
between.

I still feel that this atomicity should somehow be included with
ESM just because they feel related, but the benefit seems likely
to be extremely small. How often would software want to copy
multiple regions atomically or combine region copying with
ordinary ESM atomicity?? There *might* be some use for an atomic
region copy and an updating of a separate data structure (moving a
structure and updating one or a very few pointers??). For
structures three cache lines in size where only one region
occupies four cache lines, ordinary ESM could be used.

My feeling based on "relatedness" is not a strong basis for such
an architectural design choice.

(Simple page masking would allow false conflicts when smaller
memory moves are used. If there is a separate pair of range
registers that is checked for coherence of memory moves, this
issue would only apply for multiple memory moves _and_ all eight
of the buffer entries could be used for smaller accesses.)

[snip]

As noted, on a 32 GPR machine, most leaf functions can fit
entirely in scratch registers.

Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without
getting totally screwed.

I wonder how many instructions would have to have access to such a
set of "special registers" and if a larger number of extra
registers would be useful. (One of the issues — in my opinion —
with PowerPC's link register and count register was that they
could not be directly loaded from or stored to memory [or loaded \
with a constant from the instruction stream]. For counted loops,
loading the count register from the instruction stream would
presumably have allowed early branch determination even for deep
pipelines and small loop counts.) SP, FP, GOT, and TLS hold
"stable values", which might facilitate some microarchitectural
optimizations compared to more frequently modified register names.

(I am intrigued by the possibility of small contexts for some
multithreaded workloads, similar to how some GPUs allow variable
context sizes.)
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Wed Apr 10 22:14:33 2024

From Newsgroup: comp.arch

On 4/10/2024 4:19 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

On 4/10/2024 12:12 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/9/2024 7:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

Also the blob of constants needed to be within 512 bytes of the load
instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of
a basic block and then branch over it).

In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache
pollution.

Yeah.

This was why some of the first things I did when I started extending
SH-4 were:
Adding mechanisms to build constants inline;
Adding Load/Store ops with a displacement (albeit with encodings
borrowed from SH-2A);
Adding 3R and 3RI encodings (originally Imm8 for 3RI).

My suggestion is that:: "Now that you have screwed around for a while,
Why not take that experience and do a new ISA without any of those
mistakes in it" ??

There was a reboot, it became BJX2.
This, of course, has developed some of its own hair...

Where, BJX1 was a modified SuperH, and BJX2 was a redesigned ISA design
that was "mostly backwards compatible" at the ASM level.

Granted, possibly I could have gone further, such as no longer having
the stack pointer in R15, but alas...

Though, in some areas, SH had features that I had dropped as well, such
as auto-increment addressing and delay slots.

Did have a mess when I later extended the ISA to 32 GPRs, as (like
with BJX2 Baseline+XGPR) only part of the ISA had access to R16..R31.

Usually they were spilled between basic-blocks, with the basic-block
needing to branch to the following basic-block in these cases.

Also 8-bit branch displacements are kinda lame, ...

Why do that to yourself ??

I didn't design SuperH, Hitachi did...

But you did not fix them en massé, and you complain about them
at least once a week. There comes a time when it takes less time
and less courage to do that big switch and clean up all that mess.

For the most part, BJX2 is using 20-bit branches for 32-bit ops.

Exceptions being the Compare-and-Branch, and Compare-Zero-and-Branch
ops, but this is mostly because there wasn't enough encoding space to
give them larger displacements.

BREQ.Q Rn, Disp11s
BREQ.Q Rm, Rn, Disp8s

There are Disp32s variants available, just that these involve using a
Jumbo prefix.

But, with BJX1, I had added Disp16 branches.

With BJX2, they were replaced with 20 bit branches. These have the
merit of being able to branch anywhere within a Doom or Quake sized
binary.

And, if one wanted a 16-bit branch:
   MOV.W (PC, 4), R0 //load a 16-bit branch displacement
   BRA/F R0
   .L0:
   NOP    // delay slot
   .WORD $(Label - .L0)

Also kinda bad...

Can you say Yech !!

Yeah.
This sort of stuff created strong incentive for ISA redesign...

Maybe consider now as the appropriate time to strt.

The above was for SuperH; this sort of thing is N/A for BJX2.

In this case, BJX2 can pull it off in a single instruction.

None the less, even with all this crap, the SuperH was still seen as sufficient for the Sega 32X/Saturn/Dreamcast (and the Naomi and Hikaru
arcade machine boards, ...).

Though, it seems Sega later jumped ship from SuperH to using low-end x86
PC motherboads in later arcade machines.

Granted, it is possible had I instead started with RISC-V instead of
SuperH, it is probable BJX2 wouldn't exist.

Though, at the time, the original thinking was that SuperH having
smaller instructions meant it would have better code density than
RV32I or similar. Turns out not really, as the penalty of the 16 bit
ops was needing almost twice as many on average.

My 66000 only requires 70% the instruction count of RISC-V,
Yours could too ................

At this point, I suspect the main issue for me not (entirely) beating
RV64G, is mostly compiler issues...

So, the ".text" section is still around 10% bigger, with some amount of
this being spent on Jumbo prefixes, and the rest due to cases where code generation falls short.

Things like memcpy/memmove/memset/etc, are function calls in cases >>>>>> when not directly transformed into register load/store sequences.

My 66000 does not convert them into LD-ST sequences, MM is a single >>>>> inst-
ruction.

I have no high-level memory move/copy/set instructions.
Only loads/stores...

You have the power to fix it.........

But, at what cost...

You would not have to spend hours a week defending the indefensible !!

I had generally avoided anything that will have required microcode or
shoving state-machines into the pipeline or similar.

Things as simple as IDIV and FDIV require sequencers.
But LDM, STM, MM require sequencers simpler than IDIV and FDIV !!

Not so much in my case.

IDIV and FDIV:
Feed inputs into Shift-Add unit;
Stall pipeline for a predefined number of clock cycles;
Grab result out of the other end (at which point, pipeline resumes).

In this case, the FDIV was based on noting that if one lets the
Shift-Add unit run for longer, it moves from doing an integer divide to
doing a fractional divide, so I could make it perform an FDIV merely by feeding the mantissas into it (as two big integers) and doubling the
latency. Then glue on some extra logic to figure out the exponents and pack/unpack Binary64, and, done.

Not really the same thing at all...

Apart from it tending to get stomped every time one does an integer
divide, could possibly also use it as an RNG, as it basically churns
over whatever random bits flow into it from the pipeline.

Things like Load/Store-Multiple or

If you like polluted ICaches..............

For small copies, can encode them inline, but past a certain size
this becomes too bulky.

A copy loop makes more sense for bigger copies, but has a high
overhead for small to medium copy.

So, there is a size range where doing it inline would be too bulky,
but a loop caries an undesirable level of overhead.

All the more reason to put it (a highly useful unit of work) into an
instruction.

This is an area where "slides" work well, the main cost is mostly the
bulk that the slide adds to the binary (albeit, it is one-off).

Consider that the predictor getting into the slide the first time
always mispredicts !!

Possibly.

But, note that the paths headed into the slide are things like structure assignment and "memcpy()" where the size is constant. So, in these
cases, the compiler already knows where it is branching.

So, say:
memcpy(dst, src, 512);
Gets compiled as, effectively:
MOV dst, R4
MOV src, R5
BSR __memcpy64_512_ua

Which is why it is a 512B memcpy slide vs, say, a 4kB memcpy slide...

What if you only wanted to copy 63 bytes ?? Your DW slide fails miserably, yet a HW sequencer only has to avoid asserting a single byte write enable once.

Two strategies:
Compiler pads it to 64 bytes (typical for struct copy, where structs can always be padded up to their natural alignment);
It emits the code for copying the last N bytes (modulo 32) and then
branches into the slide (typical for memcpy).

For variable memcpy, there is an extension:
_memcpyf(void *dst, void *src, size_t len);

Which is basically the "I don't care if it copies a little extra"
version (say, where it may pad the copy up to a multiple of 16 bytes).

For looping memcpy, it makes sense to copy 64 or 128 bytes per loop
iteration or so to try to limit looping overhead.

On low end machines, you want to operate at cache port width,
On high end machines, you want to operate at cache line widths per port.
This is essentially impossible using slides.....here, the same code is
not optimal across a line of implementations.

Possible.

As is, it uses 64-bit load/store for unaligned copy, and 128-bit for
aligned copy (support for unaligned "MOV.X" is still an optional feature).

It mostly doesn't bother trying to sort this out for the slide, as for
the size ranges dealt with by the slide, trying to separate aligned from unaligned at runtime will end up costing about as much as it saves.

Though, for larger copies, it makes more sense to figure it out.

Though, leveraging the memcpy slide for the interior part of the copy
could be possible in theory as well.

What do you do when the STAT drive wants to write a whole page ??

?...

Presumably there aren't going to be that many pages being paged out mid-memcpy.

For LZ memcpy, it is typically smaller, as LZ copies tend to be a lot
shorter (a big part of LZ decoder performance mostly being in
fine-tuning the logic for the match copies).

Though, this is part of why my runtime library had added a
"_memlzcpy(dst, src, len)" and "_memlzcpyf(dst, src, len)" functions,
which can consolidate this rather than needing to do it one-off for
each LZ decoder (as I see it, it is a similar issue to not wanting
code to endlessly re-roll stuff for functions like memcpy or
malloc/free, *).

*: Though, nevermind that the standard C interface for malloc is
annoyingly minimal, and ends up requiring most non-trivial programs to
roll their own memory management.

Ended up doing these with "slides", which end up eating roughly
several kB of code space, but was more compact than using larger
inline copies.

Say (IIRC):
   128 bytes or less: Inline Ld/St sequence
   129 bytes to 512B: Slide
   Over 512B: Call "memcpy()" or similar.

Versus::
     1-infinity: use MM instruction.

Yeah, but it makes the CPU logic more expensive.

By what, 37-gates ??

I will assume it is probably a bit more than this given there is not
currently any sort of mechanism that does anything similar.

Would need to add some sort of "inject synthesized instructions into the pipeline" mechanism, my guess is this would probably be at least a few
kLUT. Well, unless it is put in ROM, but this would have no real
advantage over "just do it in software".

FWIW:
I had originally intended to put a page-table walker in ROM and then
pretend like it had a hardware page-walker, but we all know how this
turned out.

Though, part of this was because it was competing against arguably more
useful uses of ROM space, like the FAT driver, PE/COFF and ELF loaders,
and the boot-time sanity checks (eg: verify early on that I hadn't
broken fundamental parts of the CPU).

The slide generally has entry points in multiples of 32 bytes, and
operates in reverse order. So, if not a multiple of 32 bytes, the
last bytes need to be handled externally prior to branching into the
slide.

Does this remain sequentially consistent ??

Within a thread, it is fine.

What if a SATA drive is reading while you are writing !!
That is, DMA is no different than multi-threaded applications--except
DMA cannot perform locks.

Currently there is no DMA, only polling IO.
Also no SATA interface, nor PCIE, nor ...

IO to an SDcard is basically probing the MMIO interface and spinning in
a loop until it is done. Most elaborate part of this interface is that
there was a mechanism added to allow sending/recieving 8 bytes at a time
over SPI.

Main wonk is that it does start copying from the high address first.
Presumably interrupts or similar wont be messing with application
memory mid memcpy.

The only things wanting high-low access patterns are dumping stuff to
the stack. The fact you CAN get away with it most of the time is no excuse.

AFAIK, these is no particular requirement for which direction "memcpy()"
goes.

And, high to low was more effective for the copy slide.

The looping memcpy's generally work from low to high addresses though.

As does all string processing.

Granted.

The string handling functions are their own piles of fun...

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Wed Apr 10 22:21:33 2024

From Newsgroup: comp.arch

On 4/10/2024 9:18 PM, Paul A. Clayton wrote:

On 4/9/24 8:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

[snip]

Things like memcpy/memmove/memset/etc, are function calls in cases
when not directly transformed into register load/store sequences.

My 66000 does not convert them into LD-ST sequences, MM is a single
instruction.

I wonder if it would be useful to have an immediate count form of
memory move. Copying fixed-size structures would be able to use an
immediate. Aside from not having to load an immediate for such
cases, there might be microarchitectural benefits to using a
constant. Since fixed-sized copies would likely be limited to
smaller regions (with the possible exception of 8 MiB page copies)
and the overhead of loading a constant for large sizes would be
tiny, only providing a 16-bit immediate form might be reasonable.

As noted, in my case, the whole thing of Ld/St sequences, and memcpy
slides, mostly applies to constant cases.

If the copy size is variable, the compiler merely calls "memcpy()",
which will then generally figure out which loop to use, and one has to
pay the penalty of the runtime overhead of memcpy needing to figure out
what it needs to do.

Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into a
slide.

MMs and MSs that do not cross page boundaries are ATOMIC. The entire
system
sees only the before or only the after state and nothing in between.

I still feel that this atomicity should somehow be included with
ESM just because they feel related, but the benefit seems likely
to be extremely small. How often would software want to copy
multiple regions atomically or combine region copying with
ordinary ESM atomicity?? There *might* be some use for an atomic
region copy and an updating of a separate data structure (moving a
structure and updating one or a very few pointers??). For
structures three cache lines in size where only one region
occupies four cache lines, ordinary ESM could be used.

My feeling based on "relatedness" is not a strong basis for such
an architectural design choice.

(Simple page masking would allow false conflicts when smaller
memory moves are used. If there is a separate pair of range
registers that is checked for coherence of memory moves, this
issue would only apply for multiple memory moves _and_ all eight
of the buffer entries could be used for smaller accesses.)

All seems a bit complicated to me.

But, as noted, I went for a model of weak memory coherence and leaving
most of this stuff for software to sort out.

[snip]

As noted, on a 32 GPR machine, most leaf functions can fit entirely
in scratch registers.

Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without
getting totally screwed.

I wonder how many instructions would have to have access to such a
set of "special registers" and if a larger number of extra
registers would be useful. (One of the issues — in my opinion —
with PowerPC's link register and count register was that they
could not be directly loaded from or stored to memory [or loaded \
with a constant from the instruction stream]. For counted loops,
loading the count register from the instruction stream would
presumably have allowed early branch determination even for deep
pipelines and small loop counts.) SP, FP, GOT, and TLS hold
"stable values", which might facilitate some microarchitectural
optimizations compared to more frequently modified register names.

(I am intrigued by the possibility of small contexts for some
multithreaded workloads, similar to how some GPUs allow variable context sizes.)

In my case, yeah, there are two semi-separate register spaces here:
GPRs: R0..R63
R0, R1, and R15 are Special
R0/DLR: Hard-coded register for some instructions;
Assembler may stomp without warning for pseudo-instructions.
R1/DHR:
Was originally intended similar to DLR;
Now mostly used as an auxiliary link register.
R15/SP:
Stack Pointer.
CRs: C0..C63
Various special purpose registers;
Most are privileged only.
LR, GBR, etc, are in CR space.

Though, internally, GPRs and CRs both exist within a combined register
space in the CPU:
00..3F: Mostly GPR space
40..7F: CR and SPR space.

Generally, CRs may only be accessed by certain register ports though.

By default, the only way to save/restore CRs is by shuffling them
through GPRs. There is an optional MOV.C instruction for this, but
generally it is not enabled as it isn't clear that it saves enough to be
worth the added LUT cost.

There is a subset version, where MOV.C exists, but is only really able
to be used with LR and GBR and similar. Generally, this version exists
as RISC-V Mode needs to be able to save/restore these registers (they
exist in the GPR space in RISC-V).

As I can note, if I did a new ISA, most likely the register assignment
scheme would differ, say:
R0: ZR / PC
R1: LR / TP (TBR)
R2: SP
R3: GP (GBR)
Where the interpretation of R0 and R1 would depend on context (ZR and LR
for most instructions, PC and TP when used as a Ld/St base address).

Though, some ideas had involved otherwise keeping a similar register
space layout to my existing ABI, mostly because significant ABI changes
would not be easy for my compiler as-is.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Apr 11 12:22:47 2024

From Newsgroup: comp.arch

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

On 4/9/2024 7:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

Also the blob of constants needed to be within 512 bytes of the load
instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of a
basic block and then branch over it).

In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.

It does occupy some icache space, however; have you boosted the icache
size to compensate?

Except it pretty rarely do so (increase icache pressure):

mov temp_reg, offset const_table
mov reg,qword ptr [temp_reg+const_offset]

looks to me like at least 5 bytes for the first instruction and probably
6 for the second, for a total of 11 (could be as low as 8 for a very
small offset), all on top of the 8 bytes of dcache needed to hold the
64-bit value loaded.

In My 66000 this should be a single 32-bit instruction followed by the
8-byte const, so 12 bytes total and no lookaside dcache inference.

It is only when you do a lot of 64-bit data loads, all gathered in a
single 256-byte buffer holding up to 32 such values, and you can afford
to allocate a fixed register pointing to the middle of that range, that
you actually gain some total space: Each load can now just do a

mov reg,qword ptr [fixed_base_reg+byte_offset]

which, due to the need for a 64-bit prefix, will probably need 4
instruction bytes on top of the 8 bytes from dcache. At this point we
are touching exactly the same number of bytes (12) as My 66000, but from
two different caches, so much more likley to suffer dcache misses.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Apr 11 14:13:24 2024

From Newsgroup: comp.arch

On Wed, 10 Apr 2024 23:30:02 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

In My 66000 case, the constant is the word following the
instruction. Easy to find, easy to access, no register pollution,
no DCache pollution.

It does occupy some icache space, however; have you boosted the
icache size to compensate?

The space occupied in the ICache is freed up from being in the DCache
so the overall hit rate goes up !! At typical sizes, ICache miss rate
is about ¼ the miss rate of DCache.

Besides:: if you had to LD the constant from memory, you use a LD
instruction and 1 or 2 words in DCache, while consuming a GPR. So,
overall, it takes fewer cycles, fewer GPRs, and fewer instructions.

Alternatively:: if you paste constants together (LUI, AUPIC) you have
no direct route to either 64-bit constants or 64-bit address spaces.

It looks to be a win-win !!

Win-win under constraints of Load-Store Arch. Otherwise, it depends.
--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Apr 11 14:30:27 2024

From Newsgroup: comp.arch

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

On 4/9/24 8:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

[snip]

Things like memcpy/memmove/memset/etc, are function calls in
cases when not directly transformed into register load/store
sequences.

My 66000 does not convert them into LD-ST sequences, MM is a
single instruction.

I wonder if it would be useful to have an immediate count form of
memory move. Copying fixed-size structures would be able to use an
immediate. Aside from not having to load an immediate for such
cases, there might be microarchitectural benefits to using a
constant. Since fixed-sized copies would likely be limited to
smaller regions (with the possible exception of 8 MiB page copies)
and the overhead of loading a constant for large sizes would be
tiny, only providing a 16-bit immediate form might be reasonable.

It seems to me that an offloaded DMA engine would be a far
better way to do memmove (over some threshhold, perhaps a
cache line) without trashing the caches. Likewise memset.

Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into
a slide.

MMs and MSs that do not cross page boundaries are ATOMIC. The
entire system
sees only the before or only the after state and nothing in
between.

One might wonder how that atomicity is guaranteed in a
SMP processor...

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Thu Apr 11 13:35:41 2024

From Newsgroup: comp.arch

On 4/11/2024 6:13 AM, Michael S wrote:

On Wed, 10 Apr 2024 23:30:02 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

BGB wrote:

In My 66000 case, the constant is the word following the
instruction. Easy to find, easy to access, no register pollution,
no DCache pollution.

It does occupy some icache space, however; have you boosted the
icache size to compensate?

The space occupied in the ICache is freed up from being in the DCache
so the overall hit rate goes up !! At typical sizes, ICache miss rate
is about ¼ the miss rate of DCache.

Besides:: if you had to LD the constant from memory, you use a LD
instruction and 1 or 2 words in DCache, while consuming a GPR. So,
overall, it takes fewer cycles, fewer GPRs, and fewer instructions.

Alternatively:: if you paste constants together (LUI, AUPIC) you have
no direct route to either 64-bit constants or 64-bit address spaces.

It looks to be a win-win !!

Win-win under constraints of Load-Store Arch. Otherwise, it depends.

FWIW:
The LDSH / SHORI mechanism does provide a way to get 64-bit constants,
and needs less encoding space than the LUI route.

MOV Imm16. Rn
SHORI Imm16, Rn
SHORI Imm16, Rn
SHORI Imm16, Rn

Granted, if each is a 1-cycle instruction, this still takes 4 clock cycles.

An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
1-cycle, is preferable....

In misc news:

Some compiler fiddling has now dropped the ".text" overhead (vs RV64G)
from 10% to 5%.

This was mostly in the form of adding dependency tracking logic to ASM
code (albeit in a form where it needs to use ".global" and ".extern" statements for things to work correctly), and no longer giving it a free
pass.

This in turn allowed it to effectively cull some parts of the dynamic typesystem runtime and a bunch of the Binary128 support code (shaving
roughly 14K off of the Doom build).

Does have a non-zero code impact (mostly in the form of requiring adding ".global" and ".extern" lines to the ASM code in some cases where they
were absent).

Looks like a fair chunk of the dynamic types runtime is still present
though, which appears to be culled in the GCC build (since GCC doesn't
use the dynamic typesystem at all). Theoretically, Doom should not need
it, as Doom is entirely "plain old C".

Main part that ended up culled with this change was seemingly most of
the code for ex-nihilo objects and similar (which does not seem to be reachable from any of the Doom code).

There is a printf extension for printing variant types, but this is
still present in the RV64G build (this would mostly include code needed
for the "toString" operation). I guess, one could debate whether printf actually needs support for variant types (as can be noted, most normal C
code will not use it).

Though, I guess one option could be to modify it to call toString via a function pointer which is only set if other parts of the dynamic
typesystem are initialized (could potentially save several kB off the
size of the binary it looks like). Might break stuff though if one ties
to printf a variant but had not used any types much beyond fixnum and
flonum, which would not have triggered the typesystem to initialize itself.

Probably doesn't matter too much, as this code is not likely a factor in
the delta between the ISAs.

Note that if the size of Doom's ".text" section dropped by another 15K,
it would reach parity with the RV64G build (which was around 290K in the relevant build ATM; goal being to keep the code fairly close to parity
in this case, with the differences mostly allowed for ISA specific stuff).

Though, this is ignoring that roughly 11K of this delta are Jumbo
prefixes (so the delta in instruction count is now roughly 1.3% at the moment); and RV64G has an additional 24K in its ".rodata" section
(beyond what could be accounted for in string literals and similar).

So, in terms of text+rodata (+strtab *), my stuff is smaller at the moment.

*: Where GCC rolls its string literals into '.rodata', vs BGBCC having a dedicated section for string literals.

...

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Apr 11 18:46:54 2024

From Newsgroup: comp.arch

BGB wrote:

On 4/11/2024 6:13 AM, Michael S wrote:

On Wed, 10 Apr 2024 23:30:02 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

It does occupy some icache space, however; have you boosted the
icache size to compensate?

The space occupied in the ICache is freed up from being in the DCache
so the overall hit rate goes up !! At typical sizes, ICache miss rate
is about ¼ the miss rate of DCache.

Besides:: if you had to LD the constant from memory, you use a LD
instruction and 1 or 2 words in DCache, while consuming a GPR. So,
overall, it takes fewer cycles, fewer GPRs, and fewer instructions.

Alternatively:: if you paste constants together (LUI, AUPIC) you have
no direct route to either 64-bit constants or 64-bit address spaces.

It looks to be a win-win !!

Win-win under constraints of Load-Store Arch. Otherwise, it depends.

Never seen a LD-OP architecture where the inbound memory can be in the
Rs1 position of the instruction.

FWIW:
The LDSH / SHORI mechanism does provide a way to get 64-bit constants,
and needs less encoding space than the LUI route.

MOV Imm16. Rn
SHORI Imm16, Rn
SHORI Imm16, Rn
SHORI Imm16, Rn

Granted, if each is a 1-cycle instruction, this still takes 4 clock cycles.

As compared to::

CALK Rd,Rs1,#imm64

Which takes 3 words (12 bytes) and executes in CALK cycles, the loading
of the constant is free !! (0 cycles) !! {{The above example uses at least
5 cycles to use the loaded/built constant.}}

An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and 1-cycle, is preferable....

A consuming instruction where you don't even use a register is better
still !!
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Thu Apr 11 15:42:59 2024

From Newsgroup: comp.arch

On 4/11/2024 1:46 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/11/2024 6:13 AM, Michael S wrote:

On Wed, 10 Apr 2024 23:30:02 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

It does occupy some icache space, however; have you boosted the
icache size to compensate?

The space occupied in the ICache is freed up from being in the DCache
so the overall hit rate goes up !! At typical sizes, ICache miss rate
is about ¼ the miss rate of DCache.

Besides:: if you had to LD the constant from memory, you use a LD
instruction and 1 or 2 words in DCache, while consuming a GPR. So,
overall, it takes fewer cycles, fewer GPRs, and fewer instructions.

Alternatively:: if you paste constants together (LUI, AUPIC) you have
no direct route to either 64-bit constants or 64-bit address spaces.

It looks to be a win-win !!

Win-win under constraints of Load-Store Arch. Otherwise, it depends.

Never seen a LD-OP architecture where the inbound memory can be in the
Rs1 position of the instruction.

FWIW:
The LDSH / SHORI mechanism does provide a way to get 64-bit constants,
and needs less encoding space than the LUI route.

   MOV Imm16. Rn
   SHORI Imm16, Rn
   SHORI Imm16, Rn
   SHORI Imm16, Rn

Granted, if each is a 1-cycle instruction, this still takes 4 clock
cycles.

As compared to::

    CALK   Rd,Rs1,#imm64

Which takes 3 words (12 bytes) and executes in CALK cycles, the loading
of the constant is free !! (0 cycles) !! {{The above example uses at least
5 cycles to use the loaded/built constant.}}

The main reason one might want SHORI is that it can fit into a
fixed-length 32-bit encoding. Also technically could be retrofitted onto RISC-V without any significant change, unlike some other options (as
noted, I don't argue for adding Jumbo prefixes to RV under the basis
that there is no real viable way to add them to RV, *).

Sadly, the closest option to viable for RV would be to add the SHORI instruction and optionally pattern match it in the fetch/decode.

Or, say:
LUI Xn, Imm20
ADD Xn, Xn, Imm12
SHORI Xn, Imm16
SHORI Xn, Imm16

Then, combine LUI+ADD into a 32-bit load in the decoder (though probably
only if the Imm12 is positive), and 2x SHORI into a combined "Xn=(Xn<<32)|Imm32" operation.

This could potentially get it down to 2 clock cycles.

*: To add a jumbo prefix, one needs an encoding that:
Uses up a really big chunk of encoding space;
Is otherwise illegal and unused.
RISC-V doesn't have anything here.

Ironically, in XG2 mode, I still have 28x 24-bit chunks of encoding
space that aren't yet used for anything, but aren't usable as normal
encoding space mostly because if I put instructions in there (with the existing encoding schemes), I couldn't use all the registers (and they
would not have predication or similar either). Annoyingly, the only
types of encodings that would fit in there at present are 2RI Imm16 ops
or similar (or maybe 3R 128-bit SIMD ops, where these ops only use
encodings for R0..R31 anyways, interpreting the LSB of the register
field as encoding R32..R63).

Though, 14x of these spaces would likely be alternate forms of Jumbo
prefix (with another 14 in unconditional-scalar-op land). No immediate
need to re-add an equivalent of the 40x2 encoding (from Baseline mode),
as most of what 40x2 addressed can be encoded natively in XG2 Mode.

Technically, I also have 2 unused bits in the Imm16 ops as well in XG2
Mode. I "could" in theory, if I wanted, use them to extend the:
MOV Imm17s, Rn
Case, to:
MOV Imm19s, Rn
Though, the other option is to leave them reserved if I later want more
Imm16 ops.

For now, current plan is to leave this stuff as reserved.

An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
1-cycle, is preferable....

A consuming instruction where you don't even use a register is better
still !!

Can be done, but thus far 33-bit immediate values. Luckily, Imm33s seems
to addresses around 99% of uses (for normal ALU ops and similar).

Had considered allowing an Imm57s case for SIMD immediates (4x S.E5.F8
or 2x S.E8.F19), which would have indirectly allowed the Imm57s case. By themselves though, the difference doesn't seem enough to justify the cost.

Don't have enough bits in the encoding scheme to pull off a 3RI Imm64 in
12 bytes (and allowing a 16-byte encoding would have too steep of a cost increase to be worthwhile).

So, alas...

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Thu Apr 11 16:25:54 2024

From Newsgroup: comp.arch

On 4/11/2024 9:30 AM, Scott Lurndal wrote:

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

On 4/9/24 8:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

[snip]

Things like memcpy/memmove/memset/etc, are function calls in
cases when not directly transformed into register load/store
sequences.

My 66000 does not convert them into LD-ST sequences, MM is a
single instruction.

I wonder if it would be useful to have an immediate count form of
memory move. Copying fixed-size structures would be able to use an
immediate. Aside from not having to load an immediate for such
cases, there might be microarchitectural benefits to using a
constant. Since fixed-sized copies would likely be limited to
smaller regions (with the possible exception of 8 MiB page copies)
and the overhead of loading a constant for large sizes would be
tiny, only providing a 16-bit immediate form might be reasonable.

It seems to me that an offloaded DMA engine would be a far
better way to do memmove (over some threshhold, perhaps a
cache line) without trashing the caches. Likewise memset.

Probably.
One could argue that, likely, setting up a DMA'ed memmove would be
expensive enough to make it impractical for small copies (in the
category where I am using inline Ld/St sequences or slides).

And, larger copies (where it is most likely to bring benefit) at present mostly seem to be bus/memory bound.

Sort of reminds me of the thing with the external rasterizer module:
The module itself draws stuff quickly, but getting it set-up this far is
still expensive enough to limit its benefit. So the main benefit it
could bring is seemingly just using it to pull off multi-textured
lightmap rendering, which in this case can run at similar speeds to
vertex lighting (lightmapped rendering being a somewhat slower option
for the software rasterizer).

Well, along with me recently realizing a trick to mimic the look of
trilinear filtering without increasing the number of texture fetches
(mostly by distorting the interpolation coords, *). This trick could potentially be added to the rasterizer module.

*: Traditional bilinear needs 4 texel fetches and 3 lerps (or, a poor
man's approximation with 3 fetches and 2 lerps). Traditional trilinear
needs 8 fetches and 7 lerps. The "cheap trick" version only needing the
same as bilinear.

One thing that is still needed is a good, fast, and semi-accurate way to
pull off the Z=1.0/Z' calculation, as needed for perspective-correct rasterization (affine requires subdivision, which adds cost to the
front-end, and interpolating Z directly adds significant distortion for geometry near the near plane).

Granted, this would almost seem to create a need for an OpenGL
implementation designed around the assumption of a hardware rasterizer
module rather than software span drawing.

Rasterizer module also has its own caching, where it sometimes may be
needed to signal it to perform a cache flush (such as when updating the contents of a texture, or needing to access the framebuffer for some
other reason, ...).

Potentially, the module could be used to copy/transform images in a framebuffer (such as for GUI rendering), but would need to be somewhat generalized for this (such as supporting using non-power-of-2
raster-images as textures).

Though, another possibility could be adding a dedicated DMA module, or DMA+Image module, or glue dedicated DMA and Raster-Copy functionality
onto the rasterizer module (as a separate thing from its normal "walk
edges and blend pixels" functionality).

Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into
a slide.

MMs and MSs that do not cross page boundaries are ATOMIC. The
entire system
sees only the before or only the after state and nothing in
between.

One might wonder how that atomicity is guaranteed in a
SMP processor...

Dunno there.

My stuff doesn't guerantee atomicity in general.

Only way to ensure that both parties agree on the contents of memory, is
that both need to flush their L1 caches or similar.

Or use "No Cache" memory accesses, which is basically implemented as the
L1 cache auto-flushing the line as soon as the request finishes; for
good effect one also needs to add a few NOPs after the memory access to
be sure the L1 has a chance to auto-flush it. Though, another
possibility could be to add dedicated non-caching memory access
instructions.

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Apr 11 23:06:05 2024

From Newsgroup: comp.arch

BGB-Alt wrote:

On 4/11/2024 1:46 PM, MitchAlsup1 wrote:

BGB wrote:

Win-win under constraints of Load-Store Arch. Otherwise, it depends.

Never seen a LD-OP architecture where the inbound memory can be in the
Rs1 position of the instruction.

FWIW:
The LDSH / SHORI mechanism does provide a way to get 64-bit constants,
and needs less encoding space than the LUI route.

   MOV Imm16. Rn
   SHORI Imm16, Rn
   SHORI Imm16, Rn
   SHORI Imm16, Rn

Granted, if each is a 1-cycle instruction, this still takes 4 clock
cycles.

As compared to::

    CALK   Rd,Rs1,#imm64

Which takes 3 words (12 bytes) and executes in CALK cycles, the loading
of the constant is free !! (0 cycles) !! {{The above example uses at least >> 5 cycles to use the loaded/built constant.}}

The main reason one might want SHORI is that it can fit into a
fixed-length 32-bit encoding.

While 32-bit encoding is RISC mantra, it has NOT been shown to be best
just simplest. Then, once you start widening the microarchitecture, it
is better to fetch wider than decode-issue so that you suffer least
from boundary conditions. Once you start fetching wide OR have wide decode-issue, you have ALL the infrastructure to do variable length instructions. Thus, complaining that VLE is hard has already been
eradicated.

Also technically could be retrofitted onto RISC-V without any significant change, unlike some other options (as
noted, I don't argue for adding Jumbo prefixes to RV under the basis
that there is no real viable way to add them to RV, *).

The issue is that once you do VLE RISC-Vs ISA is no longer helping you
get the job done, especially when you have to execute 40% more instructions

Sadly, the closest option to viable for RV would be to add the SHORI instruction and optionally pattern match it in the fetch/decode.

Or, say:
LUI Xn, Imm20
ADD Xn, Xn, Imm12
SHORI Xn, Imm16
SHORI Xn, Imm16

Then, combine LUI+ADD into a 32-bit load in the decoder (though probably only if the Imm12 is positive), and 2x SHORI into a combined "Xn=(Xn<<32)|Imm32" operation.

This could potentially get it down to 2 clock cycles.

Universal constants gets this down to 0 cycles......

*: To add a jumbo prefix, one needs an encoding that:
Uses up a really big chunk of encoding space;
Is otherwise illegal and unused.
RISC-V doesn't have anything here.

Which is WHY you should not jump ship from SH to RV, but jump to an
ISA without these problems.

Ironically, in XG2 mode, I still have 28x 24-bit chunks of encoding
space that aren't yet used for anything, but aren't usable as normal encoding space mostly because if I put instructions in there (with the existing encoding schemes), I couldn't use all the registers (and they
would not have predication or similar either). Annoyingly, the only
types of encodings that would fit in there at present are 2RI Imm16 ops
or similar (or maybe 3R 128-bit SIMD ops, where these ops only use
encodings for R0..R31 anyways, interpreting the LSB of the register
field as encoding R32..R63).

Just another reason not to stay with what you have developed.

In comparison, I reserve 6-major OpCodes so that a control transfer into
data is highly likely to get Undefined OpCode exceptions rather than a
try to execute what is in that data. Then, as it is, I still have 21-slots
in the major OpCode group free (27 if you count the permanently reserved).

Much of this comes from side effects of Universal Constants.

An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
1-cycle, is preferable....

A consuming instruction where you don't even use a register is better
still !!

Can be done, but thus far 33-bit immediate values. Luckily, Imm33s seems
to addresses around 99% of uses (for normal ALU ops and similar).

What do you do when accessing data that the linker knows is more than 4GB
away from IP ?? or known to be outside of 0-4GB ?? externs, GOT, PLT, ...

Had considered allowing an Imm57s case for SIMD immediates (4x S.E5.F8
or 2x S.E8.F19), which would have indirectly allowed the Imm57s case. By themselves though, the difference doesn't seem enough to justify the cost.

While I admit that <basically> anything bigger than 50-bits will be fine
as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.

Don't have enough bits in the encoding scheme to pull off a 3RI Imm64 in
12 bytes (and allowing a 16-byte encoding would have too steep of a cost increase to be worthwhile).

And yet I did.

So, alas...

Yes, alas..........
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Apr 11 23:12:25 2024

From Newsgroup: comp.arch

Scott Lurndal wrote:

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

On 4/9/24 8:28 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

[snip]

Things like memcpy/memmove/memset/etc, are function calls in
cases when not directly transformed into register load/store
sequences.

My 66000 does not convert them into LD-ST sequences, MM is a
single instruction.

I wonder if it would be useful to have an immediate count form of
memory move. Copying fixed-size structures would be able to use an >>immediate. Aside from not having to load an immediate for such
cases, there might be microarchitectural benefits to using a
constant. Since fixed-sized copies would likely be limited to
smaller regions (with the possible exception of 8 MiB page copies)
and the overhead of loading a constant for large sizes would be
tiny, only providing a 16-bit immediate form might be reasonable.

It seems to me that an offloaded DMA engine would be a far
better way to do memmove (over some threshhold, perhaps a
cache line) without trashing the caches. Likewise memset.

Effectively, that is what HW does, even on the lower end machines,
the AGEN unit of the Cache access pipeline is repeatedly cycled,
and data is read and/or written. One can execute instructions not
needing memory references while LDM, STM, ENTER, EXIT, MM, and MS
are in progress.

Moving this sequencer farther out would still require it to consume
all L1 BW in any event (snooping) for memory consistency reasons.
{Note: cache accesses are performed line-wide not register width wide}

Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into
a slide.

MMs and MSs that do not cross page boundaries are ATOMIC. The
entire system
sees only the before or only the after state and nothing in
between.

One might wonder how that atomicity is guaranteed in a
SMP processor...

The entire chunk of data traverses the interconnect as a single
transaction. All interested 3rd parties (not originator nor
recipient) see either the memory state before the transfer or
after the transfer.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Fri Apr 12 02:19:04 2024

From Newsgroup: comp.arch

On Thu, 11 Apr 2024 18:46:54 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On 4/11/2024 6:13 AM, Michael S wrote:

On Wed, 10 Apr 2024 23:30:02 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

It does occupy some icache space, however; have you boosted the
icache size to compensate?

The space occupied in the ICache is freed up from being in the
DCache so the overall hit rate goes up !! At typical sizes,
ICache miss rate is about � the miss rate of DCache.

Besides:: if you had to LD the constant from memory, you use a LD
instruction and 1 or 2 words in DCache, while consuming a GPR. So,
overall, it takes fewer cycles, fewer GPRs, and fewer
instructions.

Alternatively:: if you paste constants together (LUI, AUPIC) you
have no direct route to either 64-bit constants or 64-bit address
spaces.

It looks to be a win-win !!

Win-win under constraints of Load-Store Arch. Otherwise, it
depends.

Never seen a LD-OP architecture where the inbound memory can be in
the Rs1 position of the instruction.

May be. But out of 6 major integer OPs it matters only for SUB.
By now I don't remember for sure, but I think that I had seen LD-OP architecture that had SUBR instruction. May be, TI TMS320C30?
It was 30 years ago and my memory is not what it used to be.
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Apr 11 23:22:16 2024

From Newsgroup: comp.arch

BGB-Alt wrote:

On 4/11/2024 9:30 AM, Scott Lurndal wrote:

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

One thing that is still needed is a good, fast, and semi-accurate way to pull off the Z=1.0/Z' calculation, as needed for perspective-correct rasterization (affine requires subdivision, which adds cost to the front-end, and interpolating Z directly adds significant distortion for geometry near the near plane).

I saw a 10-cycle latency 1-cycle throughput divider at Samsumg::
10 stages of 3-bit at a time SRT divider with some exponent stuff
on the side. 1.0/z is a lot simpler than that (float only). A lot
of these great big complicated calculations can be beaten into
submission with a clever attack of brute force HW.....FMUL and FMAC
being the most often cited cases.
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Thu Apr 11 20:07:08 2024

From Newsgroup: comp.arch

On 4/11/2024 6:06 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

On 4/11/2024 1:46 PM, MitchAlsup1 wrote:

BGB wrote:

Win-win under constraints of Load-Store Arch. Otherwise, it depends.

Never seen a LD-OP architecture where the inbound memory can be in
the Rs1 position of the instruction.

FWIW:
The LDSH / SHORI mechanism does provide a way to get 64-bit
constants, and needs less encoding space than the LUI route.

   MOV Imm16. Rn
   SHORI Imm16, Rn
   SHORI Imm16, Rn
   SHORI Imm16, Rn

Granted, if each is a 1-cycle instruction, this still takes 4 clock
cycles.

As compared to::

     CALK   Rd,Rs1,#imm64

Which takes 3 words (12 bytes) and executes in CALK cycles, the loading
of the constant is free !! (0 cycles) !! {{The above example uses at
least
5 cycles to use the loaded/built constant.}}

The main reason one might want SHORI is that it can fit into a
fixed-length 32-bit encoding.

While 32-bit encoding is RISC mantra, it has NOT been shown to be best
just simplest. Then, once you start widening the microarchitecture, it
is better to fetch wider than decode-issue so that you suffer least from boundary conditions. Once you start fetching wide OR have wide
decode-issue, you have ALL the infrastructure to do variable length instructions. Thus, complaining that VLE is hard has already been
eradicated.

As noted, BJX2 is effectively VLE.
Just now split into two sub-variants.

So, as for lengths:
Baseline: 16/32/64/96
XG2: 32/64/96
Original version was 16/32/48.

But, the original 48-bit encoding was dropped, mostly to make the rest
of the encoding more orthogonal, and these were replaced with Jumbo
prefixes. An encoding space exists where 48-bit ops could in theory be re-added to Baseline, but have not done so as it does not seem be
justifiable in a cost/benefit sense (and would still have some of the
same drawbacks as the original 48 bit ops).

Had also briefly experimented with 24-bit ops, but these were quickly
dropped due to "general suckage" (though, an alternate 16/24/32/48
encoding scheme could have theoretically given better code-density).

However, RISC-V is either 32-bit, or 16/32.

For now, I am not bothering with the 16-bit C extension, not so much for
sake of difficulty of dealing with VLE (the core can already deal with
VLE), but more because the 'C' encodings are such a dog chewed mess that
I don't feel terribly inclined to bother with them.

But, like, I can't really compare BJX2 Baseline with RV64G in terms of
code density, because this wouldn't be a fair comparison. Would need to compare code-density between Baseline and RV64GC, which would imply
needing to actually support the C extension.

I could already claim a "win" here if I wanted, but as I see it, doing
so would not be valid.

Theoretically, encoding space exists for bigger ops in RISC-V, but no
one has defined ops there yet as far as I know. Also, the way RISC-V represents larger ops is very different.

However, comparing fixed-length against VLE when the VLE only has larger instructions, is still acceptable as I see it (even if larger
instructions can still allow a more compact encoding in some cases).

Say, for example, as I see it, SuperH vs Thumb2 would still be a fair comparison, as would Thumb2 vs RV32GC, but Thumb2 vs RV32G would not.

Unless one only cares about "absolute code density" irrespective of
keeping parity in terms of feature-set.

                              Also technically could be retrofitted
onto RISC-V without any significant change, unlike some other options
(as noted, I don't argue for adding Jumbo prefixes to RV under the
basis that there is no real viable way to add them to RV, *).

The issue is that once you do VLE RISC-Vs ISA is no longer helping you
get the job done, especially when you have to execute 40% more instructions

Yeah.

As noted, I had already been beating RISC-V in terms of performance,
only there was a shortfall in terms of ".text" size (for the XG2 variant).

Initially this was around a 16% delta, now down to around 5%. Nearly all
of the size reduction thus far, has been due to fiddling with stuff in
my compiler.

In theory, BJX2 (XG2) should be able to win in terms of code-density, as
the only cases where RISC-V has an advantage do not appear to be
statistically significant.

As also noted, I am using "-ffunction-sections" and similar (to allow
GCC to prune unreachable functions), otherwise there is "no contest"
(easier to win against 540K than 290K...).

Sadly, the closest option to viable for RV would be to add the SHORI
instruction and optionally pattern match it in the fetch/decode.

Or, say:
   LUI Xn, Imm20
   ADD Xn, Xn, Imm12
   SHORI Xn, Imm16
   SHORI Xn, Imm16

Then, combine LUI+ADD into a 32-bit load in the decoder (though
probably only if the Imm12 is positive), and 2x SHORI into a combined
"Xn=(Xn<<32)|Imm32" operation.

This could potentially get it down to 2 clock cycles.

Universal constants gets this down to 0 cycles......

Possibly.

*: To add a jumbo prefix, one needs an encoding that:
   Uses up a really big chunk of encoding space;
   Is otherwise illegal and unused.
RISC-V doesn't have anything here.

Which is WHY you should not jump ship from SH to RV, but jump to an
ISA without these problems.

Of the options that were available at the time:
SuperH: Simple encoding and decent code density;
RISC-V: Seemed like it would have had worse code density.
Though, it seems that RV beats SH in this area.
Thumb: Uglier encoding and some more awkward limitations vs SH.
Also, condition codes, etc.
Thumb2: Was still patent-encumbered at the time.
PowerPC: Bleh.
...

The main reason for RISC-V support is not due to "betterness", but
rather because RISC-V is at least semi-popular (and not as bad as I
initially thought, in retrospect).

Ironically, in XG2 mode, I still have 28x 24-bit chunks of encoding
space that aren't yet used for anything, but aren't usable as normal
encoding space mostly because if I put instructions in there (with the
existing encoding schemes), I couldn't use all the registers (and they
would not have predication or similar either). Annoyingly, the only
types of encodings that would fit in there at present are 2RI Imm16
ops or similar (or maybe 3R 128-bit SIMD ops, where these ops only use
encodings for R0..R31 anyways, interpreting the LSB of the register
field as encoding R32..R63).

Just another reason not to stay with what you have developed.

In comparison, I reserve 6-major OpCodes so that a control transfer into
data is highly likely to get Undefined OpCode exceptions rather than a
try to execute what is in that data. Then, as it is, I still have 21-slots
in the major OpCode group free (27 if you count the permanently reserved).

Much of this comes from side effects of Universal Constants.

An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
1-cycle, is preferable....

A consuming instruction where you don't even use a register is better
still !!

Can be done, but thus far 33-bit immediate values. Luckily, Imm33s
seems to addresses around 99% of uses (for normal ALU ops and similar).

What do you do when accessing data that the linker knows is more than
4GB away from IP ?? or known to be outside of 0-4GB ?? externs, GOT,
PLT, ...

Had considered allowing an Imm57s case for SIMD immediates (4x S.E5.F8
or 2x S.E8.F19), which would have indirectly allowed the Imm57s case.
By themselves though, the difference doesn't seem enough to justify
the cost.

While I admit that <basically> anything bigger than 50-bits will be fine
as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.

The number of cases where this comes up is not statistically significant enough to have a meaningful impact on performance.

Fraction of a percent edge-cases are not deal-breakers, as I see it.

I do at least have some confidence that my stuff can be made usable on affordable FPGAs.

Some of the stuff you argue for, I don't feel is viable on this class of hardware.

Like, the challenge would be to, say, make a soft-processor and fit all
of the stuff you are arguing for into an XC7S50 or similar (say, on one
of the Arty boards or something).

Or, some other sub $400 or so FPGA board (that can be targeted with the
free version of Vivado or similar...).
Something like an Lattice ECP5 is probably OK.

Though, Cyclone-V or Zynq is probably not, too much room for "cheating"
there by leveraging the ARM cores...

Don't have enough bits in the encoding scheme to pull off a 3RI Imm64
in 12 bytes (and allowing a 16-byte encoding would have too steep of a
cost increase to be worthwhile).

And yet I did.

I am not saying it is impossible, only that I can't pull it off with my existing encoding.

I guess it could be possible if I burnt all off the remaining encoding
bits on it (effectively 27-bit jumbo prefixes, + the WI bit in the final instruction).

This would preclude using these bits for anything else though.
Debatable if it is "worth it".

So, alas...

Yes, alas..........

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Apr 12 01:40:27 2024

From Newsgroup: comp.arch

BGB wrote:

On 4/11/2024 6:06 PM, MitchAlsup1 wrote:

While I admit that <basically> anything bigger than 50-bits will be fine
as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.

The number of cases where this comes up is not statistically significant enough to have a meaningful impact on performance.

Fraction of a percent edge-cases are not deal-breakers, as I see it.

Idle speculation::

.globl r8_erf ; -- Begin function r8_erf
.type r8_erf,@function
r8_erf: ; @r8_erf
; %bb.0:
add sp,sp,#-128
std #4614300636657501161,[sp,88] // a[0]
std #4645348406721991307,[sp,104] // a[2]
std #4659275911028085274,[sp,112] // a[3]
std #4595861367557309218,[sp,120] // a[4]
std #4599171895595656694,[sp,40] // p[0]
std #4593699784569291823,[sp,56] // p[2]
std #4580293056851789237,[sp,64] // p[3]
std #4559215111867327292,[sp,72] // p[4]
std #4580359811580069319,[sp,80] // p[4]
std #4612966212090462427,[sp] // q[0]
std #4602930165995154489,[sp,16] // q[2]
std #4588882433176075751,[sp,24] // q[3]
std #4567531038595922641,[sp,32] // q[4]
fabs r2,r1
fcmp r3,r2,#0x3EF00000 // thresh
bnlt r3,.LBB141_6
; %bb.1:
fcmp r3,r2,#4 // xabs <= 4.0
bnlt r3,.LBB141_7
; %bb.2:
fcmp r3,r2,#0x403A8B020C49BA5E // xbig
bngt r3,.LBB141_11
; %bb.3:
fmul r3,r1,r1
fdiv r3,#1,r3
mov r4,#0x3F90B4FB18B485C7 // p[5]
fmac r4,r3,r4,#0x3FD38A78B9F065F6 // p[0]
fadd r5,r3,#0x40048C54508800DB // q[0]
fmac r6,r3,r4,#0x3FD70FE40E2425B8 // p[1]
fmac r4,r3,r5,#0x3FFDF79D6855F0AD // q[1]
fmul r4,r3,r4
fmul r6,r3,r6
mov r5,#2
add r7,sp,#40 // p[*]
add r8,sp,#0 // q[*]
LBB141_4: ; %._crit_edge11
; =>This Inner Loop Header: Depth=1
vec r9,{r4,r6}
ldd r10,[r7,r5<<3,0] // p[*]
ldd r11,[r8,r5<<3,0] // q[*]
fadd r6,r6,r10
fadd r4,r4,r11
fmul r4,r3,r4
fmul r6,r3,r6
loop ne,r5,#4,#1
; %bb.5:
fadd r5,r6,#0x3F4595FD0D71E33C // p[4]
fmul r3,r3,r5
fadd r4,r4,#0x3F632147A014BAD1 // q[4]
fdiv r3,r3,r4
fadd r3,#0x3FE20DD750429B6D,-r3 // c[0]
fdiv r3,r3,r2
br .LBB141_10 // common tail
LBB141_6: ; %._crit_edge
fmul r3,r1,r1
fcmp r2,r2,#0x3C9FFE5AB7E8AD5E // xsmall
sra r2,r2,<1:13>
cvtsd r4,#0
mux r2,r2,r3,r4
mov r3,#0x3FC7C7905A31C322 // a[4]
fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
ldd r4,[sp,104] // a[2]
fmac r3,r2,r3,r4
fadd r4,r2,#0x403799EE342FB2DE // b[0]
fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
fmul r1,r3,r1
fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
fdiv r2,r1,r2
mov r1,r2
add sp,sp,#128
ret // 68
LBB141_7:
fmul r3,r2,#0x3E571E703C5F5815 // c[8]
mov r5,#0
mov r4,r2
LBB141_8: ; =>This Inner Loop Header: Depth=1
vec r6,{r3,r4}
ldd r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
fadd r3,r3,r7
fmul r3,r2,r3
ldd r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
fadd r4,r4,r7
fmul r4,r2,r4
loop ne,r5,#7,#1
; %bb.9:
fadd r3,r3,#0x4093395B7FD2FC8E // c[7]
fadd r4,r4,#0x4093395B7FD35F61 // d[7]
fdiv r3,r3,r4
LBB141_10: // common tail
fmul r4,r2,#0x41800000 // 16.0
fmul r4,r4,#0x3D800000 // 1/16.0
cvtds r4,r4 // (signed)double
cvtsd r4,r4 // (double)signed
fadd r5,r2,-r4
fadd r2,r2,r4
fmul r4,r4,-r4
fexp r4,r4 // exp()
fmul r2,r2,-r5
fexp r2,r2 // exp()
fmul r2,r4,r2
fadd r2,#0,-r2
fmac r2,r2,r3,#0x3F000000 // 0.5
fadd r2,r2,#0x3F000000 // 0.5
pflt r1,0,T
fadd r2,#0,-r2
mov r1,r2
add sp,sp,#128
ret
LBB141_11:
fcmp r1,r1,#0
sra r1,r1,<1:13>
cvtsd r2,#-1 // (double)-1
cvtsd r3,#1 // (double)+1
mux r2,r1,r3,r2
mov r1,r2
add sp,sp,#128
ret
Lfunc_end141:
.size r8_erf, .Lfunc_end141-r8_erf
; -- End function
--- Synchronet 3.20a-Linux NewsLink 1.114

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Apr 12 13:40:01 2024

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Thu, 11 Apr 2024 18:46:54 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

It looks to be a win-win !! =20

=20
Win-win under constraints of Load-Store Arch. Otherwise, it
depends. =20

=20
Never seen a LD-OP architecture where the inbound memory can be in
the Rs1 position of the instruction.
=20

May be. But out of 6 major integer OPs it matters only for SUB.
By now I don't remember for sure, but I think that I had seen LD-OP >architecture that had SUBR instruction. May be, TI TMS320C30?

ARM has LDADD - negate one argument and it becomes a subtract.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Michael S@already5chosen@yahoo.com to comp.arch on Fri Apr 12 18:08:33 2024

From Newsgroup: comp.arch

On Fri, 12 Apr 2024 13:40:01 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Thu, 11 Apr 2024 18:46:54 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

It looks to be a win-win !! =20

=20
Win-win under constraints of Load-Store Arch. Otherwise, it
depends. =20

=20
Never seen a LD-OP architecture where the inbound memory can be in
the Rs1 position of the instruction.
=20

May be. But out of 6 major integer OPs it matters only for SUB.
By now I don't remember for sure, but I think that I had seen LD-OP >architecture that had SUBR instruction. May be, TI TMS320C30?

ARM has LDADD - negate one argument and it becomes a subtract.

ARM LDADD is not a LD-OP instruction. It is RMW.

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Fri Apr 12 13:12:28 2024

From Newsgroup: comp.arch

On 4/11/2024 8:40 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/11/2024 6:06 PM, MitchAlsup1 wrote:

While I admit that <basically> anything bigger than 50-bits will be fine >>> as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.

The number of cases where this comes up is not statistically
significant enough to have a meaningful impact on performance.

Fraction of a percent edge-cases are not deal-breakers, as I see it.

Idle speculation::

    .globl    r8_erf                          ; -- Begin function r8_erf
    .type    r8_erf,@function r8_erf:                                 ; @r8_erf
; %bb.0:
    add    sp,sp,#-128
    std    #4614300636657501161,[sp,88]    // a[0]
    std    #4645348406721991307,[sp,104]    // a[2]
    std    #4659275911028085274,[sp,112]    // a[3]
    std    #4595861367557309218,[sp,120]    // a[4]
    std    #4599171895595656694,[sp,40]    // p[0]
    std    #4593699784569291823,[sp,56]    // p[2]
    std    #4580293056851789237,[sp,64]    // p[3]
    std    #4559215111867327292,[sp,72]    // p[4]
    std    #4580359811580069319,[sp,80]    // p[4]
    std    #4612966212090462427,[sp]    // q[0]
    std    #4602930165995154489,[sp,16]    // q[2]
    std    #4588882433176075751,[sp,24]    // q[3]
    std    #4567531038595922641,[sp,32]    // q[4]
    fabs    r2,r1
    fcmp    r3,r2,#0x3EF00000        // thresh
    bnlt    r3,.LBB141_6
; %bb.1:
    fcmp    r3,r2,#4            // xabs <= 4.0
    bnlt    r3,.LBB141_7
; %bb.2:
    fcmp    r3,r2,#0x403A8B020C49BA5E    // xbig
    bngt    r3,.LBB141_11
; %bb.3:
    fmul    r3,r1,r1
    fdiv    r3,#1,r3
    mov    r4,#0x3F90B4FB18B485C7        // p[5]
    fmac    r4,r3,r4,#0x3FD38A78B9F065F6    // p[0]
    fadd    r5,r3,#0x40048C54508800DB    // q[0]
    fmac    r6,r3,r4,#0x3FD70FE40E2425B8    // p[1]
    fmac    r4,r3,r5,#0x3FFDF79D6855F0AD    // q[1]
    fmul    r4,r3,r4
    fmul    r6,r3,r6
    mov    r5,#2
    add    r7,sp,#40            // p[*]
    add    r8,sp,#0            // q[*] LBB141_4:                              ; %._crit_edge11
                                       ; =>This Inner Loop Header: Depth=1
    vec    r9,{r4,r6}
    ldd    r10,[r7,r5<<3,0]        // p[*]
    ldd    r11,[r8,r5<<3,0]        // q[*]
    fadd    r6,r6,r10
    fadd    r4,r4,r11
    fmul    r4,r3,r4
    fmul    r6,r3,r6
    loop    ne,r5,#4,#1
; %bb.5:
    fadd    r5,r6,#0x3F4595FD0D71E33C    // p[4]
    fmul    r3,r3,r5
    fadd    r4,r4,#0x3F632147A014BAD1    // q[4]
    fdiv    r3,r3,r4
    fadd    r3,#0x3FE20DD750429B6D,-r3    // c[0]
    fdiv    r3,r3,r2
    br    .LBB141_10            // common tail LBB141_6:                              ; %._crit_edge
    fmul    r3,r1,r1
    fcmp    r2,r2,#0x3C9FFE5AB7E8AD5E    // xsmall
    sra    r2,r2,<1:13>
    cvtsd    r4,#0
    mux    r2,r2,r3,r4
    mov    r3,#0x3FC7C7905A31C322        // a[4]
    fmac    r3,r2,r3,#0x400949FB3ED443E9    // a[0]
    fmac    r3,r2,r3,#0x405C774E4D365DA3    // a[1]
    ldd    r4,[sp,104]            // a[2]
    fmac    r3,r2,r3,r4
    fadd    r4,r2,#0x403799EE342FB2DE    // b[0]
    fmac    r4,r2,r4,#0x406E80C9D57E55B8    // b[1]
    fmac    r4,r2,r4,#0x40940A77529CADC8    // b[2]
    fmac    r3,r2,r3,#0x40A912C1535D121A    // a[3]
    fmul    r1,r3,r1
    fmac    r2,r2,r4,#0x40A63879423B87AD    // b[3]
    fdiv    r2,r1,r2
    mov    r1,r2
    add    sp,sp,#128
    ret                // 68
LBB141_7:
    fmul    r3,r2,#0x3E571E703C5F5815    // c[8]
    mov    r5,#0
    mov    r4,r2 LBB141_8:                              ; =>This Inner Loop Header: Depth=1
    vec    r6,{r3,r4}
    ldd    r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
    fadd    r3,r3,r7
    fmul    r3,r2,r3
    ldd    r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
    fadd    r4,r4,r7
    fmul    r4,r2,r4
    loop    ne,r5,#7,#1
; %bb.9:
    fadd    r3,r3,#0x4093395B7FD2FC8E    // c[7]
    fadd    r4,r4,#0x4093395B7FD35F61    // d[7]
    fdiv    r3,r3,r4
LBB141_10:                // common tail
    fmul    r4,r2,#0x41800000        // 16.0
    fmul    r4,r4,#0x3D800000        // 1/16.0
    cvtds    r4,r4                // (signed)double
    cvtsd    r4,r4                // (double)signed
    fadd    r5,r2,-r4
    fadd    r2,r2,r4
    fmul    r4,r4,-r4
    fexp    r4,r4                // exp()
    fmul    r2,r2,-r5
    fexp    r2,r2                // exp()
    fmul    r2,r4,r2
    fadd    r2,#0,-r2
    fmac    r2,r2,r3,#0x3F000000        // 0.5
    fadd    r2,r2,#0x3F000000        // 0.5
    pflt    r1,0,T
    fadd    r2,#0,-r2
    mov    r1,r2
    add    sp,sp,#128
    ret
LBB141_11:
    fcmp    r1,r1,#0
    sra    r1,r1,<1:13>
    cvtsd    r2,#-1                // (double)-1
    cvtsd    r3,#1                // (double)+1
    mux    r2,r1,r3,r2
    mov    r1,r2
    add    sp,sp,#128
    ret
Lfunc_end141:
    .size    r8_erf, .Lfunc_end141-r8_erf
                                       ; -- End function

These patterns seem rather unusual...
Don't really know the ABI.

Patterns don't really fit observations for typical compiler output
though (mostly in the FP constants, and particular ones that fall
outside the scope of what can be exactly represented as Binary16 or
similar, are rare).

.globl r8_erf ; -- Begin function r8_erf
.type r8_erf,@function
r8_erf: ; @r8_erf
; %bb.0:
add sp,sp,#-128

ADD -128, SP

std #4614300636657501161,[sp,88] // a[0]

MOV 0x400949FB3ED443E9, R3
MOV.Q R3, (SP, 88)

std #4645348406721991307,[sp,104] // a[2]

MOV 0x407797C38897528B, R3
MOV.Q R3, (SP, 104)

std #4659275911028085274,[sp,112] // a[3]
std #4595861367557309218,[sp,120] // a[4]
std #4599171895595656694,[sp,40] // p[0]
std #4593699784569291823,[sp,56] // p[2]
std #4580293056851789237,[sp,64] // p[3]
std #4559215111867327292,[sp,72] // p[4]
std #4580359811580069319,[sp,80] // p[4]
std #4612966212090462427,[sp] // q[0]
std #4602930165995154489,[sp,16] // q[2]
std #4588882433176075751,[sp,24] // q[3]
std #4567531038595922641,[sp,32] // q[4]

... pattern is obvious enough.
Each constant needs 12 bytes, so 16 bytes/store.

fabs r2,r1
fcmp r3,r2,#0x3EF00000 // thresh
bnlt r3,.LBB141_6

FABS R5, R6
FLDH 0x3780, R3 //A
FCMPGT R3, R6 //A
BT .LBB141_6 //A

Or (FP-IMM extension):

FABS R5, R6
FCMPGE 0x0DE, R6 //B (FP-IMM)
BF .LBB141_6 //B

; %bb.1:
fcmp r3,r2,#4 // xabs <= 4.0
bnlt r3,.LBB141_7

FCMPGE 0x110, R6
BF .LBB141_7

; %bb.2:
fcmp r3,r2,#0x403A8B020C49BA5E // xbig
bngt r3,.LBB141_11

MOV 0x403A8B020C49BA5E, R3
FCMPGT R3, R6
BT .LBB141_11

Where, FP-IMM wont work with that value.

; %bb.3:
fmul r3,r1,r1

FMUL R5, R5, R7

fdiv r3,#1,r3

Skip, operation gives identity?...

mov r4,#0x3F90B4FB18B485C7 // p[5]

Similar.

fmac r4,r3,r4,#0x3FD38A78B9F065F6 // p[0]
fadd r5,r3,#0x40048C54508800DB // q[0]
fmac r6,r3,r4,#0x3FD70FE40E2425B8 // p[1]
fmac r4,r3,r5,#0x3FFDF79D6855F0AD // q[1]

Turns into 4 constants, 7 FPU instructions (if no FMAC extension, 4 with FMAC). Though, at present, FMAC is slower than separate FMUL+FADD.

So, between 8 and 11 instructions.

fmul r4,r3,r4
fmul r6,r3,r6
mov r5,#2
add r7,sp,#40 // p[*]
add r8,sp,#0 // q[*]

These can map 1:1.

LBB141_4: ; %._crit_edge11
; =>This Inner Loop Header:

Depth=1

vec r9,{r4,r6}
ldd r10,[r7,r5<<3,0] // p[*]
ldd r11,[r8,r5<<3,0] // q[*]
fadd r6,r6,r10
fadd r4,r4,r11
fmul r4,r3,r4
fmul r6,r3,r6
loop ne,r5,#4,#1

Could be mapped to a scalar loop, pretty close to 1:1.

Could possibly also be mapped over to 2x Binary64 SIMD ops, I am
guessing 2 copies for a 4-element vector?...

; %bb.5:
fadd r5,r6,#0x3F4595FD0D71E33C // p[4]
fmul r3,r3,r5
fadd r4,r4,#0x3F632147A014BAD1 // q[4]
fdiv r3,r3,r4
fadd r3,#0x3FE20DD750429B6D,-r3 // c[0]
fdiv r3,r3,r2
br .LBB141_10 // common tail

Same patterns as before.
Would need ~ 10 ops.

Well, could be expressed with fewer ops via jumbo-prefixed FP-IMM ops,
but this would only give "Binary32 truncated to 29 bits" precision for
the immediate values.

Theoretically, could allow an FE-FE-F0 encoding for FP-IMM, which could
give ~ 53 bits of precision. But, if one needs full Binary64, this will
not gain much in this case.

LBB141_6: ; %._crit_edge
fmul r3,r1,r1
fcmp r2,r2,#0x3C9FFE5AB7E8AD5E // xsmall
sra r2,r2,<1:13>
cvtsd r4,#0
mux r2,r2,r3,r4
mov r3,#0x3FC7C7905A31C322 // a[4]
fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
ldd r4,[sp,104] // a[2]
fmac r3,r2,r3,r4
fadd r4,r2,#0x403799EE342FB2DE // b[0]
fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
fmul r1,r3,r1
fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
fdiv r2,r1,r2
mov r1,r2
add sp,sp,#128
ret // 68
LBB141_7:
fmul r3,r2,#0x3E571E703C5F5815 // c[8]
mov r5,#0
mov r4,r2
LBB141_8: ; =>This Inner Loop Header:

Depth=1

vec r6,{r3,r4}
ldd r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
fadd r3,r3,r7
fmul r3,r2,r3
ldd r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
fadd r4,r4,r7
fmul r4,r2,r4
loop ne,r5,#7,#1
; %bb.9:
fadd r3,r3,#0x4093395B7FD2FC8E // c[7]
fadd r4,r4,#0x4093395B7FD35F61 // d[7]
fdiv r3,r3,r4
LBB141_10: // common tail
fmul r4,r2,#0x41800000 // 16.0
fmul r4,r4,#0x3D800000 // 1/16.0
cvtds r4,r4 // (signed)double
cvtsd r4,r4 // (double)signed
fadd r5,r2,-r4
fadd r2,r2,r4
fmul r4,r4,-r4
fexp r4,r4 // exp()
fmul r2,r2,-r5
fexp r2,r2 // exp()
fmul r2,r4,r2
fadd r2,#0,-r2
fmac r2,r2,r3,#0x3F000000 // 0.5
fadd r2,r2,#0x3F000000 // 0.5
pflt r1,0,T
fadd r2,#0,-r2
mov r1,r2
add sp,sp,#128
ret
LBB141_11:
fcmp r1,r1,#0
sra r1,r1,<1:13>
cvtsd r2,#-1 // (double)-1
cvtsd r3,#1 // (double)+1
mux r2,r1,r3,r2
mov r1,r2
add sp,sp,#128
ret
Lfunc_end141:
.size r8_erf, .Lfunc_end141-r8_erf
; -- End function

Don't really have time at the moment to comment on the rest of this...

In other news, found a bug in the function dependency-walking code.

Fixing this bug got things a little closer to beak-even with RV64G GCC
output regarding ".text" size (though, was still not sufficient to
entirely close the gap).

This was mostly based on noting that the compiler output had included
some things that were not reachable from within the program being
compiled (namely, noticing that the Doom build had included a copy of
the MS-CRAM video decoder and similar, which was not reachable from
anywhere within Doom).

Some more analysis may be needed.

...

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Fri Apr 12 23:46:33 2024

From Newsgroup: comp.arch

BGB wrote:

On 4/11/2024 8:40 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/11/2024 6:06 PM, MitchAlsup1 wrote:

While I admit that <basically> anything bigger than 50-bits will be fine >>>> as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.

The number of cases where this comes up is not statistically
significant enough to have a meaningful impact on performance.

Fraction of a percent edge-cases are not deal-breakers, as I see it.

Idle speculation::

    .globl    r8_erf                          ; -- Begin function r8_erf
    .type    r8_erf,@function
r8_erf:                                 ; @r8_erf
; %bb.0:
    add    sp,sp,#-128
    std    #4614300636657501161,[sp,88]    // a[0]
    std    #4645348406721991307,[sp,104]    // a[2]
    std    #4659275911028085274,[sp,112]    // a[3]
    std    #4595861367557309218,[sp,120]    // a[4]
    std    #4599171895595656694,[sp,40]    // p[0]
    std    #4593699784569291823,[sp,56]    // p[2]
    std    #4580293056851789237,[sp,64]    // p[3]
    std    #4559215111867327292,[sp,72]    // p[4]
    std    #4580359811580069319,[sp,80]    // p[4]
    std    #4612966212090462427,[sp]    // q[0]
    std    #4602930165995154489,[sp,16]    // q[2]
    std    #4588882433176075751,[sp,24]    // q[3]
    std    #4567531038595922641,[sp,32]    // q[4]
    fabs    r2,r1
    fcmp    r3,r2,#0x3EF00000        // thresh
    bnlt    r3,.LBB141_6
; %bb.1:
    fcmp    r3,r2,#4            // xabs <= 4.0
    bnlt    r3,.LBB141_7
; %bb.2:
    fcmp    r3,r2,#0x403A8B020C49BA5E    // xbig
    bngt    r3,.LBB141_11
; %bb.3:
    fmul    r3,r1,r1
    fdiv    r3,#1,r3
    mov    r4,#0x3F90B4FB18B485C7        // p[5]
    fmac    r4,r3,r4,#0x3FD38A78B9F065F6    // p[0]
    fadd    r5,r3,#0x40048C54508800DB    // q[0]
    fmac    r6,r3,r4,#0x3FD70FE40E2425B8    // p[1]
    fmac    r4,r3,r5,#0x3FFDF79D6855F0AD    // q[1]
    fmul    r4,r3,r4
    fmul    r6,r3,r6
    mov    r5,#2
    add    r7,sp,#40            // p[*]
    add    r8,sp,#0            // q[*]
LBB141_4:                              ; %._crit_edge11
                                       ; =>This Inner Loop Header: Depth=1
    vec    r9,{r4,r6}
    ldd    r10,[r7,r5<<3,0]        // p[*]
    ldd    r11,[r8,r5<<3,0]        // q[*]
    fadd    r6,r6,r10
    fadd    r4,r4,r11
    fmul    r4,r3,r4
    fmul    r6,r3,r6
    loop    ne,r5,#4,#1
; %bb.5:
    fadd    r5,r6,#0x3F4595FD0D71E33C    // p[4]
    fmul    r3,r3,r5
    fadd    r4,r4,#0x3F632147A014BAD1    // q[4]
    fdiv    r3,r3,r4
    fadd    r3,#0x3FE20DD750429B6D,-r3    // c[0]
    fdiv    r3,r3,r2
    br    .LBB141_10            // common tail
LBB141_6:                              ; %._crit_edge
    fmul    r3,r1,r1
    fcmp    r2,r2,#0x3C9FFE5AB7E8AD5E    // xsmall
    sra    r2,r2,<1:13>
    cvtsd    r4,#0
    mux    r2,r2,r3,r4
    mov    r3,#0x3FC7C7905A31C322        // a[4]
    fmac    r3,r2,r3,#0x400949FB3ED443E9    // a[0]
    fmac    r3,r2,r3,#0x405C774E4D365DA3    // a[1]
    ldd    r4,[sp,104]            // a[2]
    fmac    r3,r2,r3,r4
    fadd    r4,r2,#0x403799EE342FB2DE    // b[0]
    fmac    r4,r2,r4,#0x406E80C9D57E55B8    // b[1]
    fmac    r4,r2,r4,#0x40940A77529CADC8    // b[2]
    fmac    r3,r2,r3,#0x40A912C1535D121A    // a[3]
    fmul    r1,r3,r1
    fmac    r2,r2,r4,#0x40A63879423B87AD    // b[3]
    fdiv    r2,r1,r2
    mov    r1,r2
    add    sp,sp,#128
    ret                // 68
LBB141_7:
    fmul    r3,r2,#0x3E571E703C5F5815    // c[8]
    mov    r5,#0
    mov    r4,r2
LBB141_8:                              ; =>This Inner Loop Header: Depth=1
    vec    r6,{r3,r4}
    ldd    r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
    fadd    r3,r3,r7
    fmul    r3,r2,r3
    ldd    r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
    fadd    r4,r4,r7
    fmul    r4,r2,r4
    loop    ne,r5,#7,#1
; %bb.9:
    fadd    r3,r3,#0x4093395B7FD2FC8E    // c[7]
    fadd    r4,r4,#0x4093395B7FD35F61    // d[7]
    fdiv    r3,r3,r4
LBB141_10:                // common tail
    fmul    r4,r2,#0x41800000        // 16.0
    fmul    r4,r4,#0x3D800000        // 1/16.0
    cvtds    r4,r4                // (signed)double >>     cvtsd    r4,r4                // (double)signed >>     fadd    r5,r2,-r4
    fadd    r2,r2,r4
    fmul    r4,r4,-r4
    fexp    r4,r4                // exp()
    fmul    r2,r2,-r5
    fexp    r2,r2                // exp()
    fmul    r2,r4,r2
    fadd    r2,#0,-r2
    fmac    r2,r2,r3,#0x3F000000        // 0.5
    fadd    r2,r2,#0x3F000000        // 0.5
    pflt    r1,0,T
    fadd    r2,#0,-r2
    mov    r1,r2
    add    sp,sp,#128
    ret
LBB141_11:
    fcmp    r1,r1,#0
    sra    r1,r1,<1:13>
    cvtsd    r2,#-1                // (double)-1
    cvtsd    r3,#1                // (double)+1
    mux    r2,r1,r3,r2
    mov    r1,r2
    add    sp,sp,#128
    ret
Lfunc_end141:
    .size    r8_erf, .Lfunc_end141-r8_erf
                                       ; -- End function

These patterns seem rather unusual...
Don't really know the ABI.

Patterns don't really fit observations for typical compiler output
though (mostly in the FP constants, and particular ones that fall
outside the scope of what can be exactly represented as Binary16 or
similar, are rare).

.globl r8_erf ; -- Begin function r8_erf
.type r8_erf,@function
r8_erf: ; @r8_erf
; %bb.0:
add sp,sp,#-128

ADD -128, SP

std #4614300636657501161,[sp,88] // a[0]

MOV 0x400949FB3ED443E9, R3
MOV.Q R3, (SP, 88)

std #4645348406721991307,[sp,104] // a[2]

MOV 0x407797C38897528B, R3
MOV.Q R3, (SP, 104)

std #4659275911028085274,[sp,112] // a[3]
std #4595861367557309218,[sp,120] // a[4]
std #4599171895595656694,[sp,40] // p[0]
std #4593699784569291823,[sp,56] // p[2]
std #4580293056851789237,[sp,64] // p[3]
std #4559215111867327292,[sp,72] // p[4]
std #4580359811580069319,[sp,80] // p[4]
std #4612966212090462427,[sp] // q[0]
std #4602930165995154489,[sp,16] // q[2]
std #4588882433176075751,[sp,24] // q[3]
std #4567531038595922641,[sp,32] // q[4]

.... pattern is obvious enough.
Each constant needs 12 bytes, so 16 bytes/store.

But 2 instructions instead of 1 and 16 bytes instead of 12.

fabs r2,r1
fcmp r3,r2,#0x3EF00000 // thresh
bnlt r3,.LBB141_6

FABS R5, R6
FLDH 0x3780, R3 //A
FCMPGT R3, R6 //A
BT .LBB141_6 //A

Or (FP-IMM extension):

FABS R5, R6
FCMPGE 0x0DE, R6 //B (FP-IMM)
BF .LBB141_6 //B

; %bb.1:
fcmp r3,r2,#4 // xabs <= 4.0
bnlt r3,.LBB141_7

FCMPGE 0x110, R6
BF .LBB141_7

; %bb.2:
fcmp r3,r2,#0x403A8B020C49BA5E // xbig
bngt r3,.LBB141_11

MOV 0x403A8B020C49BA5E, R3
FCMPGT R3, R6
BT .LBB141_11

Where, FP-IMM wont work with that value.

Value came from source code.

; %bb.3:
fmul r3,r1,r1

FMUL R5, R5, R7

fdiv r3,#1,r3

Skip, operation gives identity?...

It is a reciprocate R3 = #1.0/R3

mov r4,#0x3F90B4FB18B485C7 // p[5]

Similar.

fmac r4,r3,r4,#0x3FD38A78B9F065F6 // p[0]
fadd r5,r3,#0x40048C54508800DB // q[0]
fmac r6,r3,r4,#0x3FD70FE40E2425B8 // p[1]
fmac r4,r3,r5,#0x3FFDF79D6855F0AD // q[1]

Turns into 4 constants, 7 FPU instructions (if no FMAC extension, 4 with FMAC). Though, at present, FMAC is slower than separate FMUL+FADD.

So, between 8 and 11 instructions.

Instead of 4.....

fmul r4,r3,r4
fmul r6,r3,r6
mov r5,#2
add r7,sp,#40 // p[*]
add r8,sp,#0 // q[*]

These can map 1:1.

LBB141_4: ; %._crit_edge11
; =>This Inner Loop Header:

Depth=1

vec r9,{r4,r6}
ldd r10,[r7,r5<<3,0] // p[*]
ldd r11,[r8,r5<<3,0] // q[*]
fadd r6,r6,r10
fadd r4,r4,r11
fmul r4,r3,r4
fmul r6,r3,r6
loop ne,r5,#4,#1

Could be mapped to a scalar loop, pretty close to 1:1.

I have 7 instructions in the loop, you would have 9.

Could possibly also be mapped over to 2x Binary64 SIMD ops, I am
guessing 2 copies for a 4-element vector?...

; %bb.5:
fadd r5,r6,#0x3F4595FD0D71E33C // p[4]
fmul r3,r3,r5
fadd r4,r4,#0x3F632147A014BAD1 // q[4]
fdiv r3,r3,r4
fadd r3,#0x3FE20DD750429B6D,-r3 // c[0]
fdiv r3,r3,r2
br .LBB141_10 // common tail

Same patterns as before.
Would need ~ 10 ops.

Well, could be expressed with fewer ops via jumbo-prefixed FP-IMM ops,
but this would only give "Binary32 truncated to 29 bits" precision for
the immediate values.

Theoretically, could allow an FE-FE-F0 encoding for FP-IMM, which could
give ~ 53 bits of precision. But, if one needs full Binary64, this will
not gain much in this case.

LBB141_6: ; %._crit_edge
fmul r3,r1,r1
fcmp r2,r2,#0x3C9FFE5AB7E8AD5E // xsmall
sra r2,r2,<1:13>
cvtsd r4,#0
mux r2,r2,r3,r4
mov r3,#0x3FC7C7905A31C322 // a[4]
fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
ldd r4,[sp,104] // a[2]
fmac r3,r2,r3,r4
fadd r4,r2,#0x403799EE342FB2DE // b[0]
fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
fmul r1,r3,r1
fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
fdiv r2,r1,r2
mov r1,r2
add sp,sp,#128
ret // 68
LBB141_7:
fmul r3,r2,#0x3E571E703C5F5815 // c[8]
mov r5,#0
mov r4,r2
LBB141_8: ; =>This Inner Loop Header:

Depth=1

vec r6,{r3,r4}
ldd r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
fadd r3,r3,r7
fmul r3,r2,r3
ldd r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
fadd r4,r4,r7
fmul r4,r2,r4
loop ne,r5,#7,#1
; %bb.9:
fadd r3,r3,#0x4093395B7FD2FC8E // c[7]
fadd r4,r4,#0x4093395B7FD35F61 // d[7]
fdiv r3,r3,r4
LBB141_10: // common tail
fmul r4,r2,#0x41800000 // 16.0
fmul r4,r4,#0x3D800000 // 1/16.0
cvtds r4,r4 // (signed)double
cvtsd r4,r4 // (double)signed
fadd r5,r2,-r4
fadd r2,r2,r4
fmul r4,r4,-r4
fexp r4,r4 // exp()
fmul r2,r2,-r5
fexp r2,r2 // exp()
fmul r2,r4,r2
fadd r2,#0,-r2
fmac r2,r2,r3,#0x3F000000 // 0.5
fadd r2,r2,#0x3F000000 // 0.5
pflt r1,0,T
fadd r2,#0,-r2
mov r1,r2
add sp,sp,#128
ret
LBB141_11:
fcmp r1,r1,#0
sra r1,r1,<1:13>
cvtsd r2,#-1 // (double)-1
cvtsd r3,#1 // (double)+1
mux r2,r1,r3,r2
mov r1,r2
add sp,sp,#128
ret
Lfunc_end141:
.size r8_erf, .Lfunc_end141-r8_erf
; -- End function

Don't really have time at the moment to comment on the rest of this...

In other news, found a bug in the function dependency-walking code.

Fixing this bug got things a little closer to beak-even with RV64G GCC output regarding ".text" size (though, was still not sufficient to
entirely close the gap).

This was mostly based on noting that the compiler output had included
some things that were not reachable from within the program being
compiled (namely, noticing that the Doom build had included a copy of
the MS-CRAM video decoder and similar, which was not reachable from
anywhere within Doom).

Some more analysis may be needed.

....

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Apr 13 03:17:43 2024

From Newsgroup: comp.arch

BGB wrote:

On 4/11/2024 8:40 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/11/2024 6:06 PM, MitchAlsup1 wrote:

While I admit that <basically> anything bigger than 50-bits will be fine >>>> as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.

The number of cases where this comes up is not statistically
significant enough to have a meaningful impact on performance.

Fraction of a percent edge-cases are not deal-breakers, as I see it.

Idle speculation::

    .globl    r8_erf                          ; -- Begin function r8_erf
    .type    r8_erf,@function
r8_erf:                                 ; @r8_erf
; %bb.0:
    add    sp,sp,#-128
    std    #4614300636657501161,[sp,88]    // a[0]
    std    #4645348406721991307,[sp,104]    // a[2]
    std    #4659275911028085274,[sp,112]    // a[3]
    std    #4595861367557309218,[sp,120]    // a[4]
    std    #4599171895595656694,[sp,40]    // p[0]
    std    #4593699784569291823,[sp,56]    // p[2]
    std    #4580293056851789237,[sp,64]    // p[3]
    std    #4559215111867327292,[sp,72]    // p[4]
    std    #4580359811580069319,[sp,80]    // p[4]
    std    #4612966212090462427,[sp]    // q[0]
    std    #4602930165995154489,[sp,16]    // q[2]
    std    #4588882433176075751,[sp,24]    // q[3]
    std    #4567531038595922641,[sp,32]    // q[4]
    fabs    r2,r1
    fcmp    r3,r2,#0x3EF00000        // thresh
    bnlt    r3,.LBB141_6
; %bb.1:
    fcmp    r3,r2,#4            // xabs <= 4.0
    bnlt    r3,.LBB141_7
; %bb.2:
    fcmp    r3,r2,#0x403A8B020C49BA5E    // xbig
    bngt    r3,.LBB141_11
; %bb.3:
    fmul    r3,r1,r1
    fdiv    r3,#1,r3
    mov    r4,#0x3F90B4FB18B485C7        // p[5]
    fmac    r4,r3,r4,#0x3FD38A78B9F065F6    // p[0]
    fadd    r5,r3,#0x40048C54508800DB    // q[0]
    fmac    r6,r3,r4,#0x3FD70FE40E2425B8    // p[1]
    fmac    r4,r3,r5,#0x3FFDF79D6855F0AD    // q[1]
    fmul    r4,r3,r4
    fmul    r6,r3,r6
    mov    r5,#2
    add    r7,sp,#40            // p[*]
    add    r8,sp,#0            // q[*]
LBB141_4:                              ; %._crit_edge11
                                       ; =>This Inner Loop Header: Depth=1
    vec    r9,{r4,r6}
    ldd    r10,[r7,r5<<3,0]        // p[*]
    ldd    r11,[r8,r5<<3,0]        // q[*]
    fadd    r6,r6,r10
    fadd    r4,r4,r11
    fmul    r4,r3,r4
    fmul    r6,r3,r6
    loop    ne,r5,#4,#1
; %bb.5:
    fadd    r5,r6,#0x3F4595FD0D71E33C    // p[4]
    fmul    r3,r3,r5
    fadd    r4,r4,#0x3F632147A014BAD1    // q[4]
    fdiv    r3,r3,r4
    fadd    r3,#0x3FE20DD750429B6D,-r3    // c[0]
    fdiv    r3,r3,r2
    br    .LBB141_10            // common tail
LBB141_6:                              ; %._crit_edge
    fmul    r3,r1,r1
    fcmp    r2,r2,#0x3C9FFE5AB7E8AD5E    // xsmall
    sra    r2,r2,<1:13>
    cvtsd    r4,#0
    mux    r2,r2,r3,r4
    mov    r3,#0x3FC7C7905A31C322        // a[4]
    fmac    r3,r2,r3,#0x400949FB3ED443E9    // a[0]
    fmac    r3,r2,r3,#0x405C774E4D365DA3    // a[1]
    ldd    r4,[sp,104]            // a[2]
    fmac    r3,r2,r3,r4
    fadd    r4,r2,#0x403799EE342FB2DE    // b[0]
    fmac    r4,r2,r4,#0x406E80C9D57E55B8    // b[1]
    fmac    r4,r2,r4,#0x40940A77529CADC8    // b[2]
    fmac    r3,r2,r3,#0x40A912C1535D121A    // a[3]
    fmul    r1,r3,r1
    fmac    r2,r2,r4,#0x40A63879423B87AD    // b[3]
    fdiv    r2,r1,r2
    mov    r1,r2
    add    sp,sp,#128
    ret                // 68
LBB141_7:
    fmul    r3,r2,#0x3E571E703C5F5815    // c[8]
    mov    r5,#0
    mov    r4,r2
LBB141_8:                              ; =>This Inner Loop Header: Depth=1
    vec    r6,{r3,r4}
    ldd    r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
    fadd    r3,r3,r7
    fmul    r3,r2,r3
    ldd    r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
    fadd    r4,r4,r7
    fmul    r4,r2,r4
    loop    ne,r5,#7,#1
; %bb.9:
    fadd    r3,r3,#0x4093395B7FD2FC8E    // c[7]
    fadd    r4,r4,#0x4093395B7FD35F61    // d[7]
    fdiv    r3,r3,r4
LBB141_10:                // common tail
    fmul    r4,r2,#0x41800000        // 16.0
    fmul    r4,r4,#0x3D800000        // 1/16.0
    cvtds    r4,r4                // (signed)double >>     cvtsd    r4,r4                // (double)signed >>     fadd    r5,r2,-r4
    fadd    r2,r2,r4
    fmul    r4,r4,-r4
    fexp    r4,r4                // exp()
    fmul    r2,r2,-r5
    fexp    r2,r2                // exp()
    fmul    r2,r4,r2
    fadd    r2,#0,-r2
    fmac    r2,r2,r3,#0x3F000000        // 0.5
    fadd    r2,r2,#0x3F000000        // 0.5
    pflt    r1,0,T
    fadd    r2,#0,-r2
    mov    r1,r2
    add    sp,sp,#128
    ret
LBB141_11:
    fcmp    r1,r1,#0
    sra    r1,r1,<1:13>
    cvtsd    r2,#-1                // (double)-1
    cvtsd    r3,#1                // (double)+1
    mux    r2,r1,r3,r2
    mov    r1,r2
    add    sp,sp,#128
    ret
Lfunc_end141:
    .size    r8_erf, .Lfunc_end141-r8_erf
                                       ; -- End function

These patterns seem rather unusual...
Don't really know the ABI.

Patterns don't really fit observations for typical compiler output
though (mostly in the FP constants, and particular ones that fall
outside the scope of what can be exactly represented as Binary16 or
similar, are rare).

You are N E V E R going to find the coefficients of a Chebyshev
polynomial to fit in a small FP container; excepting the very
occasional C0 or C1 term {which are mostly 1.0 and 0.0}
--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB@cr88192@gmail.com to comp.arch on Sat Apr 13 01:12:53 2024

From Newsgroup: comp.arch

On 4/12/2024 10:17 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/11/2024 8:40 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/11/2024 6:06 PM, MitchAlsup1 wrote:

While I admit that <basically> anything bigger than 50-bits will be >>>>> fine
as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.

The number of cases where this comes up is not statistically
significant enough to have a meaningful impact on performance.

Fraction of a percent edge-cases are not deal-breakers, as I see it.

Idle speculation::

     .globl    r8_erf                          ; -- Begin function
r8_erf
     .type    r8_erf,@function
r8_erf:                                 ; @r8_erf
; %bb.0:
     add    sp,sp,#-128
     std    #4614300636657501161,[sp,88]    // a[0]
     std    #4645348406721991307,[sp,104]    // a[2]
     std    #4659275911028085274,[sp,112]    // a[3]
     std    #4595861367557309218,[sp,120]    // a[4]
     std    #4599171895595656694,[sp,40]    // p[0]
     std    #4593699784569291823,[sp,56]    // p[2]
     std    #4580293056851789237,[sp,64]    // p[3]
     std    #4559215111867327292,[sp,72]    // p[4]
     std    #4580359811580069319,[sp,80]    // p[4]
     std    #4612966212090462427,[sp]    // q[0]
     std    #4602930165995154489,[sp,16]    // q[2]
     std    #4588882433176075751,[sp,24]    // q[3]
     std    #4567531038595922641,[sp,32]    // q[4]
     fabs    r2,r1
     fcmp    r3,r2,#0x3EF00000        // thresh
     bnlt    r3,.LBB141_6
; %bb.1:
     fcmp    r3,r2,#4            // xabs <= 4.0
     bnlt    r3,.LBB141_7
; %bb.2:
     fcmp    r3,r2,#0x403A8B020C49BA5E    // xbig
     bngt    r3,.LBB141_11
; %bb.3:
     fmul    r3,r1,r1
     fdiv    r3,#1,r3
     mov    r4,#0x3F90B4FB18B485C7        // p[5]
     fmac    r4,r3,r4,#0x3FD38A78B9F065F6    // p[0]
     fadd    r5,r3,#0x40048C54508800DB    // q[0]
     fmac    r6,r3,r4,#0x3FD70FE40E2425B8    // p[1]
     fmac    r4,r3,r5,#0x3FFDF79D6855F0AD    // q[1]
     fmul    r4,r3,r4
     fmul    r6,r3,r6
     mov    r5,#2
     add    r7,sp,#40            // p[*]
     add    r8,sp,#0            // q[*]
LBB141_4:                              ; %._crit_edge11
                                        ; =>This Inner Loop Header:
Depth=1
     vec    r9,{r4,r6}
     ldd    r10,[r7,r5<<3,0]        // p[*]
     ldd    r11,[r8,r5<<3,0]        // q[*]
     fadd    r6,r6,r10
     fadd    r4,r4,r11
     fmul    r4,r3,r4
     fmul    r6,r3,r6
     loop    ne,r5,#4,#1
; %bb.5:
     fadd    r5,r6,#0x3F4595FD0D71E33C    // p[4]
     fmul    r3,r3,r5
     fadd    r4,r4,#0x3F632147A014BAD1    // q[4]
     fdiv    r3,r3,r4
     fadd    r3,#0x3FE20DD750429B6D,-r3    // c[0]
     fdiv    r3,r3,r2
     br    .LBB141_10            // common tail
LBB141_6:                              ; %._crit_edge
     fmul    r3,r1,r1
     fcmp    r2,r2,#0x3C9FFE5AB7E8AD5E    // xsmall
     sra    r2,r2,<1:13>
     cvtsd    r4,#0
     mux    r2,r2,r3,r4
     mov    r3,#0x3FC7C7905A31C322        // a[4]
     fmac    r3,r2,r3,#0x400949FB3ED443E9    // a[0]
     fmac    r3,r2,r3,#0x405C774E4D365DA3    // a[1]
     ldd    r4,[sp,104]            // a[2]
     fmac    r3,r2,r3,r4
     fadd    r4,r2,#0x403799EE342FB2DE    // b[0]
     fmac    r4,r2,r4,#0x406E80C9D57E55B8    // b[1]
     fmac    r4,r2,r4,#0x40940A77529CADC8    // b[2]
     fmac    r3,r2,r3,#0x40A912C1535D121A    // a[3]
     fmul    r1,r3,r1
     fmac    r2,r2,r4,#0x40A63879423B87AD    // b[3]
     fdiv    r2,r1,r2
     mov    r1,r2
     add    sp,sp,#128
     ret                // 68
LBB141_7:
     fmul    r3,r2,#0x3E571E703C5F5815    // c[8]
     mov    r5,#0
     mov    r4,r2
LBB141_8:                              ; =>This Inner Loop Header:
Depth=1
     vec    r6,{r3,r4}
     ldd    r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
     fadd    r3,r3,r7
     fmul    r3,r2,r3
     ldd    r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
     fadd    r4,r4,r7
     fmul    r4,r2,r4
     loop    ne,r5,#7,#1
; %bb.9:
     fadd    r3,r3,#0x4093395B7FD2FC8E    // c[7]
     fadd    r4,r4,#0x4093395B7FD35F61    // d[7]
     fdiv    r3,r3,r4
LBB141_10:                // common tail
     fmul    r4,r2,#0x41800000        // 16.0
     fmul    r4,r4,#0x3D800000        // 1/16.0
     cvtds    r4,r4                // (signed)double >>>      cvtsd    r4,r4                // (double)signed >>>      fadd    r5,r2,-r4
     fadd    r2,r2,r4
     fmul    r4,r4,-r4
     fexp    r4,r4                // exp()
     fmul    r2,r2,-r5
     fexp    r2,r2                // exp()
     fmul    r2,r4,r2
     fadd    r2,#0,-r2
     fmac    r2,r2,r3,#0x3F000000        // 0.5
     fadd    r2,r2,#0x3F000000        // 0.5
     pflt    r1,0,T
     fadd    r2,#0,-r2
     mov    r1,r2
     add    sp,sp,#128
     ret
LBB141_11:
     fcmp    r1,r1,#0
     sra    r1,r1,<1:13>
     cvtsd    r2,#-1                // (double)-1 >>>      cvtsd    r3,#1                // (double)+1
     mux    r2,r1,r3,r2
     mov    r1,r2
     add    sp,sp,#128
     ret
Lfunc_end141:
     .size    r8_erf, .Lfunc_end141-r8_erf
                                        ; -- End function

These patterns seem rather unusual...
Don't really know the ABI.

Patterns don't really fit observations for typical compiler output
though (mostly in the FP constants, and particular ones that fall
outside the scope of what can be exactly represented as Binary16 or
similar, are rare).

You are N E V E R going to find the coefficients of a Chebyshev
polynomial to fit in a small FP container; excepting the very
occasional C0 or C1 term {which are mostly 1.0 and 0.0}

Some stats I have (for GLQuake):
14.9% of constants are floating-point.
10.99% are FP and can be expressed exactly as Binary16.
7.3% as Fp5 (E3.F2)
9.5% as Fp10 (S.E5.F6)
1.3% can be expressed in Binary32
2.7% need Binary64.

If scaled so that this is only FP constants:
73.0% are Binary16
8.7% are Binary32
18.1% are Binary64

Granted, this is inexact, as the stat is based on pattern recognition
rather than type. However, given that for Doom the total percentage of constants flagged as FP drops to around 1%, probably not too far off.

So, here, it seems it is common enough to where ability to load it into
a register in 1 cycle is worthwhile, but not so much that I am all that worried about needing to spend an instruction to do so.

More so, when the 1 cycle spent on the constant load, is overshadowed by
the 6 cycles it takes to to a Binary64 FADD or FMUL (faster only exists
for low-precision ops, or for Binary16/Binary32 SIMD).

Can also note, for integers immediate values:
3RI Imm9un: 97% hit-rate
2% turn into Jumbo Imm33s
1% require a separate constant
2RI Imm10un: 94% hit-rate
4.4% turn into 2RI Imm16
1.5% turn into Jumbo Imm33s
0.1% require a separate constant
Ld/St Disp9u: 96.4%
0.18% are negative
3.42%: Jumbo Disp33s

For RISC-V, the Imm12s case does result in a better hit rate for the
basic instructions, albeit the fallback case is worse (LUI+ADD or a
memory load).

Whereas, in my case, it is more a question of whether it ends up better
to load the immediate into a register or to use a jumbo prefix (where
the compile may look forward and make a guess).

In a world where I could have the ability to directly store constants to memory, or glue full 64 bit constants onto any instruction, etc, it
doesn't seem likely that this would have all that large of an impact on
either program size or performance.

Though, at the moment (in ongoing compiler fiddling), I am not seeing
much more evidence of unrelated / unreachable code being included in the binary. Seems like this optimization case may be used up.

This leaves roughly another 4% remaining...

I guess, will see how much of a fight this last 4% puts up...

Though, looks like I could in theory shave several kB off mostly by
disabling the memcpy slide and making the limit for inline memcpy
smaller and similar, but this comes at a performance impact (needs to go
the slower route of "actually calling memcpy()" in more cases...). Will continue to look for other options.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Brian G. Lucas@bagel99@gmail.com to comp.arch on Sat Apr 13 12:51:18 2024

From Newsgroup: comp.arch

On 4/9/24 13:24, Thomas Koenig wrote:

I wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

John Savard <quadibloc@servername.invalid> schrieb:

Thus, instead of having mode bits, one _could_ do the following:

Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.

But have one value for the first four bits in a block that indicates >>>>> 36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed >>>>> in which don't fill a whole block.

While that's a theoretical possibility, I don't view it as being
worthwhile in practice.

I played around a bit with another scheme: Encoding things into
128-bit blocks, with either 21-bit or 42-bit or longer instructions
(or a block header with six bits, and 20 or 40 bits for each
instruction).

Not having seen said encoding scheme:: I suspect you used the Rd=Rs1
destructive operand model for the 21-bit encodings. Yes :: no ??

It was not very well developed, I gave it up when I saw there wasn't
much to gain.

Maybe one more thing: In order to justify the more complex encoding,
I was going for 64 registers, and that didn't work out too well
(missing bits).

Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.

If you want to know more about MCore, you can contact me.
I was the initial designer of the mcore ISA. It was targeted
at embedded processors, particularly control processors in phones
and radios. It was extended and found its way into GPS receivers and
set top boxes. Motorola licensed it to the Chinese and there it is
known as CSky ISAv1 (there is a different ISAv2). There is even a
supported Linux port of CSky v1.

brian

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Apr 14 22:58:22 2024

From Newsgroup: comp.arch

Stephen Fuld wrote:
<snip>

I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with separate compilations? The called function probably doesn’t know the
tag value for callee saved registers. Fortunately, the My 66000 architecture comes to the rescue here. You would modify the Enter and
Exit instructions to save/restore the tag bits of the registers they are saving or restoring in the same data structure it uses for the registers (yes, it adds 32 bits to that structure – minimal cost). The same mechanism works for interrupts that take control away from a running process.

I had missed this until now:: The stack remains 64-bit aligned at all times,
so if you add 32-bits to the stack you actually add 64-bits to the stack.

Given this, you an effectively use a 2-bit tag {integral, floating, pointing, describing}. The difference between pointing and describing is that pointing
is C-like, while describing is dope-vector-like. {{Although others may find something else to put in the 4-th slot.}}

Any comments are welcome.

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Apr 14 23:25:52 2024

From Newsgroup: comp.arch

Anton Ertl wrote:

I have a similar problem for the carry and overflow bits in
< http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to
let those bits not survive across calls; if there was a cheap solution
for the problem, it would eliminate this drawback of my idea.

My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
whereas RISC-V encodes the inner loop in 11 instructions.

Source code:

void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
{
uint64_t c = 0;
for( int i = 0; i < n; i++ )
{
{c, sum[i]} = a[i] + b[i] + c;
}
return
}

Assembly code::

.global mpn_add_n
mpn_add_n:
MOV R5,#0 // c
MOV R6,#0 // i

VEC R7,{}
LDD R8,[R2,Ri<<3]
LDD R9,[R3,Ri<<3]
CARRY R5,{{IO}}
ADD R10,R8,R9
STD R10,[R1,Ri<<3]
LOOP LT,R6,#1,R4
RET

So, adding a few "bells and whistles" to RISC-V does give you a
performance gain (1.38×); using a well designed ISA gives you a
performance gain of 2.00× !! {{moral: don't stop too early}}

Note that all the register bookkeeping has disappeared !! because
of the indexed memory reference form.

As I count executing instructions, VEC does not execute, nor does
CARRY--CARRY causes the subsequent ADD to take C input as carry and
the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
BC sequence in a single instruction and in a single clock.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Apr 15 10:02:46 2024

From Newsgroup: comp.arch

MitchAlsup1 wrote:

Anton Ertl wrote:

I have a similar problem for the carry and overflow bits in
< http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to>> let those bits not survive across calls; if there was a cheap solution>> for the problem, it would eliminate this drawback of my idea.

My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
whereas RISC-V encodes the inner loop in 11 instructions.

Source code:

void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
{
    uint64_t c = 0;
    for( int i = 0; i < n; i++ )
    {
         {c, sum[i]} = a[i] + b[i] + c;
    }
    return
}

Assembly code::

    .global mpn_add_n
mpn_add_n:
    MOV   R5,#0     // c
    MOV   R6,#0     // i

    VEC   R7,{}
    LDD   R8,[R2,Ri<<3]
    LDD   R9,[R3,Ri<<3]
    CARRY R5,{{IO}}
    ADD   R10,R8,R9
    STD   R10,[R1,Ri<<3]
    LOOP LT,R6,#1,R4
    RET

So, adding a few "bells and whistles" to RISC-V does give you a
performance gain (1.38Ã—); using a well designed ISA gives you a performance gain of 2.00Ã— !! {{moral: don't stop too early}}

Note that all the register bookkeeping has disappeared !! because
of the indexed memory reference form.

As I count executing instructions, VEC does not execute, nor does CARRY--CARRY causes the subsequent ADD to take C input as carry and
the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
BC sequence in a single instruction and in a single clock.

; RSI->a[n], RDX->b[n], RDI->sum[n], RCX=-n
xor rax,rax ;; Clear carry
next:
mov rax,[rsi+rcx*8]
adc rax,[rdx+rcx*8]
mov [rdi+rcx*8],rax
inc rcx
jnz next
The code above is 5 instructions, or 6 if we avoid the load-op, doing
two loads and one store, so it should only be limited by the latency of
the ADC, i.e. one or two cycles.
In the non-OoO (i.e Pentium) days, I would have inverted the loop in
order to hide the latencies as much as possible, resulting in an inner
loop something like this:
next:
adc eax,ebx
mov ebx,[edx+ecx*4] ; First cycle
mov [edi+ecx*4],eax
mov eax,[esi+ecx*4] ; Second cycle
inc ecx
jnz next ; Third cycle
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Apr 15 11:16:15 2024

From Newsgroup: comp.arch

Terje Mathisen wrote:

MitchAlsup1 wrote:

Anton Ertl wrote:

I have a similar problem for the carry and overflow bits in
< http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to
let those bits not survive across calls; if there was a cheap solution
for the problem, it would eliminate this drawback of my idea.

My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
whereas RISC-V encodes the inner loop in 11 instructions.

Source code:

void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
{
Â Â Â uint64_t c = 0;
Â Â Â for( int i = 0; i < n; i++ )
Â Â Â {
Â Â Â Â Â Â Â Â {c, sum[i]} = a[i] + b[i] + c;
Â Â Â }
Â Â Â return
}

Assembly code::

Â Â Â .global mpn_add_n
mpn_add_n:
Â Â Â MOVÂ Â R5,#0Â Â Â Â // c
Â Â Â MOVÂ Â R6,#0Â Â Â Â // i

Â Â Â VECÂ Â R7,{}
Â Â Â LDDÂ Â R8,[R2,Ri<<3]
Â Â Â LDDÂ Â R9,[R3,Ri<<3]
Â Â Â CARRY R5,{{IO}}
Â Â Â ADDÂ Â R10,R8,R9
Â Â Â STDÂ Â R10,[R1,Ri<<3]
Â Â Â LOOPÂ LT,R6,#1,R4
Â Â Â RET

So, adding a few "bells and whistles" to RISC-V does give you a
performance gain (1.38Ãƒâ€”); using a well designed ISA gives you a >> performance gain of 2.00Ãƒâ€” !! {{moral: don't stop too early}}

Note that all the register bookkeeping has disappeared !! because
of the indexed memory reference form.

As I count executing instructions, VEC does not execute, nor does
CARRY--CARRY causes the subsequent ADD to take C input as carry and
the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
BC sequence in a single instruction and in a single clock.

; RSI->a[n], RDX->b[n], RDI->sum[n], RCX=-n
xor rax,rax ;; Clear carry
next:
mov rax,[rsi+rcx*8]
adc rax,[rdx+rcx*8]
mov [rdi+rcx*8],rax
inc rcx
   jnz next

The code above is 5 instructions, or 6 if we avoid the load-op, doing
two loads and one store, so it should only be limited by the latency of
the ADC, i.e. one or two cycles.

In the non-OoO (i.e Pentium) days, I would have inverted the loop in
order to hide the latencies as much as possible, resulting in an inner > loop something like this:

next:
adc eax,ebx
mov ebx,[edx+ecx*4]    ; First cycle

mov [edi+ecx*4],eax
mov eax,[esi+ecx*4]    ; Second cycle

inc ecx
   jnz next        ; Third cycle

In the same bad old days, the standard way to speed it up would have
used unrolling, but until we got more registers, it would have stopped
itself very quickly. With AVX2 we could use 4 64-bit slots in a 32-byte register, but then we would have needed to handle the carry propagation manually, and that would take longer than a series of ADC/ADX instructions. next4:
mov eax,[esi]
adc eax,[esi+edx]
mov [esi+edi],eax
mov eax,[esi+4]
adc eax,[esi+edx+4]
mov [esi+edi+4],eax
mov eax,[esi+8]
adc eax,[esi+edx+8]
mov [esi+edi+8],eax
mov eax,[esi+12]
adc eax,[esi+edx+12]
mov [esi+edi+12],eax
lea esi,[esi+16]
dec ecx
jnz next4
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Apr 15 19:03:34 2024

From Newsgroup: comp.arch

Michael S wrote:

On Thu, 11 Apr 2024 18:46:54 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On 4/11/2024 6:13 AM, Michael S wrote:

On Wed, 10 Apr 2024 23:30:02 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

It does occupy some icache space, however; have you boosted the
icache size to compensate?

The space occupied in the ICache is freed up from being in the
DCache so the overall hit rate goes up !! At typical sizes,
ICache miss rate is about ¼ the miss rate of DCache.

Besides:: if you had to LD the constant from memory, you use a LD
instruction and 1 or 2 words in DCache, while consuming a GPR. So,
overall, it takes fewer cycles, fewer GPRs, and fewer
instructions.

Alternatively:: if you paste constants together (LUI, AUPIC) you
have no direct route to either 64-bit constants or 64-bit address
spaces.

It looks to be a win-win !!

Win-win under constraints of Load-Store Arch. Otherwise, it
depends.

Never seen a LD-OP architecture where the inbound memory can be in
the Rs1 position of the instruction.

May be. But out of 6 major integer OPs it matters only for SUB.
By now I don't remember for sure, but I think that I had seen LD-OP architecture that had SUBR instruction. May be, TI TMS320C30?
It was 30 years ago and my memory is not what it used to be.

That a SUBR instruction exists does not disavow my statement that
the inbound memory reference was never in the Rs1 position.
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Mon Apr 15 20:55:53 2024

From Newsgroup: comp.arch

Terje Mathisen wrote:

MitchAlsup1 wrote:

In the non-OoO (i.e Pentium) days, I would have inverted the loop in
order to hide the latencies as much as possible, resulting in an inner
loop something like this:

next:
adc eax,ebx
mov ebx,[edx+ecx*4] ; First cycle

mov [edi+ecx*4],eax
mov eax,[esi+ecx*4] ; Second cycle

inc ecx
jnz next ; Third cycle

Terje

As opposed to::

.global mpn_add_n
mpn_add_n:
MOV R5,#0 // c
MOV R6,#0 // i

VEC R7,{}
LDD R8,[R2,Ri<<3] // Load 128-to-512 bits
LDD R9,[R3,Ri<<3] // Load 128-to-512 bits
CARRY R5,{{IO}}
ADD R10,R8,R9 // Add pair to add octal
STD R10,[R1,Ri<<3] // Store 128-to-512 bits
LOOP LT,R6,#1,R4 // increment 2-to-8 times
RET

--------------------------------------------------------

LDD R8,[R2,Ri<<3] // AGEN cycle 1
LDD R9,[R3,Ri<<3] // AGEN cycle 2 data cycle 4
CARRY R5,{{IO}}
ADD R10,R8,R9 // cycle 4
STD R10,[R1,Ri<<3] // AGEN cycle 3 write cycle 5
LOOP LT,R6,#1,R4 // cycle 3

OR

LDD LDd
LDD LDd
ADD
ST STd
LOOP
LDD LDd
LDD LDd
ADD
ST STd
LOOP

10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM machine !! without code scheduling heroics.

40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM machine !! --- Synchronet 3.20a-Linux NewsLink 1.114

From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Mon Apr 15 16:14:24 2024

From Newsgroup: comp.arch

On 4/12/2024 10:17 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/11/2024 8:40 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/11/2024 6:06 PM, MitchAlsup1 wrote:

While I admit that <basically> anything bigger than 50-bits will be >>>>> fine
as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.

The number of cases where this comes up is not statistically
significant enough to have a meaningful impact on performance.

Fraction of a percent edge-cases are not deal-breakers, as I see it.

Idle speculation::

     .globl    r8_erf                          ; -- Begin function
r8_erf
     .type    r8_erf,@function
r8_erf:                                 ; @r8_erf
; %bb.0:
     add    sp,sp,#-128
     std    #4614300636657501161,[sp,88]    // a[0]
     std    #4645348406721991307,[sp,104]    // a[2]
     std    #4659275911028085274,[sp,112]    // a[3]
     std    #4595861367557309218,[sp,120]    // a[4]
     std    #4599171895595656694,[sp,40]    // p[0]
     std    #4593699784569291823,[sp,56]    // p[2]
     std    #4580293056851789237,[sp,64]    // p[3]
     std    #4559215111867327292,[sp,72]    // p[4]
     std    #4580359811580069319,[sp,80]    // p[4]
     std    #4612966212090462427,[sp]    // q[0]
     std    #4602930165995154489,[sp,16]    // q[2]
     std    #4588882433176075751,[sp,24]    // q[3]
     std    #4567531038595922641,[sp,32]    // q[4]
     fabs    r2,r1
     fcmp    r3,r2,#0x3EF00000        // thresh
     bnlt    r3,.LBB141_6
; %bb.1:
     fcmp    r3,r2,#4            // xabs <= 4.0
     bnlt    r3,.LBB141_7
; %bb.2:
     fcmp    r3,r2,#0x403A8B020C49BA5E    // xbig
     bngt    r3,.LBB141_11
; %bb.3:
     fmul    r3,r1,r1
     fdiv    r3,#1,r3
     mov    r4,#0x3F90B4FB18B485C7        // p[5]
     fmac    r4,r3,r4,#0x3FD38A78B9F065F6    // p[0]
     fadd    r5,r3,#0x40048C54508800DB    // q[0]
     fmac    r6,r3,r4,#0x3FD70FE40E2425B8    // p[1]
     fmac    r4,r3,r5,#0x3FFDF79D6855F0AD    // q[1]
     fmul    r4,r3,r4
     fmul    r6,r3,r6
     mov    r5,#2
     add    r7,sp,#40            // p[*]
     add    r8,sp,#0            // q[*]
LBB141_4:                              ; %._crit_edge11
                                        ; =>This Inner Loop Header:
Depth=1
     vec    r9,{r4,r6}
     ldd    r10,[r7,r5<<3,0]        // p[*]
     ldd    r11,[r8,r5<<3,0]        // q[*]
     fadd    r6,r6,r10
     fadd    r4,r4,r11
     fmul    r4,r3,r4
     fmul    r6,r3,r6
     loop    ne,r5,#4,#1
; %bb.5:
     fadd    r5,r6,#0x3F4595FD0D71E33C    // p[4]
     fmul    r3,r3,r5
     fadd    r4,r4,#0x3F632147A014BAD1    // q[4]
     fdiv    r3,r3,r4
     fadd    r3,#0x3FE20DD750429B6D,-r3    // c[0]
     fdiv    r3,r3,r2
     br    .LBB141_10            // common tail
LBB141_6:                              ; %._crit_edge
     fmul    r3,r1,r1
     fcmp    r2,r2,#0x3C9FFE5AB7E8AD5E    // xsmall
     sra    r2,r2,<1:13>
     cvtsd    r4,#0
     mux    r2,r2,r3,r4
     mov    r3,#0x3FC7C7905A31C322        // a[4]
     fmac    r3,r2,r3,#0x400949FB3ED443E9    // a[0]
     fmac    r3,r2,r3,#0x405C774E4D365DA3    // a[1]
     ldd    r4,[sp,104]            // a[2]
     fmac    r3,r2,r3,r4
     fadd    r4,r2,#0x403799EE342FB2DE    // b[0]
     fmac    r4,r2,r4,#0x406E80C9D57E55B8    // b[1]
     fmac    r4,r2,r4,#0x40940A77529CADC8    // b[2]
     fmac    r3,r2,r3,#0x40A912C1535D121A    // a[3]
     fmul    r1,r3,r1
     fmac    r2,r2,r4,#0x40A63879423B87AD    // b[3]
     fdiv    r2,r1,r2
     mov    r1,r2
     add    sp,sp,#128
     ret                // 68
LBB141_7:
     fmul    r3,r2,#0x3E571E703C5F5815    // c[8]
     mov    r5,#0
     mov    r4,r2
LBB141_8:                              ; =>This Inner Loop Header:
Depth=1
     vec    r6,{r3,r4}
     ldd    r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
     fadd    r3,r3,r7
     fmul    r3,r2,r3
     ldd    r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
     fadd    r4,r4,r7
     fmul    r4,r2,r4
     loop    ne,r5,#7,#1
; %bb.9:
     fadd    r3,r3,#0x4093395B7FD2FC8E    // c[7]
     fadd    r4,r4,#0x4093395B7FD35F61    // d[7]
     fdiv    r3,r3,r4
LBB141_10:                // common tail
     fmul    r4,r2,#0x41800000        // 16.0
     fmul    r4,r4,#0x3D800000        // 1/16.0
     cvtds    r4,r4                // (signed)double >>>      cvtsd    r4,r4                // (double)signed >>>      fadd    r5,r2,-r4
     fadd    r2,r2,r4
     fmul    r4,r4,-r4
     fexp    r4,r4                // exp()
     fmul    r2,r2,-r5
     fexp    r2,r2                // exp()
     fmul    r2,r4,r2
     fadd    r2,#0,-r2
     fmac    r2,r2,r3,#0x3F000000        // 0.5
     fadd    r2,r2,#0x3F000000        // 0.5
     pflt    r1,0,T
     fadd    r2,#0,-r2
     mov    r1,r2
     add    sp,sp,#128
     ret
LBB141_11:
     fcmp    r1,r1,#0
     sra    r1,r1,<1:13>
     cvtsd    r2,#-1                // (double)-1 >>>      cvtsd    r3,#1                // (double)+1
     mux    r2,r1,r3,r2
     mov    r1,r2
     add    sp,sp,#128
     ret
Lfunc_end141:
     .size    r8_erf, .Lfunc_end141-r8_erf
                                        ; -- End function

These patterns seem rather unusual...
Don't really know the ABI.

Patterns don't really fit observations for typical compiler output
though (mostly in the FP constants, and particular ones that fall
outside the scope of what can be exactly represented as Binary16 or
similar, are rare).

You are N E V E R going to find the coefficients of a Chebyshev
polynomial to fit in a small FP container; excepting the very
occasional C0 or C1 term {which are mostly 1.0 and 0.0}

FWIW:
I went and was able to find some odd corners to fit full Binary64
encodings into, but currently only in XG2 Mode (there will be
equivalents in Baseline, albeit effectively truncated to 56 bits).

So, went and burned the remaining bits in the Jumbo prefixes in XG2
Mode, which now allows:
ADD/SUB/AND/OR/XOR, and FADD/FMUL with full 64-bit immediate values.

Ended up going and burning the MULS.L/MULU.L and ADDS.L/ADDU.L encodings
on FADD and FMUL, so:
FEii-iiii FEii-iiii F2nm-2Gii FMUL Rm, Imm56f, Rn
FEii-iiii FEii-iiii F2nm-3Gii FADD Rm, Imm56f, Rn

With Imm56f encoded like Imm57s (extended to 64-bit), but:
Imm(55: 0) goes into (63: 8)
Imm(62:56) goes into ( 7: 1)
Imm( 63) goes into ( 0)

Effects on code-size and performance: Minimal.
Saves roughly 600 bytes from GLQuake;
No real effect on the size of Doom.

As for my fiddling with getting ".text" size down (testing mostly with
Doom):
XG2 : 293K
RV64G: 285K

Of this, there is still effectively 10K in Jumbo prefixes, and RV64G has
an additional 24K in ".rodata".

This implies, at the moment, XG2 instruction count is ~ 1% less than
RV64G, albeit text slightly larger due to average instruction size being slightly bigger (~ 4.14 bytes).

Also noted that the BGBCC compiler output has ~ 40K worth of MOV RegReg instructions, so MOV RegReg is around 14% of the total size of the
binary (I suspect a big chunk of the relative size reduction of Baseline
may be due to having a 16-bit encoding for this).

So, one ongoing goal on the compiler optimization front would be
continuing to work towards a reduction in the number of MOV instructions.

...

--- Synchronet 3.20a-Linux NewsLink 1.114

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Apr 16 08:44:26 2024

From Newsgroup: comp.arch

MitchAlsup1 wrote:

Terje Mathisen wrote:

MitchAlsup1 wrote:

In the non-OoO (i.e Pentium) days, I would have inverted the loop in
order to hide the latencies as much as possible, resulting in an inner
loop something like this:

next:
   adc eax,ebx
   mov ebx,[edx+ecx*4]    ; First cycle

   mov [edi+ecx*4],eax
   mov eax,[esi+ecx*4]    ; Second cycle

   inc ecx
   jnz next        ; Third cycle

Terje

As opposed to::

    .global mpn_add_n
mpn_add_n:
    MOV   R5,#0     // c
    MOV   R6,#0     // i

    VEC   R7,{}
    LDD   R8,[R2,Ri<<3]       // Load 128-to-512 bits
    LDD   R9,[R3,Ri<<3]       // Load 128-to-512 bits
    CARRY R5,{{IO}}
    ADD   R10,R8,R9           // Add pair to add octal
    STD   R10,[R1,Ri<<3]      // Store 128-to-512 bits
    LOOP LT,R6,#1,R4         // increment 2-to-8 times
    RET

--------------------------------------------------------

    LDD   R8,[R2,Ri<<3]       // AGEN cycle 1
    LDD   R9,[R3,Ri<<3]       // AGEN cycle 2 data cycle 4
    CARRY R5,{{IO}}
    ADD   R10,R8,R9           // cycle 4
    STD   R10,[R1,Ri<<3]      // AGEN cycle 3 write cycle 5
    LOOP LT,R6,#1,R4         // cycle 3

OR

    LDD       LDd
         LDD       LDd                    ADD
              ST        STd
              LOOP
                   LDD       LDd
                        LDD       LDd
ADD
                             ST        STd
                             LOOP

10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM
machine !!
without code scheduling heroics.

40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM
machine !!

It all comes down to the carry propagation, right?
The way I understood the original code, you are doing a very wide
unsigned add, so you need a carry to propagate from each and every block to the next, right?
If you can do that at half a clock cycle per 64 bit ADD, then consider
me very impressed!
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Apr 16 18:14:39 2024

From Newsgroup: comp.arch

Terje Mathisen wrote:

MitchAlsup1 wrote:

Terje Mathisen wrote:

MitchAlsup1 wrote:

In the non-OoO (i.e Pentium) days, I would have inverted the loop in
order to hide the latencies as much as possible, resulting in an inner
loop something like this:

next:
   adc eax,ebx
   mov ebx,[edx+ecx*4]    ; First cycle

   mov [edi+ecx*4],eax
   mov eax,[esi+ecx*4]    ; Second cycle

   inc ecx
   jnz next        ; Third cycle

Terje

As opposed to::

    .global mpn_add_n
mpn_add_n:
    MOV   R5,#0     // c
    MOV   R6,#0     // i

    VEC   R7,{}
    LDD   R8,[R2,Ri<<3]       // Load 128-to-512 bits
    LDD   R9,[R3,Ri<<3]       // Load 128-to-512 bits
    CARRY R5,{{IO}}
    ADD   R10,R8,R9           // Add pair to add octal
    STD   R10,[R1,Ri<<3]      // Store 128-to-512 bits
    LOOP LT,R6,#1,R4         // increment 2-to-8 times
    RET

--------------------------------------------------------

    LDD   R8,[R2,Ri<<3]       // AGEN cycle 1
    LDD   R9,[R3,Ri<<3]       // AGEN cycle 2 data cycle 4
    CARRY R5,{{IO}}
    ADD   R10,R8,R9           // cycle 4
    STD   R10,[R1,Ri<<3]      // AGEN cycle 3 write cycle 5
    LOOP LT,R6,#1,R4         // cycle 3

OR

    LDD       LDd
         LDD       LDd                    ADD
              ST        STd
              LOOP
                   LDD       LDd
                        LDD       LDd
ADD
                             ST        STd
                             LOOP

10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM
machine !!
without code scheduling heroics.

40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM
machine !!

It all comes down to the carry propagation, right?

The way I understood the original code, you are doing a very wide
unsigned add, so you need a carry to propagate from each and every block
to the next, right?

Most ST pipelines have an align stage to align the data to be stored to where it needs to be stored, one can extend the carry into this stage if needed,
and capture the a+b and a+b+1 and use carry in to select one or the other.

If you can do that at half a clock cycle per 64 bit ADD, then consider
me very impressed!

Terje

--- Synchronet 3.20a-Linux NewsLink 1.114

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Apr 16 15:02:13 2024

From Newsgroup: comp.arch

On 4/3/2024 10:24 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If
set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load single
floating and load double floating, which work the same as the other 32-
and 64-bit loads, but in addition to loading the value, set the tag bit
for the destination register. Non-floating-point loads would clear the
tag bit. As I show below, I don’t think you need any special "store
tag" instructions.

...

But we can go further. There are some opcodes that only make sense for
FP operands, e.g. the transcendental instructions. And there are some
operations that probably only make sense for non-FP operands, e.g. POP,
FF1, probably shifts. Given the tag bit, these could share the same
op-code. There may be several more of these.

Certainly makes reading disassembler output fun (or writing the disassembler).

Good point. It probably isn't too bad for the arithmetic operations,
etc, but once you extend it as I suggested in the last paragraph it gets
ugly. :-(

big snip

That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it? To me, a major question is the effect on performance. What
is the cost of having to decode the source registers and reading their
respective tag bits before knowing which FU to use?

In in OoO CPU, that's pretty heavy.

OK, but in the vast majority of cases (i.e. unless there is something
like a conditional branch that uses floating point or integer depending
upon whether the branch is taken.) the flag bit that a register will
have can be known well in advance. As I said, IANAHG, but that might
make it easier.

But actually, your idea does not need any computation results for
determining the tag bits of registers (except during EXIT),

But even here, you almost certainly know what the tag bit for any given register is long before you execute the EXIT instruction. And remember,
on MY 66000 EXIT is performed lazily, so you have time and the mechanism
is in place to wait if needed.

so you
probably can handle the tags in the front end (decoder and renamer).
Then the tags are really separate and not part of the rgisters that
have to be renamed, and you don't need to perform any waiting on
ENTER.

However, in EXIT the front end would have to wait for the result of
the load/store unit loading the 32 bits, unless you add a special
mechanism for that. So EXIT would become expensive, one way or the
other.

Yes.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- Synchronet 3.20a-Linux NewsLink 1.114

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Apr 16 15:06:48 2024

From Newsgroup: comp.arch

On 4/3/2024 11:44 AM, EricP wrote:

Stephen Fuld wrote:

There has been discussion here about the benefits of reducing the
number of op codes. One reason not mentioned before is if you have
fixed length instructions, you may want to leave as many codes as
possible available for future use. Of course, if you are doing a
16-bit instruction design, where instruction bits are especially
tight, you may save enough op-codes to save a bit, perhaps allowing a
larger register specifier field, or to allow more instructions in the
smaller subset.

It is in this spirit that I had an idea, partially inspired by Mill’s
use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First, >> it has several features that are “friendly” to the idea. Second, I
know Mitch cares about keeping the number of op codes low.

Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate
discussions, and because it has been fun to think about.

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag.
If set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load
single floating and load double floating, which work the same as the
other 32- and 64-bit loads, but in addition to loading the value, set
the tag bit for the destination register. Non-floating-point loads
would clear the tag bit. As I show below, I don’t think you need any
special "store tag" instructions.

If you are adding a float/int data type flag you might as well
also add operand size for floats at least, though some ISA's
have both int32 and int64 ALU operations for result compatibility.

Not needed for My 66000, as all floating point loads convert the loaded
value to double precision.

big snip

Currently the opcode data type can tell the uArch how to route
the operands internally without knowing the data values.
For example, FPU reservation stations monitor float operands
and schedule for just the FPU FADD or FMUL units.

Dynamic data typing would change that to be data dependent routing.
It means, for example, you can't begin to schedule a uOp
until you know all its operand types and opcode.

Seems right.

Looks like it makes such distributed decisions impossible.
Probably everything winds up in a big pile of logic in the center,
which might be problematic for those things whose complexity grows N^2.
Not sure how significant that is.

Could be. Again, IANAHG.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- Synchronet 3.20a-Linux NewsLink 1.114

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Apr 16 15:08:58 2024

From Newsgroup: comp.arch

On 4/3/2024 1:02 PM, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

[saving opcodes]

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If
set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.).

I don't think this would save a lot of opcode space, which
is the important thing.

A typical RISC design has a six-bit major opcode.
Having three registers takes away fifteen bits, leaving
eleven, which is far more than anybody would ever want as
minor opdoce for arithmetic instructions. Compare with https://en.wikipedia.org/wiki/DEC_Alpha#Instruction_formats
where DEC actually left out three bits because they did not
need them.

I think that is probably true for 32 bit instructions, but what about 16
bit?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- Synchronet 3.20a-Linux NewsLink 1.114

From BGB-Alt@bohannonindustriesllc@gmail.com to comp.arch on Tue Apr 16 17:46:00 2024

From Newsgroup: comp.arch

On 4/16/2024 5:08 PM, Stephen Fuld wrote:

On 4/3/2024 1:02 PM, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

[saving opcodes]

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If >>> set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.).

I don't think this would save a lot of opcode space, which
is the important thing.

A typical RISC design has a six-bit major opcode.
Having three registers takes away fifteen bits, leaving
eleven, which is far more than anybody would ever want as
minor opdoce for arithmetic instructions. Compare with
https://en.wikipedia.org/wiki/DEC_Alpha#Instruction_formats
where DEC actually left out three bits because they did not
need them.

I think that is probably true for 32 bit instructions, but what about 16 bit?

At least, as I see it...

If 4 bit registers:
16-4-4 => 8
If 5 bit registers:
15-5-5 => 6

Realistically, I don't think 6 bits of opcode is enough except if the
purpose of the 16-bit ops is merely to shorten some common 32-bit ops.

But, a subset of instructions can use 5-bit fields (say, MOV, EXTS.L,
and common Load/Store ops).

Say (in my notation):
MOV Rm, Rn
EXTS.L Rm, Rn
MOV.L (SP, Disp), Rn
MOV.Q (SP, Disp), Rn
MOV.X (SP, Disp), Xn
MOV.L Rn, (SP, Disp)
MOV.Q Rn, (SP, Disp)
MOV.X Xn, (SP, Disp)
As, these tend to be some of the most commonly used instructions.

For most everything else, one can limit things either to the first 16 registers, or the most commonly used 16 registers (if not equivalent to
the first 16).

Though, for 1R ops, it can make sense to have 5-bit registers.

I don't really think 3-bit register fields are worth bothering with;
even if limited to the most common registers. Granted, being limited to
2R encodings is also limiting.

Granted, both Thumb and RVC apparently thought 3-bit register fields
were worthwhile, so...

Similarly, not worth bothering (at all) with 6-bit register fields in
16-bit ops.

Though, if one has 16-bit VLE, a question is how is best to split up 16
vs 32-bit encoding space.

...

--- Synchronet 3.20a-Linux NewsLink 1.114

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Apr 16 16:44:27 2024

From Newsgroup: comp.arch

On 4/3/2024 2:30 PM, MitchAlsup1 wrote:

BGB-Alt wrote:

On 4/3/2024 11:43 AM, Stephen Fuld wrote:

There has been discussion here about the benefits of reducing the
number of op codes. One reason not mentioned before is if you have
fixed length instructions, you may want to leave as many codes as
possible available for future use. Of course, if you are doing a
16-bit instruction design, where instruction bits are especially
tight, you may save enough op-codes to save a bit, perhaps allowing a
larger register specifier field, or to allow more instructions in the
smaller subset.

It is in this spirit that I had an idea, partially inspired by Mill’s >>> use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First, it

               66000

Sorry. Typo.

has several features that are “friendly” to the idea. Second, I know >>> Mitch cares about keeping the number of op codes low.

Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate
discussions, and because it has been fun to think about.

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag.
If set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load
single floating and load double floating, which work the same as the
other 32- and 64-bit loads, but in addition to loading the value, set
the tag bit for the destination register. Non-floating-point loads
would clear the tag bit. As I show below, I don’t think you need any >>> special "store tag" instructions.

What do you do when you want a FP bit pattern interpreted as an integer,
or vice versa.

As I said below, if you need that, you can use an otherwise :"useless" instruction, such as ORing a register with itself the modify the tag bits.

When executing arithmetic instructions, if the tag bits of both
sources of an instruction are the same, do the appropriate operation
(floating or integer), and set the tag bit of the result register
appropriately.
If the tag bits of the two sources are different, I see several
possibilities.

1.    Generate an exception.
2.    Use the sense of source 1 for the arithmetic operation, but
perform the appropriate conversion on the second operand first,
potentially saving an instruction

Conversions to/from FP often require a rounding mode. How do you specify that?

Good point.

3.    Always do the operation in floating point and convert the
integer operand prior to the operation. (Or, if you prefer, change
floating point to integer in the above description.)
4.    Same as 2 or 3 above, but don’t do the conversions.

I suspect this is the least useful choice. I am not sure which is
the best option.

Given that, use the same op code for the floating-point and fixed
versions of the same operations. So we can save eight op codes, the
four arithmetic operations, max, min, abs and compare. So far, a net
savings of six opcodes.

But we can go further. There are some opcodes that only make sense
for FP operands, e.g. the transcendental instructions. And there are
some operations that probably only make sense for non-FP operands,
e.g. POP, FF1, probably shifts. Given the tag bit, these could share
the same op-code. There may be several more of these.

Hands waving:: "Danger Will Robinson, Danger" more waving of hands.

Agreed.

I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with
separate compilations? The called function probably doesn’t know the

The compiler will certainly have a function prototype. In any event, if FP and Integers share a register file the lack of prototype is much less stress-
full to the compiler/linking system.

tag value for callee saved registers. Fortunately, the My 66000
architecture comes to the rescue here. You would modify the Enter
and Exit instructions to save/restore the tag bits of the registers
they are saving or restoring in the same data structure it uses for
the registers (yes, it adds 32 bits to that structure – minimal
cost). The same mechanism works for interrupts that take control
away from a running process.

Yes, but we do just fine without the tag and without the stuff mentioned above. Neither ENTER nor EXIT care about the 64-bit pattern in the
register.

I think you need it for callee saved registers to insure the tag is set correctly for the calling program upon return to it.

I don’t think you need to set or clear the tag bits without doing
anything else, but if you do, I think you could “repurpose” some
other instructions to do this, without requiring another op-code.
For example, Oring a register with itself could be used to set the
tag bit and Oring a register with zero could clear it. These should
be pretty rare.

That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it?

No.

           To me, a major question is the effect on performance. What

is the cost of having to decode the source registers and reading
their respective tag bits before knowing which FU to use?

The problem is you have put decode dependent on dynamic pipeline information.
I suggest you don't want to do that. Consider a change from int to FP instruction
as a predicated instruction, so the pipeline cannot DECODE the
instruction at
hand until the predicate resolves. Yech.

Good point.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- Synchronet 3.20a-Linux NewsLink 1.114

From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Apr 17 01:11:12 2024

From Newsgroup: comp.arch

Stephen Fuld wrote:

On 4/3/2024 11:44 AM, EricP wrote:

If you are adding a float/int data type flag you might as well
also add operand size for floats at least, though some ISA's
have both int32 and int64 ALU operations for result compatibility.

Not needed for My 66000, as all floating point loads convert the loaded value to double precision.

Insufficient verbal precision::

My 66000 only cares about the size of a value being loaded from memory
(or ST into memory).

While (float) LDs load the 32-bit value from memory, they remain (float)
while residing in the register; and the High Order 32-bits are ignored.
The (float) register can be consumed by a (float) FP calculation and it
remains (float) after processing.

Small immediates, when consumed by FP instructions, are converted from
integer to <sized> FP during DECODE. So::

FADD R7,R7,#1

adds 1.0D0 to the (double) value in R7 (and takes one 32-bit instruction), while:

FADDs R7,R7,#1

Adds 1.0E0 to the (float) value in R7.
--- Synchronet 3.20a-Linux NewsLink 1.114

From John Savard@quadibloc@servername.invalid to comp.arch on Wed Apr 17 15:06:05 2024

From Newsgroup: comp.arch

On Mon, 8 Apr 2024 17:25:38 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

John Savard <quadibloc@servername.invalid> schrieb:

Well, when the computer fetches a 256-bit block of code, the first
four bits indicates whether it is composed of 36-bit instructions or
28-bit instructions.

Do you think that instructions which require a certain size (almost)
always happen to be situated together so they fit in a block?

Well, floating-point and integer instructions of one size each can be arbitrarily mixed. And when different sizes need to mix, going to
36-bit instructions is low overhead.

John Savard
--- Synchronet 3.20a-Linux NewsLink 1.114

From John Savard@quadibloc@servername.invalid to comp.arch on Wed Apr 17 15:07:18 2024

From Newsgroup: comp.arch

On Mon, 8 Apr 2024 19:56:27 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

So, instead of using the branch target address, one rounds it down to
a 256-bit boundary, reads 256-bits and looks at the first 4-bits to
determine the format, nd then uses the branch offset to pick a cont-
tainer which will become the first instruction executed.

Sounds more complicated than necessary.

Yes, I don't disagree. I'm just pointing out that it's possible to
make the mini tags idea work that way, since it lets you easily turn
mini tags off when you need to.

John Savard
--- Synchronet 3.20a-Linux NewsLink 1.114

From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Sat Apr 20 19:19:53 2024

From Newsgroup: comp.arch

On 4/11/24 7:12 PM, MitchAlsup1 wrote:

Scott Lurndal wrote:

[snip]

It seems to me that an offloaded DMA engine would be a far
better way to do memmove (over some threshhold, perhaps a
cache line) without trashing the caches. Likewise memset.

Effectively, that is what HW does, even on the lower end machines,
the AGEN unit of the Cache access pipeline is repeatedly cycled,
and data is read and/or written. One can execute instructions not
needing memory references while LDM, STM, ENTER, EXIT, MM, and MS
are in progress.

Moving this sequencer farther out would still require it to consume
all L1 BW in any event (snooping) for memory consistency reasons.
{Note: cache accesses are performed line-wide not register width
wide}

If the data was not in L1 cache, only its absence would need to be
determined by the DMA engine. A snoop filter, tag-inclusive L2/L3
probing, or similar mechanism could avoid L1 accesses. Even if the
source or destination for a memory copy was in L1, only one L1
access per cache line might be needed.

I also wonder if the cache fill and/or spill mechanism might be
decoupled from the load/store such that if the cache had enough
banks/subarrays some loads and stores might be done in parallel
with a cache fill or spill/external-read-without-eviction. Tag
checking would limit the utility of such, though tags might also
be banked or access flexibly scheduled (at the cost of choosing a
victim early for fills). Of course, if the cache has such
available bandwidth, why not make it available to the core as well
even if it was rarely useful? (Perhaps higher register bandwidth
might be more difficult than higher cache bandwidth for banking-
friendly patterns?)

Deciding when to bypass cache seems difficult (for both software
developers and hardware). Overwriting cache lines within the same
memory copy is obviously silly. Filling a cache with a memory copy
is also suboptimal, but L1 hardware copy-on-write would probably
be too complicated even with page aligned copies. A copy from
cacheable memory to uncacheable memory (I/O) might be a strong
hint that the source should not be installed into L1 or L2 cache,
but I would guess that not installing the source would often be
the right choice.

I could also imagine a programmer wanting to use memory copy as a
prefetch *directive* for a large chunk of memory (by having source
and destination be the same). This idiom would be easy to detect
(from and to base registers being the same), but may be too niche
to be worth detecting (for most implementations).

(My 66000 might use an idiom with a prefetch instruction preceding
a memory move to indicate the cache level of the destination but
that only manages [some of] the difficulty of the hardware
choice.)

For memset, compression is also an obvious possibility. A memset
might not write any cache lines but rather cache the address range
and the set value and perform hardware copy on access into cache
lines.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Paul A. Clayton@paaronclayton@gmail.com to comp.arch on Sat Apr 20 20:02:07 2024

From Newsgroup: comp.arch

On 4/11/24 10:30 AM, Scott Lurndal wrote:
[snip]

On 4/9/24 8:28â€¯PM, MitchAlsup1 wrote:

[snip]

MMs and MSs that do not cross page boundaries are ATOMIC. The
entire system
sees only the before or only the after state and nothing in
between.

One might wonder how that atomicity is guaranteed in a
SMP processor...

While Mitch Alsup's response ("The entire chunk of data traverses
the interconnect as a single transaction." — I am not certain how
that would work given reading up to a page and writing up to a
page) provides one mechanism and probably the best one,
theoretically the *data* does not need to be moved atomically but
only the "ownership" (the source does not have to be owned in the
traditional sense but needs to marked as readable by the copier).
This is somewhat similar to My 66000's Exotic Synchronization
Mechanism in that once all the addresses involved are known (the
two ranges for memory copy), NAKs can be used for remote requests
for "owned" cache lines while the copy is made.

Only the visibility needs to be atomic.

Memory set provides optimization opportunities in that the source
is small. In theory, the set value could be sent to L3 with the
destination range and all monitoring could be done at L3 and
requested cache line sent immediately from L3 (hardware copy on
access) — the first and last part of the range might be partial
cache lines requiring read-for-ownership.

For cache line aligned copies, a cache which used indirection
between tags and data might not even copy the data but only the
tag-related metadata. Some forms of cache compression might allow
partial cache lines to be cached such that even unaligned copies
might partially share data by having one tag indicate lossy
compression with an indication of where the stored data is not
valid, but that seems too funky to be practical.
--- Synchronet 3.20a-Linux NewsLink 1.114

Who's Online
Recent Visitors
- Grey Gamer
  Wed May 1 15:05:58 2024
  from Show Low, Az via Telnet
- Grey Gamer
  Wed May 1 11:25:27 2024
  from Show Low, Az via Telnet
- Microbot
  Wed May 1 06:29:14 2024
  from Moore, Ok via Telnet
- Microbot
  Thu May 2 04:34:00 2024
  from Moore, Ok via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	915
Nodes:	10 (1 / 9)
Uptime:	30:38:21
Calls:	12,169
Calls today:	1
Files:	186,521
Messages:	2,234,249

"Mini" tags to reduce the number of op codes

Who's Online

Recent Visitors

System Info