The idea is to add 32 bits to the processor state, one per register...
(though probably not physically part of the register file) as a tag. If >set, the bit indicates that the corresponding register contains a >floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load single >floating and load double floating, which work the same as the other 32-
and 64-bit loads, but in addition to loading the value, set the tag bit
for the destination register. Non-floating-point loads would clear the
tag bit. As I show below, I don’t think you need any special "store
tag" instructions.
But we can go further. There are some opcodes that only make sense for
FP operands, e.g. the transcendental instructions. And there are some >operations that probably only make sense for non-FP operands, e.g. POP,
FF1, probably shifts. Given the tag bit, these could share the same >op-code. There may be several more of these.
I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with >separate compilations? The called function probably doesn’t know the
tag value for callee saved registers. Fortunately, the My 66000
architecture comes to the rescue here. You would modify the Enter and
Exit instructions to save/restore the tag bits of the registers they are >saving or restoring in the same data structure it uses for the registers >(yes, it adds 32 bits to that structure – minimal cost).
The same
mechanism works for interrupts that take control away from a running >process.
That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it? To me, a major question is the effect on performance. What
is the cost of having to decode the source registers and reading their >respective tag bits before knowing which FU to use?
There has been discussion here about the benefits of reducing the number
of op codes. One reason not mentioned before is if you have fixed
length instructions, you may want to leave as many codes as possible available for future use. Of course, if you are doing a 16-bit
instruction design, where instruction bits are especially tight, you may save enough op-codes to save a bit, perhaps allowing a larger register specifier field, or to allow more instructions in the smaller subset.
It is in this spirit that I had an idea, partially inspired by Mill’s
use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First, it has several features that are “friendly” to the idea. Second, I know Mitch cares about keeping the number of op codes low.
Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate
discussions, and because it has been fun to think about.
The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If set, the bit indicates that the corresponding register contains a floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load single floating and load double floating, which work the same as the other 32-
and 64-bit loads, but in addition to loading the value, set the tag bit
for the destination register. Non-floating-point loads would clear the
tag bit. As I show below, I don’t think you need any special "store
tag" instructions.
When executing arithmetic instructions, if the tag bits of both sources
of an instruction are the same, do the appropriate operation (floating
or integer), and set the tag bit of the result register appropriately.
If the tag bits of the two sources are different, I see several possibilities.
1. Generate an exception.
2. Use the sense of source 1 for the arithmetic operation, but
perform the appropriate conversion on the second operand first,
potentially saving an instruction
3. Always do the operation in floating point and convert the integer operand prior to the operation. (Or, if you prefer, change floating
point to integer in the above description.)
4. Same as 2 or 3 above, but don’t do the conversions.
I suspect this is the least useful choice. I am not sure which is the
best option.
Given that, use the same op code for the floating-point and fixed
versions of the same operations. So we can save eight op codes, the
four arithmetic operations, max, min, abs and compare. So far, a net savings of six opcodes.
But we can go further. There are some opcodes that only make sense for
FP operands, e.g. the transcendental instructions. And there are some operations that probably only make sense for non-FP operands, e.g. POP,
FF1, probably shifts. Given the tag bit, these could share the same op-code. There may be several more of these.
I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with separate compilations? The called function probably doesn’t know the
tag value for callee saved registers. Fortunately, the My 66000 architecture comes to the rescue here. You would modify the Enter and
Exit instructions to save/restore the tag bits of the registers they are saving or restoring in the same data structure it uses for the registers (yes, it adds 32 bits to that structure – minimal cost). The same mechanism works for interrupts that take control away from a running process.
I don’t think you need to set or clear the tag bits without doing
anything else, but if you do, I think you could “repurpose” some other instructions to do this, without requiring another op-code. For
example, Oring a register with itself could be used to set the tag bit
and Oring a register with zero could clear it. These should be pretty
rare.
That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it? To me, a major question is the effect on performance. What
is the cost of having to decode the source registers and reading their respective tag bits before knowing which FU to use? If it causes an
extra cycle per instruction, then it is almost certainly not worth it. IANAHG, so I don’t know. But even if it doesn’t cost any performance, I think the overall gains are pretty small, and probably not worth it
unless the op-code space is really tight (which, for My 66000 it isn’t).
Anyway, it has been fun thinking about this, so I hope you don’t mind
the, probably too long, post.
Any comments are welcome.
The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If set, the bit indicates that the corresponding register contains a floating-point value. Clear indicates not floating point (integer,
address, etc.).
There has been discussion here about the benefits of reducing the number
of op codes. One reason not mentioned before is if you have fixed
length instructions, you may want to leave as many codes as possible available for future use. Of course, if you are doing a 16-bit
instruction design, where instruction bits are especially tight, you may save enough op-codes to save a bit, perhaps allowing a larger register specifier field, or to allow more instructions in the smaller subset.
It is in this spirit that I had an idea, partially inspired by Mill’s
use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First, it has several features that are “friendly” to the idea. Second, I know Mitch cares about keeping the number of op codes low.
Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate discussions, and because it has been fun to think about.
The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If set, the bit indicates that the corresponding register contains a floating-point value. Clear indicates not floating point (integer, address, etc.). There would be two additional instructions, load single floating and load double floating, which work the same as the other 32-
and 64-bit loads, but in addition to loading the value, set the tag bit
for the destination register. Non-floating-point loads would clear the
tag bit. As I show below, I don’t think you need any special "store
tag" instructions.
When executing arithmetic instructions, if the tag bits of both sources
of an instruction are the same, do the appropriate operation (floating
or integer), and set the tag bit of the result register appropriately.
If the tag bits of the two sources are different, I see several possibilities.
1. Generate an exception.
2. Use the sense of source 1 for the arithmetic operation, but
perform the appropriate conversion on the second operand first,
potentially saving an instruction
3. Always do the operation in floating point and convert the integer operand prior to the operation. (Or, if you prefer, change floating
point to integer in the above description.)
4. Same as 2 or 3 above, but don’t do the conversions.
I suspect this is the least useful choice. I am not sure which is the
best option.
Given that, use the same op code for the floating-point and fixed
versions of the same operations. So we can save eight op codes, the
four arithmetic operations, max, min, abs and compare. So far, a net savings of six opcodes.
But we can go further. There are some opcodes that only make sense for
FP operands, e.g. the transcendental instructions. And there are some operations that probably only make sense for non-FP operands, e.g. POP,
FF1, probably shifts. Given the tag bit, these could share the same op-code. There may be several more of these.
I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with separate compilations? The called function probably doesn’t know the
tag value for callee saved registers. Fortunately, the My 66000 architecture comes to the rescue here. You would modify the Enter and
Exit instructions to save/restore the tag bits of the registers they are saving or restoring in the same data structure it uses for the registers (yes, it adds 32 bits to that structure – minimal cost). The same mechanism works for interrupts that take control away from a running process.
I don’t think you need to set or clear the tag bits without doing
anything else, but if you do, I think you could “repurpose” some other instructions to do this, without requiring another op-code. For
example, Oring a register with itself could be used to set the tag bit
and Oring a register with zero could clear it. These should be pretty rare.
That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it? To me, a major question is the effect on performance. What
is the cost of having to decode the source registers and reading their respective tag bits before knowing which FU to use? If it causes an
extra cycle per instruction, then it is almost certainly not worth it. IANAHG, so I don’t know. But even if it doesn’t cost any performance, I
think the overall gains are pretty small, and probably not worth it
unless the op-code space is really tight (which, for My 66000 it isn’t).
Anyway, it has been fun thinking about this, so I hope you don’t mind
the, probably too long, post.
Any comments are welcome.
On 4/3/2024 11:43 AM, Stephen Fuld wrote:66000
There has been discussion here about the benefits of reducing the number
of op codes. One reason not mentioned before is if you have fixed
length instructions, you may want to leave as many codes as possible
available for future use. Of course, if you are doing a 16-bit
instruction design, where instruction bits are especially tight, you may
save enough op-codes to save a bit, perhaps allowing a larger register
specifier field, or to allow more instructions in the smaller subset.
It is in this spirit that I had an idea, partially inspired by Mill’s
use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First, it
has several features that are “friendly” to the idea. Second, I know >> Mitch cares about keeping the number of op codes low.
Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate
discussions, and because it has been fun to think about.
The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If >> set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load single >> floating and load double floating, which work the same as the other 32-
and 64-bit loads, but in addition to loading the value, set the tag bit
for the destination register. Non-floating-point loads would clear the
tag bit. As I show below, I don’t think you need any special "store
tag" instructions.
When executing arithmetic instructions, if the tag bits of both sources
of an instruction are the same, do the appropriate operation (floating
or integer), and set the tag bit of the result register appropriately.
If the tag bits of the two sources are different, I see several
possibilities.
1. Generate an exception.
2. Use the sense of source 1 for the arithmetic operation, but
perform the appropriate conversion on the second operand first,
potentially saving an instruction
3. Always do the operation in floating point and convert the integer >> operand prior to the operation. (Or, if you prefer, change floating
point to integer in the above description.)
4. Same as 2 or 3 above, but don’t do the conversions.
I suspect this is the least useful choice. I am not sure which is the
best option.
Given that, use the same op code for the floating-point and fixed
versions of the same operations. So we can save eight op codes, the
four arithmetic operations, max, min, abs and compare. So far, a net
savings of six opcodes.
But we can go further. There are some opcodes that only make sense for
FP operands, e.g. the transcendental instructions. And there are some
operations that probably only make sense for non-FP operands, e.g. POP,
FF1, probably shifts. Given the tag bit, these could share the same
op-code. There may be several more of these.
I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with
separate compilations? The called function probably doesn’t know the
tag value for callee saved registers. Fortunately, the My 66000
architecture comes to the rescue here. You would modify the Enter and
Exit instructions to save/restore the tag bits of the registers they are
saving or restoring in the same data structure it uses for the registers
(yes, it adds 32 bits to that structure – minimal cost). The same
mechanism works for interrupts that take control away from a running
process.
I don’t think you need to set or clear the tag bits without doing
anything else, but if you do, I think you could “repurpose” some other >> instructions to do this, without requiring another op-code. For
example, Oring a register with itself could be used to set the tag bit
and Oring a register with zero could clear it. These should be pretty
rare.
That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it?
To me, a major question is the effect on performance. What
is the cost of having to decode the source registers and reading their
respective tag bits before knowing which FU to use?
If it causes an
extra cycle per instruction, then it is almost certainly not worth it.
IANAHG, so I don’t know. But even if it doesn’t cost any performance, I
think the overall gains are pretty small, and probably not worth it
unless the op-code space is really tight (which, for My 66000 it isn’t). >>
Anyway, it has been fun thinking about this, so I hope you don’t mind
the, probably too long, post.
Any comments are welcome.
FWIW:
This doesn't seem too far off from what would be involved with dynamic typing at the ISA level, but with many of same sorts of drawbacks...
Say, for example, top 2 bits of a register:
00: Object Reference
Next 2 bits:
00: Pointer (with type-tag)
01: ?
1z: Bounded Array
01: Fixnum (route to ALU)
10: Flonum (route to FPU)
11: Other types
00: Smaller value types
Say: int/uint, short/ushort, ...
...
One issue:
Decoding based on register tags would mean needing to know the register
tag bits at the same time the instruction is being decoded. In this
case, one is likely to need two clock-cycles to fully decode the opcode.
FWIW:
This doesn't seem too far off from what would be involved with dynamic typing at the ISA level, but with many of same sorts of drawbacks...
Say, for example, top 2 bits of a register:
00: Object Reference
Next 2 bits:
00: Pointer (with type-tag)
01: ?
1z: Bounded Array
01: Fixnum (route to ALU)
10: Flonum (route to FPU)
11: Other types
00: Smaller value types
Say: int/uint, short/ushort, ...
...
One issue:
Decoding based on register tags would mean needing to know the register
tag bits at the same time the instruction is being decoded. In this
case, one is likely to need two clock-cycles to fully decode the opcode.
ID1: Unpack instruction to figure out register fields, etc.
ID2: Fetch registers, specialize variable instructions based on tag bits.
For timing though, one ideally doesn't want to do anything with the
register values until the EX stages (since ID2 might already be tied up
with the comparably expensive register-forwarding logic), but asking for
3 cycles for decode is a bit much.
Otherwise, if one does not know which FU should handle the operation
until EX1, this has its own issues.
Or, possible, the FU's decide
whether to accept the operation:
ALU: Accepts operation if both are fixnum, FPU if both are Flonum.
But, a proper dynamic language allows mixing fixnum and flonum with the result being implicitly converted to flonum, but from the FPU's POV,
this would effectively require two chained FADD operations (one for the Fixnum to Flonum conversion, one for the FADD itself).
Many other cases could get hairy, but to have any real benefit, the CPU would need to be able to deal with them. In cases where the compiler
deals with everything, the type-tags become mostly moot (or potentially detrimental).
But, then, there is another issue:maybe
C code expects C type semantics to be respected, say:
Signed int overflow wraps at 32 bits (sign extending);
Unsigned int overflow wraps at 32 bits (zero extending);maybe
Variables may not hold values out-of-range for that type;LLVM does this GCC does not.
The 'long long' and 'unsigned long long' types are exactly 64-bit;At least 64-bit not exactly.
...
...
If one has tagged 64-bit registers, then fixnum might not hold the
entire range of 'long long'. If one has 66 or 68 bit registers, then
memory storage is a problem.
If one has untagged registers for cases where they are needed, one has
not saved any encoding space.
BGB-Alt wrote:
But, a proper dynamic language allows mixing fixnum and flonum with the
result being implicitly converted to flonum, but from the FPU's POV,
this would effectively require two chained FADD operations (one for the
Fixnum to Flonum conversion, one for the FADD itself).
That is a LANGUAGE problem not an ISA problem. SNOBOL allowed one to add
a string to an integer and the string would be converted to int before.....
BGB-Alt wrote:
FWIW:
This doesn't seem too far off from what would be involved with dynamic
typing at the ISA level, but with many of same sorts of drawbacks...
Say, for example, top 2 bits of a register:
00: Object Reference
Next 2 bits:
00: Pointer (with type-tag)
01: ?
1z: Bounded Array
01: Fixnum (route to ALU)
10: Flonum (route to FPU)
11: Other types
00: Smaller value types
Say: int/uint, short/ushort, ...
...
So, you either have 66-bit registers, or you have 62-bit FP numbers ?!?
This solves nobody's problems; not even LISP.
One issue:
Decoding based on register tags would mean needing to know the
register tag bits at the same time the instruction is being decoded.
In this case, one is likely to need two clock-cycles to fully decode
the opcode.
Not good. But what if you don't know the tag until the register is
delivered from a latent FU, do you stall DECODE, or do you launch and
make the instruction
queue element have to deal with all outcomes.
ID1: Unpack instruction to figure out register fields, etc.
ID2: Fetch registers, specialize variable instructions based on tag bits.
For timing though, one ideally doesn't want to do anything with the
register values until the EX stages (since ID2 might already be tied
up with the comparably expensive register-forwarding logic), but
asking for 3 cycles for decode is a bit much.
Otherwise, if one does not know which FU should handle the operation
until EX1, this has its own issues.
Real-friggen-ely
Or, possible, the FU's decide
whether to accept the operation:
ALU: Accepts operation if both are fixnum, FPU if both are Flonum.
What if IMUL is performed in FMAC, IDIV in FDIV,... Int<->FP routing is
based on calculation capability {Even CDC 6600 performed int × in the FP
× unit (not in Thornton's book, but via conversation with 6600 logic designer at Asilomar some time ago. All they had to do to get FP × to perform int × was disable 1 gate.......)
But, a proper dynamic language allows mixing fixnum and flonum with
the result being implicitly converted to flonum, but from the FPU's
POV, this would effectively require two chained FADD operations (one
for the Fixnum to Flonum conversion, one for the FADD itself).
That is a LANGUAGE problem not an ISA problem. SNOBOL allowed one to add
a string to an integer and the string would be converted to int before.....
Many other cases could get hairy, but to have any real benefit, the
CPU would need to be able to deal with them. In cases where the
compiler deals with everything, the type-tags become mostly moot (or
potentially detrimental).
You are arguing that the added complexity would somehow pay for itself.
I can't see it paying for itself.
But, then, there is another issue:maybe
C code expects C type semantics to be respected, say:
Signed int overflow wraps at 32 bits (sign extending);
Unsigned int overflow wraps at 32 bits (zero extending);maybe
Variables may not hold values out-of-range for that type;LLVM does this GCC does not.
The 'long long' and 'unsigned long long' types are exactly 64-bit;At least 64-bit not exactly.
...
...
If one has tagged 64-bit registers, then fixnum might not hold the
entire range of 'long long'. If one has 66 or 68 bit registers, then
memory storage is a problem.
Ya think ?
If one has untagged registers for cases where they are needed, one has
not saved any encoding space.
I give up--not worth trying to teach cosmologist why the color of the lipstick going on the pig is not the problem.....
BGB-Alt wrote:This is why, if you want to copy Mill, you have to do it properly:
On 4/3/2024 11:43 AM, Stephen Fuld wrote:66000
There has been discussion here about the benefits of reducing the
number of op codes. One reason not mentioned before is if you have
fixed length instructions, you may want to leave as many codes as
possible available for future use. Of course, if you are doing a
16-bit instruction design, where instruction bits are especially
tight, you may save enough op-codes to save a bit, perhaps allowing a
larger register specifier field, or to allow more instructions in the
smaller subset.
It is in this spirit that I had an idea, partially inspired by Mill’s >>> use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First, it
has several features that are “friendly” to the idea. Second, I know >>> Mitch cares about keeping the number of op codes low.
Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate >>> discussions, and because it has been fun to think about.
The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag.
If set, the bit indicates that the corresponding register contains a >>> floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load
single floating and load double floating, which work the same as the >>> other 32- and 64-bit loads, but in addition to loading the value, set
the tag bit for the destination register. Non-floating-point loads
would clear the tag bit. As I show below, I don’t think you need any >>> special "store tag" instructions.
What do you do when you want a FP bit pattern interpreted as an integer,
or vice versa.
We do NOT make any attempt
Terje
On Thu, 4 Apr 2024 10:32:48 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
We do NOT make any attempt
Terje
Does a present tense means that you are still involved in Mill project?
Michael S wrote:
On Thu, 4 Apr 2024 10:32:48 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
We do NOT make any attempt
Terje
Does a present tense means that you are still involved in MillI am much less active than I used to be, but I still get the weekly
project?
conf call invites and respond to any interesting subject on our
mailing list.
So, yes, I do consider myself to still be involved.
Terje
MitchAlsup1 wrote:
BGB-Alt wrote:
On 4/3/2024 11:43 AM, Stephen Fuld wrote:66000
There has been discussion here about the benefits of reducing the
number of op codes. One reason not mentioned before is if you have
fixed length instructions, you may want to leave as many codes as
possible available for future use. Of course, if you are doing a
16-bit instruction design, where instruction bits are especially
tight, you may save enough op-codes to save a bit, perhaps allowing
a larger register specifier field, or to allow more instructions in
the smaller subset.
It is in this spirit that I had an idea, partially inspired by
Mill’s use of tags in registers, but not memory. I worked through >>>> this idea using the My 6600 as an example “substrate” for two
reasons. First, it
has several features that are “friendly” to the idea. Second, I >>>> know Mitch cares about keeping the number of op codes low.
Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate
discussions, and because it has been fun to think about.
The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag.
If set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load
single floating and load double floating, which work the same as the
other 32- and 64-bit loads, but in addition to loading the value,
set the tag bit for the destination register. Non-floating-point
loads would clear the tag bit. As I show below, I don’t think you >>>> need any special "store tag" instructions.
What do you do when you want a FP bit pattern interpreted as an
integer, or vice versa.
This is why, if you want to copy Mill, you have to do it properly:
Mill does NOT care about the type of data loaded into a particular belt slot, only the size and if it is a scalar or a vector filling up the
full belt slot. In either case you will also have marker bits for
special types like None and NaR.
So scalar 8/16/32/64/128 and vector 8x16/16x8/32x4/64x2/128x1 (with the
last being the same as the scalar anyway).
Only load ops and explicit widening/narrowing ops sets the size tag
bits, from that point any op where it makes sense will do the right
thing for either a scalar or a short vector, so you can add 16+16 8-bit
vars with the same ADD encoding as you would use for a single 64-bit ADD.
We do NOT make any attempt to interpret the actual bit patterns sotred within each belt slot, that is up to the instructions. This means that
there is no difference between loading a float or an int32_t, it also
means that it is perfectly legel (and supported) to use bit operations
on a FP variable. This can be very useful, not just to fake exact
arithmetic by splitting a double into two 26-bit mantissa parts:
Terje
On 4/4/2024 3:32 AM, Terje Mathisen wrote:
MitchAlsup1 wrote:
As I can note, in my actual ISA, any type-tagging in the registers was explicit and opt-in, generally managed by the compiler/runtime/etc; in
this case, the ISA merely providing facilities to assist with this.
The main exception would likely have been the possible "Bounds Check Enforce" mode, which would still need a bit of work to implement, and is
not likely to be terribly useful.
Most complicated and expensive parts
are that it will require implicit register and memory tagging (to flag capabilities). Though, cheaper option is simply to not enable it, in
which case things either behave as before, with the new functionality essentially being NOP. Much of the work still needed on this would be getting the 128-bit ABI working, and adding some new tweaks to the ABI
to play well with the capability addressing (effectively it requires
partly reworking how global variables are accessed).
The type-tagging scheme used in my case is very similar to that used in
my previous BGBScript VMs (where, as I can note, BGBCC was itself a fork
off of an early version of the BGBScript VM, and effectively using a lax hybrid typesystem masquerading as C). Though, it has long since moved to
a more proper C style typesystem, with dynamic types more as an optional extension.
BGB-Alt wrote:
On 4/4/2024 3:32 AM, Terje Mathisen wrote:
MitchAlsup1 wrote:
As I can note, in my actual ISA, any type-tagging in the registers was
explicit and opt-in, generally managed by the compiler/runtime/etc; in
this case, the ISA merely providing facilities to assist with this.
The main exception would likely have been the possible "Bounds Check
Enforce" mode, which would still need a bit of work to implement, and
is not likely to be terribly useful.
A while back (and maybe in the future) My 66000 had what I called the Foreign Access Mode. When the HoB of the pointer was set, the first
entry in the translation table was a 4 doubleword structure, A Root
pointer, the Lowest addressable Byte, the Highest addressable Byte,
and a DW of access rights, permissions,... While sort-of like a capability
I don't think it was close enough to actually be a capability or used as
one.
So, it fell out of favor, and it was not clear how it fit into the HyperVisor/SuperVisor model, either.
Most complicated and expensive parts
are that it will require implicit register and memory tagging (to flag
capabilities). Though, cheaper option is simply to not enable it, in
which case things either behave as before, with the new functionality
essentially being NOP. Much of the work still needed on this would be
getting the 128-bit ABI working, and adding some new tweaks to the ABI
to play well with the capability addressing (effectively it requires
partly reworking how global variables are accessed).
The type-tagging scheme used in my case is very similar to that used
in my previous BGBScript VMs (where, as I can note, BGBCC was itself a
fork off of an early version of the BGBScript VM, and effectively
using a lax hybrid typesystem masquerading as C). Though, it has long
since moved to a more proper C style typesystem, with dynamic types
more as an optional extension.
In general, any time one needs to change the type you waste an instruction compared to type less registers.
On some older CPUs, there might be one set of integer opcodes and one
set of floating-point opcodes, with a status register containing the
integer precision, and the floating-point precision, currently in use.
The idea was that this would be efficient because most programs only
use one size of each type of number, so the number of opcodes would be
the most appropriate, and that status register wouldn't need to be
reloaded too often.
It's considered dangerous, though, to have a mechanism for changing
what instructions mean, since this could let malware alter what
programs do in a useful and sneaky fashion. Memory bandwidth is no
longer a crippling constraint the way it was back in the days of core
memory and discrete transistors - at least not for program code, even
if memory bandwidth for _data_ often limits the processing speed of computers.
This is basically because any program that does any real work, taking
any real length of time to do its job, is going to mostly consist of
loops that fit in cache. So letting program code be verbose if there
are other benefits obtained thereby is the current conventional
wisdom.
John Savard
On some older CPUs, there might be one set of integer opcodes and one
set of floating-point opcodes, with a status register containing the
integer precision, and the floating-point precision, currently in use.
The idea was that this would be efficient because most programs only
use one size of each type of number, so the number of opcodes would be
the most appropriate, and that status register wouldn't need to be
reloaded too often.
It's considered dangerous, though, to have a mechanism for changing--- Synchronet 3.20a-Linux NewsLink 1.114
what instructions mean, since this could let malware alter what
programs do in a useful and sneaky fashion. Memory bandwidth is no
longer a crippling constraint the way it was back in the days of core
memory and discrete transistors - at least not for program code, even
if memory bandwidth for _data_ often limits the processing speed of computers.
This is basically because any program that does any real work, taking
any real length of time to do its job, is going to mostly consist of
loops that fit in cache. So letting program code be verbose if there
are other benefits obtained thereby is the current conventional
wisdom.
John Savard
Early in My 66000 LLVM development Brian looked at the cost of having
only 1 FP OpCode set--and it did not look good--so we went back to the >standard way of an OpCode for each FP size calculation.
On Fri, 5 Apr 2024 21:34:16 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:
Early in My 66000 LLVM development Brian looked at the cost of having
only 1 FP OpCode set--and it did not look good--so we went back to the >>standard way of an OpCode for each FP size × calculation.
I do tend to agree.
However, a silly idea has now occurred to me.
256 bits can contain eight instructions that are 32 bits long.
Or they can also contain seven instructions that are 36 bits long,
with four bits left over.
So they could contain *nine* instructions that are 28 bits long, also
with four bits left over.
Thus, instead of having mode bits, one _could_ do the following:
Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.
But have one value for the first four bits in a block that indicates
36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed
in which don't fill a whole block.
While that's a theoretical possibility, I don't view it as being
worthwhile in practice.
John Savard--- Synchronet 3.20a-Linux NewsLink 1.114
Thus, instead of having mode bits, one _could_ do the following:
Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.
But have one value for the first four bits in a block that indicates
36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed
in which don't fill a whole block.
While that's a theoretical possibility, I don't view it as being
worthwhile in practice.
John Savard <quadibloc@servername.invalid> schrieb:
Thus, instead of having mode bits, one _could_ do the following:
Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.
But have one value for the first four bits in a block that indicates
36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed
in which don't fill a whole block.
While that's a theoretical possibility, I don't view it as being
worthwhile in practice.
I played around a bit with another scheme: Encoding things into
128-bit blocks, with either 21-bit or 42-bit or longer instructions
(or a block header with six bits, and 20 or 40 bits for each
instruction).
Did that look promising? Not really; the 21 bits offered a lot
of useful opcode space for two-register operations and even for
a few of the often-used three-register, but 42 bits was really
a bit too long, so the advantage wasn't great. And embedding
32-bit or 64-bit instructions in the code stream does not really
fit the 21-bit raster well, so compared to an ISA which can do so
(like My 66000) it came out at a disadvantage. Might be possible
to beat RISC-V, though.
Thomas Koenig wrote:
John Savard <quadibloc@servername.invalid> schrieb:
Thus, instead of having mode bits, one _could_ do the following:
Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.
But have one value for the first four bits in a block that indicates
36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed
in which don't fill a whole block.
While that's a theoretical possibility, I don't view it as being
worthwhile in practice.
I played around a bit with another scheme: Encoding things into
128-bit blocks, with either 21-bit or 42-bit or longer instructions
(or a block header with six bits, and 20 or 40 bits for each
instruction).
Not having seen said encoding scheme:: I suspect you used the Rd=Rs1 destructive operand model for the 21-bit encodings. Yes :: no ??
Otherwise one has 3×5-bit registers = 15-bits leaving only 6-bits
for 64 OpCodes.
Now if you have floats and doubles and signed and
unsigned, you get 16 of each and we have not looked at memory
references or branching.
Did that look promising? Not really; the 21 bits offered a lot
of useful opcode space for two-register operations and even for
a few of the often-used three-register, but 42 bits was really
a bit too long, so the advantage wasn't great. And embedding
32-bit or 64-bit instructions in the code stream does not really
fit the 21-bit raster well, so compared to an ISA which can do so
(like My 66000) it came out at a disadvantage. Might be possible
to beat RISC-V, though.
But beating RISC-V is easy, try getting you instruction count down
to VAX counts without losing the ability to pipeline and parallel
instruction execution.
At handwaving accuracy::
VAX has 1.0 instructions
My 66000 has 1.1 instructions
RISC-V has 1.5 instructions
But beating RISC-V is easy, try getting you instruction count down
to VAX counts without losing the ability to pipeline and parallel
instruction execution.
At handwaving accuracy::
VAX has 1.0 instructions
My 66000 has 1.1 instructions
RISC-V has 1.5 instructions
To reach VAX instruction density
How do you attach 32-bit or 64-bit constants to 28-bit instructions ??
How do you switch from 64-bit to Byte to 32-bit to 16-bit in one
set of 256-bit instruction decodes ??
In complicated if-then-else codes (and switches) I often see one inst- >ruction followed by a branch to a common point. Does your encoding deal
with these efficiently ?? That is:: what happens when you jump to the
middle of a block of 36-bit instructions ??
Well, when the computer fetches a 256-bit block of code, the first
four bits indicates whether it is composed of 36-bit instructions or
28-bit instructions.
On Sun, 7 Apr 2024 20:41:45 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:
In complicated if-then-else codes (and switches) I often see one inst- >>ruction followed by a branch to a common point. Does your encoding deal >>with these efficiently ?? That is:: what happens when you jump to the >>middle of a block of 36-bit instructions ??
Well, when the computer fetches a 256-bit block of code, the first
four bits indicates whether it is composed of 36-bit instructions or
28-bit instructions. So the computer knows where the instructions are;
and thus a convention can be applied, such as addressing each 36-bit instruction by the addresses of the first seven 32-bit positions in
the block.
In the case of 28-bit instructions, the first eight correspond to the--- Synchronet 3.20a-Linux NewsLink 1.114
32-bit positions, the ninth corresponds to the last 16 bits of the
block.
John Savard
MitchAlsup1 <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
John Savard <quadibloc@servername.invalid> schrieb:
Thus, instead of having mode bits, one _could_ do the following:
Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.
But have one value for the first four bits in a block that indicates
36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed
in which don't fill a whole block.
While that's a theoretical possibility, I don't view it as being
worthwhile in practice.
I played around a bit with another scheme: Encoding things into
128-bit blocks, with either 21-bit or 42-bit or longer instructions
(or a block header with six bits, and 20 or 40 bits for each
instruction).
Not having seen said encoding scheme:: I suspect you used the Rd=Rs1
destructive operand model for the 21-bit encodings. Yes :: no ??
It was not very well developed, I gave it up when I saw there wasn't
much to gain.
I wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
John Savard <quadibloc@servername.invalid> schrieb:
Thus, instead of having mode bits, one _could_ do the following:
Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.
But have one value for the first four bits in a block that indicates >>>>> 36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed >>>>> in which don't fill a whole block.
While that's a theoretical possibility, I don't view it as being
worthwhile in practice.
I played around a bit with another scheme: Encoding things into
128-bit blocks, with either 21-bit or 42-bit or longer instructions
(or a block header with six bits, and 20 or 40 bits for each
instruction).
Not having seen said encoding scheme:: I suspect you used the Rd=Rs1
destructive operand model for the 21-bit encodings. Yes :: no ??
It was not very well developed, I gave it up when I saw there wasn't
much to gain.
Maybe one more thing: In order to justify the more complex encoding,
I was going for 64 registers, and that didn't work out too well
(missing bits).
Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.
On 4/9/2024 1:24 PM, Thomas Koenig wrote:
I wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:Maybe one more thing: In order to justify the more complex encoding,
Thomas Koenig wrote:
I was going for 64 registers, and that didn't work out too well
(missing bits).
Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.
For 32-bit instructions at least, 64 GPRs can work out OK.
Though, the gain of 64 over 32 seems to be fairly small for most
"typical" code, mostly bringing a benefit if one is spending a lot of
CPU time in functions that have large numbers of local variables all
being used at the same time.
Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for performance.
Where, 16 GPRs isn't really enough (lots of register spills), and 128
GPRs is wasteful (would likely need lots of monster functions with 250+ local variables to make effective use of this, *, which probably isn't
going to happen).
*: Where, it appears it is most efficient (for non-leaf functions) if
the number of local variables is roughly twice that of the number of CPU registers. If more local variables than this, then spill/fill rate goes
up significantly, and if less, then the registers aren't utilized as effectively.
Well, except in "tiny leaf" functions, where the criteria is instead
that the number of local variables be less than the number of scratch registers. However, for many/most small leaf functions, the total number
of variables isn't all that large either.
Where, function categories:
Tiny Leaf:
Everything fits in scratch registers, no stack frame, no calls.
Leaf:
No function calls (either explicit or implicit);
Will have a stack frame.
Non-Leaf:
May call functions, has a stack frame.
There is a "static assign everything" case in my case, where all of the variables are statically assigned to registers (for the scope of the function). This case typically requires that everything fit into callee
save registers, so (like the "tiny leaf" category, requires that the
number of local variables is less than the available registers).
On a 32 register machine, if there are 14 available callee-save
registers, the limit is 14 variables. On a 64 register machine, this
limit might be 30 instead. This seems to have good coverage.
BGB wrote:
On 4/9/2024 1:24 PM, Thomas Koenig wrote:
I wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:Maybe one more thing: In order to justify the more complex encoding,
Thomas Koenig wrote:
I was going for 64 registers, and that didn't work out too well
(missing bits).
Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.
For 32-bit instructions at least, 64 GPRs can work out OK.
Though, the gain of 64 over 32 seems to be fairly small for most
"typical" code, mostly bringing a benefit if one is spending a lot of
CPU time in functions that have large numbers of local variables all
being used at the same time.
Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code
density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
performance.
Where, 16 GPRs isn't really enough (lots of register spills), and 128
GPRs is wasteful (would likely need lots of monster functions with
250+ local variables to make effective use of this, *, which probably
isn't going to happen).
16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part
of GPRs AND you have good access to constants.
*: Where, it appears it is most efficient (for non-leaf functions) if
the number of local variables is roughly twice that of the number of
CPU registers. If more local variables than this, then spill/fill rate
goes up significantly, and if less, then the registers aren't utilized
as effectively.
Well, except in "tiny leaf" functions, where the criteria is instead
that the number of local variables be less than the number of scratch
registers. However, for many/most small leaf functions, the total
number of variables isn't all that large either.
The vast majority of leaf functions use less than 16 GPRs, given one has
a SP not part of GPRs {including arguments and return values}. Once one starts placing things like memove(), memset(), sin(), cos(), exp(), log()
in the ISA, it goes up even more.
Where, function categories:
Tiny Leaf:
Everything fits in scratch registers, no stack frame, no calls.
Leaf:
No function calls (either explicit or implicit);
Will have a stack frame.
Non-Leaf:
May call functions, has a stack frame.
You are forgetting about FP, GOT, TLS, and whatever resources are required
to do try-throw-catch stuff as demanded by the source language.
There is a "static assign everything" case in my case, where all of
the variables are statically assigned to registers (for the scope of
the function). This case typically requires that everything fit into
callee save registers, so (like the "tiny leaf" category, requires
that the number of local variables is less than the available registers).
On a 32 register machine, if there are 14 available callee-save
registers, the limit is 14 variables. On a 64 register machine, this
limit might be 30 instead. This seems to have good coverage.
The apparent number of registers goes up when one does not waste a register to hold a use-once constant.
On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
BGB wrote:
Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code
density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
performance.
Where, 16 GPRs isn't really enough (lots of register spills), and 128
GPRs is wasteful (would likely need lots of monster functions with
250+ local variables to make effective use of this, *, which probably
isn't going to happen).
16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part
of GPRs AND you have good access to constants.
On the main ISA's I had tried to generate code for, 16 GPRs was kind of
a pain as it resulted in fairly high spill rates.
Though, it would probably be less bad if the compiler was able to use
all of the registers at the same time without stepping on itself (such
as dealing with register allocation involving scratch registers while
also not conflicting with the use of function arguments, ...).
My code generators had typically only used callee save registers for variables in basic blocks which ended in a function call (in my compiler design, both function calls and branches terminating the current basic-block).
On SH, the main way of getting constants (larger than 8 bits) was via PC-relative memory loads, which kinda sucked.
This is slightly less bad on x86-64, since one can use memory operands
with most instructions, and the CPU tends to deal fairly well with code
that has lots of spill-and-fill. This along with instructions having
access to 32-bit immediate values.
The vast majority of leaf functions use less than 16 GPRs, given one has
a SP not part of GPRs {including arguments and return values}. Once one
starts placing things like memove(), memset(), sin(), cos(), exp(), log()
in the ISA, it goes up even more.
Yeah.
Things like memcpy/memmove/memset/etc, are function calls in cases when
not directly transformed into register load/store sequences.
Did end up with an intermediate "memcpy slide", which can handle medium
size memcpy and memset style operations by branching into a slide.
As noted, on a 32 GPR machine, most leaf functions can fit entirely in scratch registers.
On a 64 GPR machine, this percentage is slightly
higher (but, not significantly, since there are few leaf functions
remaining at this point).
If one had a 16 GPR machine with 6 usable scratch registers, it is a
little harder though (as typically these need to cover both any
variables used by the function, and any temporaries used, ...). There
are a whole lot more leaf functions that exceed a limit of 6 than of 14.
But, say, a 32 GPR machine could still do well here.
Note that there are reasons why I don't claim 64 GPRs as a large
performance advantage:
On programs like Doom, the difference is small at best.
It mostly effects things like GLQuake in my case, mostly because TKRA-GL
has a lot of functions with a large numbers of local variables (some exceeding 100 local variables).
Partly though this is due to code that is highly inlined and unrolled
and uses lots of variables tending to perform better in my case (and
tightly looping code, with lots of small functions, not so much...).
Where, function categories:
Tiny Leaf:
Everything fits in scratch registers, no stack frame, no calls. >>> Leaf:
No function calls (either explicit or implicit);
Will have a stack frame.
Non-Leaf:
May call functions, has a stack frame.
You are forgetting about FP, GOT, TLS, and whatever resources are required >> to do try-throw-catch stuff as demanded by the source language.
Yeah, possibly true.
In my case:
There is no frame pointer, as BGBCC doesn't use one;
All stack-frames are fixed size, VLA's and alloca use the heap;
GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
TLS, accessed via TBR.
Try/throw/catch:
Mostly N/A for leaf functions.
Any function that can "throw", is in effect no longer a leaf function. Implicitly, any function which uses "variant" or similar is also, no
longer a leaf function.
Need for GBR save/restore effectively excludes a function from being tiny-leaf. This may happen, say, if a function accesses global variables
and may be called as a function pointer.
One "TODO" here would be to merge constants with the same "actual" value into the same register. At present, they will be duplicated if the types
are sufficiently different (such as integer 0 vs NULL).
For functions with dynamic assignment, immediate values are more likely
to be used. If the code-generator were clever, potentially it could
exclude assigning registers to constants which are only used by
instructions which can encode them directly as an immediate. Currently, BGBCC is not that clever.
Or, say:
y=x+31; //31 only being used here, and fits easily in an Imm9.
Ideally, compiler could realize 31 does not need a register here.
Well, and another weakness is with temporaries that exist as function arguments:
If static assigned, the "target variable directly to argument register" optimization can't be used (it ends up needing to go into a callee-save register and then be MOV'ed into the argument register; otherwise the compiler breaks...).
Though, I guess possible could be that the compiler could try to
partition temporaries that are used exclusively as function arguments
into a different category from "normal" temporaries (or those whose
values may cross a basic-block boundary), and then avoid statically-assigning them (and somehow not cause this to effectively
break the full-static-assignment scheme in the process).
Though, IIRC, I had also considered the possibility of a temporary
"virtual assignment", allowing the argument value to be temporarily
assigned to a function argument register, then going "poof" and
disappearing when the function is called. Hadn't yet thought of a good
way to add this logic to the register allocator though.
But, yeah, compiler stuff is really fiddly...
On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
BGB wrote:
On 4/9/2024 1:24 PM, Thomas Koenig wrote:
I wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:Maybe one more thing: In order to justify the more complex encoding,
Thomas Koenig wrote:
I was going for 64 registers, and that didn't work out too well
(missing bits).
Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.
For 32-bit instructions at least, 64 GPRs can work out OK.
Though, the gain of 64 over 32 seems to be fairly small for most
"typical" code, mostly bringing a benefit if one is spending a lot of
CPU time in functions that have large numbers of local variables all
being used at the same time.
Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for
code density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
performance.
Where, 16 GPRs isn't really enough (lots of register spills), and 128
GPRs is wasteful (would likely need lots of monster functions with
250+ local variables to make effective use of this, *, which probably
isn't going to happen).
16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part
of GPRs AND you have good access to constants.
On the main ISA's I had tried to generate code for, 16 GPRs was kind of
a pain as it resulted in fairly high spill rates.
Though, it would probably be less bad if the compiler was able to use
all of the registers at the same time without stepping on itself (such
as dealing with register allocation involving scratch registers while
also not conflicting with the use of function arguments, ...).
My code generators had typically only used callee save registers for variables in basic blocks which ended in a function call (in my compiler design, both function calls and branches terminating the current basic-block).
On SH, the main way of getting constants (larger than 8 bits) was via PC-relative memory loads, which kinda sucked.
This is slightly less bad on x86-64, since one can use memory operands
with most instructions, and the CPU tends to deal fairly well with code
that has lots of spill-and-fill. This along with instructions having
access to 32-bit immediate values.
*: Where, it appears it is most efficient (for non-leaf functions) if
the number of local variables is roughly twice that of the number of
CPU registers. If more local variables than this, then spill/fill
rate goes up significantly, and if less, then the registers aren't
utilized as effectively.
Well, except in "tiny leaf" functions, where the criteria is instead
that the number of local variables be less than the number of scratch
registers. However, for many/most small leaf functions, the total
number of variables isn't all that large either.
The vast majority of leaf functions use less than 16 GPRs, given one has
a SP not part of GPRs {including arguments and return values}. Once
one starts placing things like memove(), memset(), sin(), cos(),
exp(), log()
in the ISA, it goes up even more.
Yeah.
Things like memcpy/memmove/memset/etc, are function calls in cases when
not directly transformed into register load/store sequences.
Did end up with an intermediate "memcpy slide", which can handle medium
size memcpy and memset style operations by branching into a slide.
As noted, on a 32 GPR machine, most leaf functions can fit entirely in scratch registers. On a 64 GPR machine, this percentage is slightly
higher (but, not significantly, since there are few leaf functions
remaining at this point).
If one had a 16 GPR machine with 6 usable scratch registers, it is a
little harder though (as typically these need to cover both any
variables used by the function, and any temporaries used, ...). There
are a whole lot more leaf functions that exceed a limit of 6 than of 14.
But, say, a 32 GPR machine could still do well here.
Note that there are reasons why I don't claim 64 GPRs as a large
performance advantage:
On programs like Doom, the difference is small at best.
It mostly effects things like GLQuake in my case, mostly because TKRA-GL
has a lot of functions with a large numbers of local variables (some exceeding 100 local variables).
Partly though this is due to code that is highly inlined and unrolled
and uses lots of variables tending to perform better in my case (and
tightly looping code, with lots of small functions, not so much...).
Where, function categories:
Tiny Leaf:
Everything fits in scratch registers, no stack frame, no calls. >>> Leaf:
No function calls (either explicit or implicit);
Will have a stack frame.
Non-Leaf:
May call functions, has a stack frame.
You are forgetting about FP, GOT, TLS, and whatever resources are
required
to do try-throw-catch stuff as demanded by the source language.
Yeah, possibly true.
In my case:
There is no frame pointer, as BGBCC doesn't use one;
All stack-frames are fixed size, VLA's and alloca use the heap;
GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
TLS, accessed via TBR.[...]
BGB-Alt wrote:
On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
BGB wrote:
Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for
code density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
performance.
Where, 16 GPRs isn't really enough (lots of register spills), and
128 GPRs is wasteful (would likely need lots of monster functions
with 250+ local variables to make effective use of this, *, which
probably isn't going to happen).
16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not
part of GPRs AND you have good access to constants.
On the main ISA's I had tried to generate code for, 16 GPRs was kind
of a pain as it resulted in fairly high spill rates.
Though, it would probably be less bad if the compiler was able to use
all of the registers at the same time without stepping on itself (such
as dealing with register allocation involving scratch registers while
also not conflicting with the use of function arguments, ...).
My code generators had typically only used callee save registers for
variables in basic blocks which ended in a function call (in my
compiler design, both function calls and branches terminating the
current basic-block).
On SH, the main way of getting constants (larger than 8 bits) was via
PC-relative memory loads, which kinda sucked.
This is slightly less bad on x86-64, since one can use memory operands
with most instructions, and the CPU tends to deal fairly well with
code that has lots of spill-and-fill. This along with instructions
having access to 32-bit immediate values.
Yes, x86 and any architecture (IBM 360, S.E.L. , Interdata, ...) that have LD-Ops act as if they have 4-6 more registers than they really have. x86
with 16 GPRs acts like a RISC with 20-24 GPRs as does 360. Does not really take the place of universal constants, but goes a long way.
The vast majority of leaf functions use less than 16 GPRs, given one has >>> a SP not part of GPRs {including arguments and return values}. Once
one starts placing things like memove(), memset(), sin(), cos(),
exp(), log()
in the ISA, it goes up even more.
Yeah.
Things like memcpy/memmove/memset/etc, are function calls in cases
when not directly transformed into register load/store sequences.
My 66000 does not convert them into LD-ST sequences, MM is a single inst- ruction.
Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into a slide.
MMs and MSs that do not cross page boundaries are ATOMIC. The entire system sees only the before or only the after state and nothing in between. This means one can start (queue up) a SATA disk access without obtaining a lock
to the device--simply because one can fill in all the data of a command in
a single instruction which smells ATOMIC to all interested 3rd parties.
As noted, on a 32 GPR machine, most leaf functions can fit entirely in
scratch registers.
Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without getting totally screwed.
On a 64 GPR machine, this percentage is slightly
higher (but, not significantly, since there are few leaf functions
remaining at this point).
If one had a 16 GPR machine with 6 usable scratch registers, it is a
little harder though (as typically these need to cover both any
variables used by the function, and any temporaries used, ...). There
are a whole lot more leaf functions that exceed a limit of 6 than of 14.
The data back in the R2000-3000 days indicated that 32 GPRs has a 15%+ advantage over a 16 GPRs; while 84 had only a 3% advantage.
But, say, a 32 GPR machine could still do well here.
Note that there are reasons why I don't claim 64 GPRs as a large
performance advantage:
On programs like Doom, the difference is small at best.
It mostly effects things like GLQuake in my case, mostly because
TKRA-GL has a lot of functions with a large numbers of local variables
(some exceeding 100 local variables).
Partly though this is due to code that is highly inlined and unrolled
and uses lots of variables tending to perform better in my case (and
tightly looping code, with lots of small functions, not so much...).
Where, function categories:
Tiny Leaf:
Everything fits in scratch registers, no stack frame, no calls. >>>> Leaf:
No function calls (either explicit or implicit);
Will have a stack frame.
Non-Leaf:
May call functions, has a stack frame.
You are forgetting about FP, GOT, TLS, and whatever resources are
required
to do try-throw-catch stuff as demanded by the source language.
Yeah, possibly true.
In my case:
There is no frame pointer, as BGBCC doesn't use one;
Can't do PASCAL and other ALOGO derived languages with block structure.
All stack-frames are fixed size, VLA's and alloca use the heap;
longjump() is at a serious disadvantage here. desctructors are sometimes hard to position on the stack.
GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
TLS, accessed via TBR.
Try/throw/catch:
Mostly N/A for leaf functions.
Any function that can "throw", is in effect no longer a leaf function.
Implicitly, any function which uses "variant" or similar is also, no
longer a leaf function.
You do realize that there is a set of #define-s that can implement try-throw-catch without requiring any subroutines ?!?
Need for GBR save/restore effectively excludes a function from being
tiny-leaf. This may happen, say, if a function accesses global
variables and may be called as a function pointer.
------------------------------------------------------
One "TODO" here would be to merge constants with the same "actual"
value into the same register. At present, they will be duplicated if
the types are sufficiently different (such as integer 0 vs NULL).
In practice, the upper 48-bits of a extern variable is completely shared whereas the lower 16-bits are unique.
For functions with dynamic assignment, immediate values are more
likely to be used. If the code-generator were clever, potentially it
could exclude assigning registers to constants which are only used by
instructions which can encode them directly as an immediate.
Currently, BGBCC is not that clever.
And then there are languages like PL/1 and FORTRAN where the compiler
has to figure out how big an intermediate array is, allocate it, perform
the math, and then deallocate it.
Or, say:
y=x+31; //31 only being used here, and fits easily in an Imm9.
Ideally, compiler could realize 31 does not need a register here.
Well, and another weakness is with temporaries that exist as function
arguments:
If static assigned, the "target variable directly to argument
register" optimization can't be used (it ends up needing to go into a
callee-save register and then be MOV'ed into the argument register;
otherwise the compiler breaks...).
Though, I guess possible could be that the compiler could try to
partition temporaries that are used exclusively as function arguments
into a different category from "normal" temporaries (or those whose
values may cross a basic-block boundary), and then avoid
statically-assigning them (and somehow not cause this to effectively
break the full-static-assignment scheme in the process).
Brian's compiler finds the largest argument list and the largest return
value list and merges them into a single area on the stack used only
for passing arguments and results across the call interface. And the
<static> SP points at this area.
Though, IIRC, I had also considered the possibility of a temporary
"virtual assignment", allowing the argument value to be temporarily
assigned to a function argument register, then going "poof" and
disappearing when the function is called. Hadn't yet thought of a good
way to add this logic to the register allocator though.
But, yeah, compiler stuff is really fiddly...
More orthogonality helps.
On 4/9/2024 3:47 PM, BGB-Alt wrote:
On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
BGB wrote:
On 4/9/2024 1:24 PM, Thomas Koenig wrote:
I wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:Maybe one more thing: In order to justify the more complex encoding, >>>>> I was going for 64 registers, and that didn't work out too well
Thomas Koenig wrote:
(missing bits).
Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.
For 32-bit instructions at least, 64 GPRs can work out OK.
Though, the gain of 64 over 32 seems to be fairly small for most
"typical" code, mostly bringing a benefit if one is spending a lot
of CPU time in functions that have large numbers of local variables
all being used at the same time.
Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for
code density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
performance.
Where, 16 GPRs isn't really enough (lots of register spills), and
128 GPRs is wasteful (would likely need lots of monster functions
with 250+ local variables to make effective use of this, *, which
probably isn't going to happen).
16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not
part of GPRs AND you have good access to constants.
On the main ISA's I had tried to generate code for, 16 GPRs was kind
of a pain as it resulted in fairly high spill rates.
Though, it would probably be less bad if the compiler was able to use
all of the registers at the same time without stepping on itself (such
as dealing with register allocation involving scratch registers while
also not conflicting with the use of function arguments, ...).
My code generators had typically only used callee save registers for
variables in basic blocks which ended in a function call (in my
compiler design, both function calls and branches terminating the
current basic-block).
On SH, the main way of getting constants (larger than 8 bits) was via
PC-relative memory loads, which kinda sucked.
This is slightly less bad on x86-64, since one can use memory operands
with most instructions, and the CPU tends to deal fairly well with
code that has lots of spill-and-fill. This along with instructions
having access to 32-bit immediate values.
*: Where, it appears it is most efficient (for non-leaf functions)
if the number of local variables is roughly twice that of the number
of CPU registers. If more local variables than this, then spill/fill
rate goes up significantly, and if less, then the registers aren't
utilized as effectively.
Well, except in "tiny leaf" functions, where the criteria is instead
that the number of local variables be less than the number of
scratch registers. However, for many/most small leaf functions, the
total number of variables isn't all that large either.
The vast majority of leaf functions use less than 16 GPRs, given one has >>> a SP not part of GPRs {including arguments and return values}. Once
one starts placing things like memove(), memset(), sin(), cos(),
exp(), log()
in the ISA, it goes up even more.
Yeah.
Things like memcpy/memmove/memset/etc, are function calls in cases
when not directly transformed into register load/store sequences.
Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into a slide.
As noted, on a 32 GPR machine, most leaf functions can fit entirely in
scratch registers. On a 64 GPR machine, this percentage is slightly
higher (but, not significantly, since there are few leaf functions
remaining at this point).
If one had a 16 GPR machine with 6 usable scratch registers, it is a
little harder though (as typically these need to cover both any
variables used by the function, and any temporaries used, ...). There
are a whole lot more leaf functions that exceed a limit of 6 than of 14.
But, say, a 32 GPR machine could still do well here.
Note that there are reasons why I don't claim 64 GPRs as a large
performance advantage:
On programs like Doom, the difference is small at best.
It mostly effects things like GLQuake in my case, mostly because
TKRA-GL has a lot of functions with a large numbers of local variables
(some exceeding 100 local variables).
Partly though this is due to code that is highly inlined and unrolled
and uses lots of variables tending to perform better in my case (and
tightly looping code, with lots of small functions, not so much...).
Where, function categories:
Tiny Leaf:
Everything fits in scratch registers, no stack frame, no calls. >>>> Leaf:
No function calls (either explicit or implicit);
Will have a stack frame.
Non-Leaf:
May call functions, has a stack frame.
You are forgetting about FP, GOT, TLS, and whatever resources are
required
to do try-throw-catch stuff as demanded by the source language.
Yeah, possibly true.
In my case:
There is no frame pointer, as BGBCC doesn't use one;
All stack-frames are fixed size, VLA's and alloca use the heap;
GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
TLS, accessed via TBR.[...]
alloca using the heap? Strange to me...
On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
BGB-Alt wrote:
Also the blob of constants needed to be within 512 bytes of the load instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of a
basic block and then branch over it).
Usually they were spilled between basic-blocks, with the basic-block
needing to branch to the following basic-block in these cases.
Also 8-bit branch displacements are kinda lame, ...
And, if one wanted a 16-bit branch:
MOV.W (PC, 4), R0 //load a 16-bit branch displacement
BRA/F R0
.L0:
NOP // delay slot
.WORD $(Label - .L0)
Also kinda bad...
Things like memcpy/memmove/memset/etc, are function calls in cases
when not directly transformed into register load/store sequences.
My 66000 does not convert them into LD-ST sequences, MM is a single inst-
ruction.
I have no high-level memory move/copy/set instructions.
Only loads/stores...
For small copies, can encode them inline, but past a certain size this becomes too bulky.
A copy loop makes more sense for bigger copies, but has a high overhead
for small to medium copy.
So, there is a size range where doing it inline would be too bulky, but
a loop caries an undesirable level of overhead.
Ended up doing these with "slides", which end up eating roughly several
kB of code space, but was more compact than using larger inline copies.
Say (IIRC):
128 bytes or less: Inline Ld/St sequence
129 bytes to 512B: Slide
Over 512B: Call "memcpy()" or similar.
The slide generally has entry points in multiples of 32 bytes, and
operates in reverse order. So, if not a multiple of 32 bytes, the last
bytes need to be handled externally prior to branching into the slide.
Though, this is only used for fixed-size copies (or "memcpy()" when
value is constant).
Say:
__memcpy64_512_ua:
MOV.Q (R5, 480), R20
MOV.Q (R5, 488), R21
MOV.Q (R5, 496), R22
MOV.Q (R5, 504), R23
MOV.Q R20, (R4, 480)
MOV.Q R21, (R4, 488)
MOV.Q R22, (R4, 496)
MOV.Q R23, (R4, 504)
__memcpy64_480_ua:
MOV.Q (R5, 448), R20
MOV.Q (R5, 456), R21
MOV.Q (R5, 464), R22
MOV.Q (R5, 472), R23
MOV.Q R20, (R4, 448)
MOV.Q R21, (R4, 456)
MOV.Q R22, (R4, 464)
MOV.Q R23, (R4, 472)
....
__memcpy64_32_ua:
MOV.Q (R5), R20
MOV.Q (R5, 8), R21
MOV.Q (R5, 16), R22
MOV.Q (R5, 24), R23
MOV.Q R20, (R4)
MOV.Q R21, (R4, 8)
MOV.Q R22, (R4, 16)
MOV.Q R23, (R4, 24)
RTS
BGB wrote:
On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
BGB-Alt wrote:
Also the blob of constants needed to be within 512 bytes of the load
instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of a
basic block and then branch over it).
In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.
On 4/10/2024 12:01 AM, Chris M. Thomasson wrote:
On 4/9/2024 3:47 PM, BGB-Alt wrote:
On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
BGB wrote:
On 4/9/2024 1:24 PM, Thomas Koenig wrote:
I wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:Maybe one more thing: In order to justify the more complex encoding, >>>>>> I was going for 64 registers, and that didn't work out too well
Thomas Koenig wrote:
(missing bits).
Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.
For 32-bit instructions at least, 64 GPRs can work out OK.
Though, the gain of 64 over 32 seems to be fairly small for most
"typical" code, mostly bringing a benefit if one is spending a lot
of CPU time in functions that have large numbers of local variables >>>>> all being used at the same time.
Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for
code density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for
performance.
Where, 16 GPRs isn't really enough (lots of register spills), and
128 GPRs is wasteful (would likely need lots of monster functions
with 250+ local variables to make effective use of this, *, which
probably isn't going to happen).
16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not
part of GPRs AND you have good access to constants.
On the main ISA's I had tried to generate code for, 16 GPRs was kind
of a pain as it resulted in fairly high spill rates.
Though, it would probably be less bad if the compiler was able to use
all of the registers at the same time without stepping on itself
(such as dealing with register allocation involving scratch registers
while also not conflicting with the use of function arguments, ...).
My code generators had typically only used callee save registers for
variables in basic blocks which ended in a function call (in my
compiler design, both function calls and branches terminating the
current basic-block).
On SH, the main way of getting constants (larger than 8 bits) was via
PC-relative memory loads, which kinda sucked.
This is slightly less bad on x86-64, since one can use memory
operands with most instructions, and the CPU tends to deal fairly
well with code that has lots of spill-and-fill. This along with
instructions having access to 32-bit immediate values.
*: Where, it appears it is most efficient (for non-leaf functions)
if the number of local variables is roughly twice that of the
number of CPU registers. If more local variables than this, then
spill/fill rate goes up significantly, and if less, then the
registers aren't utilized as effectively.
Well, except in "tiny leaf" functions, where the criteria is
instead that the number of local variables be less than the number
of scratch registers. However, for many/most small leaf functions,
the total number of variables isn't all that large either.
The vast majority of leaf functions use less than 16 GPRs, given one
has
a SP not part of GPRs {including arguments and return values}. Once
one starts placing things like memove(), memset(), sin(), cos(),
exp(), log()
in the ISA, it goes up even more.
Yeah.
Things like memcpy/memmove/memset/etc, are function calls in cases
when not directly transformed into register load/store sequences.
Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into a
slide.
As noted, on a 32 GPR machine, most leaf functions can fit entirely
in scratch registers. On a 64 GPR machine, this percentage is
slightly higher (but, not significantly, since there are few leaf
functions remaining at this point).
If one had a 16 GPR machine with 6 usable scratch registers, it is a
little harder though (as typically these need to cover both any
variables used by the function, and any temporaries used, ...). There
are a whole lot more leaf functions that exceed a limit of 6 than of 14. >>>
But, say, a 32 GPR machine could still do well here.
Note that there are reasons why I don't claim 64 GPRs as a large
performance advantage:
On programs like Doom, the difference is small at best.
It mostly effects things like GLQuake in my case, mostly because
TKRA-GL has a lot of functions with a large numbers of local
variables (some exceeding 100 local variables).
Partly though this is due to code that is highly inlined and unrolled
and uses lots of variables tending to perform better in my case (and
tightly looping code, with lots of small functions, not so much...).
Where, function categories:
Tiny Leaf:
Everything fits in scratch registers, no stack frame, no calls. >>>>> Leaf:
No function calls (either explicit or implicit);
Will have a stack frame.
Non-Leaf:
May call functions, has a stack frame.
You are forgetting about FP, GOT, TLS, and whatever resources are
required
to do try-throw-catch stuff as demanded by the source language.
Yeah, possibly true.
In my case:
There is no frame pointer, as BGBCC doesn't use one;
All stack-frames are fixed size, VLA's and alloca use the heap; >>> GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
TLS, accessed via TBR.[...]
alloca using the heap? Strange to me...
Well, in this case:
The alloca calls are turned into calls which allocate the memory blob
and add it to a linked list;
when the function returns, everything in the linked list is freed;
Then, it internally pulls this off via malloc and free.
Also the typical default stack size in this case is 128K, so trying to
put big allocations on the stack is more liable to result in a stack overflow.
Bigger stack needs more memory, so is not ideal for NOMMU use. Luckily
heap allocation is not too slow in this case.
Though, at the same time, ideally one limits use of language features
where the code-generation degenerates into a mess of hidden runtime
calls. These cases are not ideal for performance...
BGB wrote:
On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
BGB-Alt wrote:
Also the blob of constants needed to be within 512 bytes of the load
instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of a
basic block and then branch over it).
In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.
Usually they were spilled between basic-blocks, with the basic-block
needing to branch to the following basic-block in these cases.
Also 8-bit branch displacements are kinda lame, ...
Why do that to yourself ??
And, if one wanted a 16-bit branch:
MOV.W (PC, 4), R0 //load a 16-bit branch displacement
BRA/F R0
.L0:
NOP // delay slot
.WORD $(Label - .L0)
Also kinda bad...
Can you say Yech !!
Things like memcpy/memmove/memset/etc, are function calls in cases
when not directly transformed into register load/store sequences.
My 66000 does not convert them into LD-ST sequences, MM is a single
inst-
ruction.
I have no high-level memory move/copy/set instructions.
Only loads/stores...
You have the power to fix it.........
For small copies, can encode them inline, but past a certain size this
becomes too bulky.
A copy loop makes more sense for bigger copies, but has a high
overhead for small to medium copy.
So, there is a size range where doing it inline would be too bulky,
but a loop caries an undesirable level of overhead.
All the more reason to put it (a highly useful unit of work) into an instruction.
Ended up doing these with "slides", which end up eating roughly
several kB of code space, but was more compact than using larger
inline copies.
Say (IIRC):
128 bytes or less: Inline Ld/St sequence
129 bytes to 512B: Slide
Over 512B: Call "memcpy()" or similar.
Versus::
1-infinity: use MM instruction.
The slide generally has entry points in multiples of 32 bytes, and
operates in reverse order. So, if not a multiple of 32 bytes, the last
bytes need to be handled externally prior to branching into the slide.
Does this remain sequentially consistent ??
Though, this is only used for fixed-size copies (or "memcpy()" when
value is constant).
Say:
__memcpy64_512_ua:
MOV.Q (R5, 480), R20
MOV.Q (R5, 488), R21
MOV.Q (R5, 496), R22
MOV.Q (R5, 504), R23
MOV.Q R20, (R4, 480)
MOV.Q R21, (R4, 488)
MOV.Q R22, (R4, 496)
MOV.Q R23, (R4, 504)
__memcpy64_480_ua:
MOV.Q (R5, 448), R20
MOV.Q (R5, 456), R21
MOV.Q (R5, 464), R22
MOV.Q (R5, 472), R23
MOV.Q R20, (R4, 448)
MOV.Q R21, (R4, 456)
MOV.Q R22, (R4, 464)
MOV.Q R23, (R4, 472)
....
__memcpy64_32_ua:
MOV.Q (R5), R20
MOV.Q (R5, 8), R21
MOV.Q (R5, 16), R22
MOV.Q (R5, 24), R23
MOV.Q R20, (R4)
MOV.Q R21, (R4, 8)
MOV.Q R22, (R4, 16)
MOV.Q R23, (R4, 24)
RTS
Duff's device in any other name.
On 4/10/2024 12:12 PM, MitchAlsup1 wrote:
BGB wrote:
On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
BGB-Alt wrote:
Also the blob of constants needed to be within 512 bytes of the load
instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of a
basic block and then branch over it).
In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.
Yeah.
This was why some of the first things I did when I started extending
SH-4 were:
Adding mechanisms to build constants inline;
Adding Load/Store ops with a displacement (albeit with encodings
borrowed from SH-2A);
Adding 3R and 3RI encodings (originally Imm8 for 3RI).
Did have a mess when I later extended the ISA to 32 GPRs, as (like with
BJX2 Baseline+XGPR) only part of the ISA had access to R16..R31.
Usually they were spilled between basic-blocks, with the basic-block
needing to branch to the following basic-block in these cases.
Also 8-bit branch displacements are kinda lame, ...
Why do that to yourself ??
I didn't design SuperH, Hitachi did...
But, with BJX1, I had added Disp16 branches.
With BJX2, they were replaced with 20 bit branches. These have the merit
of being able to branch anywhere within a Doom or Quake sized binary.
And, if one wanted a 16-bit branch:
MOV.W (PC, 4), R0 //load a 16-bit branch displacement
BRA/F R0
.L0:
NOP // delay slot
.WORD $(Label - .L0)
Also kinda bad...
Can you say Yech !!
Yeah.
This sort of stuff created strong incentive for ISA redesign...
Granted, it is possible had I instead started with RISC-V instead of
SuperH, it is probable BJX2 wouldn't exist.
Though, at the time, the original thinking was that SuperH having
smaller instructions meant it would have better code density than RV32I
or similar. Turns out not really, as the penalty of the 16 bit ops was needing almost twice as many on average.
Things like memcpy/memmove/memset/etc, are function calls in cases
when not directly transformed into register load/store sequences.
My 66000 does not convert them into LD-ST sequences, MM is a single
inst-
ruction.
I have no high-level memory move/copy/set instructions.
Only loads/stores...
You have the power to fix it.........
But, at what cost...
I had generally avoided anything that will have required microcode or shoving state-machines into the pipeline or similar.
Things like Load/Store-Multiple or
For small copies, can encode them inline, but past a certain size this
becomes too bulky.
A copy loop makes more sense for bigger copies, but has a high
overhead for small to medium copy.
So, there is a size range where doing it inline would be too bulky,
but a loop caries an undesirable level of overhead.
All the more reason to put it (a highly useful unit of work) into an
instruction.
This is an area where "slides" work well, the main cost is mostly the
bulk that the slide adds to the binary (albeit, it is one-off).
Which is why it is a 512B memcpy slide vs, say, a 4kB memcpy slide...
For looping memcpy, it makes sense to copy 64 or 128 bytes per loop iteration or so to try to limit looping overhead.
Though, leveraging the memcpy slide for the interior part of the copy
could be possible in theory as well.
For LZ memcpy, it is typically smaller, as LZ copies tend to be a lot shorter (a big part of LZ decoder performance mostly being in
fine-tuning the logic for the match copies).
Though, this is part of why my runtime library had added a
"_memlzcpy(dst, src, len)" and "_memlzcpyf(dst, src, len)" functions,
which can consolidate this rather than needing to do it one-off for each
LZ decoder (as I see it, it is a similar issue to not wanting code to endlessly re-roll stuff for functions like memcpy or malloc/free, *).
*: Though, nevermind that the standard C interface for malloc is
annoyingly minimal, and ends up requiring most non-trivial programs to
roll their own memory management.
Ended up doing these with "slides", which end up eating roughly
several kB of code space, but was more compact than using larger
inline copies.
Say (IIRC):
128 bytes or less: Inline Ld/St sequence
129 bytes to 512B: Slide
Over 512B: Call "memcpy()" or similar.
Versus::
1-infinity: use MM instruction.
Yeah, but it makes the CPU logic more expensive.
The slide generally has entry points in multiples of 32 bytes, and
operates in reverse order. So, if not a multiple of 32 bytes, the last
bytes need to be handled externally prior to branching into the slide.
Does this remain sequentially consistent ??
Within a thread, it is fine.
Main wonk is that it does start copying from the high address first. Presumably interrupts or similar wont be messing with application memory
mid memcpy.
The looping memcpy's generally work from low to high addresses though.
On 4/10/2024 12:41 AM, BGB wrote:
On 4/10/2024 12:01 AM, Chris M. Thomasson wrote:
On 4/9/2024 3:47 PM, BGB-Alt wrote:
On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
BGB wrote:
On 4/9/2024 1:24 PM, Thomas Koenig wrote:
I wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:Maybe one more thing: In order to justify the more complex encoding, >>>>>>> I was going for 64 registers, and that didn't work out too well
Thomas Koenig wrote:
(missing bits).
Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.
For 32-bit instructions at least, 64 GPRs can work out OK.
Though, the gain of 64 over 32 seems to be fairly small for most
"typical" code, mostly bringing a benefit if one is spending a lot >>>>>> of CPU time in functions that have large numbers of local
variables all being used at the same time.
Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for >>>>>> code density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for >>>>>> performance.
Where, 16 GPRs isn't really enough (lots of register spills), and >>>>>> 128 GPRs is wasteful (would likely need lots of monster functions >>>>>> with 250+ local variables to make effective use of this, *, which >>>>>> probably isn't going to happen).
16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not
part of GPRs AND you have good access to constants.
On the main ISA's I had tried to generate code for, 16 GPRs was kind
of a pain as it resulted in fairly high spill rates.
Though, it would probably be less bad if the compiler was able to
use all of the registers at the same time without stepping on itself
(such as dealing with register allocation involving scratch
registers while also not conflicting with the use of function
arguments, ...).
My code generators had typically only used callee save registers for
variables in basic blocks which ended in a function call (in my
compiler design, both function calls and branches terminating the
current basic-block).
On SH, the main way of getting constants (larger than 8 bits) was
via PC-relative memory loads, which kinda sucked.
This is slightly less bad on x86-64, since one can use memory
operands with most instructions, and the CPU tends to deal fairly
well with code that has lots of spill-and-fill. This along with
instructions having access to 32-bit immediate values.
*: Where, it appears it is most efficient (for non-leaf functions) >>>>>> if the number of local variables is roughly twice that of the
number of CPU registers. If more local variables than this, then
spill/fill rate goes up significantly, and if less, then the
registers aren't utilized as effectively.
Well, except in "tiny leaf" functions, where the criteria is
instead that the number of local variables be less than the number >>>>>> of scratch registers. However, for many/most small leaf functions, >>>>>> the total number of variables isn't all that large either.
The vast majority of leaf functions use less than 16 GPRs, given
one has
a SP not part of GPRs {including arguments and return values}. Once >>>>> one starts placing things like memove(), memset(), sin(), cos(),
exp(), log()
in the ISA, it goes up even more.
Yeah.
Things like memcpy/memmove/memset/etc, are function calls in cases
when not directly transformed into register load/store sequences.
Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into a
slide.
As noted, on a 32 GPR machine, most leaf functions can fit entirely
in scratch registers. On a 64 GPR machine, this percentage is
slightly higher (but, not significantly, since there are few leaf
functions remaining at this point).
If one had a 16 GPR machine with 6 usable scratch registers, it is a
little harder though (as typically these need to cover both any
variables used by the function, and any temporaries used, ...).
There are a whole lot more leaf functions that exceed a limit of 6
than of 14.
But, say, a 32 GPR machine could still do well here.
Note that there are reasons why I don't claim 64 GPRs as a large
performance advantage:
On programs like Doom, the difference is small at best.
It mostly effects things like GLQuake in my case, mostly because
TKRA-GL has a lot of functions with a large numbers of local
variables (some exceeding 100 local variables).
Partly though this is due to code that is highly inlined and
unrolled and uses lots of variables tending to perform better in my
case (and tightly looping code, with lots of small functions, not so
much...).
Where, function categories:
Tiny Leaf:
Everything fits in scratch registers, no stack frame, no calls. >>>>>> Leaf:
No function calls (either explicit or implicit);
Will have a stack frame.
Non-Leaf:
May call functions, has a stack frame.
You are forgetting about FP, GOT, TLS, and whatever resources are
required
to do try-throw-catch stuff as demanded by the source language.
Yeah, possibly true.
In my case:
There is no frame pointer, as BGBCC doesn't use one;
All stack-frames are fixed size, VLA's and alloca use the heap; >>>> GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR); >>>> TLS, accessed via TBR.[...]
alloca using the heap? Strange to me...
Well, in this case:
The alloca calls are turned into calls which allocate the memory blob
and add it to a linked list;
when the function returns, everything in the linked list is freed;
Then, it internally pulls this off via malloc and free.
Also the typical default stack size in this case is 128K, so trying to
put big allocations on the stack is more liable to result in a stack
overflow.
Bigger stack needs more memory, so is not ideal for NOMMU use. Luckily
heap allocation is not too slow in this case.
Though, at the same time, ideally one limits use of language features
where the code-generation degenerates into a mess of hidden runtime
calls. These cases are not ideal for performance...
Sometimes alloca is useful wrt offsetting the stack to avoid false
sharing between stacks. Intel wrote a little paper that addresses this:
https://www.intel.com/content/dam/www/public/us/en/documents/training/developing-multithreaded-applications.pdf
Remember that one?
mitchalsup@aol.com (MitchAlsup1) writes:
BGB wrote:
On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
BGB-Alt wrote:
Also the blob of constants needed to be within 512 bytes of the load
instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of a
basic block and then branch over it).
In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.
It does occupy some icache space, however; have you boosted the icache
size to compensate?
mitchalsup@aol.com (MitchAlsup1) writes:
BGB wrote:
In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.
It does occupy some icache space, however; have you boosted the icache
size to compensate?
BGB-Alt wrote:[snip]
Things like memcpy/memmove/memset/etc, are function calls in
cases when not directly transformed into register load/store
sequences.
My 66000 does not convert them into LD-ST sequences, MM is a
single instruction.
Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into
a slide.
MMs and MSs that do not cross page boundaries are ATOMIC. The
entire system
sees only the before or only the after state and nothing in
between.
As noted, on a 32 GPR machine, most leaf functions can fit
entirely in scratch registers.
Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without
getting totally screwed.
BGB-Alt wrote:
On 4/10/2024 12:12 PM, MitchAlsup1 wrote:
BGB wrote:
On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
BGB-Alt wrote:
Also the blob of constants needed to be within 512 bytes of the load
instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of
a basic block and then branch over it).
In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache
pollution.
Yeah.
This was why some of the first things I did when I started extending
SH-4 were:
Adding mechanisms to build constants inline;
Adding Load/Store ops with a displacement (albeit with encodings
borrowed from SH-2A);
Adding 3R and 3RI encodings (originally Imm8 for 3RI).
My suggestion is that:: "Now that you have screwed around for a while,
Why not take that experience and do a new ISA without any of those
mistakes in it" ??
Did have a mess when I later extended the ISA to 32 GPRs, as (like
with BJX2 Baseline+XGPR) only part of the ISA had access to R16..R31.
Usually they were spilled between basic-blocks, with the basic-block
needing to branch to the following basic-block in these cases.
Also 8-bit branch displacements are kinda lame, ...
Why do that to yourself ??
I didn't design SuperH, Hitachi did...
But you did not fix them en massé, and you complain about them
at least once a week. There comes a time when it takes less time
and less courage to do that big switch and clean up all that mess.
But, with BJX1, I had added Disp16 branches.
With BJX2, they were replaced with 20 bit branches. These have the
merit of being able to branch anywhere within a Doom or Quake sized
binary.
And, if one wanted a 16-bit branch:
MOV.W (PC, 4), R0 //load a 16-bit branch displacement
BRA/F R0
.L0:
NOP // delay slot
.WORD $(Label - .L0)
Also kinda bad...
Can you say Yech !!
Yeah.
This sort of stuff created strong incentive for ISA redesign...
Maybe consider now as the appropriate time to strt.
Granted, it is possible had I instead started with RISC-V instead of
SuperH, it is probable BJX2 wouldn't exist.
Though, at the time, the original thinking was that SuperH having
smaller instructions meant it would have better code density than
RV32I or similar. Turns out not really, as the penalty of the 16 bit
ops was needing almost twice as many on average.
My 66000 only requires 70% the instruction count of RISC-V,
Yours could too ................
Things like memcpy/memmove/memset/etc, are function calls in cases >>>>>> when not directly transformed into register load/store sequences.
My 66000 does not convert them into LD-ST sequences, MM is a single >>>>> inst-
ruction.
I have no high-level memory move/copy/set instructions.
Only loads/stores...
You have the power to fix it.........
But, at what cost...
You would not have to spend hours a week defending the indefensible !!
I had generally avoided anything that will have required microcode or
shoving state-machines into the pipeline or similar.
Things as simple as IDIV and FDIV require sequencers.
But LDM, STM, MM require sequencers simpler than IDIV and FDIV !!
Things like Load/Store-Multiple or
If you like polluted ICaches..............
For small copies, can encode them inline, but past a certain size
this becomes too bulky.
A copy loop makes more sense for bigger copies, but has a high
overhead for small to medium copy.
So, there is a size range where doing it inline would be too bulky,
but a loop caries an undesirable level of overhead.
All the more reason to put it (a highly useful unit of work) into an
instruction.
This is an area where "slides" work well, the main cost is mostly the
bulk that the slide adds to the binary (albeit, it is one-off).
Consider that the predictor getting into the slide the first time
always mispredicts !!
Which is why it is a 512B memcpy slide vs, say, a 4kB memcpy slide...
What if you only wanted to copy 63 bytes ?? Your DW slide fails miserably, yet a HW sequencer only has to avoid asserting a single byte write enable once.
For looping memcpy, it makes sense to copy 64 or 128 bytes per loop
iteration or so to try to limit looping overhead.
On low end machines, you want to operate at cache port width,
On high end machines, you want to operate at cache line widths per port.
This is essentially impossible using slides.....here, the same code is
not optimal across a line of implementations.
Though, leveraging the memcpy slide for the interior part of the copy
could be possible in theory as well.
What do you do when the STAT drive wants to write a whole page ??
For LZ memcpy, it is typically smaller, as LZ copies tend to be a lot
shorter (a big part of LZ decoder performance mostly being in
fine-tuning the logic for the match copies).
Though, this is part of why my runtime library had added a
"_memlzcpy(dst, src, len)" and "_memlzcpyf(dst, src, len)" functions,
which can consolidate this rather than needing to do it one-off for
each LZ decoder (as I see it, it is a similar issue to not wanting
code to endlessly re-roll stuff for functions like memcpy or
malloc/free, *).
*: Though, nevermind that the standard C interface for malloc is
annoyingly minimal, and ends up requiring most non-trivial programs to
roll their own memory management.
Ended up doing these with "slides", which end up eating roughly
several kB of code space, but was more compact than using larger
inline copies.
Say (IIRC):
128 bytes or less: Inline Ld/St sequence
129 bytes to 512B: Slide
Over 512B: Call "memcpy()" or similar.
Versus::
1-infinity: use MM instruction.
Yeah, but it makes the CPU logic more expensive.
By what, 37-gates ??
The slide generally has entry points in multiples of 32 bytes, and
operates in reverse order. So, if not a multiple of 32 bytes, the
last bytes need to be handled externally prior to branching into the
slide.
Does this remain sequentially consistent ??
Within a thread, it is fine.
What if a SATA drive is reading while you are writing !!
That is, DMA is no different than multi-threaded applications--except
DMA cannot perform locks.
Main wonk is that it does start copying from the high address first.
Presumably interrupts or similar wont be messing with application
memory mid memcpy.
The only things wanting high-low access patterns are dumping stuff to
the stack. The fact you CAN get away with it most of the time is no excuse.
The looping memcpy's generally work from low to high addresses though.
As does all string processing.
On 4/9/24 8:28 PM, MitchAlsup1 wrote:
BGB-Alt wrote:[snip]
Things like memcpy/memmove/memset/etc, are function calls in cases
when not directly transformed into register load/store sequences.
My 66000 does not convert them into LD-ST sequences, MM is a single
instruction.
I wonder if it would be useful to have an immediate count form of
memory move. Copying fixed-size structures would be able to use an
immediate. Aside from not having to load an immediate for such
cases, there might be microarchitectural benefits to using a
constant. Since fixed-sized copies would likely be limited to
smaller regions (with the possible exception of 8 MiB page copies)
and the overhead of loading a constant for large sizes would be
tiny, only providing a 16-bit immediate form might be reasonable.
Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into a
slide.
MMs and MSs that do not cross page boundaries are ATOMIC. The entire
system
sees only the before or only the after state and nothing in between.
I still feel that this atomicity should somehow be included with
ESM just because they feel related, but the benefit seems likely
to be extremely small. How often would software want to copy
multiple regions atomically or combine region copying with
ordinary ESM atomicity?? There *might* be some use for an atomic
region copy and an updating of a separate data structure (moving a
structure and updating one or a very few pointers??). For
structures three cache lines in size where only one region
occupies four cache lines, ordinary ESM could be used.
My feeling based on "relatedness" is not a strong basis for such
an architectural design choice.
(Simple page masking would allow false conflicts when smaller
memory moves are used. If there is a separate pair of range
registers that is checked for coherence of memory moves, this
issue would only apply for multiple memory moves _and_ all eight
of the buffer entries could be used for smaller accesses.)
[snip]
As noted, on a 32 GPR machine, most leaf functions can fit entirely
in scratch registers.
Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without
getting totally screwed.
I wonder how many instructions would have to have access to such a
set of "special registers" and if a larger number of extra
registers would be useful. (One of the issues — in my opinion —
with PowerPC's link register and count register was that they
could not be directly loaded from or stored to memory [or loaded \
with a constant from the instruction stream]. For counted loops,
loading the count register from the instruction stream would
presumably have allowed early branch determination even for deep
pipelines and small loop counts.) SP, FP, GOT, and TLS hold
"stable values", which might facilitate some microarchitectural
optimizations compared to more frequently modified register names.
(I am intrigued by the possibility of small contexts for some
multithreaded workloads, similar to how some GPUs allow variable context sizes.)
mitchalsup@aol.com (MitchAlsup1) writes:
BGB wrote:
On 4/9/2024 7:28 PM, MitchAlsup1 wrote:
BGB-Alt wrote:
Also the blob of constants needed to be within 512 bytes of the load
instruction, which was also kind of an evil mess for branch handling
(and extra bad if one needed to spill the constants in the middle of a
basic block and then branch over it).
In My 66000 case, the constant is the word following the instruction.
Easy to find, easy to access, no register pollution, no DCache pollution.
It does occupy some icache space, however; have you boosted the icache
size to compensate?
Scott Lurndal wrote:Win-win under constraints of Load-Store Arch. Otherwise, it depends.
mitchalsup@aol.com (MitchAlsup1) writes:
BGB wrote:
In My 66000 case, the constant is the word following the
instruction. Easy to find, easy to access, no register pollution,
no DCache pollution.
It does occupy some icache space, however; have you boosted the
icache size to compensate?
The space occupied in the ICache is freed up from being in the DCache
so the overall hit rate goes up !! At typical sizes, ICache miss rate
is about ¼ the miss rate of DCache.
Besides:: if you had to LD the constant from memory, you use a LD
instruction and 1 or 2 words in DCache, while consuming a GPR. So,
overall, it takes fewer cycles, fewer GPRs, and fewer instructions.
Alternatively:: if you paste constants together (LUI, AUPIC) you have
no direct route to either 64-bit constants or 64-bit address spaces.
It looks to be a win-win !!
On 4/9/24 8:28 PM, MitchAlsup1 wrote:
BGB-Alt wrote:[snip]
Things like memcpy/memmove/memset/etc, are function calls in
cases when not directly transformed into register load/store
sequences.
My 66000 does not convert them into LD-ST sequences, MM is a
single instruction.
I wonder if it would be useful to have an immediate count form of
memory move. Copying fixed-size structures would be able to use an
immediate. Aside from not having to load an immediate for such
cases, there might be microarchitectural benefits to using a
constant. Since fixed-sized copies would likely be limited to
smaller regions (with the possible exception of 8 MiB page copies)
and the overhead of loading a constant for large sizes would be
tiny, only providing a 16-bit immediate form might be reasonable.
Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into
a slide.
MMs and MSs that do not cross page boundaries are ATOMIC. The
entire system
sees only the before or only the after state and nothing in
between.
On Wed, 10 Apr 2024 23:30:02 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
BGB wrote:
In My 66000 case, the constant is the word following the
instruction. Easy to find, easy to access, no register pollution,
no DCache pollution.
It does occupy some icache space, however; have you boosted the
icache size to compensate?
The space occupied in the ICache is freed up from being in the DCache
so the overall hit rate goes up !! At typical sizes, ICache miss rate
is about ¼ the miss rate of DCache.
Besides:: if you had to LD the constant from memory, you use a LD
instruction and 1 or 2 words in DCache, while consuming a GPR. So,
overall, it takes fewer cycles, fewer GPRs, and fewer instructions.
Alternatively:: if you paste constants together (LUI, AUPIC) you have
no direct route to either 64-bit constants or 64-bit address spaces.
It looks to be a win-win !!
Win-win under constraints of Load-Store Arch. Otherwise, it depends.
On 4/11/2024 6:13 AM, Michael S wrote:
On Wed, 10 Apr 2024 23:30:02 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
It does occupy some icache space, however; have you boosted the
icache size to compensate?
The space occupied in the ICache is freed up from being in the DCache
so the overall hit rate goes up !! At typical sizes, ICache miss rate
is about ¼ the miss rate of DCache.
Besides:: if you had to LD the constant from memory, you use a LD
instruction and 1 or 2 words in DCache, while consuming a GPR. So,
overall, it takes fewer cycles, fewer GPRs, and fewer instructions.
Alternatively:: if you paste constants together (LUI, AUPIC) you have
no direct route to either 64-bit constants or 64-bit address spaces.
It looks to be a win-win !!
Win-win under constraints of Load-Store Arch. Otherwise, it depends.
FWIW:
The LDSH / SHORI mechanism does provide a way to get 64-bit constants,
and needs less encoding space than the LUI route.
MOV Imm16. Rn
SHORI Imm16, Rn
SHORI Imm16, Rn
SHORI Imm16, Rn
Granted, if each is a 1-cycle instruction, this still takes 4 clock cycles.
An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and 1-cycle, is preferable....
BGB wrote:
On 4/11/2024 6:13 AM, Michael S wrote:
On Wed, 10 Apr 2024 23:30:02 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
It does occupy some icache space, however; have you boosted the
icache size to compensate?
The space occupied in the ICache is freed up from being in the DCache
so the overall hit rate goes up !! At typical sizes, ICache miss rate
is about ¼ the miss rate of DCache.
Besides:: if you had to LD the constant from memory, you use a LD
instruction and 1 or 2 words in DCache, while consuming a GPR. So,
overall, it takes fewer cycles, fewer GPRs, and fewer instructions.
Alternatively:: if you paste constants together (LUI, AUPIC) you have
no direct route to either 64-bit constants or 64-bit address spaces.
It looks to be a win-win !!
Win-win under constraints of Load-Store Arch. Otherwise, it depends.
Never seen a LD-OP architecture where the inbound memory can be in the
Rs1 position of the instruction.
FWIW:
The LDSH / SHORI mechanism does provide a way to get 64-bit constants,
and needs less encoding space than the LUI route.
MOV Imm16. Rn
SHORI Imm16, Rn
SHORI Imm16, Rn
SHORI Imm16, Rn
Granted, if each is a 1-cycle instruction, this still takes 4 clock
cycles.
As compared to::
CALK Rd,Rs1,#imm64
Which takes 3 words (12 bytes) and executes in CALK cycles, the loading
of the constant is free !! (0 cycles) !! {{The above example uses at least
5 cycles to use the loaded/built constant.}}
An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
1-cycle, is preferable....
A consuming instruction where you don't even use a register is better
still !!
"Paul A. Clayton" <paaronclayton@gmail.com> writes:
On 4/9/24 8:28 PM, MitchAlsup1 wrote:
BGB-Alt wrote:[snip]
Things like memcpy/memmove/memset/etc, are function calls in
cases when not directly transformed into register load/store
sequences.
My 66000 does not convert them into LD-ST sequences, MM is a
single instruction.
I wonder if it would be useful to have an immediate count form of
memory move. Copying fixed-size structures would be able to use an
immediate. Aside from not having to load an immediate for such
cases, there might be microarchitectural benefits to using a
constant. Since fixed-sized copies would likely be limited to
smaller regions (with the possible exception of 8 MiB page copies)
and the overhead of loading a constant for large sizes would be
tiny, only providing a 16-bit immediate form might be reasonable.
It seems to me that an offloaded DMA engine would be a far
better way to do memmove (over some threshhold, perhaps a
cache line) without trashing the caches. Likewise memset.
Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into
a slide.
MMs and MSs that do not cross page boundaries are ATOMIC. The
entire system
sees only the before or only the after state and nothing in
between.
One might wonder how that atomicity is guaranteed in a
SMP processor...
On 4/11/2024 1:46 PM, MitchAlsup1 wrote:
BGB wrote:
Win-win under constraints of Load-Store Arch. Otherwise, it depends.
Never seen a LD-OP architecture where the inbound memory can be in the
Rs1 position of the instruction.
FWIW:
The LDSH / SHORI mechanism does provide a way to get 64-bit constants,
and needs less encoding space than the LUI route.
MOV Imm16. Rn
SHORI Imm16, Rn
SHORI Imm16, Rn
SHORI Imm16, Rn
Granted, if each is a 1-cycle instruction, this still takes 4 clock
cycles.
As compared to::
CALK Rd,Rs1,#imm64
Which takes 3 words (12 bytes) and executes in CALK cycles, the loading
of the constant is free !! (0 cycles) !! {{The above example uses at least >> 5 cycles to use the loaded/built constant.}}
The main reason one might want SHORI is that it can fit into a
fixed-length 32-bit encoding.
Also technically could be retrofitted onto RISC-V without any significant change, unlike some other options (as
noted, I don't argue for adding Jumbo prefixes to RV under the basis
that there is no real viable way to add them to RV, *).
Sadly, the closest option to viable for RV would be to add the SHORI instruction and optionally pattern match it in the fetch/decode.
Or, say:
LUI Xn, Imm20
ADD Xn, Xn, Imm12
SHORI Xn, Imm16
SHORI Xn, Imm16
Then, combine LUI+ADD into a 32-bit load in the decoder (though probably only if the Imm12 is positive), and 2x SHORI into a combined "Xn=(Xn<<32)|Imm32" operation.
This could potentially get it down to 2 clock cycles.
*: To add a jumbo prefix, one needs an encoding that:
Uses up a really big chunk of encoding space;
Is otherwise illegal and unused.
RISC-V doesn't have anything here.
Ironically, in XG2 mode, I still have 28x 24-bit chunks of encoding
space that aren't yet used for anything, but aren't usable as normal encoding space mostly because if I put instructions in there (with the existing encoding schemes), I couldn't use all the registers (and they
would not have predication or similar either). Annoyingly, the only
types of encodings that would fit in there at present are 2RI Imm16 ops
or similar (or maybe 3R 128-bit SIMD ops, where these ops only use
encodings for R0..R31 anyways, interpreting the LSB of the register
field as encoding R32..R63).
An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
1-cycle, is preferable....
A consuming instruction where you don't even use a register is better
still !!
Can be done, but thus far 33-bit immediate values. Luckily, Imm33s seems
to addresses around 99% of uses (for normal ALU ops and similar).
Had considered allowing an Imm57s case for SIMD immediates (4x S.E5.F8
or 2x S.E8.F19), which would have indirectly allowed the Imm57s case. By themselves though, the difference doesn't seem enough to justify the cost.
Don't have enough bits in the encoding scheme to pull off a 3RI Imm64 in
12 bytes (and allowing a 16-byte encoding would have too steep of a cost increase to be worthwhile).
So, alas...
"Paul A. Clayton" <paaronclayton@gmail.com> writes:
On 4/9/24 8:28 PM, MitchAlsup1 wrote:
BGB-Alt wrote:[snip]
Things like memcpy/memmove/memset/etc, are function calls in
cases when not directly transformed into register load/store
sequences.
My 66000 does not convert them into LD-ST sequences, MM is a
single instruction.
I wonder if it would be useful to have an immediate count form of
memory move. Copying fixed-size structures would be able to use an >>immediate. Aside from not having to load an immediate for such
cases, there might be microarchitectural benefits to using a
constant. Since fixed-sized copies would likely be limited to
smaller regions (with the possible exception of 8 MiB page copies)
and the overhead of loading a constant for large sizes would be
tiny, only providing a 16-bit immediate form might be reasonable.
It seems to me that an offloaded DMA engine would be a far
better way to do memmove (over some threshhold, perhaps a
cache line) without trashing the caches. Likewise memset.
Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into
a slide.
MMs and MSs that do not cross page boundaries are ATOMIC. The
entire system
sees only the before or only the after state and nothing in
between.
One might wonder how that atomicity is guaranteed in a
SMP processor...
On 4/11/2024 6:13 AM, Michael S wrote:
On Wed, 10 Apr 2024 23:30:02 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
It does occupy some icache space, however; have you boosted the
icache size to compensate?
The space occupied in the ICache is freed up from being in the
DCache so the overall hit rate goes up !! At typical sizes,
ICache miss rate is about the miss rate of DCache.
Besides:: if you had to LD the constant from memory, you use a LD
instruction and 1 or 2 words in DCache, while consuming a GPR. So,
overall, it takes fewer cycles, fewer GPRs, and fewer
instructions.
Alternatively:: if you paste constants together (LUI, AUPIC) you
have no direct route to either 64-bit constants or 64-bit address
spaces.
It looks to be a win-win !!
Win-win under constraints of Load-Store Arch. Otherwise, it
depends.
Never seen a LD-OP architecture where the inbound memory can be in
the Rs1 position of the instruction.
On 4/11/2024 9:30 AM, Scott Lurndal wrote:
"Paul A. Clayton" <paaronclayton@gmail.com> writes:
One thing that is still needed is a good, fast, and semi-accurate way to pull off the Z=1.0/Z' calculation, as needed for perspective-correct rasterization (affine requires subdivision, which adds cost to the front-end, and interpolating Z directly adds significant distortion for geometry near the near plane).
BGB-Alt wrote:
On 4/11/2024 1:46 PM, MitchAlsup1 wrote:
BGB wrote:
Win-win under constraints of Load-Store Arch. Otherwise, it depends.
Never seen a LD-OP architecture where the inbound memory can be in
the Rs1 position of the instruction.
FWIW:
The LDSH / SHORI mechanism does provide a way to get 64-bit
constants, and needs less encoding space than the LUI route.
MOV Imm16. Rn
SHORI Imm16, Rn
SHORI Imm16, Rn
SHORI Imm16, Rn
Granted, if each is a 1-cycle instruction, this still takes 4 clock
cycles.
As compared to::
CALK Rd,Rs1,#imm64
Which takes 3 words (12 bytes) and executes in CALK cycles, the loading
of the constant is free !! (0 cycles) !! {{The above example uses at
least
5 cycles to use the loaded/built constant.}}
The main reason one might want SHORI is that it can fit into a
fixed-length 32-bit encoding.
While 32-bit encoding is RISC mantra, it has NOT been shown to be best
just simplest. Then, once you start widening the microarchitecture, it
is better to fetch wider than decode-issue so that you suffer least from boundary conditions. Once you start fetching wide OR have wide
decode-issue, you have ALL the infrastructure to do variable length instructions. Thus, complaining that VLE is hard has already been
eradicated.
Also technically could be retrofitted
onto RISC-V without any significant change, unlike some other options
(as noted, I don't argue for adding Jumbo prefixes to RV under the
basis that there is no real viable way to add them to RV, *).
The issue is that once you do VLE RISC-Vs ISA is no longer helping you
get the job done, especially when you have to execute 40% more instructions
Sadly, the closest option to viable for RV would be to add the SHORI
instruction and optionally pattern match it in the fetch/decode.
Or, say:
LUI Xn, Imm20
ADD Xn, Xn, Imm12
SHORI Xn, Imm16
SHORI Xn, Imm16
Then, combine LUI+ADD into a 32-bit load in the decoder (though
probably only if the Imm12 is positive), and 2x SHORI into a combined
"Xn=(Xn<<32)|Imm32" operation.
This could potentially get it down to 2 clock cycles.
Universal constants gets this down to 0 cycles......
*: To add a jumbo prefix, one needs an encoding that:
Uses up a really big chunk of encoding space;
Is otherwise illegal and unused.
RISC-V doesn't have anything here.
Which is WHY you should not jump ship from SH to RV, but jump to an
ISA without these problems.
Ironically, in XG2 mode, I still have 28x 24-bit chunks of encoding
space that aren't yet used for anything, but aren't usable as normal
encoding space mostly because if I put instructions in there (with the
existing encoding schemes), I couldn't use all the registers (and they
would not have predication or similar either). Annoyingly, the only
types of encodings that would fit in there at present are 2RI Imm16
ops or similar (or maybe 3R 128-bit SIMD ops, where these ops only use
encodings for R0..R31 anyways, interpreting the LSB of the register
field as encoding R32..R63).
Just another reason not to stay with what you have developed.
In comparison, I reserve 6-major OpCodes so that a control transfer into
data is highly likely to get Undefined OpCode exceptions rather than a
try to execute what is in that data. Then, as it is, I still have 21-slots
in the major OpCode group free (27 if you count the permanently reserved).
Much of this comes from side effects of Universal Constants.
An encoding that can MOV a 64-bit constant in 96-bits (12 bytes) and
1-cycle, is preferable....
A consuming instruction where you don't even use a register is better
still !!
Can be done, but thus far 33-bit immediate values. Luckily, Imm33s
seems to addresses around 99% of uses (for normal ALU ops and similar).
What do you do when accessing data that the linker knows is more than
4GB away from IP ?? or known to be outside of 0-4GB ?? externs, GOT,
PLT, ...
Had considered allowing an Imm57s case for SIMD immediates (4x S.E5.F8
or 2x S.E8.F19), which would have indirectly allowed the Imm57s case.
By themselves though, the difference doesn't seem enough to justify
the cost.
While I admit that <basically> anything bigger than 50-bits will be fine
as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.
Don't have enough bits in the encoding scheme to pull off a 3RI Imm64
in 12 bytes (and allowing a 16-byte encoding would have too steep of a
cost increase to be worthwhile).
And yet I did.
So, alas...
Yes, alas..........
On 4/11/2024 6:06 PM, MitchAlsup1 wrote:
While I admit that <basically> anything bigger than 50-bits will be fine
as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.
The number of cases where this comes up is not statistically significant enough to have a meaningful impact on performance.
Fraction of a percent edge-cases are not deal-breakers, as I see it.
On Thu, 11 Apr 2024 18:46:54 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
=20It looks to be a win-win !! =20=20
Win-win under constraints of Load-Store Arch. Otherwise, it
depends. =20
Never seen a LD-OP architecture where the inbound memory can be in
the Rs1 position of the instruction.
=20
May be. But out of 6 major integer OPs it matters only for SUB.
By now I don't remember for sure, but I think that I had seen LD-OP >architecture that had SUBR instruction. May be, TI TMS320C30?
Michael S <already5chosen@yahoo.com> writes:
On Thu, 11 Apr 2024 18:46:54 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
=20It looks to be a win-win !! =20=20
Win-win under constraints of Load-Store Arch. Otherwise, it
depends. =20
Never seen a LD-OP architecture where the inbound memory can be in
the Rs1 position of the instruction.
=20
May be. But out of 6 major integer OPs it matters only for SUB.
By now I don't remember for sure, but I think that I had seen LD-OP >architecture that had SUBR instruction. May be, TI TMS320C30?
ARM has LDADD - negate one argument and it becomes a subtract.
BGB wrote:
On 4/11/2024 6:06 PM, MitchAlsup1 wrote:
While I admit that <basically> anything bigger than 50-bits will be fine >>> as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.
The number of cases where this comes up is not statistically
significant enough to have a meaningful impact on performance.
Fraction of a percent edge-cases are not deal-breakers, as I see it.
Idle speculation::
.globl r8_erf ; -- Begin function r8_erf
.type r8_erf,@function r8_erf: ; @r8_erf
; %bb.0:
add sp,sp,#-128
std #4614300636657501161,[sp,88] // a[0]
std #4645348406721991307,[sp,104] // a[2]
std #4659275911028085274,[sp,112] // a[3]
std #4595861367557309218,[sp,120] // a[4]
std #4599171895595656694,[sp,40] // p[0]
std #4593699784569291823,[sp,56] // p[2]
std #4580293056851789237,[sp,64] // p[3]
std #4559215111867327292,[sp,72] // p[4]
std #4580359811580069319,[sp,80] // p[4]
std #4612966212090462427,[sp] // q[0]
std #4602930165995154489,[sp,16] // q[2]
std #4588882433176075751,[sp,24] // q[3]
std #4567531038595922641,[sp,32] // q[4]
fabs r2,r1
fcmp r3,r2,#0x3EF00000 // thresh
bnlt r3,.LBB141_6
; %bb.1:
fcmp r3,r2,#4 // xabs <= 4.0
bnlt r3,.LBB141_7
; %bb.2:
fcmp r3,r2,#0x403A8B020C49BA5E // xbig
bngt r3,.LBB141_11
; %bb.3:
fmul r3,r1,r1
fdiv r3,#1,r3
mov r4,#0x3F90B4FB18B485C7 // p[5]
fmac r4,r3,r4,#0x3FD38A78B9F065F6 // p[0]
fadd r5,r3,#0x40048C54508800DB // q[0]
fmac r6,r3,r4,#0x3FD70FE40E2425B8 // p[1]
fmac r4,r3,r5,#0x3FFDF79D6855F0AD // q[1]
fmul r4,r3,r4
fmul r6,r3,r6
mov r5,#2
add r7,sp,#40 // p[*]
add r8,sp,#0 // q[*] LBB141_4: ; %._crit_edge11
; =>This Inner Loop Header: Depth=1
vec r9,{r4,r6}
ldd r10,[r7,r5<<3,0] // p[*]
ldd r11,[r8,r5<<3,0] // q[*]
fadd r6,r6,r10
fadd r4,r4,r11
fmul r4,r3,r4
fmul r6,r3,r6
loop ne,r5,#4,#1
; %bb.5:
fadd r5,r6,#0x3F4595FD0D71E33C // p[4]
fmul r3,r3,r5
fadd r4,r4,#0x3F632147A014BAD1 // q[4]
fdiv r3,r3,r4
fadd r3,#0x3FE20DD750429B6D,-r3 // c[0]
fdiv r3,r3,r2
br .LBB141_10 // common tail LBB141_6: ; %._crit_edge
fmul r3,r1,r1
fcmp r2,r2,#0x3C9FFE5AB7E8AD5E // xsmall
sra r2,r2,<1:13>
cvtsd r4,#0
mux r2,r2,r3,r4
mov r3,#0x3FC7C7905A31C322 // a[4]
fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
ldd r4,[sp,104] // a[2]
fmac r3,r2,r3,r4
fadd r4,r2,#0x403799EE342FB2DE // b[0]
fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
fmul r1,r3,r1
fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
fdiv r2,r1,r2
mov r1,r2
add sp,sp,#128
ret // 68
LBB141_7:
fmul r3,r2,#0x3E571E703C5F5815 // c[8]
mov r5,#0
mov r4,r2 LBB141_8: ; =>This Inner Loop Header: Depth=1
vec r6,{r3,r4}
ldd r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
fadd r3,r3,r7
fmul r3,r2,r3
ldd r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
fadd r4,r4,r7
fmul r4,r2,r4
loop ne,r5,#7,#1
; %bb.9:
fadd r3,r3,#0x4093395B7FD2FC8E // c[7]
fadd r4,r4,#0x4093395B7FD35F61 // d[7]
fdiv r3,r3,r4
LBB141_10: // common tail
fmul r4,r2,#0x41800000 // 16.0
fmul r4,r4,#0x3D800000 // 1/16.0
cvtds r4,r4 // (signed)double
cvtsd r4,r4 // (double)signed
fadd r5,r2,-r4
fadd r2,r2,r4
fmul r4,r4,-r4
fexp r4,r4 // exp()
fmul r2,r2,-r5
fexp r2,r2 // exp()
fmul r2,r4,r2
fadd r2,#0,-r2
fmac r2,r2,r3,#0x3F000000 // 0.5
fadd r2,r2,#0x3F000000 // 0.5
pflt r1,0,T
fadd r2,#0,-r2
mov r1,r2
add sp,sp,#128
ret
LBB141_11:
fcmp r1,r1,#0
sra r1,r1,<1:13>
cvtsd r2,#-1 // (double)-1
cvtsd r3,#1 // (double)+1
mux r2,r1,r3,r2
mov r1,r2
add sp,sp,#128
ret
Lfunc_end141:
.size r8_erf, .Lfunc_end141-r8_erf
; -- End function
.globl r8_erf ; -- Begin function r8_erfADD -128, SP
.type r8_erf,@function
r8_erf: ; @r8_erf
; %bb.0:
add sp,sp,#-128
std #4614300636657501161,[sp,88] // a[0]MOV 0x400949FB3ED443E9, R3
std #4645348406721991307,[sp,104] // a[2]MOV 0x407797C38897528B, R3
std #4659275911028085274,[sp,112] // a[3]... pattern is obvious enough.
std #4595861367557309218,[sp,120] // a[4]
std #4599171895595656694,[sp,40] // p[0]
std #4593699784569291823,[sp,56] // p[2]
std #4580293056851789237,[sp,64] // p[3]
std #4559215111867327292,[sp,72] // p[4]
std #4580359811580069319,[sp,80] // p[4]
std #4612966212090462427,[sp] // q[0]
std #4602930165995154489,[sp,16] // q[2]
std #4588882433176075751,[sp,24] // q[3]
std #4567531038595922641,[sp,32] // q[4]
fabs r2,r1FABS R5, R6
fcmp r3,r2,#0x3EF00000 // thresh
bnlt r3,.LBB141_6
; %bb.1:
fcmp r3,r2,#4 // xabs <= 4.0
bnlt r3,.LBB141_7
; %bb.2:
fcmp r3,r2,#0x403A8B020C49BA5E // xbig
bngt r3,.LBB141_11
; %bb.3:FMUL R5, R5, R7
fmul r3,r1,r1
fdiv r3,#1,r3Skip, operation gives identity?...
mov r4,#0x3F90B4FB18B485C7 // p[5]Similar.
fmac r4,r3,r4,#0x3FD38A78B9F065F6 // p[0]
fadd r5,r3,#0x40048C54508800DB // q[0]
fmac r6,r3,r4,#0x3FD70FE40E2425B8 // p[1]
fmac r4,r3,r5,#0x3FFDF79D6855F0AD // q[1]
fmul r4,r3,r4
fmul r6,r3,r6
mov r5,#2
add r7,sp,#40 // p[*]
add r8,sp,#0 // q[*]
LBB141_4: ; %._crit_edge11Depth=1
; =>This Inner Loop Header:
vec r9,{r4,r6}
ldd r10,[r7,r5<<3,0] // p[*]
ldd r11,[r8,r5<<3,0] // q[*]
fadd r6,r6,r10
fadd r4,r4,r11
fmul r4,r3,r4
fmul r6,r3,r6
loop ne,r5,#4,#1
; %bb.5:
fadd r5,r6,#0x3F4595FD0D71E33C // p[4]
fmul r3,r3,r5
fadd r4,r4,#0x3F632147A014BAD1 // q[4]
fdiv r3,r3,r4
fadd r3,#0x3FE20DD750429B6D,-r3 // c[0]
fdiv r3,r3,r2
br .LBB141_10 // common tail
LBB141_6: ; %._crit_edgeDepth=1
fmul r3,r1,r1
fcmp r2,r2,#0x3C9FFE5AB7E8AD5E // xsmall
sra r2,r2,<1:13>
cvtsd r4,#0
mux r2,r2,r3,r4
mov r3,#0x3FC7C7905A31C322 // a[4]
fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
ldd r4,[sp,104] // a[2]
fmac r3,r2,r3,r4
fadd r4,r2,#0x403799EE342FB2DE // b[0]
fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
fmul r1,r3,r1
fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
fdiv r2,r1,r2
mov r1,r2
add sp,sp,#128
ret // 68
LBB141_7:
fmul r3,r2,#0x3E571E703C5F5815 // c[8]
mov r5,#0
mov r4,r2
LBB141_8: ; =>This Inner Loop Header:
vec r6,{r3,r4}
ldd r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
fadd r3,r3,r7
fmul r3,r2,r3
ldd r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
fadd r4,r4,r7
fmul r4,r2,r4
loop ne,r5,#7,#1
; %bb.9:
fadd r3,r3,#0x4093395B7FD2FC8E // c[7]
fadd r4,r4,#0x4093395B7FD35F61 // d[7]
fdiv r3,r3,r4
LBB141_10: // common tail
fmul r4,r2,#0x41800000 // 16.0
fmul r4,r4,#0x3D800000 // 1/16.0
cvtds r4,r4 // (signed)double
cvtsd r4,r4 // (double)signed
fadd r5,r2,-r4
fadd r2,r2,r4
fmul r4,r4,-r4
fexp r4,r4 // exp()
fmul r2,r2,-r5
fexp r2,r2 // exp()
fmul r2,r4,r2
fadd r2,#0,-r2
fmac r2,r2,r3,#0x3F000000 // 0.5
fadd r2,r2,#0x3F000000 // 0.5
pflt r1,0,T
fadd r2,#0,-r2
mov r1,r2
add sp,sp,#128
ret
LBB141_11:
fcmp r1,r1,#0
sra r1,r1,<1:13>
cvtsd r2,#-1 // (double)-1
cvtsd r3,#1 // (double)+1
mux r2,r1,r3,r2
mov r1,r2
add sp,sp,#128
ret
Lfunc_end141:
.size r8_erf, .Lfunc_end141-r8_erf
; -- End function
On 4/11/2024 8:40 PM, MitchAlsup1 wrote:
BGB wrote:
On 4/11/2024 6:06 PM, MitchAlsup1 wrote:
While I admit that <basically> anything bigger than 50-bits will be fine >>>> as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.
The number of cases where this comes up is not statistically
significant enough to have a meaningful impact on performance.
Fraction of a percent edge-cases are not deal-breakers, as I see it.
Idle speculation::
.globl r8_erf ; -- Begin function r8_erf
.type r8_erf,@function
r8_erf: ; @r8_erf
; %bb.0:
add sp,sp,#-128
std #4614300636657501161,[sp,88] // a[0]
std #4645348406721991307,[sp,104] // a[2]
std #4659275911028085274,[sp,112] // a[3]
std #4595861367557309218,[sp,120] // a[4]
std #4599171895595656694,[sp,40] // p[0]
std #4593699784569291823,[sp,56] // p[2]
std #4580293056851789237,[sp,64] // p[3]
std #4559215111867327292,[sp,72] // p[4]
std #4580359811580069319,[sp,80] // p[4]
std #4612966212090462427,[sp] // q[0]
std #4602930165995154489,[sp,16] // q[2]
std #4588882433176075751,[sp,24] // q[3]
std #4567531038595922641,[sp,32] // q[4]
fabs r2,r1
fcmp r3,r2,#0x3EF00000 // thresh
bnlt r3,.LBB141_6
; %bb.1:
fcmp r3,r2,#4 // xabs <= 4.0
bnlt r3,.LBB141_7
; %bb.2:
fcmp r3,r2,#0x403A8B020C49BA5E // xbig
bngt r3,.LBB141_11
; %bb.3:
fmul r3,r1,r1
fdiv r3,#1,r3
mov r4,#0x3F90B4FB18B485C7 // p[5]
fmac r4,r3,r4,#0x3FD38A78B9F065F6 // p[0]
fadd r5,r3,#0x40048C54508800DB // q[0]
fmac r6,r3,r4,#0x3FD70FE40E2425B8 // p[1]
fmac r4,r3,r5,#0x3FFDF79D6855F0AD // q[1]
fmul r4,r3,r4
fmul r6,r3,r6
mov r5,#2
add r7,sp,#40 // p[*]
add r8,sp,#0 // q[*]
LBB141_4: ; %._crit_edge11
; =>This Inner Loop Header: Depth=1
vec r9,{r4,r6}
ldd r10,[r7,r5<<3,0] // p[*]
ldd r11,[r8,r5<<3,0] // q[*]
fadd r6,r6,r10
fadd r4,r4,r11
fmul r4,r3,r4
fmul r6,r3,r6
loop ne,r5,#4,#1
; %bb.5:
fadd r5,r6,#0x3F4595FD0D71E33C // p[4]
fmul r3,r3,r5
fadd r4,r4,#0x3F632147A014BAD1 // q[4]
fdiv r3,r3,r4
fadd r3,#0x3FE20DD750429B6D,-r3 // c[0]
fdiv r3,r3,r2
br .LBB141_10 // common tail
LBB141_6: ; %._crit_edge
fmul r3,r1,r1
fcmp r2,r2,#0x3C9FFE5AB7E8AD5E // xsmall
sra r2,r2,<1:13>
cvtsd r4,#0
mux r2,r2,r3,r4
mov r3,#0x3FC7C7905A31C322 // a[4]
fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
ldd r4,[sp,104] // a[2]
fmac r3,r2,r3,r4
fadd r4,r2,#0x403799EE342FB2DE // b[0]
fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
fmul r1,r3,r1
fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
fdiv r2,r1,r2
mov r1,r2
add sp,sp,#128
ret // 68
LBB141_7:
fmul r3,r2,#0x3E571E703C5F5815 // c[8]
mov r5,#0
mov r4,r2
LBB141_8: ; =>This Inner Loop Header: Depth=1
vec r6,{r3,r4}
ldd r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
fadd r3,r3,r7
fmul r3,r2,r3
ldd r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
fadd r4,r4,r7
fmul r4,r2,r4
loop ne,r5,#7,#1
; %bb.9:
fadd r3,r3,#0x4093395B7FD2FC8E // c[7]
fadd r4,r4,#0x4093395B7FD35F61 // d[7]
fdiv r3,r3,r4
LBB141_10: // common tail
fmul r4,r2,#0x41800000 // 16.0
fmul r4,r4,#0x3D800000 // 1/16.0
cvtds r4,r4 // (signed)double >> cvtsd r4,r4 // (double)signed >> fadd r5,r2,-r4
fadd r2,r2,r4
fmul r4,r4,-r4
fexp r4,r4 // exp()
fmul r2,r2,-r5
fexp r2,r2 // exp()
fmul r2,r4,r2
fadd r2,#0,-r2
fmac r2,r2,r3,#0x3F000000 // 0.5
fadd r2,r2,#0x3F000000 // 0.5
pflt r1,0,T
fadd r2,#0,-r2
mov r1,r2
add sp,sp,#128
ret
LBB141_11:
fcmp r1,r1,#0
sra r1,r1,<1:13>
cvtsd r2,#-1 // (double)-1
cvtsd r3,#1 // (double)+1
mux r2,r1,r3,r2
mov r1,r2
add sp,sp,#128
ret
Lfunc_end141:
.size r8_erf, .Lfunc_end141-r8_erf
; -- End function
These patterns seem rather unusual...
Don't really know the ABI.
Patterns don't really fit observations for typical compiler output
though (mostly in the FP constants, and particular ones that fall
outside the scope of what can be exactly represented as Binary16 or
similar, are rare).
.globl r8_erf ; -- Begin function r8_erfADD -128, SP
.type r8_erf,@function
r8_erf: ; @r8_erf
; %bb.0:
add sp,sp,#-128
std #4614300636657501161,[sp,88] // a[0]MOV 0x400949FB3ED443E9, R3
MOV.Q R3, (SP, 88)
std #4645348406721991307,[sp,104] // a[2]MOV 0x407797C38897528B, R3
MOV.Q R3, (SP, 104)
std #4659275911028085274,[sp,112] // a[3].... pattern is obvious enough.
std #4595861367557309218,[sp,120] // a[4]
std #4599171895595656694,[sp,40] // p[0]
std #4593699784569291823,[sp,56] // p[2]
std #4580293056851789237,[sp,64] // p[3]
std #4559215111867327292,[sp,72] // p[4]
std #4580359811580069319,[sp,80] // p[4]
std #4612966212090462427,[sp] // q[0]
std #4602930165995154489,[sp,16] // q[2]
std #4588882433176075751,[sp,24] // q[3]
std #4567531038595922641,[sp,32] // q[4]
Each constant needs 12 bytes, so 16 bytes/store.
fabs r2,r1FABS R5, R6
fcmp r3,r2,#0x3EF00000 // thresh
bnlt r3,.LBB141_6
FLDH 0x3780, R3 //A
FCMPGT R3, R6 //A
BT .LBB141_6 //A
Or (FP-IMM extension):
FABS R5, R6
FCMPGE 0x0DE, R6 //B (FP-IMM)
BF .LBB141_6 //B
; %bb.1:
fcmp r3,r2,#4 // xabs <= 4.0
bnlt r3,.LBB141_7
FCMPGE 0x110, R6
BF .LBB141_7
; %bb.2:
fcmp r3,r2,#0x403A8B020C49BA5E // xbig
bngt r3,.LBB141_11
MOV 0x403A8B020C49BA5E, R3
FCMPGT R3, R6
BT .LBB141_11
Where, FP-IMM wont work with that value.
; %bb.3:FMUL R5, R5, R7
fmul r3,r1,r1
fdiv r3,#1,r3Skip, operation gives identity?...
mov r4,#0x3F90B4FB18B485C7 // p[5]Similar.
fmac r4,r3,r4,#0x3FD38A78B9F065F6 // p[0]
fadd r5,r3,#0x40048C54508800DB // q[0]
fmac r6,r3,r4,#0x3FD70FE40E2425B8 // p[1]
fmac r4,r3,r5,#0x3FFDF79D6855F0AD // q[1]
Turns into 4 constants, 7 FPU instructions (if no FMAC extension, 4 with FMAC). Though, at present, FMAC is slower than separate FMUL+FADD.
So, between 8 and 11 instructions.
fmul r4,r3,r4
fmul r6,r3,r6
mov r5,#2
add r7,sp,#40 // p[*]
add r8,sp,#0 // q[*]
These can map 1:1.
LBB141_4: ; %._crit_edge11Depth=1
; =>This Inner Loop Header:
vec r9,{r4,r6}
ldd r10,[r7,r5<<3,0] // p[*]
ldd r11,[r8,r5<<3,0] // q[*]
fadd r6,r6,r10
fadd r4,r4,r11
fmul r4,r3,r4
fmul r6,r3,r6
loop ne,r5,#4,#1
Could be mapped to a scalar loop, pretty close to 1:1.
Could possibly also be mapped over to 2x Binary64 SIMD ops, I am
guessing 2 copies for a 4-element vector?...
; %bb.5:
fadd r5,r6,#0x3F4595FD0D71E33C // p[4]
fmul r3,r3,r5
fadd r4,r4,#0x3F632147A014BAD1 // q[4]
fdiv r3,r3,r4
fadd r3,#0x3FE20DD750429B6D,-r3 // c[0]
fdiv r3,r3,r2
br .LBB141_10 // common tail
Same patterns as before.
Would need ~ 10 ops.
Well, could be expressed with fewer ops via jumbo-prefixed FP-IMM ops,
but this would only give "Binary32 truncated to 29 bits" precision for
the immediate values.
Theoretically, could allow an FE-FE-F0 encoding for FP-IMM, which could
give ~ 53 bits of precision. But, if one needs full Binary64, this will
not gain much in this case.
LBB141_6: ; %._crit_edgeDepth=1
fmul r3,r1,r1
fcmp r2,r2,#0x3C9FFE5AB7E8AD5E // xsmall
sra r2,r2,<1:13>
cvtsd r4,#0
mux r2,r2,r3,r4
mov r3,#0x3FC7C7905A31C322 // a[4]
fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
ldd r4,[sp,104] // a[2]
fmac r3,r2,r3,r4
fadd r4,r2,#0x403799EE342FB2DE // b[0]
fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
fmul r1,r3,r1
fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
fdiv r2,r1,r2
mov r1,r2
add sp,sp,#128
ret // 68
LBB141_7:
fmul r3,r2,#0x3E571E703C5F5815 // c[8]
mov r5,#0
mov r4,r2
LBB141_8: ; =>This Inner Loop Header:
vec r6,{r3,r4}
ldd r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
fadd r3,r3,r7
fmul r3,r2,r3
ldd r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
fadd r4,r4,r7
fmul r4,r2,r4
loop ne,r5,#7,#1
; %bb.9:
fadd r3,r3,#0x4093395B7FD2FC8E // c[7]
fadd r4,r4,#0x4093395B7FD35F61 // d[7]
fdiv r3,r3,r4
LBB141_10: // common tail
fmul r4,r2,#0x41800000 // 16.0
fmul r4,r4,#0x3D800000 // 1/16.0
cvtds r4,r4 // (signed)double
cvtsd r4,r4 // (double)signed
fadd r5,r2,-r4
fadd r2,r2,r4
fmul r4,r4,-r4
fexp r4,r4 // exp()
fmul r2,r2,-r5
fexp r2,r2 // exp()
fmul r2,r4,r2
fadd r2,#0,-r2
fmac r2,r2,r3,#0x3F000000 // 0.5
fadd r2,r2,#0x3F000000 // 0.5
pflt r1,0,T
fadd r2,#0,-r2
mov r1,r2
add sp,sp,#128
ret
LBB141_11:
fcmp r1,r1,#0
sra r1,r1,<1:13>
cvtsd r2,#-1 // (double)-1
cvtsd r3,#1 // (double)+1
mux r2,r1,r3,r2
mov r1,r2
add sp,sp,#128
ret
Lfunc_end141:
.size r8_erf, .Lfunc_end141-r8_erf
; -- End function
Don't really have time at the moment to comment on the rest of this...
In other news, found a bug in the function dependency-walking code.
Fixing this bug got things a little closer to beak-even with RV64G GCC output regarding ".text" size (though, was still not sufficient to
entirely close the gap).
This was mostly based on noting that the compiler output had included--- Synchronet 3.20a-Linux NewsLink 1.114
some things that were not reachable from within the program being
compiled (namely, noticing that the Doom build had included a copy of
the MS-CRAM video decoder and similar, which was not reachable from
anywhere within Doom).
Some more analysis may be needed.
....
On 4/11/2024 8:40 PM, MitchAlsup1 wrote:
BGB wrote:
On 4/11/2024 6:06 PM, MitchAlsup1 wrote:
While I admit that <basically> anything bigger than 50-bits will be fine >>>> as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.
The number of cases where this comes up is not statistically
significant enough to have a meaningful impact on performance.
Fraction of a percent edge-cases are not deal-breakers, as I see it.
Idle speculation::
.globl r8_erf ; -- Begin function r8_erf
.type r8_erf,@function
r8_erf: ; @r8_erf
; %bb.0:
add sp,sp,#-128
std #4614300636657501161,[sp,88] // a[0]
std #4645348406721991307,[sp,104] // a[2]
std #4659275911028085274,[sp,112] // a[3]
std #4595861367557309218,[sp,120] // a[4]
std #4599171895595656694,[sp,40] // p[0]
std #4593699784569291823,[sp,56] // p[2]
std #4580293056851789237,[sp,64] // p[3]
std #4559215111867327292,[sp,72] // p[4]
std #4580359811580069319,[sp,80] // p[4]
std #4612966212090462427,[sp] // q[0]
std #4602930165995154489,[sp,16] // q[2]
std #4588882433176075751,[sp,24] // q[3]
std #4567531038595922641,[sp,32] // q[4]
fabs r2,r1
fcmp r3,r2,#0x3EF00000 // thresh
bnlt r3,.LBB141_6
; %bb.1:
fcmp r3,r2,#4 // xabs <= 4.0
bnlt r3,.LBB141_7
; %bb.2:
fcmp r3,r2,#0x403A8B020C49BA5E // xbig
bngt r3,.LBB141_11
; %bb.3:
fmul r3,r1,r1
fdiv r3,#1,r3
mov r4,#0x3F90B4FB18B485C7 // p[5]
fmac r4,r3,r4,#0x3FD38A78B9F065F6 // p[0]
fadd r5,r3,#0x40048C54508800DB // q[0]
fmac r6,r3,r4,#0x3FD70FE40E2425B8 // p[1]
fmac r4,r3,r5,#0x3FFDF79D6855F0AD // q[1]
fmul r4,r3,r4
fmul r6,r3,r6
mov r5,#2
add r7,sp,#40 // p[*]
add r8,sp,#0 // q[*]
LBB141_4: ; %._crit_edge11
; =>This Inner Loop Header: Depth=1
vec r9,{r4,r6}
ldd r10,[r7,r5<<3,0] // p[*]
ldd r11,[r8,r5<<3,0] // q[*]
fadd r6,r6,r10
fadd r4,r4,r11
fmul r4,r3,r4
fmul r6,r3,r6
loop ne,r5,#4,#1
; %bb.5:
fadd r5,r6,#0x3F4595FD0D71E33C // p[4]
fmul r3,r3,r5
fadd r4,r4,#0x3F632147A014BAD1 // q[4]
fdiv r3,r3,r4
fadd r3,#0x3FE20DD750429B6D,-r3 // c[0]
fdiv r3,r3,r2
br .LBB141_10 // common tail
LBB141_6: ; %._crit_edge
fmul r3,r1,r1
fcmp r2,r2,#0x3C9FFE5AB7E8AD5E // xsmall
sra r2,r2,<1:13>
cvtsd r4,#0
mux r2,r2,r3,r4
mov r3,#0x3FC7C7905A31C322 // a[4]
fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
ldd r4,[sp,104] // a[2]
fmac r3,r2,r3,r4
fadd r4,r2,#0x403799EE342FB2DE // b[0]
fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
fmul r1,r3,r1
fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
fdiv r2,r1,r2
mov r1,r2
add sp,sp,#128
ret // 68
LBB141_7:
fmul r3,r2,#0x3E571E703C5F5815 // c[8]
mov r5,#0
mov r4,r2
LBB141_8: ; =>This Inner Loop Header: Depth=1
vec r6,{r3,r4}
ldd r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
fadd r3,r3,r7
fmul r3,r2,r3
ldd r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
fadd r4,r4,r7
fmul r4,r2,r4
loop ne,r5,#7,#1
; %bb.9:
fadd r3,r3,#0x4093395B7FD2FC8E // c[7]
fadd r4,r4,#0x4093395B7FD35F61 // d[7]
fdiv r3,r3,r4
LBB141_10: // common tail
fmul r4,r2,#0x41800000 // 16.0
fmul r4,r4,#0x3D800000 // 1/16.0
cvtds r4,r4 // (signed)double >> cvtsd r4,r4 // (double)signed >> fadd r5,r2,-r4
fadd r2,r2,r4
fmul r4,r4,-r4
fexp r4,r4 // exp()
fmul r2,r2,-r5
fexp r2,r2 // exp()
fmul r2,r4,r2
fadd r2,#0,-r2
fmac r2,r2,r3,#0x3F000000 // 0.5
fadd r2,r2,#0x3F000000 // 0.5
pflt r1,0,T
fadd r2,#0,-r2
mov r1,r2
add sp,sp,#128
ret
LBB141_11:
fcmp r1,r1,#0
sra r1,r1,<1:13>
cvtsd r2,#-1 // (double)-1
cvtsd r3,#1 // (double)+1
mux r2,r1,r3,r2
mov r1,r2
add sp,sp,#128
ret
Lfunc_end141:
.size r8_erf, .Lfunc_end141-r8_erf
; -- End function
These patterns seem rather unusual...
Don't really know the ABI.
Patterns don't really fit observations for typical compiler output
though (mostly in the FP constants, and particular ones that fall
outside the scope of what can be exactly represented as Binary16 or
similar, are rare).
BGB wrote:
On 4/11/2024 8:40 PM, MitchAlsup1 wrote:
BGB wrote:
On 4/11/2024 6:06 PM, MitchAlsup1 wrote:
While I admit that <basically> anything bigger than 50-bits will be >>>>> fine
as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.
The number of cases where this comes up is not statistically
significant enough to have a meaningful impact on performance.
Fraction of a percent edge-cases are not deal-breakers, as I see it.
Idle speculation::
.globl r8_erf ; -- Begin function
r8_erf
.type r8_erf,@function
r8_erf: ; @r8_erf
; %bb.0:
add sp,sp,#-128
std #4614300636657501161,[sp,88] // a[0]
std #4645348406721991307,[sp,104] // a[2]
std #4659275911028085274,[sp,112] // a[3]
std #4595861367557309218,[sp,120] // a[4]
std #4599171895595656694,[sp,40] // p[0]
std #4593699784569291823,[sp,56] // p[2]
std #4580293056851789237,[sp,64] // p[3]
std #4559215111867327292,[sp,72] // p[4]
std #4580359811580069319,[sp,80] // p[4]
std #4612966212090462427,[sp] // q[0]
std #4602930165995154489,[sp,16] // q[2]
std #4588882433176075751,[sp,24] // q[3]
std #4567531038595922641,[sp,32] // q[4]
fabs r2,r1
fcmp r3,r2,#0x3EF00000 // thresh
bnlt r3,.LBB141_6
; %bb.1:
fcmp r3,r2,#4 // xabs <= 4.0
bnlt r3,.LBB141_7
; %bb.2:
fcmp r3,r2,#0x403A8B020C49BA5E // xbig
bngt r3,.LBB141_11
; %bb.3:
fmul r3,r1,r1
fdiv r3,#1,r3
mov r4,#0x3F90B4FB18B485C7 // p[5]
fmac r4,r3,r4,#0x3FD38A78B9F065F6 // p[0]
fadd r5,r3,#0x40048C54508800DB // q[0]
fmac r6,r3,r4,#0x3FD70FE40E2425B8 // p[1]
fmac r4,r3,r5,#0x3FFDF79D6855F0AD // q[1]
fmul r4,r3,r4
fmul r6,r3,r6
mov r5,#2
add r7,sp,#40 // p[*]
add r8,sp,#0 // q[*]
LBB141_4: ; %._crit_edge11
; =>This Inner Loop Header:
Depth=1
vec r9,{r4,r6}
ldd r10,[r7,r5<<3,0] // p[*]
ldd r11,[r8,r5<<3,0] // q[*]
fadd r6,r6,r10
fadd r4,r4,r11
fmul r4,r3,r4
fmul r6,r3,r6
loop ne,r5,#4,#1
; %bb.5:
fadd r5,r6,#0x3F4595FD0D71E33C // p[4]
fmul r3,r3,r5
fadd r4,r4,#0x3F632147A014BAD1 // q[4]
fdiv r3,r3,r4
fadd r3,#0x3FE20DD750429B6D,-r3 // c[0]
fdiv r3,r3,r2
br .LBB141_10 // common tail
LBB141_6: ; %._crit_edge
fmul r3,r1,r1
fcmp r2,r2,#0x3C9FFE5AB7E8AD5E // xsmall
sra r2,r2,<1:13>
cvtsd r4,#0
mux r2,r2,r3,r4
mov r3,#0x3FC7C7905A31C322 // a[4]
fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
ldd r4,[sp,104] // a[2]
fmac r3,r2,r3,r4
fadd r4,r2,#0x403799EE342FB2DE // b[0]
fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
fmul r1,r3,r1
fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
fdiv r2,r1,r2
mov r1,r2
add sp,sp,#128
ret // 68
LBB141_7:
fmul r3,r2,#0x3E571E703C5F5815 // c[8]
mov r5,#0
mov r4,r2
LBB141_8: ; =>This Inner Loop Header:
Depth=1
vec r6,{r3,r4}
ldd r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
fadd r3,r3,r7
fmul r3,r2,r3
ldd r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
fadd r4,r4,r7
fmul r4,r2,r4
loop ne,r5,#7,#1
; %bb.9:
fadd r3,r3,#0x4093395B7FD2FC8E // c[7]
fadd r4,r4,#0x4093395B7FD35F61 // d[7]
fdiv r3,r3,r4
LBB141_10: // common tail
fmul r4,r2,#0x41800000 // 16.0
fmul r4,r4,#0x3D800000 // 1/16.0
cvtds r4,r4 // (signed)double >>> cvtsd r4,r4 // (double)signed >>> fadd r5,r2,-r4
fadd r2,r2,r4
fmul r4,r4,-r4
fexp r4,r4 // exp()
fmul r2,r2,-r5
fexp r2,r2 // exp()
fmul r2,r4,r2
fadd r2,#0,-r2
fmac r2,r2,r3,#0x3F000000 // 0.5
fadd r2,r2,#0x3F000000 // 0.5
pflt r1,0,T
fadd r2,#0,-r2
mov r1,r2
add sp,sp,#128
ret
LBB141_11:
fcmp r1,r1,#0
sra r1,r1,<1:13>
cvtsd r2,#-1 // (double)-1 >>> cvtsd r3,#1 // (double)+1
mux r2,r1,r3,r2
mov r1,r2
add sp,sp,#128
ret
Lfunc_end141:
.size r8_erf, .Lfunc_end141-r8_erf
; -- End function
These patterns seem rather unusual...
Don't really know the ABI.
Patterns don't really fit observations for typical compiler output
though (mostly in the FP constants, and particular ones that fall
outside the scope of what can be exactly represented as Binary16 or
similar, are rare).
You are N E V E R going to find the coefficients of a Chebyshev
polynomial to fit in a small FP container; excepting the very
occasional C0 or C1 term {which are mostly 1.0 and 0.0}
I wrote:If you want to know more about MCore, you can contact me.
MitchAlsup1 <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
John Savard <quadibloc@servername.invalid> schrieb:
Thus, instead of having mode bits, one _could_ do the following:
Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.
But have one value for the first four bits in a block that indicates >>>>> 36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed >>>>> in which don't fill a whole block.
While that's a theoretical possibility, I don't view it as being
worthwhile in practice.
I played around a bit with another scheme: Encoding things into
128-bit blocks, with either 21-bit or 42-bit or longer instructions
(or a block header with six bits, and 20 or 40 bits for each
instruction).
Not having seen said encoding scheme:: I suspect you used the Rd=Rs1
destructive operand model for the 21-bit encodings. Yes :: no ??
It was not very well developed, I gave it up when I saw there wasn't
much to gain.
Maybe one more thing: In order to justify the more complex encoding,
I was going for 64 registers, and that didn't work out too well
(missing bits).
Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.
I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with separate compilations? The called function probably doesn’t know the
tag value for callee saved registers. Fortunately, the My 66000 architecture comes to the rescue here. You would modify the Enter and
Exit instructions to save/restore the tag bits of the registers they are saving or restoring in the same data structure it uses for the registers (yes, it adds 32 bits to that structure – minimal cost). The same mechanism works for interrupts that take control away from a running process.
Any comments are welcome.--- Synchronet 3.20a-Linux NewsLink 1.114
I have a similar problem for the carry and overflow bits in
< http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to
let those bits not survive across calls; if there was a cheap solution
for the problem, it would eliminate this drawback of my idea.
Anton Ertl wrote:; RSI->a[n], RDX->b[n], RDI->sum[n], RCX=-n
I have a similar problem for the carry and overflow bits in
< http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to>> let those bits not survive across calls; if there was a cheap solution>> for the problem, it would eliminate this drawback of my idea.
My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
whereas RISC-V encodes the inner loop in 11 instructions.
Source code:
void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
{
uint64_t c = 0;
for( int i = 0; i < n; i++ )
{
{c, sum[i]} = a[i] + b[i] + c;
}
return
}
Assembly code::
.global mpn_add_n
mpn_add_n:
MOV R5,#0 // c
MOV R6,#0 // i
VEC R7,{}
LDD R8,[R2,Ri<<3]
LDD R9,[R3,Ri<<3]
CARRY R5,{{IO}}
ADD R10,R8,R9
STD R10,[R1,Ri<<3]
LOOP LT,R6,#1,R4
RET
So, adding a few "bells and whistles" to RISC-V does give you a
performance gain (1.38×); using a well designed ISA gives you a performance gain of 2.00× !! {{moral: don't stop too early}}
Note that all the register bookkeeping has disappeared !! because
of the indexed memory reference form.
As I count executing instructions, VEC does not execute, nor does CARRY--CARRY causes the subsequent ADD to take C input as carry and
the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
BC sequence in a single instruction and in a single clock.
MitchAlsup1 wrote:
Anton Ertl wrote:
I have a similar problem for the carry and overflow bits in
< http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to
let those bits not survive across calls; if there was a cheap solution
for the problem, it would eliminate this drawback of my idea.
My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
whereas RISC-V encodes the inner loop in 11 instructions.
Source code:
void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
{
   uint64_t c = 0;
   for( int i = 0; i < n; i++ )
   {
        {c, sum[i]} = a[i] + b[i] + c;
   }
   return
}
Assembly code::
   .global mpn_add_n
mpn_add_n:
   MOV  R5,#0    // c
   MOV  R6,#0    // i
   VEC  R7,{}
   LDD  R8,[R2,Ri<<3]
   LDD  R9,[R3,Ri<<3]
   CARRY R5,{{IO}}
   ADD  R10,R8,R9
   STD  R10,[R1,Ri<<3]
   LOOP LT,R6,#1,R4
   RET
So, adding a few "bells and whistles" to RISC-V does give you a
performance gain (1.38×); using a well designed ISA gives you a >> performance gain of 2.00× !! {{moral: don't stop too early}}
Note that all the register bookkeeping has disappeared !! because
of the indexed memory reference form.
As I count executing instructions, VEC does not execute, nor does
CARRY--CARRY causes the subsequent ADD to take C input as carry and
the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
BC sequence in a single instruction and in a single clock.
; RSI->a[n], RDX->b[n], RDI->sum[n], RCX=-n
xor rax,rax ;; Clear carry
next:
mov rax,[rsi+rcx*8]
adc rax,[rdx+rcx*8]
mov [rdi+rcx*8],rax
inc rcx
jnz next
The code above is 5 instructions, or 6 if we avoid the load-op, doing
two loads and one store, so it should only be limited by the latency of
the ADC, i.e. one or two cycles.
In the non-OoO (i.e Pentium) days, I would have inverted the loop in
order to hide the latencies as much as possible, resulting in an inner > loop something like this:
next:
adc eax,ebx
mov ebx,[edx+ecx*4] ; First cycle
mov [edi+ecx*4],eax
mov eax,[esi+ecx*4] ; Second cycle
inc ecx
jnz next ; Third cycle
On Thu, 11 Apr 2024 18:46:54 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
On 4/11/2024 6:13 AM, Michael S wrote:
On Wed, 10 Apr 2024 23:30:02 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
It does occupy some icache space, however; have you boosted the
icache size to compensate?
The space occupied in the ICache is freed up from being in the
DCache so the overall hit rate goes up !! At typical sizes,
ICache miss rate is about ¼ the miss rate of DCache.
Besides:: if you had to LD the constant from memory, you use a LD
instruction and 1 or 2 words in DCache, while consuming a GPR. So,
overall, it takes fewer cycles, fewer GPRs, and fewer
instructions.
Alternatively:: if you paste constants together (LUI, AUPIC) you
have no direct route to either 64-bit constants or 64-bit address
spaces.
It looks to be a win-win !!
Win-win under constraints of Load-Store Arch. Otherwise, it
depends.
Never seen a LD-OP architecture where the inbound memory can be in
the Rs1 position of the instruction.
May be. But out of 6 major integer OPs it matters only for SUB.
By now I don't remember for sure, but I think that I had seen LD-OP architecture that had SUBR instruction. May be, TI TMS320C30?
It was 30 years ago and my memory is not what it used to be.
MitchAlsup1 wrote:
In the non-OoO (i.e Pentium) days, I would have inverted the loop in
order to hide the latencies as much as possible, resulting in an inner
loop something like this:
next:
adc eax,ebx
mov ebx,[edx+ecx*4] ; First cycle
mov [edi+ecx*4],eax
mov eax,[esi+ecx*4] ; Second cycle
inc ecx
jnz next ; Third cycle
Terje
BGB wrote:
On 4/11/2024 8:40 PM, MitchAlsup1 wrote:
BGB wrote:
On 4/11/2024 6:06 PM, MitchAlsup1 wrote:
While I admit that <basically> anything bigger than 50-bits will be >>>>> fine
as displacements, they are not fine for constants and especially FP
constants and many bit twiddling constants.
The number of cases where this comes up is not statistically
significant enough to have a meaningful impact on performance.
Fraction of a percent edge-cases are not deal-breakers, as I see it.
Idle speculation::
.globl r8_erf ; -- Begin function
r8_erf
.type r8_erf,@function
r8_erf: ; @r8_erf
; %bb.0:
add sp,sp,#-128
std #4614300636657501161,[sp,88] // a[0]
std #4645348406721991307,[sp,104] // a[2]
std #4659275911028085274,[sp,112] // a[3]
std #4595861367557309218,[sp,120] // a[4]
std #4599171895595656694,[sp,40] // p[0]
std #4593699784569291823,[sp,56] // p[2]
std #4580293056851789237,[sp,64] // p[3]
std #4559215111867327292,[sp,72] // p[4]
std #4580359811580069319,[sp,80] // p[4]
std #4612966212090462427,[sp] // q[0]
std #4602930165995154489,[sp,16] // q[2]
std #4588882433176075751,[sp,24] // q[3]
std #4567531038595922641,[sp,32] // q[4]
fabs r2,r1
fcmp r3,r2,#0x3EF00000 // thresh
bnlt r3,.LBB141_6
; %bb.1:
fcmp r3,r2,#4 // xabs <= 4.0
bnlt r3,.LBB141_7
; %bb.2:
fcmp r3,r2,#0x403A8B020C49BA5E // xbig
bngt r3,.LBB141_11
; %bb.3:
fmul r3,r1,r1
fdiv r3,#1,r3
mov r4,#0x3F90B4FB18B485C7 // p[5]
fmac r4,r3,r4,#0x3FD38A78B9F065F6 // p[0]
fadd r5,r3,#0x40048C54508800DB // q[0]
fmac r6,r3,r4,#0x3FD70FE40E2425B8 // p[1]
fmac r4,r3,r5,#0x3FFDF79D6855F0AD // q[1]
fmul r4,r3,r4
fmul r6,r3,r6
mov r5,#2
add r7,sp,#40 // p[*]
add r8,sp,#0 // q[*]
LBB141_4: ; %._crit_edge11
; =>This Inner Loop Header:
Depth=1
vec r9,{r4,r6}
ldd r10,[r7,r5<<3,0] // p[*]
ldd r11,[r8,r5<<3,0] // q[*]
fadd r6,r6,r10
fadd r4,r4,r11
fmul r4,r3,r4
fmul r6,r3,r6
loop ne,r5,#4,#1
; %bb.5:
fadd r5,r6,#0x3F4595FD0D71E33C // p[4]
fmul r3,r3,r5
fadd r4,r4,#0x3F632147A014BAD1 // q[4]
fdiv r3,r3,r4
fadd r3,#0x3FE20DD750429B6D,-r3 // c[0]
fdiv r3,r3,r2
br .LBB141_10 // common tail
LBB141_6: ; %._crit_edge
fmul r3,r1,r1
fcmp r2,r2,#0x3C9FFE5AB7E8AD5E // xsmall
sra r2,r2,<1:13>
cvtsd r4,#0
mux r2,r2,r3,r4
mov r3,#0x3FC7C7905A31C322 // a[4]
fmac r3,r2,r3,#0x400949FB3ED443E9 // a[0]
fmac r3,r2,r3,#0x405C774E4D365DA3 // a[1]
ldd r4,[sp,104] // a[2]
fmac r3,r2,r3,r4
fadd r4,r2,#0x403799EE342FB2DE // b[0]
fmac r4,r2,r4,#0x406E80C9D57E55B8 // b[1]
fmac r4,r2,r4,#0x40940A77529CADC8 // b[2]
fmac r3,r2,r3,#0x40A912C1535D121A // a[3]
fmul r1,r3,r1
fmac r2,r2,r4,#0x40A63879423B87AD // b[3]
fdiv r2,r1,r2
mov r1,r2
add sp,sp,#128
ret // 68
LBB141_7:
fmul r3,r2,#0x3E571E703C5F5815 // c[8]
mov r5,#0
mov r4,r2
LBB141_8: ; =>This Inner Loop Header:
Depth=1
vec r6,{r3,r4}
ldd r7,[ip,r5<<3,.L__const.r8_erf.c]// c[*]
fadd r3,r3,r7
fmul r3,r2,r3
ldd r7,[ip,r5<<3,.L__const.r8_erf.d]// d[*]
fadd r4,r4,r7
fmul r4,r2,r4
loop ne,r5,#7,#1
; %bb.9:
fadd r3,r3,#0x4093395B7FD2FC8E // c[7]
fadd r4,r4,#0x4093395B7FD35F61 // d[7]
fdiv r3,r3,r4
LBB141_10: // common tail
fmul r4,r2,#0x41800000 // 16.0
fmul r4,r4,#0x3D800000 // 1/16.0
cvtds r4,r4 // (signed)double >>> cvtsd r4,r4 // (double)signed >>> fadd r5,r2,-r4
fadd r2,r2,r4
fmul r4,r4,-r4
fexp r4,r4 // exp()
fmul r2,r2,-r5
fexp r2,r2 // exp()
fmul r2,r4,r2
fadd r2,#0,-r2
fmac r2,r2,r3,#0x3F000000 // 0.5
fadd r2,r2,#0x3F000000 // 0.5
pflt r1,0,T
fadd r2,#0,-r2
mov r1,r2
add sp,sp,#128
ret
LBB141_11:
fcmp r1,r1,#0
sra r1,r1,<1:13>
cvtsd r2,#-1 // (double)-1 >>> cvtsd r3,#1 // (double)+1
mux r2,r1,r3,r2
mov r1,r2
add sp,sp,#128
ret
Lfunc_end141:
.size r8_erf, .Lfunc_end141-r8_erf
; -- End function
These patterns seem rather unusual...
Don't really know the ABI.
Patterns don't really fit observations for typical compiler output
though (mostly in the FP constants, and particular ones that fall
outside the scope of what can be exactly represented as Binary16 or
similar, are rare).
You are N E V E R going to find the coefficients of a Chebyshev
polynomial to fit in a small FP container; excepting the very
occasional C0 or C1 term {which are mostly 1.0 and 0.0}
Terje Mathisen wrote:It all comes down to the carry propagation, right?
MitchAlsup1 wrote:
In the non-OoO (i.e Pentium) days, I would have inverted the loop in
order to hide the latencies as much as possible, resulting in an inner
loop something like this:
next:
adc eax,ebx
mov ebx,[edx+ecx*4] ; First cycle
mov [edi+ecx*4],eax
mov eax,[esi+ecx*4] ; Second cycle
inc ecx
jnz next ; Third cycle
Terje
As opposed to::
.global mpn_add_n
mpn_add_n:
MOV R5,#0 // c
MOV R6,#0 // i
VEC R7,{}
LDD R8,[R2,Ri<<3] // Load 128-to-512 bits
LDD R9,[R3,Ri<<3] // Load 128-to-512 bits
CARRY R5,{{IO}}
ADD R10,R8,R9 // Add pair to add octal
STD R10,[R1,Ri<<3] // Store 128-to-512 bits
LOOP LT,R6,#1,R4 // increment 2-to-8 times
RET
--------------------------------------------------------
LDD R8,[R2,Ri<<3] // AGEN cycle 1
LDD R9,[R3,Ri<<3] // AGEN cycle 2 data cycle 4
CARRY R5,{{IO}}
ADD R10,R8,R9 // cycle 4
STD R10,[R1,Ri<<3] // AGEN cycle 3 write cycle 5
LOOP LT,R6,#1,R4 // cycle 3
OR
LDD LDd
LDD LDd ADD
ST STd
LOOP
LDD LDd
LDD LDd
ADD
ST STd
LOOP
10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM
machine !!
without code scheduling heroics.
40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM
machine !!
MitchAlsup1 wrote:
Terje Mathisen wrote:
MitchAlsup1 wrote:
In the non-OoO (i.e Pentium) days, I would have inverted the loop in
order to hide the latencies as much as possible, resulting in an inner
loop something like this:
next:
adc eax,ebx
mov ebx,[edx+ecx*4] ; First cycle
mov [edi+ecx*4],eax
mov eax,[esi+ecx*4] ; Second cycle
inc ecx
jnz next ; Third cycle
Terje
As opposed to::
.global mpn_add_n
mpn_add_n:
MOV R5,#0 // c
MOV R6,#0 // i
VEC R7,{}
LDD R8,[R2,Ri<<3] // Load 128-to-512 bits
LDD R9,[R3,Ri<<3] // Load 128-to-512 bits
CARRY R5,{{IO}}
ADD R10,R8,R9 // Add pair to add octal
STD R10,[R1,Ri<<3] // Store 128-to-512 bits
LOOP LT,R6,#1,R4 // increment 2-to-8 times
RET
--------------------------------------------------------
LDD R8,[R2,Ri<<3] // AGEN cycle 1
LDD R9,[R3,Ri<<3] // AGEN cycle 2 data cycle 4
CARRY R5,{{IO}}
ADD R10,R8,R9 // cycle 4
STD R10,[R1,Ri<<3] // AGEN cycle 3 write cycle 5
LOOP LT,R6,#1,R4 // cycle 3
OR
LDD LDd
LDD LDd ADD
ST STd
LOOP
LDD LDd
LDD LDd
ADD
ST STd
LOOP
10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM
machine !!
without code scheduling heroics.
40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM
machine !!
It all comes down to the carry propagation, right?
The way I understood the original code, you are doing a very wide
unsigned add, so you need a carry to propagate from each and every block
to the next, right?
If you can do that at half a clock cycle per 64 bit ADD, then consider--- Synchronet 3.20a-Linux NewsLink 1.114
me very impressed!
Terje
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
The idea is to add 32 bits to the processor state, one per register...
(though probably not physically part of the register file) as a tag. If
set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load single
floating and load double floating, which work the same as the other 32-
and 64-bit loads, but in addition to loading the value, set the tag bit
for the destination register. Non-floating-point loads would clear the
tag bit. As I show below, I don’t think you need any special "store
tag" instructions.
But we can go further. There are some opcodes that only make sense for
FP operands, e.g. the transcendental instructions. And there are some
operations that probably only make sense for non-FP operands, e.g. POP,
FF1, probably shifts. Given the tag bit, these could share the same
op-code. There may be several more of these.
Certainly makes reading disassembler output fun (or writing the disassembler).
That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it? To me, a major question is the effect on performance. What
is the cost of having to decode the source registers and reading their
respective tag bits before knowing which FU to use?
In in OoO CPU, that's pretty heavy.
But actually, your idea does not need any computation results for
determining the tag bits of registers (except during EXIT),
so you
probably can handle the tags in the front end (decoder and renamer).
Then the tags are really separate and not part of the rgisters that
have to be renamed, and you don't need to perform any waiting on
ENTER.
However, in EXIT the front end would have to wait for the result of
the load/store unit loading the 32 bits, unless you add a special
mechanism for that. So EXIT would become expensive, one way or the
other.
Stephen Fuld wrote:
There has been discussion here about the benefits of reducing the
number of op codes. One reason not mentioned before is if you have
fixed length instructions, you may want to leave as many codes as
possible available for future use. Of course, if you are doing a
16-bit instruction design, where instruction bits are especially
tight, you may save enough op-codes to save a bit, perhaps allowing a
larger register specifier field, or to allow more instructions in the
smaller subset.
It is in this spirit that I had an idea, partially inspired by Mill’s
use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First, >> it has several features that are “friendly” to the idea. Second, I
know Mitch cares about keeping the number of op codes low.
Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate
discussions, and because it has been fun to think about.
The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag.
If set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load
single floating and load double floating, which work the same as the
other 32- and 64-bit loads, but in addition to loading the value, set
the tag bit for the destination register. Non-floating-point loads
would clear the tag bit. As I show below, I don’t think you need any
special "store tag" instructions.
If you are adding a float/int data type flag you might as well
also add operand size for floats at least, though some ISA's
have both int32 and int64 ALU operations for result compatibility.
Currently the opcode data type can tell the uArch how to route
the operands internally without knowing the data values.
For example, FPU reservation stations monitor float operands
and schedule for just the FPU FADD or FMUL units.
Dynamic data typing would change that to be data dependent routing.
It means, for example, you can't begin to schedule a uOp
until you know all its operand types and opcode.
Looks like it makes such distributed decisions impossible.
Probably everything winds up in a big pile of logic in the center,
which might be problematic for those things whose complexity grows N^2.
Not sure how significant that is.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
[saving opcodes]
The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If
set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.).
I don't think this would save a lot of opcode space, which
is the important thing.
A typical RISC design has a six-bit major opcode.
Having three registers takes away fifteen bits, leaving
eleven, which is far more than anybody would ever want as
minor opdoce for arithmetic instructions. Compare with https://en.wikipedia.org/wiki/DEC_Alpha#Instruction_formats
where DEC actually left out three bits because they did not
need them.
On 4/3/2024 1:02 PM, Thomas Koenig wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
[saving opcodes]
The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If >>> set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.).
I don't think this would save a lot of opcode space, which
is the important thing.
A typical RISC design has a six-bit major opcode.
Having three registers takes away fifteen bits, leaving
eleven, which is far more than anybody would ever want as
minor opdoce for arithmetic instructions. Compare with
https://en.wikipedia.org/wiki/DEC_Alpha#Instruction_formats
where DEC actually left out three bits because they did not
need them.
I think that is probably true for 32 bit instructions, but what about 16 bit?
BGB-Alt wrote:Sorry. Typo.
On 4/3/2024 11:43 AM, Stephen Fuld wrote:66000
There has been discussion here about the benefits of reducing the
number of op codes. One reason not mentioned before is if you have
fixed length instructions, you may want to leave as many codes as
possible available for future use. Of course, if you are doing a
16-bit instruction design, where instruction bits are especially
tight, you may save enough op-codes to save a bit, perhaps allowing a
larger register specifier field, or to allow more instructions in the
smaller subset.
It is in this spirit that I had an idea, partially inspired by Mill’s >>> use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First, it
has several features that are “friendly” to the idea. Second, I know >>> Mitch cares about keeping the number of op codes low.
Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate
discussions, and because it has been fun to think about.
The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag.
If set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load
single floating and load double floating, which work the same as the
other 32- and 64-bit loads, but in addition to loading the value, set
the tag bit for the destination register. Non-floating-point loads
would clear the tag bit. As I show below, I don’t think you need any >>> special "store tag" instructions.
What do you do when you want a FP bit pattern interpreted as an integer,
or vice versa.
When executing arithmetic instructions, if the tag bits of both
sources of an instruction are the same, do the appropriate operation
(floating or integer), and set the tag bit of the result register
appropriately.
If the tag bits of the two sources are different, I see several
possibilities.
1. Generate an exception.
2. Use the sense of source 1 for the arithmetic operation, but
perform the appropriate conversion on the second operand first,
potentially saving an instruction
Conversions to/from FP often require a rounding mode. How do you specify that?
3. Always do the operation in floating point and convert the
integer operand prior to the operation. (Or, if you prefer, change
floating point to integer in the above description.)
4. Same as 2 or 3 above, but don’t do the conversions.
I suspect this is the least useful choice. I am not sure which is
the best option.
Given that, use the same op code for the floating-point and fixed
versions of the same operations. So we can save eight op codes, the
four arithmetic operations, max, min, abs and compare. So far, a net
savings of six opcodes.
But we can go further. There are some opcodes that only make sense
for FP operands, e.g. the transcendental instructions. And there are
some operations that probably only make sense for non-FP operands,
e.g. POP, FF1, probably shifts. Given the tag bit, these could share
the same op-code. There may be several more of these.
Hands waving:: "Danger Will Robinson, Danger" more waving of hands.
I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with
separate compilations? The called function probably doesn’t know the
The compiler will certainly have a function prototype. In any event, if FP and Integers share a register file the lack of prototype is much less stress-
full to the compiler/linking system.
tag value for callee saved registers. Fortunately, the My 66000
architecture comes to the rescue here. You would modify the Enter
and Exit instructions to save/restore the tag bits of the registers
they are saving or restoring in the same data structure it uses for
the registers (yes, it adds 32 bits to that structure – minimal
cost). The same mechanism works for interrupts that take control
away from a running process.
Yes, but we do just fine without the tag and without the stuff mentioned above. Neither ENTER nor EXIT care about the 64-bit pattern in the
register.
I don’t think you need to set or clear the tag bits without doing
anything else, but if you do, I think you could “repurpose” some
other instructions to do this, without requiring another op-code.
For example, Oring a register with itself could be used to set the
tag bit and Oring a register with zero could clear it. These should
be pretty rare.
That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it?
No.
To me, a major question is the effect on performance. What
is the cost of having to decode the source registers and reading
their respective tag bits before knowing which FU to use?
The problem is you have put decode dependent on dynamic pipeline information.
I suggest you don't want to do that. Consider a change from int to FP instruction
as a predicated instruction, so the pipeline cannot DECODE the
instruction at
hand until the predicate resolves. Yech.
On 4/3/2024 11:44 AM, EricP wrote:
If you are adding a float/int data type flag you might as well
also add operand size for floats at least, though some ISA's
have both int32 and int64 ALU operations for result compatibility.
Not needed for My 66000, as all floating point loads convert the loaded value to double precision.
John Savard <quadibloc@servername.invalid> schrieb:
Well, when the computer fetches a 256-bit block of code, the first
four bits indicates whether it is composed of 36-bit instructions or
28-bit instructions.
Do you think that instructions which require a certain size (almost)
always happen to be situated together so they fit in a block?
So, instead of using the branch target address, one rounds it down to
a 256-bit boundary, reads 256-bits and looks at the first 4-bits to
determine the format, nd then uses the branch offset to pick a cont-
tainer which will become the first instruction executed.
Sounds more complicated than necessary.
Scott Lurndal wrote:[snip]
It seems to me that an offloaded DMA engine would be a far
better way to do memmove (over some threshhold, perhaps a
cache line) without trashing the caches. Likewise memset.
Effectively, that is what HW does, even on the lower end machines,
the AGEN unit of the Cache access pipeline is repeatedly cycled,
and data is read and/or written. One can execute instructions not
needing memory references while LDM, STM, ENTER, EXIT, MM, and MS
are in progress.
Moving this sequencer farther out would still require it to consume
all L1 BW in any event (snooping) for memory consistency reasons.
{Note: cache accesses are performed line-wide not register width
wide}
[snip]On 4/9/24 8:28 PM, MitchAlsup1 wrote:
MMs and MSs that do not cross page boundaries are ATOMIC. The
entire system
sees only the before or only the after state and nothing in
between.
One might wonder how that atomicity is guaranteed in a
SMP processor...
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 915 |
Nodes: | 10 (1 / 9) |
Uptime: | 30:38:21 |
Calls: | 12,169 |
Calls today: | 1 |
Files: | 186,521 |
Messages: | 2,234,249 |