Been thinking some about the carry and overflow and what to do about
register spills and reloads during expression processing. My thought was
that on the machine with 256 registers, simply allocate a ridiculous
number of registers for expression processing, for example 25 or even
50. Then if the expression is too complex, have the compiler spit out an error message to the programmer to simplify the expression. Remnants of
the ‘expression too complex’ error in BASIC.
On Wed, 9 Oct 2024 10:44:08 +0000, Robert Finch wrote:
Been thinking some about the carry and overflow and what to do about
register spills and reloads during expression processing. My thought was
that on the machine with 256 registers, simply allocate a ridiculous
number of registers for expression processing, for example 25 or even
50. Then if the expression is too complex, have the compiler spit out an
error message to the programmer to simplify the expression. Remnants of
the ‘expression too complex’ error in BASIC.
Both completely unacceptable, and in your case completely unnecessary.
in 967 subroutines I read out of My 66000 LLVM compile, I only have
3 cases of spill-fill, and that is with only 32 registers with uni-
versal constants.
Of the RISC-V code I read alongside with 32+32 registers, I counted 8.
With those statistics and 256 registers, If you can't get to essentially
0 spill=fill the problem is not with your architecture but with your compiler.
On 2024-10-05 5:43 a.m., Anton Ertl wrote:
Robert Finch <robfi680@gmail.com> writes:
On 2024-10-04 2:19 a.m., Anton Ertl wrote:...
4) Keep the flags results along with GPRs: have carry and overflow as
bit 64 and 65, N is bit 63, and Z tells something about bits 0-63.
The advantage is that you do not have to track the flags separately
(and, in case of AMD64, track each of C, O, and NZP separately), but
instead can use the RAT that is already there for the GPRs. You can
find a preliminary paper on that on
<https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
One solution, not mentioned in your article, is to support arithmetic
with two bits less than the number of bit a register can support, so
that the carry and overflow can be stored. On a 64-bit machine have all
operations use only 62-bits. It would solve the issue of how to load or
store the carry and overflow bits associated with a register.
Yes, that's a solution, but the question is how well existing software
would react to having no int64_t (and equivalent types, such as long
long), but instead an int62_t (or maybe int63_t, if the 64th bit is
used for both signed and unsigned overflow, by having separate signed
and unsigned addition etc.). I expect that such an architecture would
have low acceptance. By contrast, in my paper I suggest an addition
to existing 64-bit architectures that has fewer of the same
disadvantages as the widely-used condition-code-register approach has,
but still has a few of them.
Sometimes
arithmetic is performed with fewer bits, as for pointer representation.
I wonder if pointer masking could somehow be involved. It may be useful
to have a bit indicating the presence of a pointer. Also thinking of how >>> to track a binary point position for fixed point arithmetic. Perhaps
using the whole upper byte of a register for status/control bits
would work.
There are some extensions for AMD64 in that direction.
It may be possible with Q+ to support a second destination register
which is in a subset of the GPRs. For example, one of eight registers
could be specified to holds the carry/overflow status. That effectively
ties up a second ALU though as an extra write port is needed for the
instruction.
Needing only one write port is an advantage of my approach.
- anton
Been thinking some about the carry and overflow and what to do about register spills and reloads during expression processing. My thought was that on the machine with 256 registers, simply allocate a ridiculous
number of registers for expression processing, for example 25 or even
50. Then if the expression is too complex, have the compiler spit out an error message to the programmer to simplify the expression. Remnants of
the ‘expression too complex’ error in BASIC. So, there are no spills or reloads during expression processing. I think the storextra / loadextra registers used during context switching would work okay. But in Q+ there
are 256 regs which require eight storextra / loadextra registers. I
think the store extra / load extra registers could be hidden in the
context save and restore hardware. Not even requiring access via CSRs or whatever. I suppose context loads and stores could be done in blocks of
32 registers. An issue is that the load extra needs to be done before registers are loaded. So, the extra word full of carry/overflow bits
would need to be fetched in a non-sequential fashion. Assuming for
instance, that saving register values is followed by a save of the CO
word. Then it is positioned wrong for a sequential load. It may be
better to have the wrong position for a store, so loads can proceed sequentially.
It strikes me that there is no real good solution, only perhaps an engineered one. Toyed with the idea of having 16 separate flags
registers, but not liking that as a solution as much as the store/load extra.
Another thought is to store additional info such as a CRC check of the register file on context save and restore.
*****
Finally wrote the SM to walk the ROB backwards and restore register
mappings for a checkpoint restore. Cannot get Q+ to do more than light
up one LED in SIM. Register values are not propagating properly.
On 2024-10-09 6:44 a.m., Robert Finch wrote:
Mulled over carry and overflow in arithmetic operations. Looked at
widening the datapath to 66-bits to hold carry and overflow bits.
Thinking it may increase the size of the design by over 3% just to
support carry and overflow. For now, an instruction, ADDGC, was added to generate the carry bit as a result. A 256-bit add looks like:
; 256 bit add
; A = r1,r2,r3,r4
; B = r5,r6,r7,r8
; S = r9,r10,r11,r12
add r9,r1,r5,r0
addgc r13,r1,r5,r0
add r10,r2,r6,r13
addgc r13,r2,r6,r13
add r11,r7,r3,r13
addgc r13,r7,r3,r13
add r12,r8,r4,r13
Not very elegant a solution, but it is simple. I think it requires--- Synchronet 3.20a-Linux NewsLink 1.114
minimal hardware. Three input ADD is already present and ADDGC just
routes the carry bit to the output.
On Wed, 9 Oct 2024 10:44:08 +0000, Robert Finch wrote:
Been thinking some about the carry and overflow and what to do about
register spills and reloads during expression processing. My thought was
that on the machine with 256 registers, simply allocate a ridiculous
number of registers for expression processing, for example 25 or even
50. Then if the expression is too complex, have the compiler spit out an
error message to the programmer to simplify the expression. Remnants of
the ‘expression too complex’ error in BASIC.
Both completely unacceptable, and in your case completely unnecessary.
in 967 subroutines I read out of My 66000 LLVM compile, I only have
3 cases of spill-fill, and that is with only 32 registers with uni-
versal constants.
Of the RISC-V code I read alongside with 32+32 registers, I counted 8.
With those statistics and 256 registers, If you can't get to essentially
0 spill=fill the problem is not with your architecture but with your compiler.
On Sat, 12 Oct 2024 9:38:01 +0000, Robert Finch wrote:
On 2024-10-09 6:44 a.m., Robert Finch wrote:
Mulled over carry and overflow in arithmetic operations. Looked at
widening the datapath to 66-bits to hold carry and overflow bits.
Thinking it may increase the size of the design by over 3% just to
support carry and overflow. For now, an instruction, ADDGC, was added to
generate the carry bit as a result. A 256-bit add looks like:
; 256 bit add
; A = r1,r2,r3,r4
; B = r5,r6,r7,r8
; S = r9,r10,r11,r12
add r9,r1,r5,r0
addgc r13,r1,r5,r0
add r10,r2,r6,r13
addgc r13,r2,r6,r13
add r11,r7,r3,r13
addgc r13,r7,r3,r13
add r12,r8,r4,r13
My 66000 version::
CARRY R8,{{IO}{IO}{IO}{O}}
ADD R4,R12,R16
ADD R5,R13,R17
ADD R6,R14,R18
ADD R7,R15,R19
// R{8,7,6,5,4} contain the 257-bit result.
256-bit add giving 257-bit result.
Not very elegant a solution, but it is simple. I think it requires
minimal hardware. Three input ADD is already present and ADDGC just
routes the carry bit to the output.
On 10/12/2024 1:50 PM, MitchAlsup1 wrote:
On Sat, 12 Oct 2024 9:38:01 +0000, Robert Finch wrote:
On 2024-10-09 6:44 a.m., Robert Finch wrote:
Mulled over carry and overflow in arithmetic operations. Looked at
widening the datapath to 66-bits to hold carry and overflow bits.
Thinking it may increase the size of the design by over 3% just to
support carry and overflow. For now, an instruction, ADDGC, was added to >>> generate the carry bit as a result. A 256-bit add looks like:
; 256 bit add
; A = r1,r2,r3,r4
; B = r5,r6,r7,r8
; S = r9,r10,r11,r12
add r9,r1,r5,r0
addgc r13,r1,r5,r0
add r10,r2,r6,r13
addgc r13,r2,r6,r13
add r11,r7,r3,r13
addgc r13,r7,r3,r13
add r12,r8,r4,r13
My 66000 version::
CARRY R8,{{IO}{IO}{IO}{O}}
ADD R4,R12,R16
ADD R5,R13,R17
ADD R6,R14,R18
ADD R7,R15,R19
// R{8,7,6,5,4} contain the 257-bit result.
256-bit add giving 257-bit result.
BJX2 / XG2, assuming in-register (A/D=R4..R7, B=R20..R23):
CLRT
ADDC R20, R4
ADDC R21, R5
ADDC R22, R6
ADDC R23, R7
Or, D=R16..R19
MOV.X R4, R16
MOV.X R6, R18
CLRT
ADDC R20, R16
ADDC R21, R17
ADDC R22, R18
ADDC R23, R19
ADDC is itself mostly a holdover from SH.
Could almost make sense to make it have a 3R form though and move it to updating SR.S instead, since SR.T is likely better left exclusively to predication (vs mostly predication, and obscure edge-case ops like ADDC/ SUBC/ROTCL/...).
Could almost add an ADDC.X op which operates 128 bits at a time, say:
CLRT
ADDC.X R4, R20, R16
ADDC.X R6, R22, R18
Except that it would be rarely used enough to make its existence
debatable at best.
Not very elegant a solution, but it is simple. I think it requires
minimal hardware. Three input ADD is already present and ADDGC just
routes the carry bit to the output.
On 2024-10-12 4:14 p.m., BGB wrote:
On 10/12/2024 1:50 PM, MitchAlsup1 wrote:BJX2 / XG2 has destroys the value of the one source operand, I noted the extra code to preserve the one operand. Is that only for the ADDC instruction?
On Sat, 12 Oct 2024 9:38:01 +0000, Robert Finch wrote:
On 2024-10-09 6:44 a.m., Robert Finch wrote:
Mulled over carry and overflow in arithmetic operations. Looked at
widening the datapath to 66-bits to hold carry and overflow bits.
Thinking it may increase the size of the design by over 3% just to
support carry and overflow. For now, an instruction, ADDGC, was added to >>>> generate the carry bit as a result. A 256-bit add looks like:
; 256 bit add
; A = r1,r2,r3,r4
; B = r5,r6,r7,r8
; S = r9,r10,r11,r12
add r9,r1,r5,r0
addgc r13,r1,r5,r0
add r10,r2,r6,r13
addgc r13,r2,r6,r13
add r11,r7,r3,r13
addgc r13,r7,r3,r13
add r12,r8,r4,r13
My 66000 version::
CARRY R8,{{IO}{IO}{IO}{O}}
ADD R4,R12,R16
ADD R5,R13,R17
ADD R6,R14,R18
ADD R7,R15,R19
// R{8,7,6,5,4} contain the 257-bit result.
256-bit add giving 257-bit result.
BJX2 / XG2, assuming in-register (A/D=R4..R7, B=R20..R23):
CLRT
ADDC R20, R4
ADDC R21, R5
ADDC R22, R6
ADDC R23, R7
Or, D=R16..R19
MOV.X R4, R16
MOV.X R6, R18
CLRT
ADDC R20, R16
ADDC R21, R17
ADDC R22, R18
ADDC R23, R19
ADDC is itself mostly a holdover from SH.
Could almost make sense to make it have a 3R form though and move it to
updating SR.S instead, since SR.T is likely better left exclusively to
predication (vs mostly predication, and obscure edge-case ops like ADDC/
SUBC/ROTCL/...).
Could almost add an ADDC.X op which operates 128 bits at a time, say:
CLRT
ADDC.X R4, R20, R16
ADDC.X R6, R22, R18
Except that it would be rarely used enough to make its existence
debatable at best.
Not very elegant a solution, but it is simple. I think it requires
minimal hardware. Three input ADD is already present and ADDGC just
routes the carry bit to the output.
What is the limit on the My66000 CARRY modifier for the number of
carries? Assuming the sequence is interruptible there must be a few bits
of state that need to be preserved.
I found incorporating modifiers have a tendency to turn my code into spaghetti. Maybe my grasp of implementation is not so great though.
The add, addgc can execute at the same time. So, it is 4 clocks at the--- Synchronet 3.20a-Linux NewsLink 1.114
worst to add two 256-bit numbers. (The first / last instructions may
execute at the same time as other instructions).
I wanted to avoid using instruction modifiers and special flags
registers as much as possible. It is somewhat tricky to have a carry
flag in flight. Q+ is not very code dense, but the add can be done. It
is also possible to put the carry bit in a predicate register.
On 2024-10-12 4:14 p.m., BGB wrote:
On 10/12/2024 1:50 PM, MitchAlsup1 wrote:BJX2 / XG2 has destroys the value of the one source operand, I noted the extra code to preserve the one operand. Is that only for the ADDC instruction?
On Sat, 12 Oct 2024 9:38:01 +0000, Robert Finch wrote:
On 2024-10-09 6:44 a.m., Robert Finch wrote:
Mulled over carry and overflow in arithmetic operations. Looked at
widening the datapath to 66-bits to hold carry and overflow bits.
Thinking it may increase the size of the design by over 3% just to
support carry and overflow. For now, an instruction, ADDGC, was
added to
generate the carry bit as a result. A 256-bit add looks like:
; 256 bit add
; A = r1,r2,r3,r4
; B = r5,r6,r7,r8
; S = r9,r10,r11,r12
add r9,r1,r5,r0
addgc r13,r1,r5,r0
add r10,r2,r6,r13
addgc r13,r2,r6,r13
add r11,r7,r3,r13
addgc r13,r7,r3,r13
add r12,r8,r4,r13
My 66000 version::
CARRY R8,{{IO}{IO}{IO}{O}}
ADD R4,R12,R16
ADD R5,R13,R17
ADD R6,R14,R18
ADD R7,R15,R19
// R{8,7,6,5,4} contain the 257-bit result.
256-bit add giving 257-bit result.
BJX2 / XG2, assuming in-register (A/D=R4..R7, B=R20..R23):
CLRT
ADDC R20, R4
ADDC R21, R5
ADDC R22, R6
ADDC R23, R7
Or, D=R16..R19
MOV.X R4, R16
MOV.X R6, R18
CLRT
ADDC R20, R16
ADDC R21, R17
ADDC R22, R18
ADDC R23, R19
ADDC is itself mostly a holdover from SH.
Could almost make sense to make it have a 3R form though and move it
to updating SR.S instead, since SR.T is likely better left exclusively
to predication (vs mostly predication, and obscure edge-case ops like
ADDC/ SUBC/ROTCL/...).
Could almost add an ADDC.X op which operates 128 bits at a time, say:
CLRT
ADDC.X R4, R20, R16
ADDC.X R6, R22, R18
Except that it would be rarely used enough to make its existence
debatable at best.
Not very elegant a solution, but it is simple. I think it requires
minimal hardware. Three input ADD is already present and ADDGC just
routes the carry bit to the output.
What is the limit on the My66000 CARRY modifier for the number of
carries? Assuming the sequence is interruptible there must be a few bits
of state that need to be preserved.
I found incorporating modifiers have a tendency to turn my code into spaghetti. Maybe my grasp of implementation is not so great though.
The add, addgc can execute at the same time. So, it is 4 clocks at the
worst to add two 256-bit numbers. (The first / last instructions may
execute at the same time as other instructions).
I wanted to avoid using instruction modifiers and special flags
registers as much as possible. It is somewhat tricky to have a carry
flag in flight. Q+ is not very code dense, but the add can be done. It
is also possible to put the carry bit in a predicate register.
BJX2 / XG2 has destroys the value of the one source operand, I noted the
extra code to preserve the one operand. Is that only for the ADDC
instruction?
What is the limit on the My66000 CARRY modifier for the number of
carries? Assuming the sequence is interruptible there must be a few bits
of state that need to be preserved.
CARRY casts its modification over 8 subsequent instructions using its
16-bit immediate.
I found incorporating modifiers have a tendency to turn my code into
spaghetti. Maybe my grasp of implementation is not so great though.
DECODE has a shift register to attach 2-bits to subsequent instructions
each. However, the Rd provided by CARRY carries 64-bits from instruction
to instruction--which makes 256×64 -bit multiplication straightforward.
The add, addgc can execute at the same time. So, it is 4 clocks at the
worst to add two 256-bit numbers. (The first / last instructions may
execute at the same time as other instructions).
I wanted to avoid using instruction modifiers and special flags
registers as much as possible. It is somewhat tricky to have a carry
flag in flight. Q+ is not very code dense, but the add can be done. It
is also possible to put the carry bit in a predicate register.
Been thinking some about the carry and overflow and what to do about >register spills and reloads during expression processing. My thought was >that on the machine with 256 registers, simply allocate a ridiculous
number of registers for expression processing, for example 25 or even
50. Then if the expression is too complex, have the compiler spit out an >error message to the programmer to simplify the expression. Remnants of
the ‘expression too complex’ error in BASIC. So, there are no spills or >reloads during expression processing.
I think the storextra / loadextra
registers used during context switching would work okay. But in Q+ there
are 256 regs which require eight storextra / loadextra registers. I
think the store extra / load extra registers could be hidden in the
context save and restore hardware. Not even requiring access via CSRs or >whatever.
I suppose context loads and stores could be done in blocks of
32 registers. An issue is that the load extra needs to be done before >registers are loaded.
Another thought is to store additional info such as a CRC check of the >register file on context save and restore.
BJX2 / XG2 has destroys the value of the one source operand, I noted the >>> extra code to preserve the one operand. Is that only for the ADDC
instruction?
What is the limit on the My66000 CARRY modifier for the number of
carries? Assuming the sequence is interruptible there must be a few bits >>> of state that need to be preserved.
CARRY casts its modification over 8 subsequent instructions using its
16-bit immediate.
I found incorporating modifiers have a tendency to turn my code into
spaghetti. Maybe my grasp of implementation is not so great though.
DECODE has a shift register to attach 2-bits to subsequent instructions
each. However, the Rd provided by CARRY carries 64-bits from instruction
to instruction--which makes 256×64 -bit multiplication straightforward.
Q+ has something using a similar approach, the ATOM instruction, which
sets the interrupt priority level for the next 11 instructions. It
shifts three bits per instruction at a time at the enqueue stage when
the instruction group is loaded into the ROB. The shift should maybe be
moved back to decode. It is a bit of spaghetti code ATM. I suspect could
be implemented better. The idea is ATOM allows temporarily disabling interrupts and automatically restoring the interrupt level to what it
was after a certain number of instructions.
I found writing code I was disabling then enabling interrupts at various points, which was tricky to do as the original interrupt status needed
to be recorded and restored. It took several instructions. Looking for a cleaner solution.
atom “77777” ; disable all but non-maskable interrupts
< instr. >
Currently, all Q+ instructions have only a single write port max. To use
two ports means using two ALUs at the same time, which would serialize
the machine. I think the CARRY modifier requires two write ports. The quad-float extender prefix (QFEXT) allows 128-bit floats by using an FPU
and ALU port at the same time.
There were a couple of other modifiers, PRED and ROUND, but they got--- Synchronet 3.20a-Linux NewsLink 1.114
removed as they were not needed when the instructions were enlarged to 64-bit. PRED is just a predicate register spec in every instruction now.
The add, addgc can execute at the same time. So, it is 4 clocks at the
worst to add two 256-bit numbers. (The first / last instructions may
execute at the same time as other instructions).
I wanted to avoid using instruction modifiers and special flags
registers as much as possible. It is somewhat tricky to have a carry
flag in flight. Q+ is not very code dense, but the add can be done. It
is also possible to put the carry bit in a predicate register.
On 2024-10-09 6:44 a.m., Robert Finch wrote:
On 2024-10-05 5:43 a.m., Anton Ertl wrote:Mulled over carry and overflow in arithmetic operations. Looked at
Robert Finch <robfi680@gmail.com> writes:
On 2024-10-04 2:19 a.m., Anton Ertl wrote:...
4) Keep the flags results along with GPRs: have carry and overflow as >>>>> bit 64 and 65, N is bit 63, and Z tells something about bits 0-63.
The advantage is that you do not have to track the flags separately
(and, in case of AMD64, track each of C, O, and NZP separately), but >>>>> instead can use the RAT that is already there for the GPRs. You can >>>>> find a preliminary paper on that on
<https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
One solution, not mentioned in your article, is to support arithmetic
with two bits less than the number of bit a register can support, so
that the carry and overflow can be stored. On a 64-bit machine have all >>>> operations use only 62-bits. It would solve the issue of how to load or >>>> store the carry and overflow bits associated with a register.
Yes, that's a solution, but the question is how well existing software
would react to having no int64_t (and equivalent types, such as long
long), but instead an int62_t (or maybe int63_t, if the 64th bit is
used for both signed and unsigned overflow, by having separate signed
and unsigned addition etc.). I expect that such an architecture would
have low acceptance. By contrast, in my paper I suggest an addition
to existing 64-bit architectures that has fewer of the same
disadvantages as the widely-used condition-code-register approach has,
but still has a few of them.
Sometimes
arithmetic is performed with fewer bits, as for pointer representation. >>>> I wonder if pointer masking could somehow be involved. It may be useful >>>> to have a bit indicating the presence of a pointer. Also thinking of
how
to track a binary point position for fixed point arithmetic. Perhaps
using the whole upper byte of a register for status/control bits
would work.
There are some extensions for AMD64 in that direction.
It may be possible with Q+ to support a second destination register
which is in a subset of the GPRs. For example, one of eight registers
could be specified to holds the carry/overflow status. That effectively >>>> ties up a second ALU though as an extra write port is needed for the
instruction.
Needing only one write port is an advantage of my approach.
- anton
Been thinking some about the carry and overflow and what to do about
register spills and reloads during expression processing. My thought
was that on the machine with 256 registers, simply allocate a
ridiculous number of registers for expression processing, for example
25 or even 50. Then if the expression is too complex, have the
compiler spit out an error message to the programmer to simplify the
expression. Remnants of the ‘expression too complex’ error in BASIC.
So, there are no spills or reloads during expression processing. I
think the storextra / loadextra registers used during context
switching would work okay. But in Q+ there are 256 regs which require
eight storextra / loadextra registers. I think the store extra / load
extra registers could be hidden in the context save and restore
hardware. Not even requiring access via CSRs or whatever. I suppose
context loads and stores could be done in blocks of 32 registers. An
issue is that the load extra needs to be done before registers are
loaded. So, the extra word full of carry/overflow bits would need to
be fetched in a non-sequential fashion. Assuming for instance, that
saving register values is followed by a save of the CO word. Then it
is positioned wrong for a sequential load. It may be better to have
the wrong position for a store, so loads can proceed sequentially.
It strikes me that there is no real good solution, only perhaps an
engineered one. Toyed with the idea of having 16 separate flags
registers, but not liking that as a solution as much as the store/load
extra.
Another thought is to store additional info such as a CRC check of the
register file on context save and restore.
*****
Finally wrote the SM to walk the ROB backwards and restore register
mappings for a checkpoint restore. Cannot get Q+ to do more than light
up one LED in SIM. Register values are not propagating properly.
widening the datapath to 66-bits to hold carry and overflow bits.
Thinking it may increase the size of the design by over 3% just to
support carry and overflow. For now, an instruction, ADDGC, was added to generate the carry bit as a result. A 256-bit add looks like:
; 256 bit add
; A = r1,r2,r3,r4
; B = r5,r6,r7,r8
; S = r9,r10,r11,r12
add r9,r1,r5,r0
addgc r13,r1,r5,r0
add r10,r2,r6,r13
addgc r13,r2,r6,r13
add r11,r7,r3,r13
addgc r13,r7,r3,r13
add r12,r8,r4,r13
Not very elegant a solution, but it is simple. I think it requires
minimal hardware. Three input ADD is already present and ADDGC just
routes the carry bit to the output.
Thinking about organizing a cache controller to fetch an entire 4kB page
of memory at a time on a cache miss. The reason being the memory system
is tremendously faster than the CPU clock as long as burst mode is used.
The longer the burst, the better. The entire 4kB page can be transferred
in < 40 CPU clocks. It takes about 4 CPU clocks to fetch one cache line. Cache is still needed as the memory latency prevents its direct use.
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 991 |
Nodes: | 10 (0 / 10) |
Uptime: | 119:21:58 |
Calls: | 12,958 |
Files: | 186,574 |
Messages: | 3,265,634 |