• Re: Tonights Tradeoff - Background Execution Buffers

    From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Wed Oct 9 16:19:41 2024
    From Newsgroup: comp.arch

    On Wed, 9 Oct 2024 10:44:08 +0000, Robert Finch wrote:


    Been thinking some about the carry and overflow and what to do about
    register spills and reloads during expression processing. My thought was
    that on the machine with 256 registers, simply allocate a ridiculous
    number of registers for expression processing, for example 25 or even
    50. Then if the expression is too complex, have the compiler spit out an error message to the programmer to simplify the expression. Remnants of
    the ‘expression too complex’ error in BASIC.

    Both completely unacceptable, and in your case completely unnecessary.
    in 967 subroutines I read out of My 66000 LLVM compile, I only have
    3 cases of spill-fill, and that is with only 32 registers with uni-
    versal constants.

    Of the RISC-V code I read alongside with 32+32 registers, I counted 8.

    With those statistics and 256 registers, If you can't get to essentially
    0 spill=fill the problem is not with your architecture but with your
    compiler.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 9 15:37:26 2024
    From Newsgroup: comp.arch

    On 2024-10-09 12:19 p.m., MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 10:44:08 +0000, Robert Finch wrote:


    Been thinking some about the carry and overflow and what to do about
    register spills and reloads during expression processing. My thought was
    that on the machine with 256 registers, simply allocate a ridiculous
    number of registers for expression processing, for example 25 or even
    50. Then if the expression is too complex, have the compiler spit out an
    error message to the programmer to simplify the expression. Remnants of
    the ‘expression too complex’ error in BASIC.

    Both completely unacceptable, and in your case completely unnecessary.
    in 967 subroutines I read out of My 66000 LLVM compile, I only have
    3 cases of spill-fill, and that is with only 32 registers with uni-
    versal constants.

    Of the RISC-V code I read alongside with 32+32 registers, I counted 8.

    With those statistics and 256 registers, If you can't get to essentially
    0 spill=fill the problem is not with your architecture but with your compiler.

    Yes, that is sort of what I was thinking. The compiler does not generate
    very many spills and fills, using just 32 regs (10 temps+9 saved
    possibly). Most functions do not have any spills or fills. Using just a
    few more registers might virtually guarantee they never happen. Or have
    the number of registers used a compiler option, in case there is a case
    with not enough registers. I suppose the compiler could keep increasing
    the number of registers it uses until it has enough.

    Not spilling and filling was to get around the issue of having to save
    extra carry-overflow bits for a register. So, maybe allowing the
    compiler to spill and fill is possible if CO bits are not needed for the
    app. So, one might have either compact and fast extended precision
    arithmetic or complex arithmetic expressions. Could make the compiler
    smart enough to detect the situation.



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Oct 12 05:38:01 2024
    From Newsgroup: comp.arch

    On 2024-10-09 6:44 a.m., Robert Finch wrote:
    On 2024-10-05 5:43 a.m., Anton Ertl wrote:
    Robert Finch <robfi680@gmail.com> writes:
    On 2024-10-04 2:19 a.m., Anton Ertl wrote:
    4) Keep the flags results along with GPRs: have carry and overflow as
    bit 64 and 65, N is bit 63, and Z tells something about bits 0-63.
    The advantage is that you do not have to track the flags separately
    (and, in case of AMD64, track each of C, O, and NZP separately), but
    instead can use the RAT that is already there for the GPRs.  You can
    find a preliminary paper on that on
    <https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
    ...
    One solution, not mentioned in your article, is to support arithmetic
    with two bits less than the number of bit a register can support, so
    that the carry and overflow can be stored. On a 64-bit machine have all
    operations use only 62-bits. It would solve the issue of how to load or
    store the carry and overflow bits associated with a register.

    Yes, that's a solution, but the question is how well existing software
    would react to having no int64_t (and equivalent types, such as long
    long), but instead an int62_t (or maybe int63_t, if the 64th bit is
    used for both signed and unsigned overflow, by having separate signed
    and unsigned addition etc.).  I expect that such an architecture would
    have low acceptance.  By contrast, in my paper I suggest an addition
    to existing 64-bit architectures that has fewer of the same
    disadvantages as the widely-used condition-code-register approach has,
    but still has a few of them.

    Sometimes
    arithmetic is performed with fewer bits, as for pointer representation.
    I wonder if pointer masking could somehow be involved. It may be useful
    to have a bit indicating the presence of a pointer. Also thinking of how >>> to track a binary point position for fixed point arithmetic. Perhaps
    using the whole upper byte of a register for status/control bits
    would work.

    There are some extensions for AMD64 in that direction.

    It may be possible with Q+ to support a second destination register
    which is in a subset of the GPRs. For example, one of eight registers
    could be specified to holds the carry/overflow status. That effectively
    ties up a second ALU though as an extra write port is needed for the
    instruction.

    Needing only one write port is an advantage of my approach.

    - anton

    Been thinking some about the carry and overflow and what to do about register spills and reloads during expression processing. My thought was that on the machine with 256 registers, simply allocate a ridiculous
    number of registers for expression processing, for example 25 or even
    50. Then if the expression is too complex, have the compiler spit out an error message to the programmer to simplify the expression. Remnants of
    the ‘expression too complex’ error in BASIC. So, there are no spills or reloads during expression processing. I think the storextra / loadextra registers used during context switching would work okay. But in Q+ there
    are 256 regs which require eight storextra / loadextra registers. I
    think the store extra / load extra registers could be hidden in the
    context save and restore hardware. Not even requiring access via CSRs or whatever. I suppose context loads and stores could be done in blocks of
    32 registers. An issue is that the load extra needs to be done before registers are loaded. So, the extra word full of carry/overflow bits
    would need to be fetched in a non-sequential fashion. Assuming for
    instance, that saving register values is followed by a save of the CO
    word. Then it is positioned wrong for a sequential load. It may be
    better to have the wrong position for a store, so loads can proceed sequentially.
    It strikes me that there is no real good solution, only perhaps an engineered one. Toyed with the idea of having 16 separate flags
    registers, but not liking that as a solution as much as the store/load extra.

    Another thought is to store additional info such as a CRC check of the register file on context save and restore.

    *****

    Finally wrote the SM to walk the ROB backwards and restore register
    mappings for a checkpoint restore. Cannot get Q+ to do more than light
    up one LED in SIM. Register values are not propagating properly.


    Mulled over carry and overflow in arithmetic operations. Looked at
    widening the datapath to 66-bits to hold carry and overflow bits.
    Thinking it may increase the size of the design by over 3% just to
    support carry and overflow. For now, an instruction, ADDGC, was added to generate the carry bit as a result. A 256-bit add looks like:

    ; 256 bit add
    ; A = r1,r2,r3,r4
    ; B = r5,r6,r7,r8
    ; S = r9,r10,r11,r12

    add r9,r1,r5,r0
    addgc r13,r1,r5,r0
    add r10,r2,r6,r13
    addgc r13,r2,r6,r13
    add r11,r7,r3,r13
    addgc r13,r7,r3,r13
    add r12,r8,r4,r13

    Not very elegant a solution, but it is simple. I think it requires
    minimal hardware. Three input ADD is already present and ADDGC just
    routes the carry bit to the output.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Oct 12 18:50:35 2024
    From Newsgroup: comp.arch

    On Sat, 12 Oct 2024 9:38:01 +0000, Robert Finch wrote:

    On 2024-10-09 6:44 a.m., Robert Finch wrote:
    Mulled over carry and overflow in arithmetic operations. Looked at
    widening the datapath to 66-bits to hold carry and overflow bits.
    Thinking it may increase the size of the design by over 3% just to
    support carry and overflow. For now, an instruction, ADDGC, was added to generate the carry bit as a result. A 256-bit add looks like:

    ; 256 bit add
    ; A = r1,r2,r3,r4
    ; B = r5,r6,r7,r8
    ; S = r9,r10,r11,r12

    add r9,r1,r5,r0
    addgc r13,r1,r5,r0
    add r10,r2,r6,r13
    addgc r13,r2,r6,r13
    add r11,r7,r3,r13
    addgc r13,r7,r3,r13
    add r12,r8,r4,r13

    My 66000 version::

    CARRY R8,{{IO}{IO}{IO}{O}}
    ADD R4,R12,R16
    ADD R5,R13,R17
    ADD R6,R14,R18
    ADD R7,R15,R19
    // R{8,7,6,5,4} contain the 257-bit result.

    256-bit add giving 257-bit result.

    Not very elegant a solution, but it is simple. I think it requires
    minimal hardware. Three input ADD is already present and ADDGC just
    routes the carry bit to the output.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sat Oct 12 14:10:01 2024
    From Newsgroup: comp.arch

    On 10/9/2024 11:19 AM, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 10:44:08 +0000, Robert Finch wrote:


    Been thinking some about the carry and overflow and what to do about
    register spills and reloads during expression processing. My thought was
    that on the machine with 256 registers, simply allocate a ridiculous
    number of registers for expression processing, for example 25 or even
    50. Then if the expression is too complex, have the compiler spit out an
    error message to the programmer to simplify the expression. Remnants of
    the ‘expression too complex’ error in BASIC.

    Both completely unacceptable, and in your case completely unnecessary.
    in 967 subroutines I read out of My 66000 LLVM compile, I only have
    3 cases of spill-fill, and that is with only 32 registers with uni-
    versal constants.


    Tends to be a bit higher IME, but granted my compiler is a bit more naive:
    Either it can static-assign everything;
    Or, it needs to use spill-and-fill.

    In RISC-V mode:
    Static-assign everything, Leaf: 13%
    Partial assign, Leaf: 7.1%
    Static-assign everything, Non-Leaf: 1.8%
    Partial assign, Non-Leaf: 85%
    Average, ~ 4.6 variables static-assigned
    Out of 16.6 variables in a function.

    In XG2 mode:
    Static-assign everything, Leaf: 16%
    Partial assign, Leaf: 0.7%
    Static-assign everything, Non-Leaf: 1.9%
    Partial assign, Non-Leaf: 82%
    Average, ~ 4.8 variables static-assigned
    Out of 16.8 variables in a function.

    Theoretically, the number of static-assigned variables and fully static-assigned functions could be higher, but it looks like the
    compiler is excluding a lot of them for some reason (may need to look
    into it).



    Of the RISC-V code I read alongside with 32+32 registers, I counted 8.


    With 64 GPRs, there can be less spill/fill, and without any increase in
    the number of hardware registers vs RV64G's 32+32 scheme.

    Rarely is register pressure equally balanced in this way, and more often
    it is one of:
    High integer register pressure, little or no FP pressure (most code);
    Very high FP register pressure, low integer pressure (say, unrolled
    matrix multiply).

    Where, an even-split X/F scheme serves neither, and a bigger unified
    register space serves both.



    Though, I guess the usual argument for split GPR/FPR spaces is that with unified register spaces, both ALU and FPU need to use the same pipeline.


    But, if it is a shared register pipeline, one can also leverage ALU for
    a lot of edge cases, like FPU compare.

    If one uses a longer pipeline for FPU ops vs ALU, it seems like one will
    still need to pay the costs of the longer FPU pipeline regardless of
    whether they are a single or separate register file.



    Apparently, similar reasoning for the V extension using separate vector registers (vs just aliasing with the F registers), but I don't really
    want to implement the V extension.


    Almost more tempting to do a cut-down non-conforming "V in F" style implementation:
    * Aliases V to F register pairs;
    ** TBD if better to use V0..V15 or even-only numbering.
    ** Or, V0..V31 exist (if aliased) for 64b vectors,
    ** but only even for 128b.
    * Will drop mask bits and other more advanced features.
    * Trying to set up V properly would result in the instructions faulting.
    ** Could allow the possibility of adding proper V later.


    With those statistics and 256 registers, If you can't get to essentially
    0 spill=fill the problem is not with your architecture but with your compiler.

    With 256 registers, probably 99% of functions could use a "statically
    assign every variable to a register" strategy (though, assuming a case
    where one can reuse registers for temporary values).

    Where, most temporary values are created and used within a single basic
    block, and if no references to that specific temporary exist outside of
    the basic block (and if not marked with a phi operator), the value of
    the temporary can simply be assumed to disappear at the end of a basic
    block. This can also allow temporaries to be allocated into scratch
    registers.


    My own thought though is that going much bigger in terms of the main
    register file likely isn't worth it.

    Only real compelling use for a bigger register file (much over 64) at
    the moment would be more for optimizing interrupts and context switches.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sat Oct 12 15:14:36 2024
    From Newsgroup: comp.arch

    On 10/12/2024 1:50 PM, MitchAlsup1 wrote:
    On Sat, 12 Oct 2024 9:38:01 +0000, Robert Finch wrote:

    On 2024-10-09 6:44 a.m., Robert Finch wrote:
    Mulled over carry and overflow in arithmetic operations. Looked at
    widening the datapath to 66-bits to hold carry and overflow bits.
    Thinking it may increase the size of the design by over 3% just to
    support carry and overflow. For now, an instruction, ADDGC, was added to
    generate the carry bit as a result. A 256-bit add looks like:

    ; 256 bit add
    ; A = r1,r2,r3,r4
    ; B = r5,r6,r7,r8
    ; S = r9,r10,r11,r12

        add r9,r1,r5,r0
        addgc r13,r1,r5,r0
        add r10,r2,r6,r13
        addgc r13,r2,r6,r13
        add r11,r7,r3,r13
        addgc r13,r7,r3,r13
        add r12,r8,r4,r13

    My 66000 version::

          CARRY   R8,{{IO}{IO}{IO}{O}}
          ADD     R4,R12,R16
          ADD     R5,R13,R17
          ADD     R6,R14,R18
          ADD     R7,R15,R19
               // R{8,7,6,5,4} contain the 257-bit result.

    256-bit add giving 257-bit result.

    BJX2 / XG2, assuming in-register (A/D=R4..R7, B=R20..R23):
    CLRT
    ADDC R20, R4
    ADDC R21, R5
    ADDC R22, R6
    ADDC R23, R7

    Or, D=R16..R19
    MOV.X R4, R16
    MOV.X R6, R18
    CLRT
    ADDC R20, R16
    ADDC R21, R17
    ADDC R22, R18
    ADDC R23, R19

    ADDC is itself mostly a holdover from SH.

    Could almost make sense to make it have a 3R form though and move it to updating SR.S instead, since SR.T is likely better left exclusively to predication (vs mostly predication, and obscure edge-case ops like ADDC/SUBC/ROTCL/...).

    Could almost add an ADDC.X op which operates 128 bits at a time, say:
    CLRT
    ADDC.X R4, R20, R16
    ADDC.X R6, R22, R18

    Except that it would be rarely used enough to make its existence
    debatable at best.



    Not very elegant a solution, but it is simple. I think it requires
    minimal hardware. Three input ADD is already present and ADDGC just
    routes the carry bit to the output.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Oct 12 18:20:50 2024
    From Newsgroup: comp.arch

    On 2024-10-12 4:14 p.m., BGB wrote:
    On 10/12/2024 1:50 PM, MitchAlsup1 wrote:
    On Sat, 12 Oct 2024 9:38:01 +0000, Robert Finch wrote:

    On 2024-10-09 6:44 a.m., Robert Finch wrote:
    Mulled over carry and overflow in arithmetic operations. Looked at
    widening the datapath to 66-bits to hold carry and overflow bits.
    Thinking it may increase the size of the design by over 3% just to
    support carry and overflow. For now, an instruction, ADDGC, was added to >>> generate the carry bit as a result. A 256-bit add looks like:

    ; 256 bit add
    ; A = r1,r2,r3,r4
    ; B = r5,r6,r7,r8
    ; S = r9,r10,r11,r12

        add r9,r1,r5,r0
        addgc r13,r1,r5,r0
        add r10,r2,r6,r13
        addgc r13,r2,r6,r13
        add r11,r7,r3,r13
        addgc r13,r7,r3,r13
        add r12,r8,r4,r13

    My 66000 version::

           CARRY   R8,{{IO}{IO}{IO}{O}}
           ADD     R4,R12,R16
           ADD     R5,R13,R17
           ADD     R6,R14,R18
           ADD     R7,R15,R19
                // R{8,7,6,5,4} contain the 257-bit result.

    256-bit add giving 257-bit result.

    BJX2 / XG2, assuming in-register (A/D=R4..R7, B=R20..R23):
      CLRT
      ADDC  R20, R4
      ADDC  R21, R5
      ADDC  R22, R6
      ADDC  R23, R7

    Or, D=R16..R19
      MOV.X R4, R16
      MOV.X R6, R18
      CLRT
      ADDC  R20, R16
      ADDC  R21, R17
      ADDC  R22, R18
      ADDC  R23, R19

    ADDC is itself mostly a holdover from SH.

    Could almost make sense to make it have a 3R form though and move it to updating SR.S instead, since SR.T is likely better left exclusively to predication (vs mostly predication, and obscure edge-case ops like ADDC/ SUBC/ROTCL/...).

    Could almost add an ADDC.X op which operates 128 bits at a time, say:
      CLRT
      ADDC.X R4, R20, R16
      ADDC.X R6, R22, R18

    Except that it would be rarely used enough to make its existence
    debatable at best.



    Not very elegant a solution, but it is simple. I think it requires
    minimal hardware. Three input ADD is already present and ADDGC just
    routes the carry bit to the output.

    BJX2 / XG2 has destroys the value of the one source operand, I noted the
    extra code to preserve the one operand. Is that only for the ADDC
    instruction?

    What is the limit on the My66000 CARRY modifier for the number of
    carries? Assuming the sequence is interruptible there must be a few bits
    of state that need to be preserved.
    I found incorporating modifiers have a tendency to turn my code into spaghetti. Maybe my grasp of implementation is not so great though.

    The add, addgc can execute at the same time. So, it is 4 clocks at the
    worst to add two 256-bit numbers. (The first / last instructions may
    execute at the same time as other instructions).
    I wanted to avoid using instruction modifiers and special flags
    registers as much as possible. It is somewhat tricky to have a carry
    flag in flight. Q+ is not very code dense, but the add can be done. It
    is also possible to put the carry bit in a predicate register.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sat Oct 12 23:28:26 2024
    From Newsgroup: comp.arch

    On Sat, 12 Oct 2024 22:20:50 +0000, Robert Finch wrote:

    On 2024-10-12 4:14 p.m., BGB wrote:
    On 10/12/2024 1:50 PM, MitchAlsup1 wrote:
    On Sat, 12 Oct 2024 9:38:01 +0000, Robert Finch wrote:

    On 2024-10-09 6:44 a.m., Robert Finch wrote:
    Mulled over carry and overflow in arithmetic operations. Looked at
    widening the datapath to 66-bits to hold carry and overflow bits.
    Thinking it may increase the size of the design by over 3% just to
    support carry and overflow. For now, an instruction, ADDGC, was added to >>>> generate the carry bit as a result. A 256-bit add looks like:

    ; 256 bit add
    ; A = r1,r2,r3,r4
    ; B = r5,r6,r7,r8
    ; S = r9,r10,r11,r12

        add r9,r1,r5,r0
        addgc r13,r1,r5,r0
        add r10,r2,r6,r13
        addgc r13,r2,r6,r13
        add r11,r7,r3,r13
        addgc r13,r7,r3,r13
        add r12,r8,r4,r13

    My 66000 version::

           CARRY   R8,{{IO}{IO}{IO}{O}}
           ADD     R4,R12,R16
           ADD     R5,R13,R17
           ADD     R6,R14,R18
           ADD     R7,R15,R19
                // R{8,7,6,5,4} contain the 257-bit result.

    256-bit add giving 257-bit result.

    BJX2 / XG2, assuming in-register (A/D=R4..R7, B=R20..R23):
      CLRT
      ADDC  R20, R4
      ADDC  R21, R5
      ADDC  R22, R6
      ADDC  R23, R7

    Or, D=R16..R19
      MOV.X R4, R16
      MOV.X R6, R18
      CLRT
      ADDC  R20, R16
      ADDC  R21, R17
      ADDC  R22, R18
      ADDC  R23, R19

    ADDC is itself mostly a holdover from SH.

    Could almost make sense to make it have a 3R form though and move it to
    updating SR.S instead, since SR.T is likely better left exclusively to
    predication (vs mostly predication, and obscure edge-case ops like ADDC/
    SUBC/ROTCL/...).

    Could almost add an ADDC.X op which operates 128 bits at a time, say:
      CLRT
      ADDC.X R4, R20, R16
      ADDC.X R6, R22, R18

    Except that it would be rarely used enough to make its existence
    debatable at best.



    Not very elegant a solution, but it is simple. I think it requires
    minimal hardware. Three input ADD is already present and ADDGC just
    routes the carry bit to the output.

    BJX2 / XG2 has destroys the value of the one source operand, I noted the extra code to preserve the one operand. Is that only for the ADDC instruction?

    What is the limit on the My66000 CARRY modifier for the number of
    carries? Assuming the sequence is interruptible there must be a few bits
    of state that need to be preserved.

    CARRY casts its modification over 8 subsequent instructions using its
    16-bit immediate.

    I found incorporating modifiers have a tendency to turn my code into spaghetti. Maybe my grasp of implementation is not so great though.

    DECODE has a shift register to attach 2-bits to subsequent instructions
    each. However, the Rd provided by CARRY carries 64-bits from instruction
    to instruction--which makes 256×64 -bit multiplication straightforward.

    The add, addgc can execute at the same time. So, it is 4 clocks at the
    worst to add two 256-bit numbers. (The first / last instructions may
    execute at the same time as other instructions).
    I wanted to avoid using instruction modifiers and special flags
    registers as much as possible. It is somewhat tricky to have a carry
    flag in flight. Q+ is not very code dense, but the add can be done. It
    is also possible to put the carry bit in a predicate register.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From BGB@cr88192@gmail.com to comp.arch on Sat Oct 12 20:36:43 2024
    From Newsgroup: comp.arch

    On 10/12/2024 5:20 PM, Robert Finch wrote:
    On 2024-10-12 4:14 p.m., BGB wrote:
    On 10/12/2024 1:50 PM, MitchAlsup1 wrote:
    On Sat, 12 Oct 2024 9:38:01 +0000, Robert Finch wrote:

    On 2024-10-09 6:44 a.m., Robert Finch wrote:
    Mulled over carry and overflow in arithmetic operations. Looked at
    widening the datapath to 66-bits to hold carry and overflow bits.
    Thinking it may increase the size of the design by over 3% just to
    support carry and overflow. For now, an instruction, ADDGC, was
    added to
    generate the carry bit as a result. A 256-bit add looks like:

    ; 256 bit add
    ; A = r1,r2,r3,r4
    ; B = r5,r6,r7,r8
    ; S = r9,r10,r11,r12

        add r9,r1,r5,r0
        addgc r13,r1,r5,r0
        add r10,r2,r6,r13
        addgc r13,r2,r6,r13
        add r11,r7,r3,r13
        addgc r13,r7,r3,r13
        add r12,r8,r4,r13

    My 66000 version::

           CARRY   R8,{{IO}{IO}{IO}{O}}
           ADD     R4,R12,R16
           ADD     R5,R13,R17
           ADD     R6,R14,R18
           ADD     R7,R15,R19
                // R{8,7,6,5,4} contain the 257-bit result.

    256-bit add giving 257-bit result.

    BJX2 / XG2, assuming in-register (A/D=R4..R7, B=R20..R23):
       CLRT
       ADDC  R20, R4
       ADDC  R21, R5
       ADDC  R22, R6
       ADDC  R23, R7

    Or, D=R16..R19
       MOV.X R4, R16
       MOV.X R6, R18
       CLRT
       ADDC  R20, R16
       ADDC  R21, R17
       ADDC  R22, R18
       ADDC  R23, R19

    ADDC is itself mostly a holdover from SH.

    Could almost make sense to make it have a 3R form though and move it
    to updating SR.S instead, since SR.T is likely better left exclusively
    to predication (vs mostly predication, and obscure edge-case ops like
    ADDC/ SUBC/ROTCL/...).

    Could almost add an ADDC.X op which operates 128 bits at a time, say:
       CLRT
       ADDC.X R4, R20, R16
       ADDC.X R6, R22, R18

    Except that it would be rarely used enough to make its existence
    debatable at best.



    Not very elegant a solution, but it is simple. I think it requires
    minimal hardware. Three input ADD is already present and ADDGC just
    routes the carry bit to the output.

    BJX2 / XG2 has destroys the value of the one source operand, I noted the extra code to preserve the one operand. Is that only for the ADDC instruction?


    In this case, ADDC/SUBC were destructive 2R because:
    Originally, in SH4 and BJX1, they were destructive 2R;
    They were not common enough to justify spending 3R encodings on them;
    But, still common enough to justify not dropping them entirely.

    So, for example:
    ADD is 3R;
    But, ADDC/SUBC (aka ADC/SBB), are only 2R.


    Early on, I ended up adding both 2R and 3R versions of many
    instructions, but ended up later dropping a lot of the 32-bit 2R
    encodings after noting that they were entirely redundant. ADDC/SUBC
    lived on as 2R as they were never given 3R variants.

    And, in turn, cases where one needs to implement large ALU types are infrequent, and usually for 256-bit integers or similar, one doesn't
    care that they were slow.


    For 128-bit types, they ended up with designated ALU instructions, and
    if one has a 128-bit ADD.X/SUB.X and friends, this eliminated much of
    the use-case for ADDC (so, less incentive to give it a 3R type).


    Where, ADD.X and friends reclaimed encoding space that had originally
    been used for "ADD Rm, Imm5u, Rn" and similar; but These were dropped
    after the "ADD Rm, Imm9u, Rn" and similar encodings were added.

    Similar was originally also true of the Imm5u Load/Store encodings, but
    these ended up coming back later, as some later encoding edge cases
    required them to exist.


    The original migration to Imm9u having been because Imm5u was not
    sufficient (In XG2, many of the Imm9u encodings became either Imm10u or Imm10s).

    Ironically, while 9u or 10s is still smaller than the Imm12s that RISC-V
    uses, the relative difference was smaller:
    The hit/miss difference is a lot smaller;
    It had dealt more gracefully with the cases where the immediate had
    missed (RISC-V had lacked any sort of "graceful" fallback; and a typical
    best case of "LUI+ADDI+OP", kinda sucks...).


    As can be noted, as-is, RISC-V also lacks any good way to deal with
    large integer arithmetic. But, then again, it is infrequent and usually
    not significant to performance.


    What is the limit on the My66000 CARRY modifier for the number of
    carries? Assuming the sequence is interruptible there must be a few bits
    of state that need to be preserved.
    I found incorporating modifiers have a tendency to turn my code into spaghetti. Maybe my grasp of implementation is not so great though.

    The add, addgc can execute at the same time. So, it is 4 clocks at the
    worst to add two 256-bit numbers. (The first / last instructions may
    execute at the same time as other instructions).
    I wanted to avoid using instruction modifiers and special flags
    registers as much as possible. It is somewhat tricky to have a carry
    flag in flight. Q+ is not very code dense, but the add can be done. It
    is also possible to put the carry bit in a predicate register.



    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Oct 13 02:46:14 2024
    From Newsgroup: comp.arch

    BJX2 / XG2 has destroys the value of the one source operand, I noted the
    extra code to preserve the one operand. Is that only for the ADDC
    instruction?

    What is the limit on the My66000 CARRY modifier for the number of
    carries? Assuming the sequence is interruptible there must be a few bits
    of state that need to be preserved.

    CARRY casts its modification over 8 subsequent instructions using its
    16-bit immediate.

    I found incorporating modifiers have a tendency to turn my code into
    spaghetti. Maybe my grasp of implementation is not so great though.

    DECODE has a shift register to attach 2-bits to subsequent instructions
    each. However, the Rd provided by CARRY carries 64-bits from instruction
    to instruction--which makes 256×64 -bit multiplication straightforward.

    Q+ has something using a similar approach, the ATOM instruction, which
    sets the interrupt priority level for the next 11 instructions. It
    shifts three bits per instruction at a time at the enqueue stage when
    the instruction group is loaded into the ROB. The shift should maybe be
    moved back to decode. It is a bit of spaghetti code ATM. I suspect could
    be implemented better. The idea is ATOM allows temporarily disabling interrupts and automatically restoring the interrupt level to what it
    was after a certain number of instructions.

    I found writing code I was disabling then enabling interrupts at various points, which was tricky to do as the original interrupt status needed
    to be recorded and restored. It took several instructions. Looking for a cleaner solution.

    atom “77777” ; disable all but non-maskable interrupts
    < instr. >

    Currently, all Q+ instructions have only a single write port max. To use
    two ports means using two ALUs at the same time, which would serialize
    the machine. I think the CARRY modifier requires two write ports. The quad-float extender prefix (QFEXT) allows 128-bit floats by using an FPU
    and ALU port at the same time.

    There were a couple of other modifiers, PRED and ROUND, but they got
    removed as they were not needed when the instructions were enlarged to
    64-bit. PRED is just a predicate register spec in every instruction now.


    The add, addgc can execute at the same time. So, it is 4 clocks at the
    worst to add two 256-bit numbers. (The first / last instructions may
    execute at the same time as other instructions).
    I wanted to avoid using instruction modifiers and special flags
    registers as much as possible. It is somewhat tricky to have a carry
    flag in flight. Q+ is not very code dense, but the add can be done. It
    is also possible to put the carry bit in a predicate register.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Oct 13 16:43:53 2024
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> writes:

    [Context: carry and overflow in GPRs
    <https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>]

    Been thinking some about the carry and overflow and what to do about >register spills and reloads during expression processing. My thought was >that on the machine with 256 registers, simply allocate a ridiculous
    number of registers for expression processing, for example 25 or even
    50. Then if the expression is too complex, have the compiler spit out an >error message to the programmer to simplify the expression. Remnants of
    the ‘expression too complex’ error in BASIC. So, there are no spills or >reloads during expression processing.

    The first question is how carry and overflow are represented in the
    programming language.

    Currently there are programming languages with growable integers, and
    overflow is needed short-term for that, so spilling the overflow bit
    is probably not necessary for that (and indeed, the one overflow bit
    of AMD64 or ARM A64 that is not preserved across calls is good enough
    for that).

    For dealing with multiple-precision integers (e.g., when the growable
    integers actually grow to more than one word), typically library
    routines are used, but sure, one could also have a programming
    language that computes with multi-precision integers and then is
    compiled into either loops over the individual words of these numbers,
    or it unrolls these loops (if the length is known in advance). Yes,
    if you run out of registers there, you may want to spill and refill a
    register, including its carry bit. But that should be rare, so if
    it's an expensive operation, we can live with it.

    What we have now is things like the GNU C extension

    bool __builtin_add_overflow (type1 a, type2 b, type3 *res);

    This produces two different results, the return value, and res. With
    the kind of architecture I have in mind, these two results could be
    allocated into the same register. If at some point the register has
    to be spilled, the two results can be stored into different memory
    locations, and on refill they will land in different GPRs unless the
    compiler writer really puts a lot more work in than is merited (I
    don't expect many spills and refills).

    I think the storextra / loadextra
    registers used during context switching would work okay. But in Q+ there
    are 256 regs which require eight storextra / loadextra registers. I
    think the store extra / load extra registers could be hidden in the
    context save and restore hardware. Not even requiring access via CSRs or >whatever.

    Yes. In my paper I wanted to spell out an implementation that does
    not look like I am ignoring some hard problems and shove it over to
    the implementor. If a computer architect wants to pick my idea up,
    they are welcome to implement context-switching in any way they deem appropriate.

    I suppose context loads and stores could be done in blocks of
    32 registers. An issue is that the load extra needs to be done before >registers are loaded.

    Maybe, with 256 GPRs, you would use 8 storeextra and 8 loadextra
    registers, each on associated with 32 registers. This avoids having
    to make the whole process a sequential operation working on 32-GPR
    blocks. Just store all 256 GPRs, sync (to get the storeextra
    registers up-to-date, then store the 8 storeextra registers. For
    context load, load the 8 loadextra registers, sync (so the loads of
    the loadextra registers are finished), then the 256 GPRs.

    Or alternatively just have 8 extra registers that are used for both
    context stores and context loads. Then you cannot use the same sync
    for both storing and loading, but you may prefer a little more
    context-switch overhead to needing 16 extra registers.

    Another thought is to store additional info such as a CRC check of the >register file on context save and restore.

    Typically ECC memory and something similar in bus protocols achieve
    what I guess you want to achieve with the CRC checks.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Sun Oct 13 18:19:47 2024
    From Newsgroup: comp.arch

    On Sun, 13 Oct 2024 6:46:14 +0000, Robert Finch wrote:

    BJX2 / XG2 has destroys the value of the one source operand, I noted the >>> extra code to preserve the one operand. Is that only for the ADDC
    instruction?

    What is the limit on the My66000 CARRY modifier for the number of
    carries? Assuming the sequence is interruptible there must be a few bits >>> of state that need to be preserved.

    CARRY casts its modification over 8 subsequent instructions using its
    16-bit immediate.

    I found incorporating modifiers have a tendency to turn my code into
    spaghetti. Maybe my grasp of implementation is not so great though.

    DECODE has a shift register to attach 2-bits to subsequent instructions
    each. However, the Rd provided by CARRY carries 64-bits from instruction
    to instruction--which makes 256×64 -bit multiplication straightforward.

    Q+ has something using a similar approach, the ATOM instruction, which
    sets the interrupt priority level for the next 11 instructions. It
    shifts three bits per instruction at a time at the enqueue stage when
    the instruction group is loaded into the ROB. The shift should maybe be
    moved back to decode. It is a bit of spaghetti code ATM. I suspect could
    be implemented better. The idea is ATOM allows temporarily disabling interrupts and automatically restoring the interrupt level to what it
    was after a certain number of instructions.

    I found writing code I was disabling then enabling interrupts at various points, which was tricky to do as the original interrupt status needed
    to be recorded and restored. It took several instructions. Looking for a cleaner solution.

    atom “77777” ; disable all but non-maskable interrupts
    < instr. >

    Currently, all Q+ instructions have only a single write port max. To use
    two ports means using two ALUs at the same time, which would serialize
    the machine. I think the CARRY modifier requires two write ports. The quad-float extender prefix (QFEXT) allows 128-bit floats by using an FPU
    and ALU port at the same time.

    I have a clever implementation of CARRY where it is a result bus
    and an Operand port but it does not need a write register port.

    There were a couple of other modifiers, PRED and ROUND, but they got
    removed as they were not needed when the instructions were enlarged to 64-bit. PRED is just a predicate register spec in every instruction now.


    The add, addgc can execute at the same time. So, it is 4 clocks at the
    worst to add two 256-bit numbers. (The first / last instructions may
    execute at the same time as other instructions).
    I wanted to avoid using instruction modifiers and special flags
    registers as much as possible. It is somewhat tricky to have a carry
    flag in flight. Q+ is not very code dense, but the add can be done. It
    is also possible to put the carry bit in a predicate register.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Tue Oct 15 09:49:42 2024
    From Newsgroup: comp.arch

    Robert Finch wrote:
    On 2024-10-09 6:44 a.m., Robert Finch wrote:
    On 2024-10-05 5:43 a.m., Anton Ertl wrote:
    Robert Finch <robfi680@gmail.com> writes:
    On 2024-10-04 2:19 a.m., Anton Ertl wrote:
    4) Keep the flags results along with GPRs: have carry and overflow as >>>>> bit 64 and 65, N is bit 63, and Z tells something about bits 0-63.
    The advantage is that you do not have to track the flags separately
    (and, in case of AMD64, track each of C, O, and NZP separately), but >>>>> instead can use the RAT that is already there for the GPRs. You can >>>>> find a preliminary paper on that on
    <https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
    ...
    One solution, not mentioned in your article, is to support arithmetic
    with two bits less than the number of bit a register can support, so
    that the carry and overflow can be stored. On a 64-bit machine have all >>>> operations use only 62-bits. It would solve the issue of how to load or >>>> store the carry and overflow bits associated with a register.

    Yes, that's a solution, but the question is how well existing software
    would react to having no int64_t (and equivalent types, such as long
    long), but instead an int62_t (or maybe int63_t, if the 64th bit is
    used for both signed and unsigned overflow, by having separate signed
    and unsigned addition etc.). I expect that such an architecture would
    have low acceptance. By contrast, in my paper I suggest an addition
    to existing 64-bit architectures that has fewer of the same
    disadvantages as the widely-used condition-code-register approach has,
    but still has a few of them.

    Sometimes
    arithmetic is performed with fewer bits, as for pointer representation. >>>> I wonder if pointer masking could somehow be involved. It may be useful >>>> to have a bit indicating the presence of a pointer. Also thinking of
    how
    to track a binary point position for fixed point arithmetic. Perhaps
    using the whole upper byte of a register for status/control bits
    would work.

    There are some extensions for AMD64 in that direction.

    It may be possible with Q+ to support a second destination register
    which is in a subset of the GPRs. For example, one of eight registers
    could be specified to holds the carry/overflow status. That effectively >>>> ties up a second ALU though as an extra write port is needed for the
    instruction.

    Needing only one write port is an advantage of my approach.

    - anton

    Been thinking some about the carry and overflow and what to do about
    register spills and reloads during expression processing. My thought
    was that on the machine with 256 registers, simply allocate a
    ridiculous number of registers for expression processing, for example
    25 or even 50. Then if the expression is too complex, have the
    compiler spit out an error message to the programmer to simplify the
    expression. Remnants of the ‘expression too complex’ error in BASIC.
    So, there are no spills or reloads during expression processing. I
    think the storextra / loadextra registers used during context
    switching would work okay. But in Q+ there are 256 regs which require
    eight storextra / loadextra registers. I think the store extra / load
    extra registers could be hidden in the context save and restore
    hardware. Not even requiring access via CSRs or whatever. I suppose
    context loads and stores could be done in blocks of 32 registers. An
    issue is that the load extra needs to be done before registers are
    loaded. So, the extra word full of carry/overflow bits would need to
    be fetched in a non-sequential fashion. Assuming for instance, that
    saving register values is followed by a save of the CO word. Then it
    is positioned wrong for a sequential load. It may be better to have
    the wrong position for a store, so loads can proceed sequentially.
    It strikes me that there is no real good solution, only perhaps an
    engineered one. Toyed with the idea of having 16 separate flags
    registers, but not liking that as a solution as much as the store/load
    extra.

    Another thought is to store additional info such as a CRC check of the
    register file on context save and restore.

    *****

    Finally wrote the SM to walk the ROB backwards and restore register
    mappings for a checkpoint restore. Cannot get Q+ to do more than light
    up one LED in SIM. Register values are not propagating properly.


    Mulled over carry and overflow in arithmetic operations. Looked at
    widening the datapath to 66-bits to hold carry and overflow bits.
    Thinking it may increase the size of the design by over 3% just to
    support carry and overflow. For now, an instruction, ADDGC, was added to generate the carry bit as a result. A 256-bit add looks like:

    ; 256 bit add
    ; A = r1,r2,r3,r4
    ; B = r5,r6,r7,r8
    ; S = r9,r10,r11,r12

    add r9,r1,r5,r0
    addgc r13,r1,r5,r0
    add r10,r2,r6,r13
    addgc r13,r2,r6,r13
    add r11,r7,r3,r13
    addgc r13,r7,r3,r13
    add r12,r8,r4,r13

    Not very elegant a solution, but it is simple. I think it requires
    minimal hardware. Three input ADD is already present and ADDGC just
    routes the carry bit to the output.

    I started with that ADDGC approach too.
    But if you want to both generate and propagate carrys in an integer
    register it implies a 3 source register ADD.
    Since a general 3 integer register ADD can generate 2 bits of carry-out,
    it implies two dest registers.
    Which is how I end up with various forms of double-wide adds.

    The carry bits propagate through back-to-back register forwarding.

    ; A = (r4,r3,r2,r1)
    ; B = (r8,r7,r6,r5)
    ; S = (r13,r12,r11,r10,r9)

    addw2 (r10,r9) = r1 + r5
    addw3 (r11,r10) = r2 + r6 + r10
    addw3 (r12,r11) = r3 + r7 + r11
    addw3 (r13,r12) = r4 + r8 + r12

    Unsigned carry out if r13 != 0.
    Signed overflow is still detected with the idiom
    overflow = (((r12 xor r4) and (r12 xor r8)) < 0);


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Robert Finch@robfi680@gmail.com to comp.arch on Thu Oct 31 05:18:37 2024
    From Newsgroup: comp.arch

    Thinking about organizing a cache controller to fetch an entire 4kB page
    of memory at a time on a cache miss. The reason being the memory system
    is tremendously faster than the CPU clock as long as burst mode is used.
    The longer the burst, the better. The entire 4kB page can be transferred
    in < 40 CPU clocks. It takes about 4 CPU clocks to fetch one cache line.
    Cache is still needed as the memory latency prevents its direct use.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Thu Oct 31 19:11:47 2024
    From Newsgroup: comp.arch

    On Thu, 31 Oct 2024 9:18:37 +0000, Robert Finch wrote:

    Thinking about organizing a cache controller to fetch an entire 4kB page
    of memory at a time on a cache miss. The reason being the memory system
    is tremendously faster than the CPU clock as long as burst mode is used.
    The longer the burst, the better. The entire 4kB page can be transferred
    in < 40 CPU clocks. It takes about 4 CPU clocks to fetch one cache line. Cache is still needed as the memory latency prevents its direct use.

    Cache is based in lines as the replacement quantum, because over
    fetching has not proven to be valuable in the not so distant past.

    For similar reasons, one does not install a cache line of PTEs on
    a TLB miss.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Nov 5 23:30:50 2024
    From Newsgroup: comp.arch

    Reached a milestone: got the Q+ CPU to execute Fibonacci in SIM. The CPU
    is working in serial mode.

    Backout and restore handling for branches is disabled. Instead,
    instructions following the branch that should not be executed are turned
    into copy-targets, that copy the target register to the new target
    register. Since the ROB is only 16 entries this can be done in about the
    same length of time as a backout and restore.

    The target register is already being read for predication purposes.

    Converting instructions to copy-targets means that checkpoints are not
    needed, so the core size can be reduced. It may cost a little bit in performance. 3 or 4 instructions can be skipped over per clock with copy-targets, compared to single instruction backout per clock.

    --- Synchronet 3.20a-Linux NewsLink 1.114