• Extending GPRs with AF+CF (Was: is Vax addressing sane today)

    From Michael S@already5chosen@yahoo.com to comp.arch on Tue Oct 15 13:46:33 2024
    From Newsgroup: comp.arch

    On Sun, 13 Oct 2024 13:00:14 +0300
    Michael S <already5chosen@yahoo.com> wrote:

    On Sat, 12 Oct 2024 10:23:18 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    That's correct about intrinsics, but incorrect about ADCX/ADOX.
    The later can be moderately helpful in special situuations, esp.
    128b * 128b => 256b multiplication, but it is never necessary
    and for addition/sbtraction is not needed at all.

    They are useful if there are two strings of additions. This happens naturally in wide multiplication (also beyond 256b results). But it
    also happens when you add three multi-precision numbers (say, X, Y,
    Z): You need C for the carry of XYi=X[i]+Y[i]+C, and O for the carry
    of XYZ[i]=XYi+Z[i]+O. If you have ADCX/ADOX, you can do both
    additions in one loop, so XYi can be in a register and does not need
    to be stored . If you don't have these instructions, only ADC, you
    need one loop to compute X+Y and store the result in memory, and one
    loop to compute XY+Z, i.e., the lack of ADCX/ADOX results in
    substantial additional cost.

    If you add 4 multi-precision numbers, AMD64 with ADX runs out of
    carry bits, so you have to spend the overhead of an additional loop
    (but not of two additional loops as without ADCX/ADOX).

    With carry bits in the general purpose registers <https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> and 30 GPRs
    (one is zero, one is sp), you can add 14 multi-precision numbers per
    loop: 14 GPRs for source addresses, 1 GPR for the target address, 1
    for the loop counter, 13 registers for loop-carried carry flags.

    Of course, the question is if this kind of computation is needed
    frequently enough to justify this kind of extension. For
    multi-precision multiplication and squaring, Intel considered the
    frequency relevant enough to introduce ADCX/ADOX/MULX.

    - anton

    That's not bad. I think, you see yourself that spill and context
    switch parts could benefit from more work.
    But I suspect that the main opposition you'll face in RISC-V
    organization will center not on that, but on fear of increase in cycle
    time, no matter if proven or not with hard numbers.


    Second thought: why do we have to insist on 64 payload bits?
    64-bit format with 2 or 3 flag bits and 62 or 61 payload bits appears
    to simplify system issues at relatively small cost in storage density.
    If we want to use your proposal as a base for growable integers that
    start as single 64-bit word and expected to remain single-word in
    overwhelming majority of use cases then we need at least one flag bit
    anyway.
    Another point: I already figured out from our earlier discussions that
    you like your bigint in two-complement format. But sign-magnitude
    format simplifies many issues and probably removes the need for
    separation between signed overflow and unsigned overflow (carry) bits.









    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Oct 15 19:35:31 2024
    From Newsgroup: comp.arch

    On Tue, 15 Oct 2024 10:46:33 +0000, Michael S wrote:

    On Sun, 13 Oct 2024 13:00:14 +0300

    Second thought: why do we have to insist on 64 payload bits?

    Because 8-bit bytes won (over 6-bit bytes and 9-bit bytes.)
    Then 8×2^k became the size of each larger data type.

    64-bit format with 2 or 3 flag bits and 62 or 61 payload bits appears
    to simplify system issues at relatively small cost in storage density.

    Loosing an easy ability to go for 128-bits or 256-bit arithmetic.
    --- Synchronet 3.20a-Linux NewsLink 1.114