On Sat, 12 Oct 2024 10:23:18 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Michael S <already5chosen@yahoo.com> writes:
That's correct about intrinsics, but incorrect about ADCX/ADOX.
The later can be moderately helpful in special situuations, esp.
128b * 128b => 256b multiplication, but it is never necessary
and for addition/sbtraction is not needed at all.
They are useful if there are two strings of additions. This happens naturally in wide multiplication (also beyond 256b results). But it
also happens when you add three multi-precision numbers (say, X, Y,
Z): You need C for the carry of XYi=X[i]+Y[i]+C, and O for the carry
of XYZ[i]=XYi+Z[i]+O. If you have ADCX/ADOX, you can do both
additions in one loop, so XYi can be in a register and does not need
to be stored . If you don't have these instructions, only ADC, you
need one loop to compute X+Y and store the result in memory, and one
loop to compute XY+Z, i.e., the lack of ADCX/ADOX results in
substantial additional cost.
If you add 4 multi-precision numbers, AMD64 with ADX runs out of
carry bits, so you have to spend the overhead of an additional loop
(but not of two additional loops as without ADCX/ADOX).
With carry bits in the general purpose registers <https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> and 30 GPRs
(one is zero, one is sp), you can add 14 multi-precision numbers per
loop: 14 GPRs for source addresses, 1 GPR for the target address, 1
for the loop counter, 13 registers for loop-carried carry flags.
Of course, the question is if this kind of computation is needed
frequently enough to justify this kind of extension. For
multi-precision multiplication and squaring, Intel considered the
frequency relevant enough to introduce ADCX/ADOX/MULX.
- anton
That's not bad. I think, you see yourself that spill and context
switch parts could benefit from more work.
But I suspect that the main opposition you'll face in RISC-V
organization will center not on that, but on fear of increase in cycle
time, no matter if proven or not with hard numbers.
On Sun, 13 Oct 2024 13:00:14 +0300
Second thought: why do we have to insist on 64 payload bits?
64-bit format with 2 or 3 flag bits and 62 or 61 payload bits appears
to simplify system issues at relatively small cost in storage density.
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 993 |
Nodes: | 10 (0 / 10) |
Uptime: | 207:34:26 |
Calls: | 12,972 |
Calls today: | 1 |
Files: | 186,574 |
Messages: | 3,268,393 |