• Zen 5 FP latencies / throughput

    From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Sep 19 18:09:11 2025
    From Newsgroup: comp.arch

    I just looked at the latency / throughput for Zen 5 (the link I
    followed is https://docs.amd.com/v/u/en-US/58455_1.00 if anybody
    wants to see for themselves), and I found the performance quite
    impressive.

    They can execute two 512-bit AVX512 fp adds in parallel (either 64
    or 32 bits), plus two 512-bit AFX 512 FMA instructions on top.

    Latency for the floating point add is two (!) cycles, for the FMA
    it is four cycles, which is not a lof when running with a boost
    frequency 5.7 GHz. The ratio is also interesting, they must
    have optimized the floating-point adder quite well.

    Let's see... the peak FP performance with 64-bit reals, with 16
    cores (to get an upper limit on FP performance) would be

    16 cores * (2 * 2 for FMA + 1 * 2 for fadd) * 8 FP numbers * 5.7e9/s
    which is approximately 4.3 TFlops per CPU.

    An interesting question: When (approximately) did the total
    installed floating point performace of all computers worldwide
    surpass that of a single 16-core Zen5 CPU? My guess would be
    somewhere in the late 1970s/early 1980s, before the PC and the
    8087 took off.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Sep 19 19:49:12 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    I just looked at the latency / throughput for Zen 5 (the link I
    followed is https://docs.amd.com/v/u/en-US/58455_1.00 if anybody
    wants to see for themselves), and I found the performance quite
    impressive.

    They can execute two 512-bit AVX512 fp adds in parallel (either 64
    or 32 bits), plus two 512-bit AFX 512 FMA instructions on top.

    Latency for the floating point add is two (!) cycles, for the FMA
    it is four cycles, which is not a lof when running with a boost
    frequency 5.7 GHz. The ratio is also interesting, they must
    have optimized the floating-point adder quite well.

    MIPS R2000 did FADD in 2 cycles: the first thing you have to recognize
    is that when one operand is more than ±1 in the exponent, that normali-
    zation is not needed. So, you build 2 FADDs, one specializing in the
    case where the exponents are with ±1 of each other--which means per-
    alignment is not needed (>>1, ×1, <<1) and you can start the fraction
    add immediately, and then use the second cycle for normalization. And
    you build a second FADD that aligns before Addition but does not need
    to normalize.

    This is all 1983-stuff.

    MIPS did get FMUL into 3 cycles, but modern wire delay is pushing
    to 4 cycles. FMUL of 4 cycles is fairly easy target to hit with
    "throw Verilog over the wall" design style.

    Let's see... the peak FP performance with 64-bit reals, with 16
    cores (to get an upper limit on FP performance) would be

    16 cores * (2 * 2 for FMA + 1 * 2 for fadd) * 8 FP numbers * 5.7e9/s
    which is approximately 4.3 TFlops per CPU.

    An interesting question: When (approximately) did the total
    installed floating point performace of all computers worldwide
    surpass that of a single 16-core Zen5 CPU? My guess would be
    somewhere in the late 1970s/early 1980s, before the PC and the
    8087 took off.

    CoPilot says they sold 600,000 VAX 11/780s and at 1 FLOP×5MHz we
    get* 3G FLOPs. At this point we are 1300× far away, so it was
    definitely post RISC generation 1 when there was that much FLOPs
    world wide.

    (*) a very generous number--probably 4-6× overstating VAX capabilities
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sat Sep 20 12:41:11 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> wrote:
    I just looked at the latency / throughput for Zen 5 (the link I
    followed is https://docs.amd.com/v/u/en-US/58455_1.00 if anybody
    wants to see for themselves), and I found the performance quite
    impressive.

    They can execute two 512-bit AVX512 fp adds in parallel (either 64
    or 32 bits), plus two 512-bit AFX 512 FMA instructions on top.

    Latency for the floating point add is two (!) cycles, for the FMA
    it is four cycles, which is not a lof when running with a boost
    frequency 5.7 GHz. The ratio is also interesting, they must
    have optimized the floating-point adder quite well.

    Let's see... the peak FP performance with 64-bit reals, with 16
    cores (to get an upper limit on FP performance) would be

    16 cores * (2 * 2 for FMA + 1 * 2 for fadd) * 8 FP numbers * 5.7e9/s
    which is approximately 4.3 TFlops per CPU.

    I do not think you can run 16 cores at boost frequency for any
    reasonable period of time. And all processors that I looked at
    slowed down clock when AVX FMA was present. And I doubt this
    "on the top" claim: 2 FMA-s + 2 fadd-s need 10 arguments.
    If the chip can provide that many arguments in a single cycle
    this probably can be only for some special combination of
    sources.

    And note that your mix is 2 multiplies and 4 adds per cycle.
    Normal FP mix is closer to 50% multiplies.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Sep 20 13:32:34 2025
    From Newsgroup: comp.arch

    Waldek Hebisch <antispam@fricas.org> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    I just looked at the latency / throughput for Zen 5 (the link I
    followed is https://docs.amd.com/v/u/en-US/58455_1.00 if anybody
    wants to see for themselves), and I found the performance quite
    impressive.

    They can execute two 512-bit AVX512 fp adds in parallel (either 64
    or 32 bits), plus two 512-bit AFX 512 FMA instructions on top.

    Latency for the floating point add is two (!) cycles, for the FMA
    it is four cycles, which is not a lof when running with a boost
    frequency 5.7 GHz. The ratio is also interesting, they must
    have optimized the floating-point adder quite well.

    Let's see... the peak FP performance with 64-bit reals, with 16
    cores (to get an upper limit on FP performance) would be

    16 cores * (2 * 2 for FMA + 1 * 2 for fadd) * 8 FP numbers * 5.7e9/s
    which is approximately 4.3 TFlops per CPU.

    I do not think you can run 16 cores at boost frequency for any
    reasonable period of time. And all processors that I looked at
    slowed down clock when AVX FMA was present.

    It slows down somewhat, but the behavior is still impressive.
    If you want to know the details, an analysis is at https://chipsandcheese.com/p/zen-5s-avx-512-frequency-behavior .
    Unfortunately, they didn't run two FMA + two adds, but only
    two FMA + one add in parralel.

    And I doubt this
    "on the top" claim: 2 FMA-s + 2 fadd-s need 10 arguments.
    If the chip can provide that many arguments in a single cycle
    this probably can be only for some special combination of
    sources.

    Register to register

    And note that your mix is 2 multiplies and 4 adds per cycle.
    Normal FP mix is closer to 50% multiplies.

    I wrote about "peak performance", which is the speed where there
    it is guaranteed that it cannot be exceeded :-) It's like the
    160 MFlops for the Cray-I, which people also could not realistically
    achieve.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sat Sep 20 22:10:57 2025
    From Newsgroup: comp.arch

    On Sat, 20 Sep 2025 12:41:11 -0000 (UTC)
    antispam@fricas.org (Waldek Hebisch) wrote:


    And note that your mix is 2 multiplies and 4 adds per cycle.
    Normal FP mix is closer to 50% multiplies.


    Normality is the eye of beholder.
    Consider FFT.
    Radix-2 butterfly that you find in the books consists of 2 FMUL, 2 FMADD
    and 4 FADD/FSUB. Radix-4 butterfly that constitutes bulk of modern
    high-perf implementations is a little less unbalanced, but only a
    little.


    --- Synchronet 3.21a-Linux NewsLink 1.2