Forum: War Ensemble BBS

Zen 5 FP latencies / throughput

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Sep 19 18:09:11 2025

From Newsgroup: comp.arch

I just looked at the latency / throughput for Zen 5 (the link I
followed is https://docs.amd.com/v/u/en-US/58455_1.00 if anybody
wants to see for themselves), and I found the performance quite
impressive.

They can execute two 512-bit AVX512 fp adds in parallel (either 64
or 32 bits), plus two 512-bit AFX 512 FMA instructions on top.

Latency for the floating point add is two (!) cycles, for the FMA
it is four cycles, which is not a lof when running with a boost
frequency 5.7 GHz. The ratio is also interesting, they must
have optimized the floating-point adder quite well.

Let's see... the peak FP performance with 64-bit reals, with 16
cores (to get an upper limit on FP performance) would be

16 cores * (2 * 2 for FMA + 1 * 2 for fadd) * 8 FP numbers * 5.7e9/s
which is approximately 4.3 TFlops per CPU.

An interesting question: When (approximately) did the total
installed floating point performace of all computers worldwide
surpass that of a single 16-core Zen5 CPU? My guess would be
somewhere in the late 1970s/early 1980s, before the PC and the
8087 took off.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Sep 19 19:49:12 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

I just looked at the latency / throughput for Zen 5 (the link I
followed is https://docs.amd.com/v/u/en-US/58455_1.00 if anybody
wants to see for themselves), and I found the performance quite
impressive.

They can execute two 512-bit AVX512 fp adds in parallel (either 64
or 32 bits), plus two 512-bit AFX 512 FMA instructions on top.

Latency for the floating point add is two (!) cycles, for the FMA
it is four cycles, which is not a lof when running with a boost
frequency 5.7 GHz. The ratio is also interesting, they must
have optimized the floating-point adder quite well.

MIPS R2000 did FADD in 2 cycles: the first thing you have to recognize
is that when one operand is more than ±1 in the exponent, that normali-
zation is not needed. So, you build 2 FADDs, one specializing in the
case where the exponents are with ±1 of each other--which means per-
alignment is not needed (>>1, ×1, <<1) and you can start the fraction
add immediately, and then use the second cycle for normalization. And
you build a second FADD that aligns before Addition but does not need
to normalize.

This is all 1983-stuff.

MIPS did get FMUL into 3 cycles, but modern wire delay is pushing
to 4 cycles. FMUL of 4 cycles is fairly easy target to hit with
"throw Verilog over the wall" design style.

Let's see... the peak FP performance with 64-bit reals, with 16
cores (to get an upper limit on FP performance) would be

16 cores * (2 * 2 for FMA + 1 * 2 for fadd) * 8 FP numbers * 5.7e9/s
which is approximately 4.3 TFlops per CPU.

An interesting question: When (approximately) did the total
installed floating point performace of all computers worldwide
surpass that of a single 16-core Zen5 CPU? My guess would be
somewhere in the late 1970s/early 1980s, before the PC and the
8087 took off.

CoPilot says they sold 600,000 VAX 11/780s and at 1 FLOP×5MHz we
get* 3G FLOPs. At this point we are 1300× far away, so it was
definitely post RISC generation 1 when there was that much FLOPs
world wide.

(*) a very generous number--probably 4-6× overstating VAX capabilities
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sat Sep 20 12:41:11 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> wrote:

I just looked at the latency / throughput for Zen 5 (the link I
followed is https://docs.amd.com/v/u/en-US/58455_1.00 if anybody
wants to see for themselves), and I found the performance quite
impressive.

They can execute two 512-bit AVX512 fp adds in parallel (either 64
or 32 bits), plus two 512-bit AFX 512 FMA instructions on top.

Latency for the floating point add is two (!) cycles, for the FMA
it is four cycles, which is not a lof when running with a boost
frequency 5.7 GHz. The ratio is also interesting, they must
have optimized the floating-point adder quite well.

Let's see... the peak FP performance with 64-bit reals, with 16
cores (to get an upper limit on FP performance) would be

16 cores * (2 * 2 for FMA + 1 * 2 for fadd) * 8 FP numbers * 5.7e9/s
which is approximately 4.3 TFlops per CPU.

I do not think you can run 16 cores at boost frequency for any
reasonable period of time. And all processors that I looked at
slowed down clock when AVX FMA was present. And I doubt this
"on the top" claim: 2 FMA-s + 2 fadd-s need 10 arguments.
If the chip can provide that many arguments in a single cycle
this probably can be only for some special combination of
sources.

And note that your mix is 2 multiplies and 4 adds per cycle.
Normal FP mix is closer to 50% multiplies.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Sep 20 13:32:34 2025

From Newsgroup: comp.arch

Waldek Hebisch <antispam@fricas.org> schrieb:

Thomas Koenig <tkoenig@netcologne.de> wrote:

I just looked at the latency / throughput for Zen 5 (the link I
followed is https://docs.amd.com/v/u/en-US/58455_1.00 if anybody
wants to see for themselves), and I found the performance quite
impressive.

They can execute two 512-bit AVX512 fp adds in parallel (either 64
or 32 bits), plus two 512-bit AFX 512 FMA instructions on top.

Latency for the floating point add is two (!) cycles, for the FMA
it is four cycles, which is not a lof when running with a boost
frequency 5.7 GHz. The ratio is also interesting, they must
have optimized the floating-point adder quite well.

Let's see... the peak FP performance with 64-bit reals, with 16
cores (to get an upper limit on FP performance) would be

16 cores * (2 * 2 for FMA + 1 * 2 for fadd) * 8 FP numbers * 5.7e9/s
which is approximately 4.3 TFlops per CPU.

I do not think you can run 16 cores at boost frequency for any
reasonable period of time. And all processors that I looked at
slowed down clock when AVX FMA was present.

It slows down somewhat, but the behavior is still impressive.
If you want to know the details, an analysis is at https://chipsandcheese.com/p/zen-5s-avx-512-frequency-behavior .
Unfortunately, they didn't run two FMA + two adds, but only
two FMA + one add in parralel.

And I doubt this
"on the top" claim: 2 FMA-s + 2 fadd-s need 10 arguments.
If the chip can provide that many arguments in a single cycle
this probably can be only for some special combination of
sources.

Register to register

And note that your mix is 2 multiplies and 4 adds per cycle.
Normal FP mix is closer to 50% multiplies.

I wrote about "peak performance", which is the speed where there
it is guaranteed that it cannot be exceeded :-) It's like the
160 MFlops for the Cray-I, which people also could not realistically
achieve.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sat Sep 20 22:10:57 2025

From Newsgroup: comp.arch

On Sat, 20 Sep 2025 12:41:11 -0000 (UTC)
antispam@fricas.org (Waldek Hebisch) wrote:

And note that your mix is 2 multiplies and 4 adds per cycle.
Normal FP mix is closer to 50% multiplies.

Normality is the eye of beholder.
Consider FFT.
Radix-2 butterfly that you find in the books consists of 2 FMUL, 2 FMADD
and 4 FADD/FSUB. Radix-4 butterfly that constitutes bulk of modern
high-perf implementations is a little less unbalanced, but only a
little.

--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Microbot
  Tue Oct 7 01:30:59 2025
  from Moore, Ok via Telnet
- Microbot
  Mon Oct 6 03:01:21 2025
  from Moore, Ok via Telnet
- Djatropine
  Sun Oct 5 20:05:43 2025
  from Memphis, Tn via SSH
- Microbot
  Sun Oct 5 04:13:15 2025
  from Moore, Ok via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,071
Nodes:	10 (0 / 10)
Uptime:	186:21:49
Calls:	13,762
Calls today:	1
Files:	186,985
D/L today:	8,378 files (2,644M bytes)
Messages:	2,427,100

Zen 5 FP latencies / throughput

Who's Online

Recent Visitors

System Info