I just looked at the latency / throughput for Zen 5 (the link I
followed is https://docs.amd.com/v/u/en-US/58455_1.00 if anybody
wants to see for themselves), and I found the performance quite
impressive.
They can execute two 512-bit AVX512 fp adds in parallel (either 64
or 32 bits), plus two 512-bit AFX 512 FMA instructions on top.
Latency for the floating point add is two (!) cycles, for the FMA
it is four cycles, which is not a lof when running with a boost
frequency 5.7 GHz. The ratio is also interesting, they must
have optimized the floating-point adder quite well.
Let's see... the peak FP performance with 64-bit reals, with 16
cores (to get an upper limit on FP performance) would be
16 cores * (2 * 2 for FMA + 1 * 2 for fadd) * 8 FP numbers * 5.7e9/s
which is approximately 4.3 TFlops per CPU.
An interesting question: When (approximately) did the total
installed floating point performace of all computers worldwide
surpass that of a single 16-core Zen5 CPU? My guess would be
somewhere in the late 1970s/early 1980s, before the PC and the
8087 took off.
I just looked at the latency / throughput for Zen 5 (the link I
followed is https://docs.amd.com/v/u/en-US/58455_1.00 if anybody
wants to see for themselves), and I found the performance quite
impressive.
They can execute two 512-bit AVX512 fp adds in parallel (either 64
or 32 bits), plus two 512-bit AFX 512 FMA instructions on top.
Latency for the floating point add is two (!) cycles, for the FMA
it is four cycles, which is not a lof when running with a boost
frequency 5.7 GHz. The ratio is also interesting, they must
have optimized the floating-point adder quite well.
Let's see... the peak FP performance with 64-bit reals, with 16
cores (to get an upper limit on FP performance) would be
16 cores * (2 * 2 for FMA + 1 * 2 for fadd) * 8 FP numbers * 5.7e9/s
which is approximately 4.3 TFlops per CPU.
Thomas Koenig <tkoenig@netcologne.de> wrote:
I just looked at the latency / throughput for Zen 5 (the link I
followed is https://docs.amd.com/v/u/en-US/58455_1.00 if anybody
wants to see for themselves), and I found the performance quite
impressive.
They can execute two 512-bit AVX512 fp adds in parallel (either 64
or 32 bits), plus two 512-bit AFX 512 FMA instructions on top.
Latency for the floating point add is two (!) cycles, for the FMA
it is four cycles, which is not a lof when running with a boost
frequency 5.7 GHz. The ratio is also interesting, they must
have optimized the floating-point adder quite well.
Let's see... the peak FP performance with 64-bit reals, with 16
cores (to get an upper limit on FP performance) would be
16 cores * (2 * 2 for FMA + 1 * 2 for fadd) * 8 FP numbers * 5.7e9/s
which is approximately 4.3 TFlops per CPU.
I do not think you can run 16 cores at boost frequency for any
reasonable period of time. And all processors that I looked at
slowed down clock when AVX FMA was present.
And I doubt this
"on the top" claim: 2 FMA-s + 2 fadd-s need 10 arguments.
If the chip can provide that many arguments in a single cycle
this probably can be only for some special combination of
sources.
And note that your mix is 2 multiplies and 4 adds per cycle.
Normal FP mix is closer to 50% multiplies.
And note that your mix is 2 multiplies and 4 adds per cycle.
Normal FP mix is closer to 50% multiplies.
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 1,071 |
Nodes: | 10 (0 / 10) |
Uptime: | 186:21:49 |
Calls: | 13,762 |
Calls today: | 1 |
Files: | 186,985 |
D/L today: |
8,378 files (2,644M bytes) |
Messages: | 2,427,100 |